Considering the programming semantics, from the threading model, from warps to blocks to grids. Was that programming model born from the architecture, or is it a bit more abstract than that?

There's a fair amount of influence from the characteristics of the chip that already exist because it's designed first for graphics. But there's also the whole idea of warps and blocks and shared memory and the synchronisation was designed from the top down, and the chip was made to fit that in a way. We have a bunch of people that have great experience from designing and building other parallel machines in the past, so CUDA is designed both as a software and the underlying hardware component. We wanted a programming model such that people with experience writing data-parallel programs would recognise the CUDA model. If you look at older data-parallel machines, you'll see some things in those systems that are present in our model with CUDA.

Is there any scope, when you setup a CUDA program and you have to be aware of your occupancy on the GPU, to do the right things to fill the chip up and keep the chip busy, that the runtime could decide some of those dimensions of execution so that your program scales automatically given the underlying GPU, which might not be G80 in the future? The runtime might have a better idea about GPU occupancy than the programmer under certain conditions, so that idea is born from that concept.

You know it's a dilemma, so do you want things to be done automatically, which might make people mad. "I told you I wanted 16 threads, so give me 16 threads dammit!". But on the other hand if you can make things parameterised, then the runtime and the chip can know things that you don't know. And that's right, a year from now with the next generation, some things may change and you didn't know that when you wrote your code.

That's the basic idea, so imagine a situation where you've profiled to just fill G80, and you run on a future GPU. It might not run slower than it does right now, but then it might not run any faster and the runtime could maybe help you out there.

Hopefully that's not the case! That's possible, but hopefully programming now for CUDA you realise that it's not just about one chip.

So could we possibly see that change, where you don't really specify any static occupancy at all, you just let the runtime deal with that to fill the chip up?

It's a similar idea to be creative with shared memory, and does the chip and the compiler know enough to about the resources available in the machine to let us do that. So just as an example, take Photoshop and possible use of CUDA -- and I don't know if they are or not but assume they are for this! -- they want to have this binary that runs on all of the G8x family automagically and runs well, so how do they do that? We'll see.

So the reason CUDA is synchronous just now was just to get it up and running?

Yeah, when the full release comes out you'll see async. I mean it's working now, it just took a little longer than expected to get working, so you didn't see it for the first beta.