So you teach developers to maximise occupancy for any given chip, but to think a bit bigger than that in terms of future scaling when they bring their codes to a new one?

Yeah, and it's usually the size of a problem, the data, that defines how they'll scale. It's about how many threads they kick off and how data parallel they are, and generally those problems will just fill up the machine to maximum occupancy and beyond, on current chips, so they'll scale. If you don't have that much data, or kick off enough threads, then that's when you probably won't even eat up G80, never mind a future processor.

And if an application does hit a performance wall, as long as the algorithm is suited to exploiting data parallelism then it's not too hard to adjust it to make it scale a bit better?

That's right, and it really hasn't been a problem for us. It's only really come up when say we've said, "well, let's find the smallest problem that'll give maximum occupancy on the GPU" for our regression testing, and then that doesn't scale too well. But our customers generally aren't writing those apps and they'll scale really well.

So what's coming in the 1.0 release and what can we look forward to there?

The emphasis there was 64-bit support for host platforms, and that was a big ask from our customers so they could set CUDA free on certain problems with large sets of data. So we have the compiler setup for that and as long as you have clean 32-bit code it shouldn't have a problem running nicely in a 64-bit CUDA environment. The other big thing there is async. The aim really was to get feedback from the beta and we were pretty happy with that and what that's fed in to 1.0.

So is there a push to provide more of the CUDA infrastructure to developers now we're getting close to 1.0? More tools and profilers and ways to work with PTX, for example?

So if you look at our toolchain just now, the front end is nvcc, but there's also a PTX assembler called ptxas and we'll provide the specs for PTX so that people can play with it. If you look at the PTX assembly, it's a little weird for things like register allocation, but the rest of PTX is pretty straightforward, so there's opportunity there for people to do some stuff with it if they want to target PTX in an application in whatever way.

So it seems that you don't really have to translate or parse or process PTX in any way to get a good look at the ISA, you can pretty much just read it and get good insight into what's going on with a particular application?

Everything's just simple text mode with us, haha, so there's no need to decompile PTX or disassemble or what have you to take a look at it like that, you just open it up and read if that's what you want to do. And the toolchain always feeds through the PTX layer, so there's nothing really hidden from view, you can see what we ask the chip to execute.

But there is PTX to native compiler?

Yeah, and the reason we provide a native binary compiler is for two reasons. The first is so companies can ship apps along with a runtime version and have predictable performance on a GPU that they can rely on. The second is for performance reasons, where we just load binary compiles into a code cache and execute out of there if nothing's changed, so we don't have to compile again every time. Depending on the flags you ask for there when you compile you can get both behaviours.

I think that's something they'll do on the graphics side, with shader compilation, so it seems like a reasonable optimisation to make if you're executing unchanged codes.