OK, so tons, as you can imagine! One of the things we recognised at the very beginning of this project was that we looked at a lot of the old school GPGPU shaders, and they tended to be really long, and it became clear that we would need to do a lot of global optimisation not just within a basic block, or across a basic block. That motivated using Open64 as a compiler front-end, so the first path. We turn on a bunch of the optimisations in Open64, and they're tunables, so we can turn them on and off at will. And it's a challenge, compiling for GPUs versus CPUs. There's not just a fixed register file, so that adds a whole new twist to the optimisation opportunity you have. So we do a lot of global optimisation in that Open64 stage, and that compiler core is the same core that's used for Itanium.
So the beta was fully synchronous just so we could get it out there and something working, and the sequential semantics of the API makes it just really straightforward to program in that sense. What we've done for the next release is relax that a little bit, so it's a bit like OpenGL. Any calls to the GPU are going to return immediately, and like with OpenGL where you have gl_finish() there's a CUDA version to help you sync on return. I don't recommend you put it in your code unless you're doing timing, but it's there to help you do synchronise there. We also provide a profiler for that, which will spit out GPU timings. I want to keep going towards asynchronous in CUDA, where you're hitting on three things at the same time, so you're transferring data and running the CPU and the GPU. That introduces a level of complexity to the programmer though, since at the moment it's nice and clean and sequential, but to get the most out of the programming model you need to move to async.
Well, yes and no. There's two levels of parallelism in CUDA as you know, so there's the number of threads per block, which for the GeForce 8 architecture, as long as you have at least 128-256 threads active inside a thread processor, that keeps the ALUs pretty darn busy. And going forward we're conscious of that being a pretty hard coded number. But there's another level of growth where you have more thread processors, so more grids that you can queue up, and it will boil down to how much data you have, to exploit data parallelism. So right now if you hard code for one chip and feed it the bare minimum to exploit its performance to just exactly fit, then you might not scale to a new chip. But most of our apps, say if you look at our BLAS lib or our FFT lib, they'll scale very nicely on a new chip.