NVIDIA CUDA 1.0 released

Monday 25th June 2007, 09:09:00 PM, written by Tim

NVIDIA has released version 1.0 of its CUDA programming framework, with a large number of new features including asynchronous kernel calls and 64-bit Linux support. Like 0.8, there seems to be support for Red Hat Enterprise Linux, OpenSUSE and SUSE Linux Enterprise Desktop, as well as 32-bit Windows XP.

If you look at the release notes, you'll see a number of interesting things as well as the new features. First, there's all the new device support (all G80s as well as G84--no G86, though). There's a brief mention of PTX, which is the intermediate ISA generated by the CUDA compiler. There are also numerous improvements in the FFT and BLAS libraries, plus more performance and stability improvements.

Asynchronous kernels have new calling conventions, so you should definitely check out the programming guide to see how to use that. Also note that there are now two different versions of CUDA-capable chips: G80 is v1.0, G84 is v1.1. At the moment, the only feature that seems to separate the two are atomic functions, which are only available in 1.1. However, this does mean that you could write a global mutex for your G84 (there is atomic compare-and-swap)...

Look for more CUDA coverage soon; we'll be exploring version 1.0 as well!

Discuss on the forums

Tagging

nvidia ± cuda, gpgpu


Latest Thread Comments (10 total)
Posted by Tim Murray on Tuesday, 26-Jun-07 03:28:06 UTC
I was all excited about writing global mutexes too, until John Stone told us that it wasn't supported by G80. :(

Posted by Geo on Tuesday, 26-Jun-07 04:58:43 UTC
You need to see this as an opportunity. "But, John, I'm going to need you guys to give me an 8600 then so we can cover it the way it deserves. . . "

Posted by Tim Murray on Tuesday, 26-Jun-07 06:19:23 UTC
Quoting Geo
You need to see this as an opportunity. "But, John, I'm going to need you guys to give me an 8600 then so we can cover it the way it deserves. . . "
Er, John Stone is the guy at UIUC who wrote all of their CUDA stuff (and probably knows more about making CUDA apps go fast than anyone who's not on the CUDA team proper). :p

Posted by nutball on Tuesday, 26-Jun-07 08:57:24 UTC
Awesome news! Slightly worrying that two parts from the same family of hardware can have differing functionality like that though -- going to make for some compatibility-issues-from-hell situations. Oh well, that's progress I suppose!

Posted by Tim Murray on Tuesday, 26-Jun-07 15:28:00 UTC
Quoting nutball
Awesome news! Slightly worrying that two parts from the same family of hardware can have differing functionality like that though -- going to make for some compatibility-issues-from-hell situations. Oh well, that's progress I suppose!
I really don't think it will. Like I said, the only thing that is different right now is the support for atomic functions, and I still can't really figure out why you'd ever want to use them. Performance is probably completely abysmal, for one, and I would imagine that you could do all of it on the CPU much faster.

Posted by Geo on Tuesday, 26-Jun-07 15:41:16 UTC
And yet they added them to the more recent part. That says something.

Posted by silent_guy on Tuesday, 26-Jun-07 17:17:31 UTC
Quoting Tim Murray
I really don't think it will. Like I said, the only thing that is different right now is the support for atomic functions, and I still can't really figure out why you'd ever want to use them. Performance is probably completely abysmal, for one, and I would imagine that you could do all of it on the CPU much faster.
At this moment, the only way for multiple blocks to interact with eachother is by using really ugly hacks. With atomic functions, it's much easier and you can now implement semaphores and polling loops to align all blocks and restart calculating without the overhead of the CPU having to reissue a kernel.

If you only use __syncthreads intra-warp and 1 atomic function per warp for inter-warp synchronization, then maybe performance won't be all that bad?

I had a quick look at the SDK this morning and grepped for 'atomic': they have the histogram64 example where they use atomics on a 1.1 shader and reduction on a 1.0 shader. It would be nice if someone with a 8600 could try both and compare the execution speeds.

Posted by Tim Murray on Tuesday, 26-Jun-07 17:25:22 UTC
Quoting silent_guy
At this moment, the only way for multiple blocks to interact with eachother is by using really ugly hacks. With atomic functions, it's much easier and you can now implement semaphores and polling loops to align all blocks and restart calculating without the overhead of the CPU having to reissue a kernel.
I'm just not convinced that it's going to be faster than just using the CPU to perform global synchronization. I also wonder how it's implemented (whether it costs two memory operations or what).

Posted by silent_guy on Wednesday, 27-Jun-07 05:12:04 UTC
Quoting Tim Murray
I'm just not convinced that it's going to be faster than just using the CPU to perform global synchronization. I also wonder how it's implemented (whether it costs two memory operations or what).
I assume you're hinting at using L2 caches in the ROPs to prevent full external memory round trips?

As for being faster or not than CPU based synchronization: It will probably depend on the amount of warps in play? For a smaller number, atomic operations will definitely have a lower overhead than a CPU relauch (PCIe latency etc.)
For a large number, atomic ops may have too many collisions and eventually CPU overhead will be smaller. My feeling is that you should be able to go pretty far with with atomics, before you hit a wall, by having multiple synchronization stages.

Anyway, it's definitely nice to have the option. Since I just want to play around with it, absolute speed is not my top concern: I may buy an 8600 instead of an 8800 just for this feature.

Posted by armchair_architect on Wednesday, 27-Jun-07 05:22:03 UTC
Quoting Tim Murray
I'm just not convinced that it's going to be faster than just using the CPU to perform global synchronization. I also wonder how it's implemented (whether it costs two memory operations or what).
No idea if this is how they've implemented it of course, but in the graphics pipeline, the z/stencil tests and color blending are all atomic RMW operations. So this is very similar to something they've optimized heavily before.


Add your comment in the forums

Related nvidia News

NVIDIA's Huang admits to underestimating ATI
NVIDIA release beta OpenGL 3.0 driver following SIGGRAPH session
NVIDIA release Cg 2.1 Beta and Gamefest presentations
NVIDIA denies exit from core logic market
NVIDIA releases beta Forceware update
NVIDIA PerfHUD 6.0 released
NVIDIA SLI to be available with upcoming Intel desktop computing platform
NVIDIA Gelato now free; merges with Mental Ray (+NaturalMotion)
NVIDIA: 15-20% Intel chipset share by year end 'seems reasonable'.
NVIDIA nForce 780a SLI breaks cover