A Brief History of CUDA


When NVIDIA's G80 launched in November 2006, there was a brief mention of a new toolkit that would greatly simplify GPU computing development. Called CUDA (for Complete Unified Device Architecture), we knew at the time that it was a C derivative that would run on the GPU without using any 3D API as an intermediary. We also knew that the lead architect for CUDA was Ian Buck, a student of the legendary Pat Hanrahan at Stanford and one of the authors of the original BrookGPU paper. Considering its pedigree, we were very excited to see what he could do with G80. While we waited to get our hands on CUDA, we saw three major advantages to NVIDIA's approach. First, by bypassing 3D APIs, there's no concern about future drivers breaking an application as has plagued them in the past; consider Folding@Home's initial release on R580 and the continued absence of G80 support as an example. Second, it makes GPU computing more accessible by allowing developers to write their applications in a potentially more familiar manner, as opposed to shoehorning their application to fit within a 3D API's paradigm. Finally, it allows developers to access portions of the chip that they wouldn't be able to use directly in a 3D API.

In February, NVIDIA released a beta of CUDA to the public. Our ideas about the advantages of the CUDA approach were confirmed, especially with the exposure of the parallel data cache to reduce DRAM accesses and accelerate algorithms that would have previously been limited by memory bandwidth or latency. Of course, it wasn't perfect; there was still only single precision (32-bit), which is not suitable for many applications, among other limitations. Still, the beta showed enormous promise, and it was obvious that GPU computing would rapidly become a major part of NVIDIA's business.

Today, NVIDIA launches its third brand of GPU products, Tesla, for GPU computing.



The Tesla Lineup


At the moment, NVIDIA Tesla is primarily focused on the highest of the high-end, namely the oil, gas, and computational finance industries. That's important to keep in mind because the Tesla introduction has answered another question we had when we first looked at CUDA: would it be limited to professional cards, even though the consumer GeForce cards would be capable of using CUDA? The answer is a resounding no. CUDA will be available across all product lines, although eventually there will be some features specific to GPU computing that are only available through the Tesla brand. Instead, Tesla, like Quadro, will be focused as a total solution. Workstations and software will be qualified to work with Tesla, with the same types of support as Quadro.

The basic unit of the current Tesla line, the Tesla C870, should be very familiar to anyone who's seen the GeForce 8800. It's essentially an 8800 GTX--a 575MHz core clock and 128 SPs at 1.35GHz--with 1.5GiB of GDDR3 RAM. Of course, it's not quite an 8800 GTX--there are no display outputs at all on the card, even though it has a new version of the NVIO chip.

Click for a bigger version

Click for a bigger version

Click for a bigger version


NVIDIA states that the C870 has a peak performance of 518 GFlops. Careful readers might already realize the implications of this number: the missing MUL is apparently now available in CUDA, increasing theoretical peak performance by 50% over what was previously stated for the 8800 GTX in CUDA. However, the conditions necessary for the MUL in the SFU to be used are unknown, and we don't know whether or not the SFU MUL will ever be available in 3D applications. Still, the difference between a 50x and an 100x speedup is a lot less important than the difference between 1x and 50x, so we aren't too concerned about the missing MUL. The C870 has an MSRP of $1299 and should be available in August.

The second product in the current Tesla line is the NVIDIA D870, called the "Deskside Supercomputer." It's very similar to the Quadro Plex; it's two Tesla C870 cards in an external unit. Like the Quadro Plex, the D870 will connect to a host computer via an external PCIe 8x or 16x connection.

Click for a bigger version


Because it's simply two C870 cards in a more convenient form factor, the peak theoretical performance of the D870 is 1.036 TFlops. Keep in mind that CUDA doesn't use any multi-chip interface like SLI. Instead, one thread on the CPU controls one CUDA device. So, in the case of the D870, there are two CUDA devices, and two CPU threads will be used to control them. As a result, if the data set can be spread across the two devices, there's a linear increase in speed. There's not any overhead from SLI or anything other than PCIe bandwidth, so the D870 really will be about twice as fast as the C870. The D870 has an MSRP of $7500 and, like the C870, is scheduled for availability in August.

Finally, there's the utter beast of the family: the Tesla GPU Server. With availability targeted for Q4, it costs $12,000, and it's twice as fast as the D870 with four G80 cards. The really terrifying thing here, however, is the form-factor. It fits four G80s into a 1U rack.

Click for a bigger version

Click for a bigger version

Click for a bigger version


Again, the peak performance is double that of the D870: 2.072 TFlops. The GPU Server will consume about 550W of power on average (with a peak of 800W) and, like the D870, will be connected to a host machine via an external PCI Express 8x or 16x interface. A host machine will be able to control a single GPU Server. Still, for markets that can leverage this kind of computing power, the GPU Server has to be unbelievably attractive, as it would fit into the existing server ecosystem without any special considerations.

But, just what are those markets, and why are they able to use GPU computing to such effect? NVIDIA has been working closely with some developers since CUDA first became available, and we've seen some of the fruits of their labor.