NVIDIA Tesla: GPU computing gets its own brand

Wednesday 20th June 2007, 08:30:00 PM, written by Rys

Being able to witness the earnest birth of a new computing industry is a special thing. In the last few years, efforts to use programmable commodity graphics hardware for things other than graphics have gained pace, and now there''s a legit multi-million dollar industry surrounding it.

Today sees NVIDIA further legitimise what they call GPU computing as they introduce Tesla, their third brand based around discrete PC GPU production. Using the famous G80 graphics processor, the first round of Tesla products joins GeForce and Quadro in their product ranks. CUDA is the conduit of course, and we have an article that lets you know what Tesla is all about.

We've also got interviews with Andy Keane, General Manager of the GPU Computing Group at NVIDIA, and Dave Kirk, Chief Scientist, regarding Tesla, CUDA, the future of GPU computing, and better incorporating parallel programming into education, among myriad other things.

Following CUDA's 1.0 release in a week or so, we''ll cover that and interview Ian Buck, Software Manager for CUDA, to talk more about the software side.

Discuss on the forums

Tagging

nvidia ± tesla, gpu computing, gpgpu, cuda, kirk, keane


Latest Thread Comments (24 total)
Posted by silent_guy on Friday, 22-Jun-07 05:41:18 UTC
Quoting Arun
Depending on how it is implemented, if part of the extra FP64 cost was neatly divided as another logic block (I'm really not sure how that'd work but heh) then they could just make it redundant, so that even if it increased die size by 5%, it wouldn't affect yields but only the number of chips on the wafer.

As I said, I completely fail to see how that kind of division could be implemented, but perhaps an EE would have a better idea of whether it is possible. Another thing to take into consideration is that only G92 will be sold as a GPGPU part. G94 and G98 will not. So this could just mean they're not including FP64 on the die of G94 and G98.
I don't think 'redundant' is the right word here. 'Ignored' may be a better fit.

In theory, you could create test vectors that don't cover the lower bits of multipliers, adders etc, and a strap that ties the outputs of the LSB's to zero. So the unnecessary logic wouldn't lower yield.

In practice, it may not be worth the effort.

Also, stuck-at DFT coverage is typically in the upper 99%, say 99.6%. That's high, but it still leaves millions of connections uncovered. However, one counts on the fact that, statistically, a fault will also impact surrounding locations and that this increases effective coverage.
For an ALU, the unnecessary logic is probably placed very close to necessary logic. Masking out unused logic may indirectly also mask real problems in useful logic.

Posted by _xxx_ on Friday, 22-Jun-07 07:31:15 UTC
I just hope there will be a Simulink library (autocode generation) for this, I'd be all over it.

Posted by Jawed on Friday, 22-Jun-07 12:33:01 UTC
Quoting Rufus
I'm thinking burning 5% on the ALUs (so like 2% area on the whole chip?) is cheaper than redoing a custom layout.
But why bother to have that overhead on lower-performance GPUs? As far as NVidia is concerned, single-GPU Tesla is "low-performance" entry level, evaluation kit. The real McCoy is a rack of em. A single G86 is nothing more than "rounding error" in comparison to that... Jawed

Posted by Voltron on Friday, 22-Jun-07 13:21:20 UTC
Great article and interview!!

When you guys asked about whether NVIDIA would support CUDA desktop applications built using GeForce were you envisioning anything specific?

Obviously NVIDIA is focusing on HPC right now, but could you sense, either from those interviews or just chatting, whether or not they were excited about the potential of CUDA on the desktop?

Offline video processing for movies was an interview topic, but was there any discussion of processing for networked video such as YouTube or Joost?

Posted by Megadrive1988 on Friday, 22-Jun-07 15:38:08 UTC
Quote
Because it's simply two C870 cards in a more convenient form factor, the peak theoretical performance of the D870 is 1.036 TFlops. *Keep in mind that CUDA doesn't use any multi-chip interface like SLI. Instead, one thread on the CPU controls one CUDA device. So, in the case of the D870, there are two CUDA devices, and two CPU threads will be used to control them. As a result, if the data set can be spread across the two devices, there's a linear increase in speed. There's not any overhead from SLI or anything other than PCIe bandwidth, so the D870 really will be about twice as fast as the C870.*
this is something that I would like to see in next-gen consoles, next-gen arcade platforms or next-gen consumer computers (PC or non-PC) -- 2 or 4 GPUs working together providing a linear increase in performance without the overhead of SLI or CrossFire.

Posted by silent_guy on Friday, 22-Jun-07 15:43:26 UTC
Quoting Megadrive1988
this is something that I would like to see in next-gen consoles, next-gen arcade platforms or next-gen consumer computers (PC or non-PC) -- 2 or 4 GPUs working together providing a linear increase in performance without the overhead of SLI or CrossFire.
I'm sure everybody wants to see this. The problem is in the 'working together' part. :wink:

Parallel Tesla machines are not linked to each other with a physical link. As soon as you have dependencies and need to exchange data between them, performance increases won't be linear anymore.

Posted by Tim Murray on Friday, 22-Jun-07 16:23:24 UTC
Quoting silent_guy
I'm sure everybody wants to see this. The problem is in the 'working together' part. :wink:Parallel Tesla machines are not linked to each other with a physical link. As soon as you have dependencies and need to exchange data between them, performance increases won't be linear anymore.
To elaborate on this:Right now, with CUDA, you create a device context that is specific to a single chip. There's no way to scale this context (at the moment) using SLI or anything, so once you have multiple chips, you need to create multiple contexts. However, I think you can only have one context per device per process (with one context per CPU thread as well). So, you can run multiple CUDA apps on the same chip (because they are different processes), but you can't have multiple CUDA contexts from the same app on the same chip. You can have multiple CUDA contexts in the same app, though, if each context has its own chip. Alternately, you could have some number of apps and have each one use a different CUDA device.I think this is part of the reason Tesla's being pushed at the high-end; they have no problem targeting multiple code paths and being certified on different hardware.

Posted by Voxilla on Saturday, 23-Jun-07 19:00:00 UTC
Does anybody know if these Tesla boards support DirectX or OpenGL besides Cuda?
This would be useful for server based rendering.

Posted by Tim Murray on Saturday, 23-Jun-07 20:26:06 UTC
Quoting Voxilla
Does anybody know if these Tesla boards support DirectX or OpenGL besides Cuda?
This would be useful for server based rendering.
They do. (I think I say that in the article. :p ) No SLI, though.

Posted by Osamar on Monday, 25-Jun-07 09:36:50 UTC
Probably is a stupidity. But the graphic surprise comment, could be a Renderman render for Tesla?


Add your comment in the forums

Related nvidia News

CUDA 4.0 and Parallel Nsight 2.0 released
NVIDIA Fermi GPU and Architecture Analysis
NVIDIA's Parallel Nsight finally released
NVIDIA GeForce GTX 460 - GF104 breaks cover
PhysX87, ancient tragedy in 5 acts by RWT
So long, Chris, and thanks for all the fish
NVIDIA GF100 graphics architecture details
NVIDIA Fermi: new GPU architecture, starting with GF100
NVIDIA release OpenCL GPU drivers for Linux and Windows
NVIDIA GeForce GTX 275 at $250 to fight HD 4890