NVIDIA Fermi: new GPU architecture, starting with GF100

Wednesday 30th September 2009, 11:43:00 PM, written by Rys

At their Graphics Technology Conference earlier this evening, NVIDIA announced their next-generation graphics architecture, codenamed Fermi.  Graphics seems like it's not the primary focus of the first implementation of Fermi, though, with GF100 going for everyone else's jugular in the general purpose GPU compute industry.

With a brand new shader core, Fermi's compute clusters comprise a single shader multiprocessor (SM) this time.  Each SM is capable of dual issuing two independent instructions per clock to two different warps, across two clocks, with each instruction run by a 16-way SIMD block capable of single precision FMAs at full rate, and doubles at half rate.

The memory heirarchy is new, with a new coherent and unified L2 cache serving all SMs with no partitions, and a new unified memory space allows each SM to talk to not just it's own local registers and shared memory, but L2 and beyond, all the way out into system memory (up to 1 TiB, backed by a hardware TLB).

Various other compute-friendly facets of performance are improved versus their last Tesla architecture chips, with GT200/T10 the pinnacle at the moment.  Atomic instruction throughput is up, everything is backed by ECC, and the hardware can sustain peak SP and DP FMA instruction throughput.

We've got a short look at GF100, including speculation on some of the graphics features, in the forums, pending a proper look at things, and our friends at The Tech Report and Real World Tech have pieces talking about things, by virtue of early briefings.
Discuss on the forums

Tagging

nvidia ± fermi, dx11, gf100


Latest Thread Comments (4019 total)
Posted by trinibwoy on Monday, 02-Aug-10 03:08:24 UTC
Did Nvidia beef up GF104's texture units? Was just browsing Damien's english review (http://www.behardware.com/articles/795-4/report-nvidia-geforce-gtx-460.html) and it seems FP16 and RGB9E5 are now full speed as opposed to half speed on GF100.

Image: http://img153.imageshack.us/img153/3681/texturing.png

Posted by Alexko on Monday, 02-Aug-10 08:39:45 UTC
Damien's reviews usually deserve a bit more than a quick browsing… :p

Quote
Moreover, the texturing units have been improved to filter FP16 textures (as well as FP11, FP10 and RGB9E5) at full speed.
http://www.behardware.com/articles/795-2/report-nvidia-geforce-gtx-460.html

Posted by trinibwoy on Tuesday, 03-Aug-10 04:05:24 UTC
Of course, thanks. Saw it on my second read through :) Wonder why they bothered.

Posted by Chalnoth on Tuesday, 03-Aug-10 06:23:29 UTC
Quoting trinibwoy
Of course, thanks. Saw it on my second read through :) Wonder why they bothered.
My first guess would be that it was something that was intended for the GF100 all along, but there was a bug in the hardware implementation that forced them to implement these modes with reduced performance.

As for why they would have wanted to go this route in the first place, well, that would make sense if they feel that these modes will become more and more common as time goes forward, and if the added hardware cost was minimal.

Posted by mczak on Tuesday, 03-Aug-10 12:54:13 UTC
Maybe the full-speed fp16 was just a later addition which didn't make it for GF100.
That said, it would imho make more sense for GF100 than GF104, since GF100 has lower tex:alu ratio (and also higher memory bandwidth / tex). Unless you think it doesn't matter for GF100 since it looks more useful for non-gaming usages anyway..

Posted by ShaidarHaran on Tuesday, 03-Aug-10 14:00:00 UTC
Interesting that the fp formats have seen performance increases from GF100->GF104, but the int formats have seen performance decreases. Also, there appears to be a hard cap @ 33.3 GTexels/s for 3 of the formats. Any thoughts as to what might be causing this? Is it a lack of cache or cache bandwidth? Some other architectural limitation? I don't think it's a lack of VRAM or VRAM bandwidth since GF104 out-performs GT200b in 2 of the 3 formats.

Posted by TKK on Tuesday, 03-Aug-10 15:38:10 UTC
Quoting ShaidarHaran
I don't think it's a lack of VRAM or VRAM bandwidth since GF104 out-performs GT200b in 2 of the 3 formats.
Also, if it was the case there should be a difference between the two GTX 460 variants, which isn't the case.

Posted by Gipsel on Tuesday, 03-Aug-10 16:57:42 UTC
Quoting ShaidarHaran
Also, there appears to be a hard cap @ 33.3 GTexels/s for 3 of the formats. Any thoughts as to what might be causing this? Is it a lack of cache or cache bandwidth? Some other architectural limitation? I don't think it's a lack of VRAM or VRAM bandwidth since GF104 out-performs GT200b in 2 of the 3 formats.
It's the theoretical max throughput of the 56 TMUs * 0.675 GHz = 37.8 GTexel/s. Obviously the efficiency (88%) is slightly lower than on AMD GPUs (~98% or so) for this simple tasks.

Posted by mczak on Tuesday, 03-Aug-10 18:34:21 UTC
Quoting Gipsel
It's the theoretical max throughput of the 56 TMUs * 0.675 GHz = 37.8 GTexel/s. Obviously the efficiency (88%) is slightly lower than on AMD GPUs (~98% or so) for this simple tasks.
I think the more interesting comparison is GTX470/480 - 60 TMUs *0.7 GHz = 42 GTexels/s and it is achieving 41.4 GTexels/s (for int8 only though) - 99%. So for some odd reason GF104 can achieve less of the peak potential of the tmus.

Posted by CarstenS on Tuesday, 03-Aug-10 19:20:50 UTC
I'm showing (almost) the same here. 33.8 GTex is the maximum i can get out of a stock GF104 with bilinear filtering. With trilinear it's a more expected 18.9 GTex/s. Together with the point sampling result of - again - 33.8 GTex/s I'm guessing, it's maybe interpolation or adress bound.

An HD5830 is literally miles away at 43.6 and 22.4 GTex/s.


Add your comment in the forums

Related nvidia News

CUDA 4.0 and Parallel Nsight 2.0 released
NVIDIA Fermi GPU and Architecture Analysis
NVIDIA's Parallel Nsight finally released
NVIDIA GeForce GTX 460 - GF104 breaks cover
PhysX87, ancient tragedy in 5 acts by RWT
So long, Chris, and thanks for all the fish
NVIDIA GF100 graphics architecture details
NVIDIA release OpenCL GPU drivers for Linux and Windows
NVIDIA GeForce GTX 275 at $250 to fight HD 4890
A look at NVIDIA's SLI Multi-OS and new Quadros