GT200: General Architecture Notes
We mentioned that the big questions posed now mostly become ones of scheduling changes, and how memory access differs when compared to prior implementations of the same basic architecture in the G8x and G9x family of GPUs. Where it concerns the former question, it becomes prudent to wonder whether the 'missing' MUL is finally available for general shading (along with the revelation about its inclusion in G8x and G9x, which we might one day share).
We've been able to verify freer
issue of the instruction in general shading, but not near the theoretical peak when the
chip is executing graphics codes. NVIDIA mention improvement to
register allocation and scheduling as the reason behind the freer
execution of the MUL, and we believe them. However it looks likely that it's only able to retire a result every second clock because of operand fetch in graphics mode, effectively halving its throughput. In CUDA mode, operand fetch seems more flexible, with thoughput nearer peak, although we've not spent enough time with the hardware yet to really be perfectly sure. Regardless, at this point it seems impossible to extract the peak figure of 933Gflops FP32 with our in-house graphics codes. How much this matters depends on whether you can use the MUL implicitly through attribute interpolation the rest of the time, which we aren't sure about just yet either.
After that it's probably best to worry about GS performance in D3D10 graphical applications, which we'll do when it comes time to benchmark the hardware. The new output buffer size increase is one of the bigger architectural differences, maybe even more so than the addition of the extra SM per cluster. Adoption of the GS stage in the D3D10 pipe has undoubtedly been held back a little by the typical NVIDIA tactic of building just enough in silicon to make a feature work, but building too little to make it immediately useful.
The increase in register file, a doubling over the number of per-SM registers available to G8x and G9x chips, means that there's less pressure for the chip to decrease the number of possible in-flight threads, letting latency hiding from the sampler hardware (it's the same 200+ cycles latency to DRAM as with G80 from the core clock's point of view) become more effective than it ever has done in the past with this architecture. Performance becomes freer and easier in other words, the schedulers more able to keep the cluster busy under heavy shading loads. Developers now need to worry less about their utilisation of the chip, not that we guess many really were with G80 and G92. The other G8x and G9x parts have different performance traits for a developer to consider there, given how NVIDIA (annoyingly in the low-end from a developer perspective) scaled them down from the grandfather parts.
That per-SM shared memory didn't increase is interesting too. The way the CUDA programming model works means that a static shared memory size across generations is attractive for the application developer. He or she doesn't have to tweak their codes too much to make the best use of GT200, given that shared memory size didn't change. However given that CUDA codes will have to be rewritten for GT200 anyway if the application developer wants to make serious use of FP64 support.... ah, but that's comparatively slow in GT200, and heck, 16KiB for every SM is a fair aggregate chunk of SRAM when multiplied out across the whole chip. 1.4B transistors sounds like room to breathe, but we doubt NVIDIA see it as an excuse to be so blasé about on-chip SRAM pools, even if they are inherently redundant parts of the chip which will help yields of the beast.
Minor additional notes about the processing architecture include improvements to how the input assembler can communicate with the DRAM devices through the memory crossbar, allowing more efficient indexing into memory contents when fetching primitive data, and a larger post-transform cache to help feed the rasteriser a bit better. Primitive setup rate is unchanged, which is a little disappointing given how much you can be limited there during certain common and intensive graphics operations. Assuming there's no catch, this is likely one of the big reasons why performance improvements over G80 are more impressive at ultra-high-end resolutions (along with the improved bilinear filtering and ALU performance which also become more important there).