Beyond3D - NVIDIA GT200 GPU and Architecture Analysis

NVIDIA GT200 GPU and Architecture Analysis - Page 4

Published on 16th Jun 2008, written by Rys for Consumer Graphics - Last updated: 16th Jun 2008

GT200: Sampling and the ROP

For data fetch and filtering, each cluster is connected to its own discrete sampler unit (with cluster + samplers called the texture processing cluster or TPC by NVIDIA), with each one able to calculate 8 sample addresses and bilinearly filter 8 samples per clock. That's unchanged compared to G92, but it's worth pointing out that prior hardware could never reach the bilinear peak outside of (strangely enough) scalar FP32 textures. It's now obtainable (or at least much closer) thanks to, according to NVIDIA, tweaks to the thread scheduler and sampler I/O. We still heavily suspect though that one of the key reasons is additional shared INT16 hardware for what we imagine actually is a shared addressing/filtering unit. Either way, each sampler has a dedicated L1 cache which is likely 16KiB and all sampler units share a global L2 cache that we believe is double the size of that in G80 at 256KiB. The sampler hardware runs at the chip base clock, whereas the shading units run at the chip hot clock, which is most easily thought of as being 2x the scheduler clock. Along with the memory clock, those mentioned clocks comprise the main domains in GT200, just like they did in G80.

The hardware is advertised as supporting D3D10.0, since its architecture is marginally incapable of supporting 10.1, by virtue of the ROP hardware. D3D10 compliance means the ability in hardware for recycling data from GS stage of the computation model back through the chip for another pass. The output buffer for that is six times larger in GT200 than in G80, although NVIDIA don't disclose the exact size. Given that the GS stage is capable of data amplification (and de-amplification of course), the increased buffer size represents a significant change in what the architecture is capable of in a performance sense, if not a theoretical sense. The same per-thread output limits are present, but now more GS threads can now be run at the same time.

That covers the changes to on-chip memories that each cluster has access to. Quickly returning to the front of the chip, It appears that the hardware can still only setup a single triangle per clock, and the rasteriser is largely unchanged. Remember that in G80, the rasteriser worked on 32 pixel blocks, correlating to the pixel batch size. GT200 continues to work on the same size pixel blocks as it sends the screen down through the clusters as screen tiles for shading.

At the back of the chip, after computation via each TPC, the same basic ROP architecture as G80 is present. With the external memory bus 512 bits wide this time and each 64-bit memory channel serving a ROP partition, that means 8 ROP partitions, each partition housing a quartet of ROP units. 32 in total then. Each ROP is now capable of a full-speed INT8 or FP16 channel blend per cycle, whereas G80 needed two cycles to complete the same operations. This guarantees that blending isn't ROP limited, which could already be the case on G80 and would have become even more of a problem with a higher memory/core clock ratio. It might also initially seem odd that FP16 is also supported at full-speed despite being certainly bandwidth limited, but remember that full-speed FP16 also means that 32-bit floating point pixels made up of three FP10 channels for colour and 2 bits for alpha also go faster for free and that's not easy to do otherwise.

The ROP partitions talk to GDDR3 memory only in GT200. We mention that in passing since it affects how the architecture works due to burst length, where you need to be sure to match what the DRAM wants every time you feed it or ask for data in any given clock cycle, especially when sampling. GDDR4 support seems non-existant, and we're certain there's no GDDR5 support in the physical interface (PHY) either. The number of ROP partitions means that with suitably fast memory, GT200 easily joins that exclusive club of microprocessors with more than 100GB/sec to their external DRAM devices. No other class of processor in consumer computing enjoys that at the time of writing.

The ROP also improves on peak compression performance compared to both G80 and G92, allowing it to do more with the available memory bandwidth, not that 512-bit and fast graphics DRAMs mean there's a lack of the stuff available to GT200-based SKUs, more on which later.

That's largely it in terms of the chip's new or changed architectural traits in a basic sense. The questions posed now mostly become ones of scheduling changes, and how memory access differs when compared to prior implementations of the same basic architecture in the G8x and G9x family of GPUs.

NVIDIA GT200 GPU and Architecture Analysis - Page 4

GT200: Sampling and the ROP

Page Navigation