ROP, Display Pipe and PureVideo


Ah, the ROP hardware. In terms of base pixel quality it's able to perform 8x multisampling using rotated or jittered subsamples laid over a 4-bit subpixel grid, looping through the ROP taking 4 multisamples per cycle. It can multisample from all backbuffer formats too, NVIDIA providing full orthogonality, including sampling from pixels maintained in a non-linear colour space or in floating point surface formats. Thus the misunderstood holy grail of " HDR+AA" is achieved by the hardware with no real developer effort. Further, it can natively blend pixels in integer and floating point formats, including FP32, at rates that somewhat diminish with bandwidth available through the ROP (INT8 and FP16 full speed (measured) and FP32 half speed). Each pair of ROPs share a single blender (so 12 blends per cycle) from testing empirically.

Sample rates in the ROP are hugely increased over previous generation NVIDIA GPUs, a G80 ROP able to empirically sample Z 8 times per clock (4x higher than G7x ever could), with that value scaling for every discrete subsample position, per pixel, bandwidth permitting of course. Concluding 'free' 4xMSAA from the ROP with enough bandwidth is therefore an easy stretch of the imagination to make, and the advantages to certain rendering algorithms become very clear. The 24 ROPs in a full G80 are divided into partitions of 4 each, each partition connecting to a 64-bit memory channel out to the DRAM pool for intermediary or final storage.

Further to the traditional multisampling implementation, the ROP also assists in a new type of exposed AA method called coverage sampling. Combined with multisampling, CSAA will take extra binary coverage samples in the subpixel, to determine if triangles intersect the subpixels at those sample points. If they do, the sample locations are used to influence the sample resolve and thus the colour blended into the framebuffer (or so I understand at least, NVIDIA are somewhat vague). The sample data consumes much less storage space per pixel than multisamples, and the hardware falls back to the basic multisample mode should coverage sampling fail for the pixel (I presume a threshold of samples show no intersections at all, or possibly if rendering is done in a way other than front-to-back). The ROPs share access to an L2 data cache that's likely 128KiB with access likely optimised for reading data aligned on 8-byte boundaries (a full FP16 pixel).

Stencil update rates only scale with ROP count compared to G71, NVIDIA not spending any area improving that facet of performance. We measure 96 stencil tests per clock with G80 and and 64 with G71, proving that's the case. Depth tests needed for CSAA are 2 for 2x, 4 for 4x, 8 for 8x, 8 for 8xQ, 12 for 16x, and ~20 for 16xQ, but we'll discuss that in the IQ piece. Finally as far as this somewhat quick look at the new ROP goes -- it's easier to talk about a ROP that sees little restriction on what it can process, after all -- it'll also do the traditional depth check for occlusion, per pixel, and forego the write to memory if it fails, saving memory bandwidth (but not saving on shading resources or ROP bandwidth any more than it has done in prior architectures). We leave D3D10 (and D3D10.1) AA considerations for a chip like G80 until we have that pesky D3D10 driver.

Display Pipe and PureVideo

While G80 itself contains no display logic, it has to present that to NVIO. Therefore talking about the the internal workings of the chip in a colour precision sense is definitely worth it. G80 has the ability to work within the bounds of a 10-bit resolution per colour component from data input to data ouput, supporting 10bpc framebuffers without issue and allowing direct transmission and scanout from those via NVIO.

As far as video processing goes, that's where you'll find some of the only carried-over functional blocks from previous GPUs in G80. While nearly 100% of the transistors for 3D processing are new according to NVIDIA (and we're inclined to believe them), what NVIDIA call PureVideo -- their on-GPU logic primarily for the decode of motion video -- is (mostly) carried over from G7x, so we're told. As fixed function logic operating in the chip's lower clock domain and with frequencies in the same range as older NVIDIA GPUs, it makes some sense to do so. Like with G7x, the shading core augments what largely amounts to post-decode processing in G80, improving video quality (maybe just using one cluster to provide baseline expectations, but there's no honest reason for NVIDIA to hardcode it that way, especially with high resolution video support).

We've got room before we finish to talk about some theoretical rate work we did, G80's application to GPGPU and image quality before we wrap up for the time being.