New Video Technology in G84

G84 introduces a set of brand new logic blocks for NVIDIA's GPUs for video decode and post processing. NVIDIA claim that the new VP2 (Video Processor 2, with VP1 present in GeForce 7 GPUs and G80), BSP (Bitstream Processor) and AES128 tech blocks will accelerate the entire set of H.264 AVC decode stages. We aren't sure if the Bitstream Processor can handle VC-1's high-order entropy coding though, as none of the docs we've seen really focus on VC-1 acceleration. Even if it couldn't though, it should be noted that it is much less computationally expensive to do on the CPU than AVC's CABAC.

Since G84 doesn't make use of NVIO for display logic, that's rolled in to the GPU this time around. NVIO's 49mm² is around 75% blank, so the remaining ~12mm² of area for display and data I/O ability is consumed by the GPU.

VP2

VP2 is responsible for handling the iDCT, motion compensation (which included frame lookaround and the buffering for that) and deblocking filter for the supported video CODECs the processor will decode, including VC-1 and H.264. The deblocking filter is specced and mandatory in H.264, and thus mandatory and meets that spec in VP2 given NVIDIA's design goals. Performance and image quality versus the same functionality in VP1 is said to be improved.

We'll have more on some of the exact features of the VP2 block in a future piece, leaving you with NVIDIA's statement that VP2 has a throughput that can provide decode of H.264 at the full 40Mib/sec requirements of Blu-Ray.

BSP

The BSP, or Bitstream Processor, is responsible for the entropy decode in G84's H.264 decode ability. H.264 makes use of CABAC (which further uses exp-Golomb entropy decode) and CAVLC (less processor intensive), and those schemes provide a good chunk of H.264's compression prowess, and operate on the variable length symbols used to encode the video data.

CABAC decode (very simply put) requires looking at neighbouring pixel's (in 4x4 pixel blocks in the reference decoder) luminance and motion data, including lots of iteration over per-pixel data and repeated macroblock processing, with heavy branching based on state, making it an inherently serial task and one not suited to parallel processing on a GPU's shader core.

It's unclear what the bandwidth through the BSP is, and how it's limited by the rest of the GPU, including interconnects to other parts of the GPU and out to video memory. It's also unclear whether the BSP contains its own small local store or whether it shares a cache with another GPU function, but we'd imagine it definitely contains its own discrete logic for cache, given the probable goals of the implementation.

In terms of what feeds what, the BSP sits before VP2 in the processing chain for video decode.

AES128

This decrypter block is there to assist in AACS decoding and provide part of the protected video path in Microsoft Windows Vista. No word from NVIDIA on the decryption rate the block can sustain, but presumably it's enough for full-spec support of AACS as implemented by Blu-Ray and HD-DVD.

Post-processing

Video image quality enhancements in the form of post-processing are performed by the GPU's shader core (and, most likely, a programmable DSP-like unit like the one in VP1), so that should be fully programmable and adjustable by the driver over time. NVIDIA don't currently state how the shader core in G84 is used to improve video quality, other than to say that a G84 scores nearly full marks (128/130) in the popular HQV standard definition video benchmark.

Let's look at decode performance then.