Beyond3D - NVIDIA G80: Architecture and GPU Analysis

NVIDIA G80: Architecture and GPU Analysis - Page 7

Published on 8th Nov 2006, written by Rys for Consumer Graphics - Last updated: 25th Apr 2007

Threading and Branching

We talked a bit about how threaded the hardware is previously, so a little more detail about that and its branching performance is prudent. Giving the technology a marketing banner of GigaThread, threads can jump onto and off of their cluster for free, with a new one ready for processing every cycle if need be, with 'free' branching via dedicated branch units (relieving the SPs from performing the calculation). Assuming say 10 stages each for MADD and MUL in the SP, and interpolation, let's call that 30 batches of 16 objects being executed per cluster, in flight and per cycle. And just incase you didn't catch the D3D10 o'clock news, three thread types exist for D3D10, and just two for D3D9 (vertex and pixel).

There's a global scheduler for managing the core, but also a local scheduler per cluster, that manages the individual threads being executed on those with thousands maintained and 'in-flight' by the hardware at any one time. Given ALU pipelining and other considerations, 4K as a count for active SP threads would be a good guess we think.

That threaded execution core is paired with branching hardware that works at a granularity of either 16 (for vertex data) or 32 (for pixels) objects, with it actually measured at 32 for pixels in our tests as mentioned earlier. Branches happen in one cycle for all thread types, and it means branching penalties are minimised at up to 32 objects per clock across the entire shader core. Constrast that with prior Shader Model 3.0 NVIDIA hardware with minimum branching granularity of 880 pixels, and NVIDIA catch up with ATI in the branching performance stakes on a modern GPU (16 pixels for R520, 48 for R580).

There's a shared register file per cluster, but no register file for the global scheduler (just FIFOs). Each file is likely big enough to maintain around several times the number of active threads to hide sampler latency by allowing the cluster to grab a thread's data, process and put it back 'for free', as the core executes sampler threads in parallel. One assumes again that if the register file gets full, the hardware just reduces the in-flight thread count until register pressure is relieved and thread count can creep back up over a number of cycles, the heuristics for which should be fairly simple we imagine

We'll cover just what the hardware runs on its shading core on the next page.

NVIDIA G80: Architecture and GPU Analysis - Page 7

Threading and Branching

Page Navigation