Threading and Branching

What we call the cluster dispatch processor is what controls execution across the clusters (AMD call them SIMDs) in the shader core, and there's a similar processor for the sampler array. Those logic blocks are what control the threading model that R600 implements, to hide latency and take advantage of unit pipelining in order to maintain instruction and data throughput through the entire chip. We'll go through the shader core's processing first.

Input from the setup engine fills up a triplet of command queues, one for each thread type, containing the threads the dispatch hardware runs on the shader core. Each cluster contains a pair of arbiters that run a pair of object threads at a time, per cluster, allowing four clocks of execution before new threads are swapped on and run in place. Thread tracking for execution is controlled by a scoreboarding system that lets the hardware run threads out of order on the cluster, tracking dependencies and other parameters (likely the ops being run, and registers being written to and read from) to decide what gets executed next, in place of the currently running threads.

Threads undergoing arbitration also have granular priority, affecting the decision making for what's being run. The basic heuristics are designed to put shader threads waiting on sampler data to sleep, to cover latency and run shading ops unhindered while that's taking place, and it's what you'll find any heavily parallel, heavily threaded design doing in terms of its threading model. The sequencer pair (one per arbiter) inside the dispatch processor is there to keep track of where a thread is at in terms of its block of execution. Data from the sequencer can feed back into the arbiter to let it know when a thread is about to finish running, so that new threads can be prepared to take a finishing thread's place.

Like competing architectures, R600 will scale back the number of threads in flight at any given time when there's severe register pressure, so that there's no thread stalling because the register file would otherwise be full. Branch granularity is 64 pixels, and we measure it to be most performant using our branching test when working on 4x4 blocks of screen-aligned quads (so 8x8 pixels). Testing our branching shader using other tiling arrangements for the texture we're sampling in the shader shows that something like 32x2, or 16x4 (or 2x32, or 4x16) are slower.

Instruction blocks and constants are kept by the dispatch processor on chip inside dedicated, virtualised cache memories. They're there to maximum efficiency and let the hardware grab thread state as quickly as possible when executing. Any cache miss here will force the thread that needs the data to sleep, and another thread is swapped on in place and the needed data fetched into the cache so it's ready when it wakes. A cache miss here can also force a thread's priority to be reduced, effectively moving it down the command queue, so that it might not be woken up as the next thread behind the one that took its place when it missed.

In terms of threads in flight just for the shader core, the hardware maintains a usual count of thousands (which can cover tens of thousands of objects), depending of course on the resources they want to use. In terms of the sampling threads, the heuristics are simpler because there's no register pressure to account for, just the requirement that the right data is in the right cache at the right time, so that a refetch because of a cache miss doesn't occur and chip throughput is high, hiding the latency of the data fetch and any post-fetch filtering.

The threading model working well is also a function of the memory controller's ability to marshal data around the chip from the main pools, including the DRAM pool on the board, to the myriad clients that want to consume data, and store it. We move on to the shader core next.