Thread Dispatch

If the command processor could've been looked at as a CEO of sorts, the thread dispatcher is the equivalent of the foreman that must keep everything working at full tilt. Leaving the realm of forced comparisons, in the context of a GPU the dispatcher's role is to ensure maximum utilisation of available units by effective scheduling of processing threads, masking DRAM access latency by switching between in-flight ready threads -- all in all, its goal is to achieve the highest possible computational throughput.

The inputs provided by the setup engine are batched into threads of 64 elements, based on their thread type, and placed into command queues. Each SIMD has an arbiter-sequencer & sequencer pair assigned in the dispatcher, which allows for the interleaved execution of two threads per SIMD, with threads running for four clocks before being slept. Branch granularity sits (obviously) at 64 objects.

Arbitration has a number of policies in place, taking into account aspects as the thread's runtime or the availability of resources. Threads that issue high-latency fetch requests (typically involving DRAM memory) are put to sleep and maintained in their frozen state until the resource becomes available. When a situation arises for which no clear arbitration policy is in place, pixel threads are favoured. Arbitration policies are programmable, and finding the optimal one to occupy the hardware is typically a case of multi-criterial optimisation -- we'd like to know more about this bit but the data is unique and valuable to ATI, sadly but obviously.

Each arbiter gets data about thread completion status from the sequencers it's paired with (along with the CP as mentioned), so that new threads can be prepared to be dispatched as prior ones finish execution. Fully virtualised dedicated instruction and constant caches are in place. A miss in either of these forces the offending thread to sleep, and also downgrades its priority in the command queue.

Arbitration and sequencing are also performed for sampler-executed instructions, and this is probably where the RV740 (and the R7xx generation in general) differs from prior parts, given the fact that the way these are handled has changed, given the change in sampler arrangement. However, compared to processing threads, policies are simplified here because there's no register pressure to take into account.

If all of this appears to be eerily familiar, it's because you've probably gone through these very same motions before, around the R600 launch - there aren't many changes, beyond the improvements to the sequencer to accommodate the increased SIMD count, and the modification of how fetch threads are handled. It's not our fault ATI opted against reinventing the wheel, really! At any rate, prepare to be dispatched to a SIMD.