The Thread Dispatcher

A statement made by the same Richard Huddy a while ago will probably haunt this part of the architecture forever, as people will keep on trying to find where the doubled complexity that Richard mentioned is hidden. We don't think it's as glamorous as most assume, and it's most likely that the above quote was taking into account the extension needed to handle scheduling for 10 extra SIMDs/TUs, as well as the juggling of the new Hull and Domain shader thread types. This means two extra command queues for the aforementioned new thread types, for a grand total of 5 (Vertex, Hull, Domain, Geometry and Pixel). Threads are still 64 elements in size and we've nicely measured this in practice.

Other than that, we're reasonably certain that the dispatcher has an internal arrangement that's quite similar to what we've seen before: dual arbiter-scheduler pairs per SIMD, which permit interleaved execution of two threads, each thread getting 4 cycles of GPU time before being swapped, and an extra arbiter-scheduler pair dedicated to the TUs.

In terms of effective arbitration, the dispatcher takes into account a thread's age and the availability of resources it required. Threads that rely on resources that need to be fetched from memory, which implies a high latency, are maintained in a sleep state until the resources become available. When no clear arbitration solution is available – think two threads, both having equal age and both having their resources available – pixel threads will be favoured before other thread types. Arbitration policies have been programmable ever since the R600, and it's safe to assume that they've probably been slightly reworked for Cypress – getting explicit details on this though is pretty hard, since no one's willing to discuss precise aspects.

Sequencers feed back into their arbiter's thread completion status, so preparation work on the threads that will be dispatched to execution can commence as currently executing ones finish. Register pressure conditions the number of threads in flight, and the higher it goes the lower the thread count. Both (decoded) instructions and constants are kept in dedicated on-chip caches, and a thread that causes a miss in either of them is immediately put to sleep, until the instructions or constants it needs are fetched into the caches. Another side-effect of a miss in either the instruction or the constant cache is that the thread causing it may be downgraded in terms of priority, so it may not end up re-entering execution immediately after the one that replaced it.

We have a number of evil things in mind, aimed at facing the dispatcher with some serious challenges and seeing how it behaves, but these will sadly have to be included in a further piece, due to time (and resource) constraints. With that said, time to go over 9000, and look at the 1600 ALUs inhabiting and empowering the shader core.