Shader Dispatch Processor

With R520 featuring the same number of quads and the same basic Pixel Shader ALU arrangement as both R420 and R480 it looks, on the surface, as though ATI are going to be leaning purely on clock speed for performance increases with R520. This is not the case, however, and rather than going the route of increasing the raw operations per cycle capabilities they have instead gone the route of attempting to make the units that they do have more efficient by increasing their utilisation further than just the shader compiler can.


Click for a bigger version

Ultra Threaded Dispatch Processor


Two key elements to maintaining efficient utilisation of the available computational resources are hiding latency within operations and avoiding wasted cycles - these are becoming more important as clock speeds increase. A commonly known area of processing that has a high latency associated with it is texture processing, which can waste hundreds of cycles if the texture data is not located in the texture cache and needs to be addressed from local, or even worse, system RAM, so it is important to ensure that other operations can be achieved on the Shader ALU's whilst such latency ridden operations are being executed.

R520's "Ultra Threaded Dispatch Processor" is the element that ATI have designed in order to better increase the overall utilisation of the Shader ALU's by breaking the tasks down into smaller chunks that can be interleaved more effectively. The central Pixel Shader dispatch unit first breaks down each of the shaders into "threads" (batches) of 16 pixels (4x4) and can track and distribute up to 128 threads per pixel quad. When one of the shader quads becomes idle, due to a completion of a task or waiting for other data, the dispatch engine will assign the quad with another task to do in the meantime, with the overall result being a greater utilisation of the shader units, theoretically.

With up to 128 threads per quad, there will plenty of temporary values (all at FP32 precision) that will need to be retrieved quickly in order to switch threads efficiently and to this end ATI have implemented a very large general purpose register array in the R520 series. The register array is multi-ported, allowing for multiple read and writes, and has a high bandwidth connection to the shader ALU's in each of the quads. In total there is enough register space for 32 128-bit (FP32 per component) pixels per thread, however should a thread require more register space, rather than dumping the extra requirements out to memory, the systems dispatch processor just reduces the number of active threads - this system should balance out the performance a little more as threads that have very high register utilisation should be fairly long and hence there is less need to balance out any latency operations anyway.

Breaking down each of the pixel shader commands into small threads also brings another benefit specific to the Shader Model 3.0 and above specification, and that is in the region of dynamic, per pixel branching. Dynamic Branching in the pixel shader allows such techniques as "early out", whereby redundant parts of a shader can be skipped, and can also reduce state change overheads by grouping lined shaders in a single, branched, shader. Developers may also like to use dynamic branching in a similar manner but for different reasons - using dynamic branching to group shaders together can reduce their code maintenance and hence their development costs.


Branching Costs                                    Branch Execution Unit

Graphics boards have always worked on batches of pixels, meaning that when a branch occurs it can occur anywhere within that batch - if a batch contains a branch that happens to occur such that one half of the pixels carry on down the initial path and the other half go down another path pixels down then all the pixels will still need that instruction executed, but the the pixels down the new path will also need to go back and will eventually become a new batch meaning that, in total, for that instruction 1.5 times the work will have been carried out because of the batch. R520's batch size, being only 4x4 pixels large, should be very efficient for large batch sizes, at least in relation to NVIDIA's G70 which is described as having batch sizes of 64x16 (1024) pixels. R520's Pixel Shader architecture also has a specific Branch Execution Unit which means that ALU cycles aren't burned just calculating the branch alone for each pixel.

Something to note on the execution unit is that R520 (4 quads) is described as having 512 "threads" in flight at any one time, whilst RV515 (1 quad) has just 128 threads - which basically means that each of these chips have 128 threads allocated per quad. Although RV530 is described as having 12 shader pipelines, it still only has 128 threads in flight at once - in this case each of the 12 shader pipelines exist within a single (quad) "pipeline" and will still operate over three separate quad fragments, but do so with larger batch sizes. For the parts in the R5xx series that have three times the number of shader pipelines to ROP's the batch sizes are 48 (4x12) pixels large, so the efficiency drops a little.