Thread Handling

To keep all of the available units active as efficiently as possible there has to be some fairly complex thread management, and the diagram ATI displayed doesn't attempt to do any justice as to what is occurring.

The thread processing does bear similarities to the patent unearthed previously which has two "reservation stations", one for pixel shader instructions and another for vertex shaders. However, beyond that there are multiple arbiters and sequencers for each of the different workload types (ALU instruction operations, texture fetches and vertex fetches). These arbiters and sequencers interleave execution from the pools of instructions from the reservations stations in order to optimise the utilisation of the available processing elements whilst hiding the latencies of dependant operations within each thread, be they texture or shader instruction oriented. Additionally there are algorithms designed to prioritise the thread execution order and for transitioning threads from one workload type to another (i.e. a shader program first requiring some texture data input then requiring ALU instruction operations).

With this complex organisation, the threading mechanisms, the number of threads that are active, or ready to be active so the system hides latency effectively, ATI's testing indicates an average of about 95% efficiency over the shader array in general purpose graphics usage conditions. The throughput of the system is such that ATI expect to be able to achieve two loops, two texture instructions and 6 ALU instructions per pixel, per cycle at Xenos's peak fill-rate.

Shader Type Load Balancing

When trying to prioritise one pixel shader program over another pixel shader, or one vertex shader over another vertex shader the best choice is nearly always first in first out. With a unified shader architecture, though, where same ALU's will be presented with both pixel and vertex shader programs over time, the prioritisation between whether vertex shading or pixel shading should be done is a little more complex.

ATI, probably understandably, weren't too keen on giving many details out in regards to the prioritisation methodology, probably because there is some fairly proprietary logic behind it, but also because for the most part you shouldn't need to know much about it other than "it happens". From ATI's comments it sounds like a fairly complicated procedure, but conceptually it appears to monitor the vertex buffer and pixel export buffer (just before the transfer to the daughter die) and, depending on application program mix, there is an equation that prioritises between pixel shading and vertex shading dependant on the size of the buffers and how full they are.

This load balancing equation is inherently weighted, however information from the OS or even the application itself, which is obviously given the control by the developer, can alter that weighting a little in order to affect the prioritisation of the vertex and pixel shader programs. ATI's experiments show that the algorithm gives a quite optimal throughput and they expect only a few teir-1 developers will actually look into the altering the weighting of the algorithm.

ATI states that there will never be an unused shader array (or texture sampler for that matter) if there are any threads that are available to use it. Whilst the load balancing is like an arbiter, it only operates on threads that are ready to go in that decision.