The Shader Pipeline Array

As we mentioned before the graphics pipeline has been an evolving entity but in this shader dominated environment that we are approaching shader ALU (Arithmetic Logic Unit) capabilities and organisation is fast becoming one of the most determinate factors in the overall performance of the graphics processor - if the processor has a poor shader pipeline, it doesn't matter how fast the rest of the chip is as the mix of graphics usage in today's graphics titles is shifting more and more to the shader pipeline.

With its unified shader pipeline Xenos has a fundamental difference with virtually all current shader capable graphics processors and whilst shader processing are enveloping both the geometry and raster pipeline, until now they have done both those elements distinctly. On Xenos there is a logical disconnect between the old OpenGL pipeline, which is basically the evolution path most graphics processors followed, as now the geometry and pixel shader processing are moved on to a single processing element of the chip as all the shader ALU's can dynamically be tasked with either vertex shader programs or pixel shader programs.

Click for a bigger version

At this years E3 ATI held a press conference to briefly outline what they did for the XBOX 360 platform and highlight a few of the capabilities of the Xenos processor. From this press conference the diagram above emerged which gives an indication of the type of functional arrangement of the Xenos processor, however it is not entirely reflective of how the unified shader pipeline actually operates. Some other misconceptions of the pipeline operation has also risen since then so here we'll go into a little more depth to explain the unified processing that is occurring with Xenos.

Its been said that Xenos's shader processor is an array of 48 ALU's, however it is more correct to say that that it is 3 separate arrays of SIMD (Single Instruction Multiple Data) ALU's. Each one of the 48 ALU's can co-issue a vector (Vec4) and a scalar instruction simultaneously, essentially allowing a "5D" operation per cycle. Each one of the ALU's is a complete instruction duplicate of the others and are all single precision IEEE floating point 32-bit compliant. The ALU's will process everything in FP32 internal precision and there are no internal partial precision requirements for FP16. Additional to the 48 ALU's is specific logic that performs all the pixel shader interpolation calculations which ATI suggests equates to about an extra 33% of pixels shader computational capability.

The arrows on ATI's diagram above indicates that there is some dependency from one of the shader arrays to another, almost as though they are pipelined; this is in fact not the case and each ALU array is working independently of the other and the data is not pipelined between them. This being the case there is no dependency between what programs, or types of programs, are being executed on each of the three ALU arrays - at a snapshot in time they could, potentially, all be vertex processing, all be pixel processing or there can be a mixture of both vertex processing and pixel processing occurring on the three different 16 ALU arrays.