Pixel Pipeline & Pixel Shaders 2.0

Radeon 9700 PRO is the first consumer graphics accelerator to feature 8 parallel rendering pipelines; for the past few years 4 pipelines have been the favoured configuration, and ATI have opted to double that giving twice the potential pixel output per clock.


Click for a bigger version

R300 Pixel Engine


This does come with at caveat, though, as only one texture sampling unit has been included per pipe, as opposed to the 2 texture units included in many configurations over the past few years, including Radeon 8500.

Click for a bigger version

R300 Pixel Shader

As is included in the DirectX9 specifications, each of the pipelines are able to sample up to 16 textures in a single geometry pass. As there is only one texture unit per pipeline this is achievable via a process commonly termed as 'loopback' - the first texture is sampled in one cycle and the result is stored on chip, and then the second texture is read in the next clock; this process is repeated up to 16 times. The textures can be any combination of one, two or three dimensional textures with bilinear, trilinear or anisotropic filtering applied arbitrarily.

As can be seen from the above diagram, R300's pixel pipeline is also able to apply 3 instructions per clock: one texture look-up, one texture address operation, and one colour operation. Pixel Shaders often consist of this mixture of operations which should ensure, according to ATI, that R300 performs optimally during pixel shading operations.

One of the biggest advancements in the R300 pipeline is the move from 32bit integer pipelines to 128bit floating point processing. The point of having such large ranges is to increase the accuracy and maintain it when high levels of processing is occurring. If you think back to the 16bit age we can remember that even with simple multitexturing or colour blending 'banding' occurred in the ouput, and this occurred because the low accuracy of 16bit caused visual errors during multiple operations. 32bit moved to alleviate this and during normal texture and colour combine operations accuracy issues were not that evident; however, with Pixel Shaders the number of operations being carried out increased significantly, especially with newer Pixel Shader functionality such as PS2.0.

Another element to using floating point accuracy is that it can also increase the dynamic range of the output and allow truly photo-realistic effects to become that much closer to reality, which is why we've seen the idea of 'cinematic realism' come to the fore recently.

The screenshot below illustrates the difference between integer and floating point rendering precision. Take note of the quality of the refection in the main globe between the two rendering formats and also the 'over-bright' lighting seen in the floating point side.


Click for a bigger version

Natural Light Demo
Without / With Floating Point
Pixel Accuracy

Although DirectX9 supports full 128bit (32bit per component) floating point precision, it only really cares about the input and output stages. Internally R300 has a hybrid pixel pipeline of both 128bit accuracy for the some of the texture sampling stages and 96bit (24bit per component) for the rest of the shader stages. ATI believe that, certainly for the foreseeable future, this level of accuracy is enough (and evidently Microsoft agreed with them) given that the final output will either be a 10:10:10:2 format or a texture address, and that normal shader operations will not show any loss of accuracy with a 24bit per component internal format.

R300 can process multiple 128 and 64bit integer texture formats for both reads and writes, and full filtering is supported on formats up to 64bit. A 64bit floating point texture format is supported, though no filtering is applied to this format. Externally the 10:10:10:2 format is supported in the frame buffer and also through to the RAMDAC. The R300 can also support various formats, such as 32bit, 64bit, or 128bit, non-displayable buffers to store intermediate results of multipass operations. When a 128bit buffer is called for the output is converted from the 96bit internal precision to the 128bit precision called for. These types of non-displayable buffers have to be explicitly called for by the developer.

To take that a step further, if you were playing an older game on Radeon 9700 PRO that forces an extra pass after a certain number of texture layers, such as Quake3, the intermediate results would not be stored in any exotic formats such as 128bit, but would remain at 32bit (or 16bit if you are running in that colour mode). This means that the game will still run without the extra bandwidth 128bit buffers would require. If, on the other hand, a developer does believe that their shader routine will require more processing than the API or hardware can handle in a single pass he can specify to store the intermediate results in a full precision buffer to maintain accuracy between the passes.

As with the Vertex Shaders, DirectX9's Pixel Shaders 2.0 allows for a greater level of functionality and programmability to be exposed. For instance, as already mentioned DX9 supports up to 16 texture inputs per pass, 32 Texture Address Instructions can be used and 64 vector + 64 scalar colour instructions are available. However, once again R300 does go beyond these specifications in some areas such as, again, 32 temporary registers in the Pixel Shader pipe. R300 also supports full component swizzling in the pixel shader pipeline and, like the Vertex Shader, DX9 Macro's allowing advanced operations such as SIN/COS to be performed

To demonstrate the power of the Shaders in R300, the Stanford Ray Tracing Paper, which was written before the availability of hardware such as R300, was shown to run fully on Radeon 9700 PRO at SIGGRAPH 2002. In this instance the Pixel Shaders were running hundreds of instructions.

R300 Pixel Shader pipeline and DX9 also support Multiple Render Targets. Previously only one value could be outputted from the pixel pipeline, though now up to 4 different values can be outputted from an individual shader. An example ATI use as a use for this is for doing image filtering or post processing on a 3D scene.


Click for a bigger version

Multiple Render Targets

DirectX9 and R300 also feature two sided stencil acceleration. Stencil buffers are becoming more common for creating realistic shadow effects, and it sounds as though DoomIII will take this to extremes. Using the vertex shaders the edges of models are determined from the point of view of a light source. Similar to the method described within our recent John Carmack interview, the geometry is rendered without any textures or shading so that the z-buffer information is filled for the scene and then the scencil buffer can be filled via two passes -- the first pass contains the forward facing polygons, with incremental values, and the second contains the back facing polygons are filled, appropriately decrementing the values from the first pass. Normally the incremented and decremented values would negate each other, however because the objects in the scene were already rendered to the Z-buffer, some pixels will be hidden behind these objects as the shadow volume is rendered, and therefore will not be cancelled out. The values that were not cancelled out in the stencil buffer are shadow values which are used to create a shadow mask and is filled with a dark colour. This process would normally be done for every light that casts a shadow within the scene. When all the light sources are calculated the texturing and shading of the rest of the scene is done and then the shadow values are blended into the rest of the scene. The two passes required for the stencil calculation per light source can consumer plenty of bandwidth, so its to this end both DX9 and R300 introduce the ability to allow both the stencil increment and decrement stage to be completed in a single pass rather than two, thus alleviating bandwidth and increasing performance.