Part Two: Pixel Shader

Pixel Shaders, just like vertex shaders, were introduced by DirectX 8.0. In DirectX 9.0, PS gains more power over the old version than VS do. Except for the same enhancements such as more instruction slots and more new instructions, there is a revolution in the execution method of the instruction. The pixel processing units in R300 and NV30 is not a series of simple logical switches (stages) which can only process simple texturing and color blending operations, but a true processing unit which can overcome the dramatic decrease of pixel fillrate when running a long pixel shading program.


Click for a bigger version

ATI's R300 Natural Light Demo Demonstrates the the DirectX9 Floating Point Pixel Shader Pipeline

Click for a bigger version


Pixel processing Power

Pixel Shader Unit 1.4 2.0 2.0 2.0+ 3.0
Float precision - ü ü ü ü
Max texture input (per pass) 6 16 16 16 16
Max texture addressing instructions 12 32 32 1024* 1024*
Max arithmetic/color instructions 16 64 128** 1024* 1024*
Instruction Slots 28 96 160 1024 1024
Max Runtime Instruction Number 28 96 160 1024 64k
Unlimited texture fetches - - - ü ?
Static flow control - - - - ü
Dynamic flow control - - - - ü
Per channel masking - - - ü ü
Multi render target - ü 4 - ü

 Note:
1) - No Support; ü Support; ? Unknown;
2) * Universal instruction slots. That is to say color and texture instructions can share the same instruction spaces.
3) ** R300 has 32 texture instructions, and 64 ALU instructions each for scalar and vector. R300 can issue instructions from each of the instructions set each cycle, and consequently execute 3 instructions per cycle. A nice design!

It seems that DX9 PS2.0 follows the specification of R300 exactly. The strength of NV30 PS specification impresses all of us. It is qualitative difference that 1024 instruction slots is compared with 160 slots and also has significant value to DCC applications. We see the influence of NV30 in DX9 PS3.0 too.

Another interesting thing about NV30 is that its PS instructions are stored in local video memory instead of chip interval. This method has two sides: it makes managing lots of fragment programs cheap though it also adds pressure to the already limited memory bandwidth.