Part Two: Pixel Shader
Pixel Shaders, just like vertex shaders, were introduced by DirectX 8.0. In
DirectX 9.0, PS gains more power over the old version than VS do. Except for the
same enhancements such as more instruction slots and more new instructions,
there is a revolution in the execution method of the instruction. The pixel
processing units in R300 and NV30 is not a series of simple logical switches
(stages) which can only process simple texturing and color blending operations,
but a true processing unit which can overcome the dramatic decrease of pixel
fillrate when running a long pixel shading program.
DX8 PS1.4 | DX9PS2.0 | R300 | NV30 | DX9PS3.0 | |
Pixel Shader Unit | 1.4 | 2.0 | 2.0 | 2.0+ | 3.0 |
Float precision | - | ü | ü | ü | ü |
Max texture input (per pass) | 6 | 16 | 16 | 16 | 16 |
Max texture addressing instructions | 12 | 32 | 32 | 1024* | 1024* |
Max arithmetic/color instructions | 16 | 64 | 128** | 1024* | 1024* |
Instruction Slots | 28 | 96 | 160 | 1024 | 1024 |
Max Runtime Instruction Number | 28 | 96 | 160 | 1024 | 64k |
Unlimited texture fetches | - | - | - | ü | ? |
Static flow control | - | - | - | - | ü |
Dynamic flow control | - | - | - | - | ü |
Per channel masking | - | - | - | ü | ü |
Multi render target | - | ü | 4 | - | ü |
Note:
1) - No Support; ü Support; ? Unknown;
2) * Universal instruction slots. That is to say color and texture instructions
can share the same instruction spaces.
3) ** R300 has 32 texture instructions, and 64 ALU instructions each for scalar
and vector. R300 can issue instructions from each of the instructions set each
cycle, and consequently execute 3 instructions per cycle. A nice design!
It seems that DX9 PS2.0 follows the specification of R300 exactly. The strength of NV30 PS specification impresses all of us. It is qualitative difference that 1024 instruction slots is compared with 160 slots and also has significant value to DCC applications. We see the influence of NV30 in DX9 PS3.0 too.
Another interesting thing about NV30 is that its PS instructions are stored in local video memory instead of chip interval. This method has two sides: it makes managing lots of fragment programs cheap though it also adds pressure to the already limited memory bandwidth.