Pixel Shader

The texture and shader core are SIMD in nature, as is usually the case with Pixel Shader cores. With the two ALUs of the shader core it appears that any vestiges of NVIDIA's legacy register combiner structure are now fully removed, with the second ALU taking over from the combiners. With 16 pipelines and two ALUs per pipe this gives NV40 a total of 32 pixel shader ALUs.

The above diagrams highlight that, as with NV3x, one of the shader cores is required for some texture operations, so that if a texture is being sampled the second ALU is available for use; however, if there are no texture accesses then dual issue of instructions over both of the ALUs is available. Each of the ALUs also have what NVIDIA describe as "mini ALU capabilities", which talking to Emmett suggests would be PS1.4 modifiers, so modifier operations could be for free in NV4x's ALUs. While the majority of the instructions are FP32 native, the first ALU does have a specific FP16 normalise instruction which will supply this operation for free. Each of the two ALUs are capable of complex instructions but are not quite a direct instruction copy of one another.

NVIDIA are describing each of the ALUs as "Super Scalar". A Vector ALU usually operates on 4 components at any one time, yet an increasing number of shader instructions do not operate on all components in a single instruction. The ALUs of NV4x are capable of Co-issuing instructions so that they utilise all the components available - if one instruction doesn't use all components and there is another that can fit in the remaining components these can be executed on a single ALU in one cycle. We've previously seen a co-issue split of a 3 vector operation and one scalar simultaneously, though NV4x's pipeline is a little more flexible in that it can be any two instruction component combinations up to a total of 4 components (i.e. 1 + 1, 1 + 2, 2 + 2, 1 + 3). Most hardware that is capable of 3 + 1 co-issue instructions can also do other the other combinations of 1 + 1 or 2 + 1 with write masks for swizzles, so, really, the only new mode is 2 + 2.

In all non-texture situations, each of NV40's pipelines can execute up to 4 instructions per cycle and up to 8 components. Also note that, as with the Vertex Shader, there are no limits imposed by the hardware on the number of instruction executed since the API is imposing them.

The shader core is somewhat different to the NV3x shader core, though it should already be able to make use of the shader compiler optimiser from NV3x, and as time goes on more may be extracted as the NX4x pipeline is better understood.

One optional element of DirectX9 that NV3x’s pixel shader pipeline didn’t support was Multiple Render Targets (MRTs), which allows up to 4 values to be outputted from the pixel shader as opposed to just one. NV4x does support MRTs, and the above image is taken from ATI’s Depth of Field demonstration which utilised MRTs to store colour information in one render target and depth / blur factor in another. This demo fails to operate on NV3x, but runs fine on NV40.