Shadow Map Acceleration and Fetch4

Standard filtered texture sampling within most DirectX operations calls for the 4 texture samples (texels) to be filtered giving a final (R, G, B, A) colour value. However, there is another mode available, primarily implemented in NVIDIA hardware, that steps out of these bounds a little, known as Percentage Closer Filtering (PCF).

PCF’s operation is often used within Shadow Mapping techniques, whereby the scene depths are rendered to a single format texture, or Depth Stencil Texture (DST), for each light. When rendering the colour pass the depth texture is sampled to dictate whether the current pixel should be shaded from the light source or not. PCF operates in particular on these DST formats, by sampling 4 texels, each giving a 0 (not shadowed) or 1 (shadowed) value. Then these four values are averaged to give a percentage value (0%, 25%, 50%, 75%, 100%) of shade for that pixel – this entire operation is done within the texture sampler, with the final result being the percentage filter value. The net result of this is that soft edged shadow can be created fairly quickly with 4 gradients of shadows, and more shadow gradients can be generated by sampling multiple times.

PCF use has proved to be a bit of a problem for vendors such as ATI, though. The actual specific operation was not a documented operation within DirectX and, indeed, is atypical texture behaviour. However, its use has been popularised by its implementation within the original XBOX, which uses NVIDIA’s NV2A processor and a closed API, where the behaviour was fully documented for developer use. Although DirectX initially didn’t make much note of this functionality directly, it did still function on NVIDIA’s PC graphics processors under DirectX, making for easy porting of titles with full soft shadowing from XBOX to PC. Another issue is that the technique is actually SGI’s intellectual property which NVIDIA have access to by way of a technology agreement the two companies reached many years ago.

Hardware without such PCF capabilities left developers with the choice of sampling just once from the shadow texture, which would result in hard edged shadows on the final frame, or take a single sample multiple times and average the results within the pixel shader. Either of scenarios have ramifications on quality and/or performance in relation to PCF, so this end ATI have implemented Fetch4 in their latest hardware.

The name describes its operation effectively since, basically, what it does is sample 4 values (texels) from the depth map at once. It does this by using the RGBA channels that would usually be used to sample each colour value in a single texel in multi-format texture to read in 4 adjacent texels of information from a single format texture, thus providing a 4 fold sampling rate (or equivalent to PCF). This functionality becomes quite important for a design such as R580 with its relative texture sampling rate to programmable ALU capabilities.

While Fetch4 is similar to PCF, it is not the same. Although similar code can be used, developers will need to alter things between Fetch4 and PCF operations should they want to use it for effects such as shadow-mapping; it is something that needs to be specifically coded for. Fetch4 also stops short of doing any mathematical operations automatically, only providing the 4 texels of data in a single sample, so the depth compare and averaging processes will need to be done in the shader. However, with the ALU capabilities of R580 (and RV530) this is only costs around a couple of cycles, providing a 2x-3x performance increase from sampling 4 different texels over multiple clocks and then doing the math in the shader (although, in both cases, some of those operations will be hidden a little by the instruction scheduler since they are reliant on both texture and math operations).


Performance cost and IQ of filtering methods

  Performance  JPG version  PNG version 
c[Single tap  693fps  link  link 
2x2 PCF  628fps  link  link 
8x8 PCF  159fps  link  link 
Fetch4 2x2 PCF  689fps  link  link 
Fetch4 8x8 PCF  263fps  link  link 
12-tap rotated Poisson disc PCF  119fps  link  link 
Fetch4 12-tap rotated Poisson disc PCF  256fps  link  link 
Fetch4 12-tap randomly rotated Poisson disc PCF
with edge processing and depth extent] 
338fps  link  link 

The images above are from a test application from ATI that demonstrates plenty of different shadow mapping techniques that can be achieved, inclusive of ATI's Fetch4 capabilities. The image just shows a single sample shadow map, which results in a very hard edged, and aliased, shadow. The next image, "2x2 PCF", just uses a 2x2 sample to generate the shadow map and then the percentage close filter to return, giving a little more of a softer shadow edge, but only with 3 intermediate gradients - this is taking 4 separate texture samples and then performing the depth comparison and PCF calculation in the shader. The 8x8 PCF is now performing the PCF calculation on a total of 64 samples, and and we can see that the shadow edge quality is much better, but the performance is dropping to a quarter of the 2x2 PCF case. The next test again uses the 2x2 PCF filtering, giving the same quality as the first 2x2 PCF case, but utilises Fetch4 for the texture sampling, thus the performance is close to the initial Single Tap case. Using Fetch4 in the 8x8 PCF still has a high performance penalty, but is significantly faster than without Fetch4.

The last 3 images are examples of some higher quality filtering techniques that ATI have demonstrated. The first of these, the 12 Tap Rotated Poisson Disk, with each tap using 2x2 PCF, uses a non-uniform sampling pattern within the disk, which is rotated per pixel to give slightly different sampling positions. This results in smoother gradients than the 8x8 (fixed) PCF, despite there being fewer shadow map samples overall. The next case uses the same technique, but the 2x2 samples are achieved with Fetch4, putting the performance very close to Fetch4 8x8 PCF.

Shadow mapping is also a very good case for the use of dynamic branching as a branch test can be done to see if a pixel is either fully in, out or on the edge of a shadow and the code branched should it occur on the edge. The final test shown above only performs a single tap PCF lookup and calculation either in or out of shadow, but where it is detected to be on the edge the code can be branched such that the same 12 Tap Rotated Poisson Disk filter from the previous test is used (with Fetch4) but only where it's actually required, thus resulting in a performance increase because the high tap filtering is not being applied on all pixels.

Fetch4 is actually not just implemented within R580 but RV530 and RV515 as well, although curiously not R520. Because of the relatively low shader capabilities of R520, in relation to R580, it's more likely to be shader bound on operations such as these anyway, so the increase in the sample time is less likely to be an issue. With R580 though, as it has such a high math capability in relation to its number of texture samplers it's more important that its texture utilisation is optimised, so wasting 3 cycles on single precision texture formats is going to bottleneck it more.

ATI have supported the 16-bit integer depth DF16 texture format format since RV350 (Radeon 9600) and onward, but with R580, RV530 and RV515 ATI also support the 24-bit depth format DF24. With the introduction of DF24 format support at the same time as Fetch4 developers can test for the DF24 format and will know that Fetch4 is also supported. Both DF24 and Fetch4 have been accessible to developers since the release of products based on these chips, and in fact ATI have been handing RV530's to developers as basis for R580 development for some time.

Because Fetch4 is not tied to any specific kind of operation, just returning 4 samples in a single cycle from a single format texture, it can be used for other tasks than just shadow map filtering. Some example uses beyond shadow mapping may be for higher quality filtering than bilinear, Perlin noise evaluation, and edge filtering.