Z Rejection Performance

Historically graphics boards have suffered from not knowing which pixels will actually be seen on the screen and which will be overwritten by others because the depth of the final pixel isn't known until it is rendered. The performance overhead of this only increases when under shader environments because there are many more clock cycles being spent on each individual pixel, so graphics chips are spending more and more transistors on early Z testing schemes to reject the pixels as early as possible in order to minimise the number pixels being rendered that will only be overwritten later.

When we sat down with S3 at CeBit they highlighted to us that the Z rejection and performance schemes was one of the areas that they have significantly revisited since DeltaChrome and they list the following elements in relation to it:

  • Sophisticated Hierarchical Z and advanced Z occlusion culling
  • Zero latency – ultra fast zero-cycle clear
  • Optimised Z buffer write and read

A Hierarchical Z buffer uses numerous layers of increasingly course (low resolution) Z buffer data, with the screen divided into tiles and each tile storing the highest Z value of the pixels within that tile. If the incoming pixels to be rendered have a Z value that is higher than the value on the corresponding Hierarchical Z buffer tile these can all be rejected without any render work occurring - if some have a lower value then they can be tested against a higher resolution layer until its decided they need to be rendered. Read our explanation here for more details on the operation of a Hierarchical Z buffer.

We'll put the Z rejection performance to the test here first using the test application GL_REME.

Z testing tests the depth of the pixel to asses whether or not it is visible - early z rejection schemes can actually reject the pixel before it is written to the buffer or in some cases before there is any work carried out in the pixel / fragment shader pipeline. For this to work optimally it would be best in terms of performance if the pixels were rendered with visible first such that any pixels that won't be visible can be rejected early; rendering can rarely occur that way though and is often much more random, on balance. GL_REME tests the rendering performance of rending full back-to-front (the worst case scenario as every pixel will be rendered regardless of whether its seen), front-to-back (the best case scenario) and random ordered.

S18 Nitro Overdraw Factor 3 485.5 1113.8 718.3 129% 48%
Overdraw Facter 8 198.3 1020.9 510.6 415% 157%
X600 PRO Overdraw Factor 3 390.9 570.2 479.2 46% 23%
Overdraw Facter 8 142.4 238.2 205.3 67% 44%

We can see that the rendering performance increases fairly significantly on the S18 Nitro from back-to-front to front-to-back rendering, and with lower gains for Random ordering, as would be expected. The S18's gains are significantly greater than the X600 PRO's increases as the X600 only features a pixel level Z-reject, and no Hierarchical Z buffer, unlike the higher end Radeon's, which highlights that S18's early Z reject schemes are working more optimally than X600's.

Note: The two GeForce 6's with the drivers being used would not operate this test, which is why results from these boards are excluded.

VillageMark is another application that tests the Z rejection performance by rendering multiple overdrawn layers with multiple texture layers per surface.



S18 Nitro 371.0 244.0 149.0 91.0 61.0
X600 PRO 251.0 166.0 108.0 65.0 44.0
6600 395.0 264.0 172.0 106.0 74.0
6200 62.7 48.0 34.2 24.7 16.8
 
X600 PRO 47.8% 47.0% 38.0% 40.0% 38.6%
6600 -6.1% -7.6% -13.4% -14.2% -17.6%
6200 491.5% 408.1% 335.1% 268.4% 263.0%

In this test the S18 Nitro is performing second only to the 6600. The 6200 lacks a number of bandwidth saving elements which is why there is such a performance gap between this and the S18 Nitro. The 6600, on the other hand, has ZCULL capabilities, which is does a similar thing as the Heirarchical Z by removing multiples of pixels even before they begin rendering if they fail a course Z test, and this scales with the number of internal shader pipelines (8 in the case of 6600) - the performance difference between the S18 Nitro and the 6600's texture rates (which correspond to the number of internal pipelines) is higher than the performance difference here which indicates that the early Z rejection schemes are working fairly effectively.