General Shading Performance

There's little to say about general shading performance for G80, other than it's very good. With simple shaders you can push instructions through the hardware at close to peak rates (more on that shortly), the driver doing a good job with instruction reordering, assembly and compilation of your presented shaders in order to reach that goal.

In the original performance piece we mentioned that the dual-issue MUL instruction issue capability that NVIDIA presented as being part of the shader processor makeup of G80 wasn't present. Rather we concluded that the 2nd 'missing' MUL was actually part of the special function and interpolation ALU, serial to SF/interpolation and thus the general shading ALUs in each cluster.

In short, during general shading on G80 in the beginning, you couldn't push more than 1 MUL per clock per SP through the hardware. However with recent driver builds and a dependant MUL instruction stream, it's possible to see about 1.15x MUL performance per expected SP (tested with HLSL only), indicating that the driver is opening up the use of the MUL unit to general shading, when interpolated attributes aren't needed or special function calculations need not be performed. We test using a full-screen quad (two triangles aligned to screen space) and just run our simple pixel shader code across those pixels.

That 1.15x happens with a short-ish shader, and falls off at very high resolutions and with long shader lengths, indicating bottlenecks elsewhere in the architecture (4MP is a lot of pixels to shade, even with a simple test) or resource allocation issues, or a choke condition somewhere in the hardware or driver.

Concluding, and all things considered, it's safe to assume that you won't encounter huge performance cliffs in general shading. It's not hard to run around 160 scalar Gflops in general shading with common instructions on 8800 GTX, and that doesn't count special function execution, interpolation horsepower, filtering flops or other processing that'd count towards G80's final total in 8800 GTX form.

Of course that depends on your shaders and how you sample, as it does with all architectures, but the basic performance is there. In terms of special function shading, performance is capped at a quarter of the general shading rate in terms of single precision float calculations, as mentioned in the architecture analysis back in November.

Measured performance with the 101.41 driver under Vista x64 with GeForce 8800 GTX is as follows, for some common instuctions (all performed on singles of course, since the shader core is natively FP32).  Remember the 2nd MUL ALU is part of SFU, hence the > 1x ratio versus what's available for general shading in the SP:

  • MUL: 199Gflops (~1.15x)
  • DP4: 136Gflops (~0.78x)
  • DP3: 165Gflops (~0.95x)
  • ADD: 167Gflops (~0.97x)
  • LRP: 161Gflops (~0.93x)
  • SIN: 41Gflops (~0.95x)