While we have an entire performance piece in the works, there's some architecture specific performance data that we'd like to share before we release that part of the G80 article series.
NVIDIA's documentation for G80 states that each SP is able to dual-issue a scalar MADD and MUL instruction per cycle, and retire the results from each once per cycle, for the completing instruction coming out of the end. The thing is, we couldn't find the MUL, and we know another Belgian graphics analyst that's having the same problem. No matter the dependant instruction window in the shader, the peak -- and publically quoted by NVIDIA at Editor's Day -- MUL issue rate never appears during general shading.
We can push almost every other instruction through the hardware at close to peak rates, with minor bubbles or inefficiencies here and there, but dual issuing that MUL is proving difficult. It turns out that the MUL isn't part of the SP ALU, rather it's serial to the interpolator/SF hardware and comes after it when executing, leaving it (currently) for attribute interpolation and perspective correction. Since RCP was a free op on G7x, you got 1/w for nothing on that architecture for helping setup texture coordinates while shading. It's not free any more, so it's calculated at the beginning of every shader on G80 and stored in a register instead. It's not outwith the bounds of possibility that the MUL will be exposed for general shading cases in future driver revisions, depending on instruction mix and particular shaders.
Bar the 'missing' MUL, getting (close to) peak base ALU rates was fairly elementary with our own shaders, and different instruction mixes testing dependant and non-dependant instruction groups all seem to run as you'd expect. A significant time spent with special function rates prompted the halleluja-esque discovery one evening that special function and interpolation are tied together, something NVIDIA's specification doesn't tell you.
It seems the shader compiler/assembler is in a decent state, then. Getting 150-170G scalar instructions/sec through the hardware is easy, hiding texturing latency works as expected (where you texture for free with enough non-dependant math) and seeing special functions execute at quarter speed (they take 4 cycles to execute remember, as a function of quadratic approximation) isn't difficult.
Into our stride after instruction issue testing was mostly put to bed, we've spent (and continue to spend) good time verifying the performance of the ROP hardware, including the very real verification of near 70Gpixels/sec of Z-only fill, performant 4xMSAA as expected (and also 8x as it turns out, but you need to choose your test wisely since the mix of CSAA can throw you off) and blend rates roughly equal to what NVIDIA quote. We're close to completing blend rate testing, with just FP32 to go.
Texturing performance was up next (or was it before ROP performance, too many late nights testing....) and verifying the sometimes astounding sample rates with filtering is something we've been working on recently. The hardware can and will filter FP32 textures, and we're seeing close to expected performance. Along with all that we've also measured performance in a number of modern games, so we don't go completely overboard on the theory. G80 is damn fast with most modern games, and it seems tuned to 2560x1600 sustaining high performance at that resolution, largely doubling (or more) what any other current single board is capable of. You can see some numbers from that testing over at HEXUS. In short, we've worked hard to make sure the hardware does what it says it can, and we're happy that G80 doesn't cut major corners. We'll sum it all up in the Performance piece in the near future.