Architecture Performance Analysis

While we look at game performance in another article to come soon, we've taken a good look at the base architecture performance, especially in terms of the shader core and filtering hardware, for this architecture analysis.

Shader Core Performance

Given the way the shader core works, each shader unit processing a single object per clock but having 5 independent ALUs available to do so, our existing shaders for testing available maximum throughput needed to be rewritten. We ran old and new shaders through two new test apps, one running the shaders through D3D9, one through D3D10, to check API-level performance differences (if any). The shaders aren't like any you'd find in a shipping game title or production app, but are explicitly designed to test the hardware's peak shader throughput.

The old codes contained streams of dependant instructions, mostly to defeat optimisations, which obviously trips R600 up when the operands are single component, running at approximately 1/5th peak ALU throughput. All shaders are run as pixel shaders, which is something we'll mix up in future versions of the tool, to test load balancing and multi-thread type throughputs.

So the basic performance tenets of the shader core hold up, with instruction throughput effectively the same via each API, and the compiler struggling a little with our independent MAD shader in terms of seeing peak (some 236.8Ginst/sec at 740MHz). The dependent shader tests show that what an architecture like G80 will eat up, an architecture like R600 will struggle with, the hardware unable to schedule free instructions on the remaining ALUs. Of course these are best and worst cases for scalar throughput; take a look at a shipping game shader and you'll find channel widths varying, massively variant instruction mixes and other opportunities for hardware like R600 to schedule intelligently.

While we've not had time to check everything out, we also note that transcendentals all seem to run at 1 per clock, barring RCP which seems to run at half speed (at least in our simple shader). We also note that float-to-int conversion is one clock per scalar channel being converted (as is going back in the other direction from int to float), and that the hardware will do dp4 in one clock if all four thin ALUs are free (and the same for dp3 and dp2 of course). We've got a more extensive set of integer benchmark tests on the way, that'll we'll run back to back on G80 and R600. Theoretically, maximum integer throughput should be very high on R600 for ADD, when the hardware can schedule it.

The thing to take away from base architecture performance is that you can push max theoretical throughput through the shader core for any given instruction in isolation. We haven't tested every instruction, but we're confident that's the case. Of course the architecture means that the compiler has to get things right there, but that's been the case for all of their previous programmable architectures, so we're sure they're able to handle the most common optimisation cases for the chip, and it'll likely get better over time as they take a look at different mixes.

We were able to test filtering and ROP throughput too, so that's next.