Analysis: ROP Throughput

We've managed to get far more work done with the ROPs, analysing the impact of differing levels of AA on colour and Z-only fill, as well as the cost of blending, then taking them for a spin in a closer to real-world scenario. First, let's see how it handles pure fill tests:

Outside of filling a 32-bit INT frame buffer without doing any blends, where it has enough bandwidth to service its ROPs and doubles the 4890 at equal clocks, Cypress doesn't get a decisive advantage. Whilst we don't think anyone would do heavy blending with an 1920x1200 FP RT like we do in the least impressive test here, we can envision that in some scenarios bandwidth will become a limiting factor. Time to see how ROPs cope with MSAA:

If we look at pure numbers, Cypress doubles the 4890 across the board, but it's a bit more interesting to look at the performance curves, since those outline the fact that ROPs' characteristics are quite close...which is not necessarily bad, because one needn't change something that works well enough. Neither the 4890 nor the 5870 hit their maximum Z-only write rate, but in this case the 5870 seems to be able to better use what's available to it. An interesting improvement that we've discovered involves writing Z to a D32F format (all our tests up to now were done with a 24-bit INT depth buffer, with no stencil):

We've stumbled upon the 4890's “problems” with D32F a while back, and they extend to all R7xx boards we're aware of. Discussing things with ATI, they were adamant about the fact that whilst there was a bug, it should've been fixed a few driver revisions ago, and that it's probably our application that's doing evil things...which we outwardly admit is possible!

What this means though is that at worse, Cypress solves a corner case that our app was exposing, and thus it may benefit other apps that were doing things in a similar fashion or, at best, we had actually uncovered a more general limitation that it solves. Based on some experiencing we've done, it would appear that for whatever reason Z-compression gets turned off on R7xx boards when outputting to D32F (at least in our application).

Putting it all together, we decided to use a rather old ATI demo (Rachel, to be precise) and see what the cost of AA is when doing something other than rendering fullscreen quads:

In (almost) practical situations, the cost of MSAA seems to be lower for the new GPU, as a consequence of doubled colour/z-caches (we presume). We'll see if this holds up when Crysis FPS become involved, in the performance analysis coming later (you must be quite tired of these “coming later” bits by now, but this wouldn't be B3D if they were missing!).