ROPs

Click for a bigger version

Rys wrote about R600's ROPs as being one of its most controversial parts, and he wasn't far from the truth. If the samplers were kicked around, the ROPs were nuked from orbit on many internet forums. We'll not rehash the info contained in the above piece, or replay the many debates that took place between then and now about the merits of resolving less than fully compressed tiles on the shader core, or lack thereof. What's important for today is that ROPs were significantly overhauled for R7 as well.

First, as basis of comparison, take RV670: its ROPs supported up to 8X MSAA, being able to test 2 positions per cycle, orthogonally with all supported surface formats. It was also limited to 8 pixels per cycle when writing to a FP16 RT, as opposed to 16 for INT8 ones (aggregate values). The blenders could do single cycle FP16 blending, with FP32 costing 2 cycles -- bandwidth permitting, of course! We've outlined the RV670 because the R600 may be slightly more potent on this front, at least based on documentation available at its release, and because whilst we had the RV670 on hand to verify those numbers in practice, an R600 was nowhere to be seen and we opted against extending what's valid for the RV670 to it as well.

When it comes to Z-only writes, these operated at twice the colour write rate, with a Z/stencil maximum compression ratio of 16:1 without MSAA (this scales with MSAA level). Resolve was done on the shader core for partially compressed tiles (effectively, the ones that needed it), with decompression happening in the ROPs and decompressed data being fed into the shader core, resolved values being pushed back through the ROP for write.

Moving to the R7xx generation (and therefore RV740), we'll start at the bottom by informing you that the ROPs work again, regaining the ability to do resolve themselves, although the funkier CFAA modes remain a shader-driven affair. The general goal for the ROP re-working seems to have been getting a 2x increase versus RV670 on (almost) all fronts: MSAA remains at a maximum of 8X, but now 4 positions can be tested per cycle, and FP16 surfaces are filled at a rate of 16 pixels/cycle, versus 8 before. Z-only rates have also been doubled, operating at 4x the colour rates in the R7xx GPUs. Compression algorithms and blend rates remained unchanged, at least as far as we can tell.

These changes presented an interesting opportunity for investigating just how high the cost of doing shader resolve was: since the CFAA modes still rely on it, we could compare the cost of enabling a custom resolve filter on the RV740 with the cost of enabling the same filter, with the same base sample count, on the RV670. Then, by assuming the cost of adding the custom filter would be the cost of doing shader resolve + cost of the filter for RV670, whereas it would only be the cost of the filter for RV740, by computing the difference between relative performance degradation in both cases we could isolate the cost of doing shader resolve. Quite a few assumptions there, but it seemed like an interesting experiment so we performed it and we'll show you the results in just a few pages.

Before that tough, we need to properly wrap up our trip through logic and cache blocks by looking at the memory interface and UVD2.