ROPs

Click for a bigger version

Cypress breaks away from a rule that had seemed truly immutable, started by the R420 many moons ago: thou shalt not have more than 16 ROPs! It took years, and 4 generations of hardware (we're not counting refreshes), but finally the ROP count budged and it sits at a rather comfortable 32. ATI's ROPs are organized in blocks of 4 dubbed Render Back Ends or RBEs, so you get 8 RBEs per Cypress GPU.

The GPU can write 32 INT8 pixels or 32 FP16 ones per cycle into the frame buffer, and test 4 AA sub-sample positions per cycle per ROP. Samples are laid out in the same manner we've come to know and love ever since the R600 was introduced, using a 4-bit subpixel grid. Full orthogonality with supported surface formats should be present, however the drivers we had available didn't seem to support MSAA on INT32/FP32 ones. Z/stencil-only writes are 4 times the rate of colour ones, for a pretty funky 128 pixel/cycle throughput....bandwidth permitting, of course.

CFAA also remains roughly unchanged, outside of the probable change in how samples for resolve are fetched we already discussed in the sampler section, and to be frank the single CFAA mode that's truly interesting is Edge-Detect. Edge-Detect offers the benefits of accounting for sub-samples outside of the pixel's area when resolving, just like the tent modes (in fact, for resolve it uses an even wider tent filter that samples 3 sub-samples from neighbouring pixels) whilst not paying the cost of losing signal frequency even in places you would not want that, which is characteristic for the straight tent filters. To do so, a rather expensive and exhaustive edge-detection shader must be run on partially compressed tiles (the ones that are likely to contain edges), with the resolve kernel also being influenced by an edge's direction which is derived from the shader's output. The results are quite good in practice, but the cost is high and it remains a proposition for high-end solutions more than anything. Whilst CFAA used to be a DX9 only affair, in order to avoid conflicts with DX10/10.1/11 code that already does its own custom resolve via shaders, it has been pointed out to us (thanks Thomas!) that this is no longer the case, and we've verified in practice that it can now be used in DX10 as well - verification under DX11 will have to wait, sadly(NOTE: this phrase is changed from the original writeup).

Speaking of DX9 only affairs, Cypress also exposes supersampling as a selectable AA mode. Supersampling has long been revered for unclear reasons, many having to do with the fact that it was one of the late 3dfx's last “weapons” in a losing battle. We'll not express our opinion about the merits of doing full-screen SS in this day and age, since that would be a bit besides the point. However, here's how we think ATI is implementing SS: from DX10.1 onwards, the hardware is there to allow running pixel shaders at sub-sample frequency when MSAA is enabled.

Whilst developers tend to use it only on edge-pixels (requires edge-detection/masking in code), or on alpha textures, ATI is likely to do it for all pixels, thus effectively providing SSAA. The sample pattern that can be observed, which is a 1:1 match to the one used with MSAA seems to support this assertion. The performance penalty is great, bringing back the good old days in which enabling n sample AA had the effect of dividing performance by n. On the flip-side, it can deal with “noisy” shaders quite well, and provides quality alpha-texture AA. Going forward, we expect developers to handle these things elegantly, in code, only supersampling shaders that need it, but for certain titles that are already here, in which there is performance to burn, SSAA has the potential to be useful.

Up to now, nothing has changed in a significant fashion compared to the RV770, outside of the major increase in ROP count, but that's not all folks. The most significant change for the ROPs, in our opinion, is the doubling of Z/stencil and colour caches versus the RV770, which can end up overlooked since it's not mentioned anywhere in the official documentation, as far as we've seen. We've not been made aware of any tweaking of the chip's colour/z-compression mechanisms, although practical experience would seem to indicate that at least z-compression has been slightly improved.

We'll move on next to the one aspect of the chip that left us scratching our heads a bit, namely the memory interface.