ROP

The ROP stage is likely going to be one of the most controversial parts of the R600 design for some time to come, given how AMD are making use of it in the shipping configurations. Basic abilities first, the ROP now supports 8x multisample AA using programmable sample grids, testing 4 positions per cycle with samples laid out over a 4-bit grid (16 discrete positions per pixel, X and Y, 256 total). It's able to test any supported surface format, including float surfaces. That means the hardware's basic multisample abilities match (and exceed by virtue of programability, you could argue) that of NVIDIA G80.

The blender is new, too, with one per ROP that's able to do any FP16 blend op in a single cycle, with FP32 half speed. Finished pixels come out of the ROP back-end at one per clock per ROP, the hardware making good use of available board bandwidth to sustain writes of floating point pixels into the framebuffer at that rate. It's a couple of GB/sec to sustain 60fps of FP16 pixel fill at 2560x1600, remember, never mind the memory bandwidth needed for every other op going on on the chip at the same time, to generate those frames!

The ROP will sustain a Z-only write rate of 2x compared to colour writes, even with AA enabled. Compression wise, AMD report that the depth and stencil compression logic for tiles is improved to the tune of 16:1, up from 8:1 in R5xx, and it scales with the AA level (so 128:1 at 8xMSAA, peak theoretically). The hardware will actively prefetch and cache tile compression data on-chip, to make sure it's available when decompressing for sample resolve. And that brings us on to probably the most controversial part of the chip: how does(n't) it do MSAA sample resolve.

Custom Filter AA

Custom Filter AA, or CFAA for short, is AMD implementing non-box filters that look outside the pixel being processed in order to calculate final colour and antialias the image. The sample resolve for that is performed on the shader core, data passed in so the programmable hardware can do the math, with the filter function defined by the driver. That means AMD can implement a pluggable user filter system if they want, but even if they don't they can update, add and remove filters from the driver at will whenever they see fit.

The big advantage is the ability to perform better filtering than the standard hardware resolve. However the disadvantages include possible implementation of bad filters, and speed issues because the driver now has to issue, and the hardware run, resolve calculation threads which chew up available shader core cycles. Ah, but there's still the regular hardware resolve if you want maximum speed and the regular quality just from sampling a single pixel, right? Well no, that's not always the case.

Even for the basic box filter resolves, where the hardware weights samples based on their proximity to the pixel centroid, R600 will perform the calculations to resolve the samples on the shader core, that is unless compression for the tile is at maximum, so you know the resolve would just return the same colour anyway, so there's no math involved to filter. Currently that points to the hardware resolve either being broken, at least under some conditions (when compression is less than maximum), or it being easier to maintain in code if you're doing other filters anyway, so you only have one path for that in the driver. We lean towards the former, rather than the latter, since the performance deficit for shader core resolve seems to be significant, even for the basic box filter cases. That can be improved in the driver, however, and the hardware helps here by likely being able to decompress sample locations for a pixel into the shader core at a high rate.

The current filters are wide and narrow tent filters, where samples from outside the pixel being processed are weighted linearly based on their distance from that pixel's centroid, with the linear function adjusted based on the wide or narrow choice. There's also an edge-detect mode that will analytically look for geometry edges intersecting the pixel based on an even wider search around it, and resolve with weight based on where they are and the direction of the edge itself (once determined). The cost for such a filter is pretty significant, and we'll examine that in the performance piece (it was only enabled in a usable driver bare days ago, at the time of writing).

Given the filter type, and the samples it takes from the pixel being processed and its surrounding neighbours, AMD still use somewhat usual nomenclature for naming the modes.

AMD R600 Custom Filter AA

  Filter 
4x  2x + Narrow Tent 
6x  2x + Wide Tent or 4x + Narrow Tent 
8x  4x + Wide Tent 
12x  8x + Narrow Tent or 4x + Edge Detect 
16x  8x + Wide Tent 
24x  8x + Edge Detect 

Because of sample positioning and the filter used, two of the modes can be achieved using different combinations of samples taken and filter used for resolve, with the number denoting total number of samples taken for the filter to work with. Because two of the modes don't have fixed quality (because there are two different ways to achieve the sample count, with different filters), you can't use the number part of a CFAA mode to determine if it's better quality than one with a higher or lower number, nor can you make any determination of speed.

It's also quite easy to imagine the tent filters, because of the way they look outside the target pixel, blurring the image because they too-heavily weight the contribution from samples in neighbouring pixels. At this stage in the real-time graphics game, we think it's advantageous to image quality as a whole if AA filter quality only ever gets better from the accepted norm, and never gets worse.

We're also not sure if the system will ever become fully custom in the sense that users will be able to write their own filters, and we're not sure if AMD will ever change or remove or add filters over time (indeed AMD aren't sure themselves at this point, since the functionality in the driver is actually quite new). We'll take a look at AA image quality quickly in a couple of pages time, and fully in the IQ article.