Data Sampling

Click for a bigger version

Given the fact that ATI has retained the 1 sampler (they call them Texture Units) per SIMD ratio, sampling capacity has followed the doubling of the shader core. Functionally, a Cypress sampler is quite close to an RV770 sampler, with extended capacity required to achieve DX11 compliancy: support for 16384x16384 textures, and support for BC6 and 7 compressed formats. Another addition is the ability to access pixel compression data in multisampled colour buffers, which translates into sample fetching for CFAA being now done via the samplers, rather than by using the mysterious RBE-Shader Core fast-path introduced in the R600, that we have a hunch had become something of a limiting factor.

This also allows saving some wiring and transistors on the chip, by doing away with the old fast-path. In the current arrangement, the chip can theoretically fetch 272 GSamples/s to be resolved on the ALUs, which is more than enough for any scenario. For 2560x1600 and 8X AA you'd have to fetch only 1.96 Gsamples/s at 60 FPS...and that's assuming you fetch all 8 samples for all tiles, whereas given the fact that samplers can now read compression flags you'd only fetch more than 1 sample per tile for partially compressed tiles, which are potentially in the minority at high resolution. All in all, this should bring a nice speed-boost for CFAA in cases where sample-fetch was the limiting factor, as opposed to ones where the math cost of CFAA was the bottleneck...so we expect to see more benefits for the tent modes, rather than for Edge-Detect. Keep in mind that the above only applies to CFAA, traditional resolve using a box-filter is still done via dedicated HW on the ROPs, but let's not get ahead of ourselves.

Getting back to the samplers themselves, each can set up 4 addresses per cycle and fetch 16 FP32 samples for filtering within the same cycle, if one uses Fetch/Gather4, otherwise only 4 FP32 samples can be fetched per cycle. Filtering is done at a rate of 4 bilerps per cycle for 8-bit per channel INT formats, with the rate being halved for 16-bit per channel FP ones, just like with the RV770. Samplers fetch from an exclusive L1 cache, that's been halved versus the RV770 sitting at 8 KB whereas it was 16 KB before. Something had to take a step back with all the doubling going on in order to fit within die-size targets (which may have been slightly surpassed anyhow, but that's another story). This L1 is fully associative, and we're told it uses 256 bit cachelines, maintaining a cacheline size tradition that was started with the R300. You'll probably see a cute 1 TB/s fetch rate being quoted for the L1, but keep in mind that's aggregate, a sampler can only fetch from the L1 at a rate of ~54.4GB/s, and aggregate doesn't make much sense here since one sampler can't exactly fetch from another's L1. Going further down the cache hierarchy, a shared 512 KB (aggregate) L2 cache can be found, split in 4 memory controller aligned 128 KB banks, which are accessed via a 1024-bit crossbar sitting between themselves and the L1s. The crossbar is probably characterized by a crosspoint complexity of 4*20=80, a diameter of 1 and is strictly nonblocking – otherwise it wouldn't quite be a full-fledged crossbar. Each L2 is uniquely mapped to a given memory location, each holding unique data, and data can be fetched from any L2 bank into any of the L1 caches, and cacheline size here also sits at 256b, in a rather unsurprising fashion. In order to ensure that the texture caches hold the latest frame buffer data, a semaphore is used which allows caching to go on only after a surface is written. In a compute environment this is handled explicitly in code, by using uncached reads for data that is known to have been modified.

You may recall that once upon a time there also used to be a vertex cache attached to the data request bus – however, this is no longer the case. It seems that fetching vertex data through L1 gives better results than using a separate vertex cache, and since it's unlikely to be huge, and thus won't have a significant footprint in the cache, ATI opted for doing away with the dedicated solution, probably saving a bit of die real-estate as well. Interestingly, this also applies to the prior R7XX family of chips, where the vertex cache, although physically present, has been disabled in later drivers.

Filtering is fully orthogonal for all supported formats, and it's received a bit of a makeover too. Trilinear is better than what you could get on an RV770, although whether or not this is merely a software change is hard to gauge. Anisotropic is the more interesting topic though, and we've not yet completely peeled it ourselves: on one hand, it's now doing angle-invariant level selection, which means that each angles get the same level of filtering/same tap count. However, the count of texels sampled for generating a pixel is something of a confounding factor, since our early measurements indicate that it has actually gone slightly down compared to the RV770, which could cause problems for certain high-frequency surfaces. All in all, we still have a lot of work left to do here, so an ample discussion of the topic will have to wait until we publish our IQ investigation. In practical use, we can't say we've noticed any filtering quality issues, but we've had little time with the board before being separated from it, so we can't make a decisive statement one way or the other. And now, on to ROPs!