Data Sampling

Click for a bigger version

If you've been following things in the last couple of years, you've probably noted that the samplers in R6 were kicked more than the ball in a Liverpool-Manchester United face-off, being often pointed out as the reason for the mediocre performance achieved by those parts. Whilst that would be a simplistic view, it does hold merit, as the samplers were rather disconnected from the realities of the 3D rendering landscape of the time. They were powerful enough to perform single-cycle FP16 (per-channel) bilinear filtering, having extra addressing capability for unfiltered textures (probably a back-firing bet on DX10 encouraging unfiltered accesses to various surface formats), but were stuck at an unit count that was worryingly low compared to the competition.

Couple that with an interesting yet unwieldy arrangement as a sampler array, which limited scaling possibilities (scaling sampler count upwards would've required either increasing SIMD width, or significant reworking) and the fact that they were pretty inefficient with regards to die space, it's not exactly surprising that ATI opted for a complete overhaul on this front.

Out went the over achieving solution and in came a more balanced one which sees each SIMD being tied to its own discrete sampler unit, with each sampler having a separate, exclusive L1 texture cache. Each sampler unit sets up addresses at a rate of 4 per cycle (no more filtered/unfiltered dichotomy), can fetch 16 FP32 (compared to 16 bilinear + 4 point sampled before) samples from the L1 cache and perform bilinear filtering on 4 INT8 (per-channel) or 2 FP16 (per-channel) of those samples. So, no more single-cycle FP16 bilinear filtering, but unit count doubles from the R600/RV670 to the RV740, thus FP16 bilinear throughput will be equal at equal clocks, whereas INT8 throughput will be twice as high. Of course the sampler units are fully D3D10 and D3D10.1 compliant, thus supporting all of the features demanded by those APIs.

Getting back to the L1 cache we already mentioned, we've measured it to be 16 KiB per sampler unit, backed up by an ~128 KiB L2 split in 2 equal sized memory-channel aligned banks, that are interfaced with via a crossbar switch. L2 also holds vertex data which is fetched into a separate vertex cache that's also sampled via the sampler units, which is probably around 32 KiB in size. The L1 is fully associative, whereas we're not sure about the L2. You've probably already mentally calculated that L1 fetch bandwidth equals 48GB/s per sampler unit (4bytes/cycle * 750MHz), whilst the L2 can provide data at a rate of 96GB/s per L2 bank.

Much to some image-quality aficionados' disdain, the anisotropic filtering implementation hasn't changed between generations, so it remains slightly lower quality than NVIDIA's current solution. Whilst it's still at a disadvantage when it comes to pure data-sampling muscle versus it's immediate competitors, the RV740 is hardly a slouch here, unlike RV670, and we've seen it be quite competitive on this front, albeit hampered by bandwidth constraints – more on that later.

After you've sampled data and chewed it in the shader core, you must actually write it to memory (ignore the Stream-Out alternative for artistic effect), and that task falls unto the ROPs, which constitute the next topic.