Beyond3D - AMD R600 Architecture and GPU Analysis

AMD R600 Architecture and GPU Analysis - Page 9

Published on 14th May 2007, written by Rys for Consumer Graphics - Last updated: 14th May 2007

Data Sampling

A significant upgrade in the shader core on any modern GPU, compared to a previous architecture generation, has to be matched elsewhere on the chip. Thus it's no honest surprise that R600 has improved sampling ability compared to R580 and R520, and you can argue it needs it given the hardware available on the competition's G80 GPU. R600 features a quartet of what they call texture units, but which we call samplers to reflect they can read and filter more than what you'd consider a traditional texture surface.

Each sampler unit can setup 8 addresses per clock, fetch 16 FP32 values for bilinear filtering, and 4 FP32 values for point sampling, all per clock, and then bilinearly filter at a rate of four INT8 or FP16 bilerps per cycle, from those fetched values. Don't believe what you read elsewhere about FP16 being half speed in R600, since it's simply not the case; 3- and 4-channel INT16 isn't full speed, NVIDIA getting a little confused. Our in-house sampler benchmark tool shows equivalent rates for INT8 and FP16 bilinear filtering with 1- to 4-channel fetches for small and large surfaces, which we'll show you later.

If you want to focus on unfiltered data fetches in your shader, the hardware can perform a Fetch4 fetch in place of the bilinear fetch, per clock. As mentioned on the previous page, each unit is tied to a certain sub 'quad' in a shader cluster, feeding the same one in each cluster in the shader core. 32-bit RGBE filtering is supported (as it has to be to comply with D3D10), for developers looking to use that shared exponent format for HDR rendering in their latest engines, and the chip can access very large textures up to 8Kx8K for any fetch or filtering ops.

All of the fetch and filtering capabilities are available to each thread type, making the samplers completely agnostic about what's using them. Each sampler unit is cached locally by a 32KiB local store, with a 256KiB shared L2 used to keep L1 misses around to reduce miss penalties from future requests striding across a collection of nearby addresses. The hardware will also use the vertex cache for accelerating unfiltered fetches. R600 also supports the same decompression formats out of cache that R580 did, for the likes of the DXT formats and ATI's 1- and 2-channel depth formats.

All filtering levels are available on all supported formats, including those in non-linear space, and with AMD supporting the same level selection and tap count for aniso that R580 did, image quality will remain subjectively high, but objectively less than the current competition. And despite whinging about its use in 3DMark, AMD now also support acceleration of depth stencil textures and PCF in R600, giving them a healthy per-clock boost in that benchmark if anyone cares about it.

The sampler unit count and their available performance means that while fetch and filter performance is increased over what's available in R580, it's one of the major areas where R600 is (sometimes heavily) deficient compared to NVIDIA G80 on a per clock basis, with the sampler units running at base clock like the rest of the chip. Like G80 and with the orthogonality in the sampler hardware, filtering power just becomes a product of available bilerps per clock, bandwidth permitting.

Of course the sampler hardware is fully threaded, with a dedicated arbiter and sequencer to manage execution, like the similar hardware that runs each R600 shader cluster, but with different heuristics to govern running threads because of the latency involved. Indeed the massively threaded nature of the sampler hardware is a means to hide latency, using the memory controller to make sure the shader core -- as a client of the sampler array via the MC -- is kept busy. Remember we said that the shader core would sleep threads while waiting on sampler data? The two schedulers work together to make sure that's the case, since a trip to a DRAM to open a page and get data, then feed it through the logic for filtering can be hundreds of clocks of wait time. You don't want to stall the shader core for that, at all.

The ROP hardware is next.

AMD R600 Architecture and GPU Analysis - Page 9

Data Sampling

Page Navigation