Architecture Performance Analysis

Filtering Performance

Our filtering throughput tester takes a look at a bunch of formats, varying channels for some, and outputs in terms of texture ops per second. We take a look at the bilinear and point sampling results for INT8, INT16 and FP16 in this article, concentrating on 1- and 4-channel results. In our performance analysis piece we'll take a look at RGBE (9:9:9:5) and FP10 as well, and also flesh out the 2- and 3-channel results, since there's also some interesting data there. We'll also add in 8800 GTS, and X1950 XTX. We focus on INT8 and FP16 here as common surface formats, and INT16 because it's seen a recent high profile use in a D3D10 application.

We sample from a 4x4 texture to get max performance, sampling all texels from the texture cache as much as possible. Compared to G80, R600 has one quarter the INT8 bilerp rate available, and one half the FP16 bilerp rate, per clock. Remember Radeon HD 2900 XT runs the sampler hardware at 740MHz, and we use GeForce 8800 GTX as the initial folly, since it's the highest clocked G80 we have access to at the time of writing.

R600 is full speed for INT8 and FP16 bilinear filtering (including with 4-channel surfaces), when bandwidth permits (we're just reading out of cache here for all 16 texels), as claimed. For other uncompressed formats, the hardware will scale back to half speed for 4-channel versus 1, although with G80 scaling differently it enjoys higher 1-channel FP32 filtering performance because of its clock (they run at the same rate, clocks normalised).

So even though 4-channel FP16 is free on R600, INT16 doesn't enjoy the same performance because the precision available to the sampler hardware isn't (there's not enough mantissa available), and it drops back to half speed as noted. G80's sampler hardware has sufficient precision to maintain the rate. In terms of D32F filtering (Depth32 on the graph), R600 runs full speed, and the format is useful for pretty much every shadowing implementation out there, bar VSM.

Point sampling shows much the same picture for the formats we test:

ROP throughput

We've done a little work with the ROPs, too, to check out throughput. We test colour, colour+Z and Z-only fill, full screen.

The sustained double Z rate claimed is in evidence, with colour writes turned off, and you can see Z-only fill drop off as expected with 8xMSAA enabled (it's ¼ with 4xMSAA, not shown). You'll see how ROP throughput translates to real-world performance in the IQ and Performance parts of our fuller R600 analysis.