Data fetch and filtering

In our diagram, we showed you how each cluster has what are effectively dedicated paths for data fetch and filtering (let's call that sampling to save some keystrokes and time) logic to service them. Rather than a global sampler array, each cluster gets its own, reducing overall texturing performance per thread (one SP thread can't use all of the sampler hardware, even if the other samplers are idle) but making the chip easier to build.

The sampler hardware per cluster runs in a separate clock domain to the SPs (a slower one), and with the chip supporting D3D10 and thus constant buffers as a data pool to fetch from, each sampler section has a bus to and from L1 and to and from dedicated constant buffer storage. Measured L1 size is seemingly 8KiB. We (crudely) measure cache sizes by fetching multi-size and multi-component texture surfaces into our shader, measuring performance and making what's essentially a guess about size, but one we're confident in, based on where we see performance decreasing as cache misses start to occur.

Data fetch and filtering is tied implicitly to any memory controller and caching schemes a GPU might employ, as well as the DRAMs the chip is connected to and their performance as well. We question if any dedicated sampler hardware exists to match constant fetch being unfiltered and with different access patterns across their memory space, compared to other data access the chip is already being asked to perform. Therefore under D3D10, we suggest that basis filter rates will decrease as constant buffers are accessed using the same logic, but we also suggest that constant fetch is sufficiently optimal so as to negate the performance hit acceptably.

Filtering wise, the sampler hardware provides bilinear as the base filtering mode with trilinear and anisotropic (non-box, perspectively correct) filtering as options for all surface types the sampler hardware can access. Up to 16x anisotropic filtering is again available, and the out-of-the-box setting with all shipping drivers will be a high level of angle invariance. Put simply, more surfaces in the scene will receive high levels of filtering as requested by the application or the user via the control panel, as a default matter of course, raising the minimum level of image quality with surface filtering enabled by a significant amount.

We've been banging the image quality drum for some time now, with a view to it being raised as a default by the main IHVs when the hardware is able to with a surfeit of performance. It happens with G80, big time, and should not be underestimated or glossed over. While other base filtering optimisations remain in the driver's default settings as it controls the hardware, with G80, the return to almost invariant filtering with G80 and GeForce 8-series products is most welcome. The hardware can also filter somewhat orthogonally, too, which deserves a mention. The chip can filter any and all integer and floating point surfaces it can access in the sampler hardware, and that includes surfaces with non-linear colour spaces. Filtering rates essentially just become a product of consumption of the available per-cycle bilerps.

The sampler hardware runs at base clock, as do the on-chip memories and the back end of the chip, which is conveniently next on the list.