Architecture Discussion - G84 versus G80 per clock

G84 is effectively one quarter of a G80 in terms of general shading ability, per clock, featuring two SP clusters to G80's eight, and thus sporting 32 of the same SP units that G80 has. Aggregate PDC size is one quarter as a result, too. As a reminder, each SP is therefore theoretically able to dual-issue a MADD and MUL per clock, with G84 running the general shading core at a much higher clock than the rest of the chip (1.45GHz vs 675MHz). Interpolation and special function shading power per clock is unchanged, per cluster, compared to G80.

In terms of per-clock ROP throughput, G84 is one third of G80. It has two identical ROP partitions, each providing a quad of ROPs and a 64-bit channel to main memory via two connected DRAMs. ROP ability is unchanged, with an 8x Z-only rate and up to 16xCSAA, with the same blend rate as G80, bandwidth permitting. Therefore, for a look at G84's ROP hardware, see our architecture and image quality pieces concerning G80.

In terms of data sampling and filtering, G84 is changed in its per-clock ability compared to G80. Each cluster now has 8 'pixels' per clock of data addressing, to pair with the same 8 INT8 bilerps (FP16 at half speed again) per clock for fetch and filter. So you get one bilerp to burn per data address the chip sets up, rather than the 1:2 ratio in G80. So no more effectively free trilinear or 2x bilinear AF, but a boost in the address hardware instead to return the ratio to 1:1.

That gives G84 a sampling rate of 10.8G/samples per second at the basic 675MHz clock, and 11.68G/samples per second for the XFX XXX edition we have on test. So 1/4 general shading, 1/3 in the ROPs and a doubling of per-cluster sample data addressing (which makes that 1/2), all per clock.

Outside of the basic architecture outlined above, G84 adds further logic compared to G80, that deals with video decoding and processing, so let's talk about that a bit more.