Benchmarks

 

Theoretical Throughputs

From the hardware specifications we know what the theoretical limits of the GeForce FX are billed as, so let's see how they stand up to 3DMark2001SE's theoretical throughput tests.

GeForce 4 Ti4600 1050.4 2315.3 60.4 12.6 98.0
GeForce FX 1578.0 3477.9 105.4 30.9 178.3
GeForce FX (400/400) 1293.5 2803.5 88.0 24.6 143.1
 
GeForce FX 50% 50% 75% 145% 82%
GeForce FX (400/400) 23% 21% 46% 95% 46%

The first thing that we can notice here is the large disparity in the single to multi-texturing fill-rates, as told by the Pixel and Texel Fill columns respectively. NVIDIA state that GeForce FX, like Radeon 9700, is an '8x1' design, meaning that it has 8 pixel pipelines which are each capable of sampling one texture per clock; thus the theoretical pixel fill-rate should be equal to the theoretical multi-texture fill-rate (multiple textures are applied over multiple clock cycles under a process commonly termed as 'Loopback'). Here, though, we can see that the pixel fill-rate is well below the multi-texture fill-rate and the theoretical hardware fill-rate of 4G pixel/sec.

We had expected the situation here to be similar to what we have seen with Radeon 9500 PRO. Radeon 9500 PRO uses the same 8x1 chip as Radeon 9700, yet where the 9700 is on a 256-bit width memory bus the 9500 PRO is on a 128-bit double data rate bus, which is unable provide enough bandwidth for 8 32-bit pixels per clock cycle.  At a 128-bit bus width, with DDR, the bus would be able to facilitate eight 32-bit pixel writes in a single cycle, but at the cost of any other memory accesses (such as Z read/writes, colour blends) possibly required in the same cycle.

Like Radeon 9500 PRO, GeForce FX also uses a 128-bit memory interface, albeit with both the core and memory running at a much higher rate, so we see a similar pattern. Under the single texturing fill-rate test the bus would be unable to facilitate all 8 pixels and any other memory accesses in a single cycle, so the memory bus is flooded and stalls the pipeline, thus reducing the single texture fill-rate to about half its theoretical ability. And, in the case of the multi-texturing test because the 'loopback' texturing is in operation it would able to balance out the resources better. As the texture layers are applied over multiple cycles it means that the memory bus is not flooded all the time -- when  a fully textured group of pixels is ready to be passed to the frame buffer the memory bus will not be able to sustain all memory accesses in one cycle, so anything that doesn't occur on the first cycle will stay in the FIFO buffer ready for the next cycle; as the rest of the pipeline is busy applying extra texture layers to the next groups of pixels the memory usage will be low and the rest of the memory accesses can be done then. At least, this is what we had seen with the 9500 PRO, though not with Geforce FX.

Theoretical 4000 4000
3DMark2001 32-Bit 1578.0 2315.0
3DMark2001 16-Bit 1919.6 3709.4
 
3DMark2001 32-Bit -61% -42%
3DMark2001 16-Bit -52% -7%

Above is the fill-rate performance under 16-bit modes. By running the benchmark in 16-bit we have alleviated the bandwidth issues slightly, and for a card capable of producing 8 pixels per clock we would expect to see a greater single texturing performance than half the theoretical rate. In this instance though, the performance from the single texturing test is still in-line with what we would expect for a card outputting 4 pixels per pipeline, but the texturing rate indicates that there are two texturing units in operation on those pipelines. We’ve had a long and wide ranging discussion on the pipeline arrangement, inclusive of a number of other tests that all indicate that the configuration of NV30 is similar to GeForce4, acting as a 4 pipeline architecture with two texture sampling units per pixel pipe, at least when straight texture dependant operations are taking place.

Since that discussion NVIDIA have issued the following statement with regards to the number of operations that NV30 can handle per clock (in a theoretical sense):

"Geforce FX 5800 and 5800 Ultra run at 8 Pixels per clock for all of the following:

a) z-rendering
b) Stencil operations
c) Texture operations
d) shader operations
"

You’ll note that missing from that list is the number of colour values that are written to the frame buffer – which is the principal value that is required for calculating theoretical fill-rate, unless the criteria for defining a pixel is being redefined.

They go on to further state:

"For most advanced applications (such as Doom3) most of the time is spent in these modes because of the advanced shadowing techniques that use shadow buffers, stencil testing and next generation shaders that are longer and therefore make apps “shading bound” rather than “color fillrate bound. Only Z+color rendering is calculated at 4 pixels per clock, all other modes (z, stencil, texture, shading) run at 8 pixels per clock. The more advanced the application the less percentage of the total rendering is color, because more time is spent texturing, shading and doing advanced shadowing/lighting"

So, there we have it, it does indeed appear that NV30 is fixed to 4 colour writes per clock cycle, which is the primary metric for calculation of theoretic fill-rate – so the fill-rate table on the second page should read “2000M pixels/sec, 4000M texels/sec” according to NVIDIA's statement.

The question is, why would NVIDIA claim this operates as an 8 pipeline card, when evidence points to the contrary. Well, first off, as they say stencil and Z buffer writes can occur at 8 per clock, and NVIDIA detailed an optimised stencil path to developers in their recent Dawn-Till-Dusk developer conferences and likewise with Z rendering, though neither of these can be termed as ‘pixels’. Another point to make is that it would appear that the Pixel Shader Arithmetic Logic Units occur at two per pipeline, so each pipeline can be doing multiple pixel shader calculations per pipeline simultaneously, though it will only be able to write 4 pixels in a single cycle. And this is a key difference, since with the inclusion of multiple pixel processing units we are moving more towards a ‘pipelined’ CPU approach: while one ALU is operating on the first pixel, the next ALU is operating on the next, so this would appear to be operating more as a pipelined 4 pixel pipe card, that can be logically working on 8 pixels at any one point in time, but only 4 can be written per clock. However, ATI’s R300 architecture features 3 shading units per pipeline which are also pipelined, meaning that in total 24 pixels can be operated upon in a single cycle, but this isn’t classified as a 24 pipe architecture!

While Beyond3D is interested in the technology, and the inner workings of the chip, many of you may be asking whether this really matters. With the public revelations of NV30’s apparent configuration the stance has become “it’s the performance in games that really matters” and for those that are buying this for gaming purposes that is very true. So will a 4x2 configuration, as NV30 appears to be, affect game playing? Well, probably not in many cases and this configuration is relatively well tuned to the bus width of the board. There may be instances, however, where odd numbers of textures are used, which may not be optimal for NV30, but this will not be too much of a performance penalty in most cases, especially if Trilinear filtering is enabled.

One area that may be of concern is with the configuration of Shader instruction execution. We’ve already had comments from John Carmack stating that the Shader execution speed is lower than he would have expected, because the Shader compiler is ‘twitchy’, yet driver updates may cure this. Given the issues we’ve seen with 3DMark03 and the performance differences between driver revisions on NV30, this may be a characterising feature of NV30’s Shader execution. If the driver level compiler cannot be developed to order instructions such that they perform optimally over NV30’s pipeline automatically NVIDIA’s developer relations and driver developers may be kept busy optimising on a game by game basis – hopefully the former will be the case.

Well, lets move on to see what its gaming performance is actually like...