Our GeForce FX preview raised a number of questions during testing. We put these questions to NVIDIA and Tony Tamasi took some time out to reply.


We've seen the official response concerning the pipeline arrangement, and to some extent it would seem that you are attempting to redefine how 'fill-rate' is classified. For instance, you are saying that Z and Stencils operate at 8 per cycle, however both of these are not colour values rendered to the frame buffer (which is how we would normally calculate fill-rate), but are off screen samples that merely contribute to the generation of the final image - if we are to start calculating these as 'pixels' it potentially opens the floodgates to all kinds of samples that could be classed as pure 'fill-rate', such as FSAA samples, which will end up in a whole confusing mess of numbers. Even though we are moving into a more programmable age, don't we still need to stick to some basic fundamental specifications?

No, we need to make sure that the definitions/specifications that we do use to describe these architectures reflect the capabilities of the architecture as accurately as possible.

Using antiquated definitions to describe modern architectures results in inaccuracies and causes people to make bad conclusions. This issue is amplified for you as a journalist, because you will communicate your conclusion to your readership. This is an opportunity for you to educate your readers on the new metrics for evaluating the latest technologies.

Let's step through some math. At 1600x1200 resolution, there are 2 million pixels on the screen. If we have a 4ppc GPU running at 500MHz, our "fill rate" is 2.0Gp/sec. So, our GPU could draw the screen 1000 times per second if depth complexity is zero (2.0G divided by 2.0M). That is clearly absurd. Nobody wants a simple application that runs at 1000 frames per second (fps.) What they do want is fancier programs that run at 30-100 fps.

So, modern applications render the Z buffer first. Then they render the scene to various 'textures' such as depth maps, shadow maps, stencil buffers, and more. These various maps are heavily biased toward Z and stencil rendering. Then the application does the final rendering pass on the visible pixels only. In fact, these pixels are rendered at a rate that is well below the 'peak' fill rate of the GPU because lots of textures and shading programs are used. In many cases, the final rendering is performed at an average throughput of 1 pixel per clock or less because sophisticated shading algorithms are used. One great example is the paint shader for NVIDIA's Time Machine demo . That shader uses up to 14 textures per pixel.

And, I want to emphasize that what end users care most about is not pixels per clock, but actual game performance. The NV30 GPU is the world's fastest GPU. It delivers better game performance across the board than any other GPU. Tom's Hardware declared "NVDIA takes the crown" and HardOCP observed that NV30 outpaces the competition across a variety of applications and display modes.

Our testing concludes that the pipeline arrangement of NV30, certainly for texturing operations, is similar to that of NV25, with two texture units per pipeline - this can even be shown when calculating odd numbers of textures in that they have the same performance drop as even numbers of textures. I also attended the 'Dawn-Till-Dusk' developer even in London and sat in on a number of the presentations in which developers were informed that the second texture comes for free (again, indicating a 2 texture units) and that ddx, ddy works by just looking at the values in there neighbours pixels shader as this is a 2x2 pipeline configuration, which it is unlikely to be if it was a true 8 pipe design (unless it operated as two 2x2 pipelines!!) In what circumstances, if any, can it operate beyond a 4 pipe x 2 textures configuration, bearing in mind that Z and stencils do not require texture sampling (on this instance its 8x0!).

Not all pixels are textured, so it is inaccurate to say that fill rate requires texturing.

For Z+stencil rendering, NV30 is 8 pixels per clock. This is in fact performed as two 2x2 areas as you mention above.

For texturing, NV30 can have 16 active textures and apply 8 textures per clock to the active pixels. If an object has 4 textures applied to it, then NV30 will render it at 4 pixels per 2 clocks because it takes 2 clock cycles to apply 4 textures to a single pixel.

NV25 had 4 Z checking units per pipeline, which is used when calculating multiple MSAA samples per pipeline, or (possibly, as we understand) rejecting multiple pixels per pipeline. It appears that a similar configuration has been carried across to NV30, but tweaked so that two of the units can Z sample and 2 can stencil sample when stencil ops are taking place - is this the case? If so, what happens to the stencil rate when MSAA is used?

For the case of anti-aliasing, both Z and stencil operations are performed at the fragment (the AA sub-pixel sample) level. Otherwise, you would get errors at the fragment level, and those errors would propagate to the pixel level.

When we interviewed Geoff Ballew at the London launch of NV30 he mentioned that there is an 'integer' path alongside the floating point path. Given that operations are moving over to the floating point path, including the integer path seems like an redundant design, not least because the integer values can be converted even under FP16. ATI's R300 employs only an FP24 (minimum) pipeline for each of the 8 pixel processing units meaning that regardless of the operation being performed it will always internally be calculated at a minimum of FP24 precision - why wasn't a similar configuration adopted with NV30? As it stands if you're calculating a legacy app in over the integer pipe then you have a big chunk of silicon for FP units doing nothing and vice-versa when calculating FP ops.

This arrangement allows nv30 to maximize throughput with a variety of conditions. As a consumer, you don't really care whether or not all of the transistors are busy or idle at any particular time - in fact, for lower power and quieter operation, you actually want some of the chip to be idle some of the time if it means that the parts that are active are more efficient at processing the current tasks. If the only tool in your tool box is a hammer, everything gets treated like a nail, but a power screwdriver is a better tool for driving screws into wood. What end users really care about is application (game) performance. By dedicating processing elements that are exactly what is required, we can maximize performance for both legacy (8bit integer) apps, and for newer 32-bit floating point apps.

One other thing to recognize is that even in new floating point applications, there is still a mix of integer, short floating point, and full floating point operations. When both data types and operation precisions are present, we can use BOTH the integer and floating pipelines and get twice the throughput. 16-bit floating point is sufficient precision for most color blending operations, and in fact it's even overkill for some. 32-bit floating point precision is absolutely required for texture address calculations and texture lookups for dependent texture reads that are used for bump mapping, reflections, etc; 24-bit is not enough to get the right answer. 24-bit is also not enough for doing geometry operations such as displacement mapping and morphing. Given that the NV30 pipeline is 32-bit floating point top-to-bottom, you can process data uniformly though the pipeline, including rendering vertex data in the frame buffer. You just can't do that with 24-bit floating point pipeline.

Testing with the DirectX _PP precision hint enabled and disabled under the 42.68 drivers show no difference in performance. It was quoted that FP16 is able to run at twice the rate of FP32 - considering there is no difference in speed this would either suggest that the current drivers are forcing FP16, thus meaning that DirectX developers and applications will not be able to benefit from the increased precision afforded by FP32, or that the drivers are currently unable to effectively schedule FP16 processing. Which is it?

The applications choose the precision level of precision and may or may not use it as efficiently as possible.

We also continue to optimize our drivers over time. NVIDIA's driver strategy is to have an initial driver that ships with new GPUs that is fast, stable and compatible. Once that initial driver is released, we go back and do another round of performance tuning to create a "Detonator" release that brings even more performance to both newer and older NVIDIA GPUs. Our end users love it because they know the board they buy today will deliver even more performance in the future.

During the Dusk-Till-Dawn event and from subsequent discussions I've had with developers, there appears to be some confusion as to the status of FP32 in the DirectX drivers. Is it not the case that texture addressing needs to at least take place at FP24 (meaning FP32 for NV30) precision to meet DirectX9 compliancy? Would your drivers fail WHQL certification if there were not running at this precision of this stage of the shader pipe?

According to DirectX 9 spec, FP24 is allowed, but we at NVIDIA believe it is insufficient to avoid errors for a variety of texture and geometry calculations. So we designed our products to use FP32. For example, NVIDIA uses FP32 for texture address calculation. In the case of dependent texture reads (e.g. a bumpy shiny object with a reflection map in a scene), full precision (FP32) for the texture address calculation is critical for getting a high-quality result.

It is unfortunate that the spec is lax on this point because FP32 is also important to match the FP precision for any work that is done on the CPU (whether Pentium4 or Athlon). Some applications still do load balancing by performing some geometry processing on the CPU. If some math is done in FP32 on the CPU (CPUs don't recognize the FP24 format) and other math is done at FP24 on the GPU, you will get errors. Because pixel shading units are now capable of rendering geometry, they need to have the same precision as the vertex shading unit. Advanced pixel shading programs can render data to a texture that is subsequently read back into the Vertex Shading Engine for additional geometric processing. For this reason, it is much better to have the same precision through the entire process.

And, of course, drivers would fail WHQL if they fail to meet minimum WHQL requirements such as FP24 or higher for texture address calculation.

Comparing 'Application' and 'Balanced' filtering it would appear that Trilinear filtering on 'Balanced' is a lot "looser", in that the mip map blending is only done partially at the mip-map boundaries, and much of the image is only using one mip-map level. This would appear to be a reasonably easy way to gain performance under texture intensive applications, is this why it has been done? Would you classify this as 'true' Trilinear?

The techniques used in the "Balanced" setting and the "Aggressive" are used specifically to give the end user more performance. NVIDIA added these settings to give end users more options to balance quality versus performance. I would classify this as "adaptive" Trilinear. If the user wants the classic Trilinear algorithm applied to every pixel, then the user chooses the "Application" setting and uses the video configuration options in the application itself to choose Trilinear. If the user wants to use the adaptive algorithms in NV30, then he or she can choose the "balanced" setting which is more conservative, or the aggressive setting.

The level of Anisotropy with 'Balanced' and 'Aggressive' gets less effective as the Anisotropic filtering level goes up; this is amplified with aggressive filtering, to the point that there is very little difference in the parts that are filtered between 4x and 8x Balanced AF is there a worthwhile difference between these modes? Also, these modes appear to form a square filtering pattern; with the aggressive mode the filtering levels do not move in a great deal directly in front of the viewer - is this effective? Competing 'performance' solutions, such as that of Radeon 8500's at least get maximum levels of anisotropy at the 90 degree angles, why wasn't a similar pattern adopted with NV30?

You mentioned that ATI delivers maximum anisotropy at the 90 degree angles but you failed to highlight the fact that they deliver lower anisotropy at other angles (most notably 45 degrees). This is a clever trick but the consequence of this is image quality that varies by the angle of the polygon. In contrast, NVIDIA's solution delivers equivalent anisotropy (and therefore equivalent *quality*) at every polygon angle.

ATI's solution is tailored for older first person shooter applications that have square walls and a flat floor. NVIDIA's solution is tailored for modern applications that use a lot more geometry to make the environment richer. Rounded walls, outdoor terrain and vegetation are key elements of next-generation applications and those features cause more polygons to intersect the screen at non-90-degree angles.

NV30's implementation is different from Radeon 8500 because it was designed by NVIDIA, with a different design philosophy. No two 3D architectures are exactly the same between different vendors.

The question of which implementation is determined entirely by the end user's preferences. It is a very individualized choice. NVIDIA's goal was to offer the user lots of choices that were slightly different (application settings, quality settings and aggressive settings) , and let them choose the setting that they like best.