The numbers in the press documentation are consistent with GeForce FX utilising a 128-bit memory bus, is that the case?

It is still a 128-bit bus.

So, other people on the market have a 256-bit bus, right, some are fast and some are slow -- just having 256-bit doesn't mean you are fast. What does mean you are fast is a good balance architecturally between your core and your memory, combined with the fact of using very fast memory. We have the worlds fastest memory, 500MHz, which on data rate is 1GHz. So, in rough comparisons, ATI is wider, but we're a hell of a lot faster. And making sure that you're tuned with the pipeline is even more important.

We know that GeForce FX has a total of 8 pixel pipelines running at 500MHz, however how many texture mapping units does it feature per pipeline?

Well, as we move into programmable shading the old conventions of fixed pipelines are becoming less important and less accurate, so with that caveat let me go back and answer your question.

We have 8 pipelines and they can each apply one texture per clock so we can apply 8 textures per clock.

Does that apply for both FP32 (128-bit) as well as FP16 (64-bit)? For example can FP16 do two texture reads per clock?

No, you're limited to 8 textures applied per clock. However, we can do addressing for 16 textures at any one time. So even though you can only rigidly apply 8 per clock, you can line up the next 8 then apply them in the next clock. So that level of flexibility is very important as we move to procedural shading and sophisticated shader programs, so they key thing is how many textures can you queue up and address and how quickly can you apply them in a shader program.

There was talk that FP16 (64-bit floating point rendering) could run twice the speed of FP32 (128-bit floating point rendering), is that the case?

Yes it is. Because we have native support in our hardware for FP16 and FP32. So, every pipeline is wide enough to accommodate the full 128-bit through the entire thing -- in the Vertex Shader, in the Pixel Shader and out to the frame buffer. Because we support 128-bit throughout the entire pipeline we added some extra control line and we can split those 128-bit channels into 64-bit channels. Now, that's only in the shading architecture, so we don't get twice as many pixels, but you get twice as many 64-bit in instructions. Also, if you want to use FP16 you'll have a smaller frame buffer so it has a lower footprint in memory as well.

With the FP16 mode you still get all the benefits of floating point, so you get the high precision. I think that a lot of the first person shooter games or games that want triple digit frame rates the FP16 modes are really going to be important.

I assume that anything available currently using the the 32-bit format will be run in FP16 mode?

Actually, no. We have native support for 32-bit integer, which is how we get the performance on the older apps. If we were to run them as FP16 then they wouldn't run as fast. So we have dedicated hardware with native support for 32-bit per pixel integer, 64-bit per pixel floating and 128-bit per pixel floating.

With the shader lengths GeForce FX allows, if they were utilised to their full extent wouldn't the performance be fairly catastrophic for real time applications at this point?

Yes, its true to say that if every pixel used thousands of shader instructions then the performance wouldn't be playable, however what we are trying to achieve here is not to give the developer any limitations. For instance, if there is some detail that a developer wishes to use hundreds of instructions on then our architecture will allow then to do so and not limit them.

What were the shader lengths used in the demos shown at the launch?

There were a variety, but we had some in the neighbourhood of about 350 instructions. The Time Machine aging truck demo also uses 14 texture layers.

The developer documentation was quoting vertex throughput rates of 1.5 times GeForce4's throughput on a clock for clock basis, which would equate to 3 Vertex Shaders.

The 1.5 number is old, because we actually have 3 times the vertex performance of Ti4600, at 500MHz.

As we moved to more programmability, instead of implementing a smaller Vertex Shader  and replicating it we tackled the problem from a different direction. And we have a pool of calculating units and an intelligent scheduling mechanism at the front of it. So instead of an entire Vertex Shader what we have made is a sea of vertex math engines that are all glued together with this interface.

3Dlabs Wildcat VP features an array of 16 scalar processors for its Vertex Shader, is GeForce FX's something similar?

There is fundamentally a different architecture, but some of the benefits of having smaller instruction units and more of them applies to our Vertex Shader architecture as well, but its only a vague high level association.