The Claim


3dfx claims that 1024 x 768 is the standard resolution to run games today. A recent poll on this website has confirmed this. On top of that 32-bit color rendering is gaining in importance. Also, the ideal frame rate for an optimal gaming-experience is around 60 frames per second. So, the conclusion is that we need 1024 x 768 x 32 bits x 60 Hz. The high-end products from the Voodoo3-series of products (and many other third generation products) approach this ideal situation, the only missing effects is the 32-bit color (although the post-filtered 16-bit color comes close). Now, 3dfx wants to add T-Buffer effects (mainly Full scene anti-aliasing) to this must-have- list. So, the ideal situation would become 1024 x 768 x 32 bits x 60 Hz with T-Buffer effects. So, in short, 3dfx wants to take the Voodoo3 performance, 1024 x 768 x 16 bits x 60 Hz, add T-Buffer effects and move to 32-bits color.

All these effects and improvements are very interesting and they will definitely increase the image quality but what about speed? As we all know nothing comes for free; doing extra things costs extra, so we need...

More Power


As I explained in the previous paragraph the T-Buffer effects use 4 sub-pixels for every single pixel. So, it works out that you are basically rendering four complete images of the exact same scene. These scenes are combined to form a final on-screen frame (the final image with T-Buffer effects). So, basically instead of rendering just one image, we need to render 4 images. This means that a new product that maintains the 1024 x 768 x 16 bits x 60 Hz claim of Voodoo3 will need to have at least 4 times the power simply because there is 4 times more work. But there is even more. Moving from 16 to 32bit is not trivial either, most of today's products that support 32 bit color see a performance hit from 20 to 50% when moving from full 16 bit color to full 32 bit color! So, basically we don't have 4 times more work we have 8 times more work!

Now, the amount of work that can be done is usually defined by the amount of bandwidth available. Let me illustrate this using an analogy. Imagine we have a large storage room, this storage room contains raw materials used in production, but it also contains final finished products. Now, we have one large machine that turns raw materials into final products. This machine needs a constant flow of raw materials to process into finished goods. It's obvious that there is a stream of incoming raw materials and outgoing products. Now, this company is rather poor and they only have one conveyor belt between the storage room and the machine, so we need to transport raw materials and products alternating (the belt changes direction). Now, to avoid hiccups in the stream to the machine, we need to have some buffers that can be seen as a small local storage. So, what happens is we send a whole bunch of raw material to the local input buffer, this buffer is filled since we provide material faster than the machine can consume it. On the other side, the machine is already producing goods; these are storage in the output buffer. Now, once the output buffer starts to fill-up we need to empty it (we don't want a full buffer since that would stall the machine). So, we change the direction of the belt (we lose some time here) and we start to empty the output buffer. Now, at this moment the machine uses the raw material from the input buffer, luckily the belt is fast and it removes final goods faster than the machine can produce them. Now, if the input buffer is almost empty we change direction again and we supply raw material to the input buffer and this cycle continues. Now, in this story the bandwidth is defined by the speed and size of the conveyor belt. The faster the belt goes the more raw materials we can provide in the same amount of time. The wider the belt is the faster it goes, as more belt space allows a quicker transport.

Now, its pretty obvious that timing is very critical in this example, if the input buffer runs empty the machine stops since it has no raw materials to work with. The same happens when the output buffer is full: there is no room to store finalized products so it has to stop. Now, as long as the bandwidth is large enough this should never happen. Now, as all companies try to make profit they want to make products faster (time = $$$) so the company decides to add a second machine. As all companies are cheap they don't add an extra belt, so the same buffers and belt now supply two machines - as you can guess this increase the risk of buffers being empty or full. In another situation, the company might decide to replace the original machine with a new one, a machine that makes better quality products but higher quality requires more material so, again buffers will run out faster the problem/risk increases.

Naturally, if the problem occurs we can try to increase the buffer sizes, we can also increase the speed of the belt, or we can make the belt wider (larger in size). All these actions increase the bandwidth to the machine(s).

Now, why did I tell this story… after all we aren't talking about 3D acceleration? Well, 3D chips and production lines are very similar. The production line is equivalent to the whole 3D render pipeline, the machine is equivalent to the texturing units, the belt is equivalent to the data-path between local on-board memory and the chip, the buffers are the on-chip cache buffers, and the large storage area is the on-board local memory. Now, the Voodoo1 was an example of a single machine chip, there was one texture unit. With Voodoo2 we introduced two machines, but we also introduced two belts. Voodoo3 introduced two machines using a single belt, but a belt with dual width (larger physical size) and higher speed.

TNT2 is an example of two machines and a single belt, but the machines can run in high quality and in low quality mode, also known as 16-bit mode and 32-bit mode. Now, this chip shows the bandwidth problem clearly. If you run at 16-bit mode everything is fine and fast but if you move to 32-bit mode the chip slows down. This is because the Higher Quality setting requires more raw material… but the chips bandwidth stays the same, so there is a bigger risk that buffer will run empty and the machine, thus, has to stop for a while and wait. These machine-stops result in the lower frame rates.

Now, there is one flaw in this whole story and that is that a chip doesn't consume raw material, the raw materials don't disappear, and there is no real transformation as with a production line. A chip processes data, data are numbers, and these numbers undergo mathematical operations to create output, which are again numbers. The machine is a bit like a large calculator. Now, the big difference introduced by this is that the same numbers can be used several times. You know that mathematics contains stuff like this: a+b=c, a+d=e. As you can see, both of the formulas use the value a. Nevertheless, the problem of transportation remains since you need all values (a, b and d) to be able to find the solutions (c and e), but the whole idea became more complex. You don't always need new data from the storage room, often the data you need is already in the fast local storage room. This means that even if you add extra machines or better quality production units you might not need to increase the bandwidth.

But, this is only true if there is a lot of re-use of the data in the buffers. Data has to be re-used enough to make sure that this is true (same value used multiple times in the calculation). Unfortunately, predicting this is very hard since different games create different situations, and in some of these situations there will be a lot of data re-use, while in others there might not be a lot of data re-use (kind of like different math assignments will use different formulas).

By now, you are probably wondering what that data is… well the input data we have is texture information. The input is, thus, the texture that defines the color of the surfaces of an object. The output is a final rendered on-screen pixel (the stuff you see on your screen). So, basically if we re-use the same texture information we have a large efficiency of the local on-chip buffer since we re-use the same stuff.

Stay tuned for part 2 which will take a closer look the T-buffer, bandwidth and the different ways it can work with the Voodoo5.