The Claim
3dfx claims that 1024 x 768 is the standard resolution to run games today.
A recent poll on this website has confirmed this. On top of that 32-bit
color rendering is gaining in importance. Also, the ideal frame rate for
an optimal gaming-experience is around 60 frames per second. So, the conclusion
is that we need 1024 x 768 x 32 bits x 60 Hz. The high-end products from
the Voodoo3-series of products (and many other third generation products)
approach this ideal situation, the only missing effects is the 32-bit
color (although the post-filtered 16-bit color comes close). Now, 3dfx
wants to add T-Buffer effects (mainly Full scene anti-aliasing) to this
must-have- list. So, the ideal situation would become 1024 x 768 x 32
bits x 60 Hz with T-Buffer effects. So, in short, 3dfx wants to take the
Voodoo3 performance, 1024 x 768 x 16 bits x 60 Hz, add T-Buffer effects
and move to 32-bits color.
All these effects and improvements are very interesting and they will
definitely increase the image quality but what about speed? As we all
know nothing comes for free; doing extra things costs extra, so we need...
More Power
As I explained in the previous paragraph the T-Buffer effects use 4 sub-pixels
for every single pixel. So, it works out that you are basically rendering
four complete images of the exact same scene. These scenes are combined
to form a final on-screen frame (the final image with T-Buffer effects).
So, basically instead of rendering just one image, we need to render 4
images. This means that a new product that maintains the 1024 x 768 x
16 bits x 60 Hz claim of Voodoo3 will need to have at least 4 times the
power simply because there is 4 times more work. But there is even more.
Moving from 16 to 32bit is not trivial either, most of today's products
that support 32 bit color see a performance hit from 20 to 50% when moving
from full 16 bit color to full 32 bit color! So, basically we don't have
4 times more work we have 8 times more work!
Now, the amount of work that can be done is usually defined by the amount
of bandwidth available. Let me illustrate this using an analogy. Imagine
we have a large storage room, this storage room contains raw materials
used in production, but it also contains final finished products. Now,
we have one large machine that turns raw materials into final products.
This machine needs a constant flow of raw materials to process into finished
goods. It's obvious that there is a stream of incoming raw materials and
outgoing products. Now, this company is rather poor and they only have
one conveyor belt between the storage room and the machine, so we need
to transport raw materials and products alternating (the belt changes
direction). Now, to avoid hiccups in the stream to the machine, we need
to have some buffers that can be seen as a small local storage. So, what
happens is we send a whole bunch of raw material to the local input buffer,
this buffer is filled since we provide material faster than the machine
can consume it. On the other side, the machine is already producing goods;
these are storage in the output buffer. Now, once the output buffer starts
to fill-up we need to empty it (we don't want a full buffer since that
would stall the machine). So, we change the direction of the belt (we
lose some time here) and we start to empty the output buffer. Now, at
this moment the machine uses the raw material from the input buffer, luckily
the belt is fast and it removes final goods faster than the machine can
produce them. Now, if the input buffer is almost empty we change direction
again and we supply raw material to the input buffer and this cycle continues.
Now, in this story the bandwidth is defined by the speed and size of the
conveyor belt. The faster the belt goes the more raw materials we can
provide in the same amount of time. The wider the belt is the faster it
goes, as more belt space allows a quicker transport.
Now, its pretty obvious that timing is very critical in this example,
if the input buffer runs empty the machine stops since it has no raw materials
to work with. The same happens when the output buffer is full: there is
no room to store finalized products so it has to stop. Now, as long as
the bandwidth is large enough this should never happen. Now, as all companies
try to make profit they want to make products faster (time = $$$) so the
company decides to add a second machine. As all companies are cheap they
don't add an extra belt, so the same buffers and belt now supply two machines
- as you can guess this increase the risk of buffers being empty or full.
In another situation, the company might decide to replace the original
machine with a new one, a machine that makes better quality products but
higher quality requires more material so, again buffers will run out faster
the problem/risk increases.
Naturally, if the problem occurs we can try to increase the buffer sizes,
we can also increase the speed of the belt, or we can make the belt wider
(larger in size). All these actions increase the bandwidth to the machine(s).
Now, why did I tell this story… after all we aren't talking about 3D acceleration?
Well, 3D chips and production lines are very similar. The production line
is equivalent to the whole 3D render pipeline, the machine is equivalent
to the texturing units, the belt is equivalent to the data-path between
local on-board memory and the chip, the buffers are the on-chip cache
buffers, and the large storage area is the on-board local memory. Now,
the Voodoo1 was an example of a single machine chip, there was one texture
unit. With Voodoo2 we introduced two machines, but we also introduced
two belts. Voodoo3 introduced two machines using a single belt, but a
belt with dual width (larger physical size) and higher speed.
TNT2 is an example of two machines and a single belt, but the machines
can run in high quality and in low quality mode, also known as 16-bit
mode and 32-bit mode. Now, this chip shows the bandwidth problem clearly.
If you run at 16-bit mode everything is fine and fast but if you move
to 32-bit mode the chip slows down. This is because the Higher Quality
setting requires more raw material… but the chips bandwidth stays the
same, so there is a bigger risk that buffer will run empty and the machine,
thus, has to stop for a while and wait. These machine-stops result in
the lower frame rates.
Now, there is one flaw in this whole story and that is that a chip doesn't
consume raw material, the raw materials don't disappear, and there is
no real transformation as with a production line. A chip processes data,
data are numbers, and these numbers undergo mathematical operations to
create output, which are again numbers. The machine is a bit like a large
calculator. Now, the big difference introduced by this is that the same
numbers can be used several times. You know that mathematics contains
stuff like this: a+b=c, a+d=e. As you can see, both of the formulas use
the value a. Nevertheless, the problem of transportation remains since
you need all values (a, b and d) to be able to find the solutions (c and
e), but the whole idea became more complex. You don't always need new
data from the storage room, often the data you need is already in the
fast local storage room. This means that even if you add extra machines
or better quality production units you might not need to increase the
bandwidth.
But, this is only true if there is a lot of re-use of the data in the
buffers. Data has to be re-used enough to make sure that this is true
(same value used multiple times in the calculation). Unfortunately, predicting
this is very hard since different games create different situations, and
in some of these situations there will be a lot of data re-use, while
in others there might not be a lot of data re-use (kind of like different
math assignments will use different formulas).
By now, you are probably wondering what that data is… well the input data
we have is texture information. The input is, thus, the texture that defines
the color of the surfaces of an object. The output is a final rendered
on-screen pixel (the stuff you see on your screen). So, basically if we
re-use the same texture information we have a large efficiency of the
local on-chip buffer since we re-use the same stuff.
Stay tuned for part 2 which will take a closer look the T-buffer, bandwidth and the different ways it can work with the Voodoo5.