Now even with geometry compression, you're still going to have bandwidth
issues. Currently, 3D accelerators are limited to around 3 GB/sec memory
bandwidth and a 128-bit data path. Some of the next-gen boards actually
have less. With newer 3D accelerators we have geometry data, higher-res
textures and more fill-rate. So what can be done for this? Well the first
thing is to reduce the size of texture through texture compression. Basically,
with both geometry and texture data, you want to get the most possibly data
through a small pipe. The only option is really to reduce the size of the
information. As we discussed, compression is used. Besides compression,
there are new types of local memory that can help, such as DDR (double data
rate, which effectively acts as though it were clocked at 2x the actual
clock) memory and embedded memory.
DDR memory is really interesting, because in theory, it offers twice the
performance as SDR memory at the same clock (how real world this is, I don't
know). What DDR memory does is allow for 2 data transfers per cycle, one
being at the rise of the cycle and one being at the fall. In the case of
the GeForce, this allows the external datapath to go from 128-bit (note:
internally it is still 256-bit) to an external 256-bit. This gives it the
desperately needed bandwidth to handle the fill-rate and T&L. Unfortunately,
DDR memory costs considerably more and so we won't see it used as much as
SDR, or at least not for a while. In the case of the GeForce, DDR boards
cost about $100 more than SDR boards.
Embedded memory is memory actually built within the 3D chips silicon. This
offers bandwidth 3-4x what we currently have today, and possibly more. This
is because we don't have to worry about any of the limitations associated
with external memory (compare it to a CPU with a massive internal cache
to a CPU with only regular system memory). We should start seeing 3D solutions
with a limited amount of embedded memory as early as the first half of next
year.
One problem with 3D accelerators involves efficiency, as we've discussed
in the past. 3D accelerators are really designed to render large triangles.
When you start using smaller ones, your efficiency goes down. Currently,
3D accelerators average about 80% efficiency. Now when T&L comes into
the picture, you're going to start dealing with many more triangles, and
smaller ones at that. This is going to require more information be sent
though, and it will require more bandwidth. Well it seems that the obvious
answer would simply be to widen the data pipe. Go from 64-bit to 128-bit,
or 128-bit to 256-bit. While this may help your overall bandwidth problem
some, it is not a good solution at all. As you widen the datapath from 128-bit
to say 256-bit, your theoretical bandwidth increases, but your efficiency
drops down. In other words, while you're getting all this extra bandwidth,
you can't use it as well.
So now that we know about the efficiency problem, what can we do to resolve
it? Well there are a several different approaches that can be taken. The
first of these would be caching, something everyone will probably be doing
to some extent in the near future. By caching it works a lot like the caches
we are familiar with. Instead of dealing with a bunch of small packets of
information (inefficient), we store a bunch of the information in the cache
and then we take all of that and send out a larger packet of data, resulting
in a more efficient process. Another option (which can be done along with
caching as well) is to tile your memory. But what is this?
When we talk about Tiles everybody immediately thinks of PowerVR, since
it was referred to as a Tile Based 3D accelerator. Now what I'll talk about
here is linked to PowerVR, but its not how PowerVR works. The techniques
described here are optimizations for the frame buffer that can be used on
traditional architectures like the Voodoo-series and the TNT-series.