Now even with geometry compression, you're still going to have bandwidth issues. Currently, 3D accelerators are limited to around 3 GB/sec memory bandwidth and a 128-bit data path. Some of the next-gen boards actually have less. With newer 3D accelerators we have geometry data, higher-res textures and more fill-rate. So what can be done for this? Well the first thing is to reduce the size of texture through texture compression. Basically, with both geometry and texture data, you want to get the most possibly data through a small pipe. The only option is really to reduce the size of the information. As we discussed, compression is used. Besides compression, there are new types of local memory that can help, such as DDR (double data rate, which effectively acts as though it were clocked at 2x the actual clock) memory and embedded memory. 

DDR memory is really interesting, because in theory, it offers twice the performance as SDR memory at the same clock (how real world this is, I don't know). What DDR memory does is allow for 2 data transfers per cycle, one being at the rise of the cycle and one being at the fall. In the case of the GeForce, this allows the external datapath to go from 128-bit (note: internally it is still 256-bit) to an external 256-bit. This gives it the desperately needed bandwidth to handle the fill-rate and T&L. Unfortunately, DDR memory costs considerably more and so we won't see it used as much as SDR, or at least not for a while. In the case of the GeForce, DDR boards cost about $100 more than SDR boards.

Embedded memory is memory actually built within the 3D chips silicon. This offers bandwidth 3-4x what we currently have today, and possibly more. This is because we don't have to worry about any of the limitations associated with external memory (compare it to a CPU with a massive internal cache to a CPU with only regular system memory). We should start seeing 3D solutions with a limited amount of embedded memory as early as the first half of next year.

One problem with 3D accelerators involves efficiency, as we've discussed in the past. 3D accelerators are really designed to render large triangles. When you start using smaller ones, your efficiency goes down. Currently, 3D accelerators average about 80% efficiency. Now when T&L comes into the picture, you're going to start dealing with many more triangles, and smaller ones at that. This is going to require more information be sent though, and it will require more bandwidth. Well it seems that the obvious answer would simply be to widen the data pipe. Go from 64-bit to 128-bit, or 128-bit to 256-bit. While this may help your overall bandwidth problem some, it is not a good solution at all. As you widen the datapath from 128-bit to say 256-bit, your theoretical bandwidth increases, but your efficiency drops down. In other words, while you're getting all this extra bandwidth, you can't use it as well.

So now that we know about the efficiency problem, what can we do to resolve it? Well there are a several different approaches that can be taken. The first of these would be caching, something everyone will probably be doing to some extent in the near future. By caching it works a lot like the caches we are familiar with. Instead of dealing with a bunch of small packets of information (inefficient), we store a bunch of the information in the cache and then we take all of that and send out a larger packet of data, resulting in a more efficient process. Another option (which can be done along with caching as well) is to tile your memory. But what is this?

When we talk about Tiles everybody immediately thinks of PowerVR, since it was referred to as a Tile Based 3D accelerator. Now what I'll talk about here is linked to PowerVR, but its not how PowerVR works. The techniques described here are optimizations for the frame buffer that can be used on traditional architectures like the Voodoo-series and the TNT-series.