How does this influence T-Buffer effects?


As I explained before, adding T-Buffer effects and moving from 16 to 32-bit color is not for free. The T-Buffer effect requires (at least) 4 sub-samples per-pixel, so we need 4 times as much output as we have today, moving from 16 to 32-bit output also increases the bandwidth needs (like a higher quality mode for the machine). Let me illustrate this last increase with some numbers:

Today we render in full 16-bit mode. This means we use 16-bit textures, 16-bit Z-Buffer, 16-bit frame buffer (read/write and RAMDAC access). In full 32-bit mode we use 32-bit textures, 32-bit Z-Buffer and a 32-bit frame-buffer so basically all our information goes from 16-bit to 32-bit. In our machine analogy this would be similar to our machine using twice as much raw materials and producing final products that are twice the size of the original ones. So, all transport doubles; its pretty obvious that this can't happen without problems unless you increase the bandwidth.

Now, some of you might say: but my xxxx-card does 16-bit and 32-bit at the same frame rate. Yes, this might be true but that's because there are other limitations: assume your machine is rather slow, actually, so slow that the conveyor belt runs empty sometimes… This means that if you increase quality, the flow increases, but you have space left on the belt so there is no slow down. In 3D chips the main cause for these empty belts is the main CPU, the CPU is not fast enough in sending commands to the machine (something like make a blue object a red object). If the machine doesn't get commands, it can't produce; if it doesn't produce, the belt is empty. This problem is known as being CPU limited. The problem we described is bandwidth limited and you can almost always see it at higher resolutions (1024x768 and above). Right now, there is no product that can render 1600x1200x32bits at the same speed as 1600x1200x16bits, simply, because the bandwidth is too low (the belt isn't fast enough). And, I mean FULL 32bits, not just 32bits frame buffers.

So, now 3dfx needs to come up with a solution for the bandwidth problem because they need 4 times the output (and thus 4 times the input) and they need to move from 16 to 32 bits, which, again, doubles the problem. Basically, the increase in bandwidth is, thus, a factor 8!

Now, truly increasing the bandwidth with 8, so running the band 8 times faster or making it 8 times wider is not necessary because of the re-use of data in the buffers. Exactly how much more bandwidth we need depends on the efficiency of re-use. Guessing this re-use is very difficult since it depends on the game/application. Because of this companies, like 3dfx, run simulations. During these simulations (where they run actually games on simulated hardware) they can see exactly how efficient the re-use of data is. Based on these tests, they can decide how much bandwidth they really need. Naturally, they can't test every game, so there is always the risk of underestimating the need. A good example of such a failure is TNT1/2 when running the Original Unreal. Unreal was programmed in a way that clashed with the re-use system of the TNT1/2 resulting in very low performance. The re-use system, or cache management, decides what data should be kept in the local buffer and what data should be replaced with new data transported in from the large storage. What happened was that 3dfx managed to get a lot of re-use and, thus, no bandwidth problems while the system of NVIDIA failed to get good re-use of buffer data meaning that the bandwidth failed and the processor had to wait, giving the well known poor frame rate numbers... Yet today Unreal runs fine on the TNT1/2 boards. This is not due to new caching algorithms or driver/hardware changes. It's due to the fact that Unreal was changed to be more NVIDIA cache friendly. This kind of problems can occur with all cache algorithms, breaking cache performance is very easy so games coders have to pay a lot of attention to make sure that caching doesn't break… EPIC learned this the hard way.

How can we increase the bandwidth?


I already mentioned in the analogy some possibilities to increase the bandwidth and overall speed, more specifically, increasing the width or the speed of the conveyor belt. In more technical terms, this means increasing the width, a.k.a. the number of bits of the memory bus or the speed, which is expressed in MHz. Voodoo1 for example used a width of 64 bits while Voodoo3 uses 128 bits. Usually, this is mentioned as the memory interface. The speed is determined by the clock speed of the memory (can be different from the render core speed) and hovers today between 125MHz and 235MHz. Speeds of 183MHz and lower are, commercially, the most interesting, speeds above 200MHz are still rather pricey.

A move to a new memory technology can also solve the problem. Today's SDRam is of the SDR type. SDR stands for Single Data Rate and it means that one packet of information is delivered per memory clock, so for each clock tick a packet of 128bits (more of less depending on the width of the memory interface) is delivered. A new type of SDRam is becoming available, which is known as DDR. DDR stands for Double Data Rate, which means that two packages of information are delivered per clock: one for the up going clock edge and one for the down going (the clock is a block-form-wave). This means that at a memory clock of 200MHz DDR you will get the equivalent of 400MHz SDR. So, it's basically possible to double the memory bandwidth using this technology. New chips like the NVIDIA GeForce256 support this new type of SDRam. Unfortunately, this new type is very expensive (relatively to SDR) and it's, thus, not always a valid option for products in the casual gamer price area. Do note that this situation can change rather rapidly. Another new technology is Rambus; all 3D companies are evaluating this type of Ram. Rambus is very quick and highly pipelined it does suffer from some latency problems but this problem can be overcome through good design of the memory interface. Note that the recent negative reports about the performance of Rambus as main memory are not necessarily true for 3D accelerator uses since the access is very different, the latency problem is much more sever for general applications than for 3D applications. Videologic is actively looking into Rambus technology and has announced support for it many months ago.

Another trick to getting more bandwidth is to move to a segmented memory structure. Today's accelerators use a unified memory model, which means that all data streams are going and coming from the same memory pool. Frame Buffer, Z-Buffer, Texture data, RAMDAC, all access the same memory chips. A segmented structure splits these accesses up. This was the case for the Voodoo2 boards. These chips had 2 separate texture memories and a separate frame buffer memory. All had a 64bits memory interface giving the Voodoo2 access to 64 x 3 = 192bits wide bus. By separating those memory streams a high efficiency was possible.

Another possibility is the use of completely separate designs that are combined by a bridge chip. Again, Voodoo2 is a good example of this, more specifically the Voodoo2 SLI set-up. Voodoo2 SLI uses two boards, each board has its own ICs and memory interfaces. At the end the results of both boards are combined. The advantage here is that performance doubles completely: you get dual processing units and dual memory access. This is the safest and surest way to double your performance. It's a bit like having two factories: two whole production lines and storage rooms! Obviously this doubles your efficiency but ... with a hefty price tag... double the hardware means double the cost.