The hardware implementation of the 3dfx T-Buffer


There has been a lot of confusion lately as to how the Voodoo5 works when dealing with both SLI and the T-buffer/FSAA. Well, now we are going to attempt to set the record straight once and for all, explaining in detail exactly how it works. To do this, we've also included diagrams of everything but the standard SLI with no T-buffer effects. 

As we know, the SLI on the Voodoo5 is not truly SLI, but rather a marketing term. What 3dfx now does is a banding, where a band of scan-lines is taken and addressed by a single chip and then a different chip addresses another band of lines. In the case of SLI without T-buffer effects, it renders bands with one chip rendering every other set on a 2-chip board, and each chip renders every fourth bunch on a 4-chip board. What about under different conditions?

Well, let us start by looking at a 2-chip board. Now, assume that we want to render 2 sub-pixels for every pixel. This allows us to achieve some anti-aliasing, but not full T-buffer effects, and still keep good performance. Here, each chip will render a set of bands through SLI. In each band, a single chip will render 2 sub-pixels per clock (each pipeline creating one of the sub-samples). This, basically, cuts the single chips performance in half. However, keep in mind that you are doing SLI. So, that means that in the next band the other chip will be rendering 2 sub-pixels per clock, so its performance will be half its full performance as well. Because each is rendering half the screen, you'll basically get the performance of a single board, while still getting anti-aliasing. With doing this you get an effective performance of a single chip, giving you 2 pixels/clock. Also, in terms for frame-buffer use, each chip works out to holding a complete frame-buffer. How so? Well, consider that for every pixel, you're actually storing 2 sub-samples. So, now, you're storing 2x the data for every pixel, so the frame-buffer is twice as big. Now, take into consideration that SLI is running. This means that each chip only renders half the screen, so it needs half the frame-buffer data. So, you're starting with each chip hold 2x the frame-buffer data, but you only render half the screen so it only holds the data for the half the renders. This means you end up with each chip holding a memory amount equal to that of the normal frame-buffer size.





Now, what if you're running a 2-chip board and you want 4 sub-samples for full T-buffer effects? Well, here you're going to lose your ability to use SLI. Why? Well, the T-buffer requires 4 sub-pixels to use the T-buffer. Because each chip only can render 2 sub-pixels in a single pass, both are required to get the complete 4 sub-pixels. This means that SLI isn't used. The real problem with this is that both chips must store a complete frame-buffer. Now, you get about half the performance of a single chip, or 1 pixel/clock. Depending on the game, this may or may not be enough. Now for the frame-buffer here, the situation is a bit different.  You're actually going to be storing 2X the frame-buffer per-chip. This means that if you're frame-buffer is normally 10 megs, because a chip renders 2 sub-samples, that chip will store 20 megs. Now, keeping in mind that we need 4 sub-samples per-pixel, the other chip will also have to render 2 sub-samples to go along with the 2 from the other chip. This means that both chips will be storing 20 megs of frame-buffer data each, or a total of 40 megs.