There is yet another thing that worried me, and that is the co-operation between the hardware transform and lighting engine and the actual rendering units. Today's accelerators have an efficiency of 80% when rendering a 16-bit testbench. This testbench is highly optimized and very cache friendly, which means that in-game the efficiency possibly drops to 60-70%. This lack of efficiency is mainly caused by the highly at random access of the memory by the render core. First of all, let me explain why random memory access is bad

Memory can be seen as a notebook, you know those small books with about 10-15 lines on each small page? Now assume that each line in that notebook has a number, so for example 1 to 15 on the first page, 16 to 30 on the second, etc. Now if I ask you to read the word on line 12 you can just look at the first page and tell me. But if I then ask you to read the word on line 212 you will have to spend some timing looking for the page with line number 212. Memory works just like that. Memory (8, 16 , 32 Mb) is split up into pages of a couple of kb. As long as you access data on the same page you can get new data every clock cycle. But if you suddenly jump to another place in memory you end up on another page and just like you have to search the notebook for that other page the memory component also has to search. This search causes a 2 to 3 cycle penalty. This means that you have to wait, and wait until the memory has located the info. The more jumps you make the less efficient the memory becomes. As I told you today's accelerators hover around 70% efficiency. Now what happens if we add T&L?

T&L increases the triangle throughput dramatically. Today a game uses maybe 10,000 polygons in a scene, well with T&L this can become 100,000 or even more. Now more polygons mean more detail, but it also means that polygons will get smaller, after all detail is small and small things are represented using small polygons. Today a polygon has a size between 100 and 1000 pixels. Smaller polygons can appear but they are rare, they are usually removed and the fact that they are missing is hidden in fog. Now there is a very good reason for this: more and smaller polygons decrease the efficiency of the accelerator, more specifically the actual render part of the pipeline. 

A simple sample to illustrate this:

The frame-buffer is stored in a linear way, this means left to right and line per line downwards. Now when you want to draw a triangle in this memory structure you end up drawing pixels on one line, some more on the next line and so on. Now the jump from line to line is rather large. Often this jump is large enough to cause a page break. A Page break is if the data is not on the same " notebook " page (cfr. earlier). Now when you have many small triangles you end up having a lot of line jumps, much more than when you have less but bigger triangles. With a large triangle you write many pixels on the same line, while with a small triangle you write a small amount of pixels per line. Since you need to fill the same surface (the screen) you thus have many more triangles filling the same space and many more jumps. Each jump decreases your efficiency.

And this is not all… Memory is written to in a linear way. This means that if you write 128-bits (or 256-bits) you have to write them sequentially to the memory. Suppose you have a 10-pixel triangle. On the first line you write 4 pixels. 4 pixels in 16-bits equal 64-bit. You need to write 128 (or even 256) bits because that's the way memory works. You thus write 128 (or 256) bits of which only 64 are really useful. This means efficiency is only 50% (or 25% over a 265 bus)! On the next line you write 3 pixels, the line below 2 and the last line 1 pixel. Efficiency drops like a rock. Up to today, we've used large triangles and with large triangles you write enough data per line to achieve a reasonable efficiency. 

An early conclusion would say that due to T&L we need a much faster render engine since efficiency drops. But isn't there a way to solve this inefficiency? Well yes there is and it's called Tile Based rendering (note that this is not equal to PowerVR's Tile Based Deferred rendering). Tile Based rendering is a trick to keep memory access local and limited to a small zone, a zone so small that you can buffer it on chip. What you do is you split the screen up in small zones, lets say 32 by 16 pixels (can be larger or smaller… depends on your bus bandwidth and buffer size). Now before you render anything, you sort the polygons per tile (a tile is one of those small zones). So basically you make a list of what polygons are located in what tile. Once you know what triangles are located in what tile you can render each tile in one go. So what you do is you render all polygons that are in the tile at the top right of the corner. If all polygons are rendered then the result is located in a small buffer on the chip. At that point, you write this buffer to the main memory. This reduces the number of jumps because of external memory access. The local ram is very fast and flexible and it doesn't suffer from page breaks. This technique can keep the performance at an acceptable level. It's unknown whether NVIDIA uses such a principle. The Bitboys' Glaze chip uses a technique like this to achieve an acceptable efficiency.