There is yet another thing that worried me, and that is the co-operation
between the hardware transform and lighting engine and the actual rendering
units. Today's accelerators have an efficiency of 80% when rendering a 16-bit
testbench. This testbench is highly optimized and very cache friendly, which
means that in-game the efficiency possibly drops to 60-70%. This lack of
efficiency is mainly caused by the highly at random access of the memory
by the render core. First of all, let me explain why random memory access
is bad
Memory can be seen as a notebook, you know those small books with about
10-15 lines on each small page? Now assume that each line in that notebook
has a number, so for example 1 to 15 on the first page, 16 to 30 on the
second, etc. Now if I ask you to read the word on line 12 you can just look
at the first page and tell me. But if I then ask you to read the word on
line 212 you will have to spend some timing looking for the page with line
number 212. Memory works just like that. Memory (8, 16 , 32 Mb) is split
up into pages of a couple of kb. As long as you access data on the same
page you can get new data every clock cycle. But if you suddenly jump to
another place in memory you end up on another page and just like you have
to search the notebook for that other page the memory component also has
to search. This search causes a 2 to 3 cycle penalty. This means that you
have to wait, and wait until the memory has located the info. The more jumps
you make the less efficient the memory becomes. As I told you today's accelerators
hover around 70% efficiency. Now what happens if we add T&L?
T&L increases the triangle throughput dramatically. Today a game uses
maybe 10,000 polygons in a scene, well with T&L this can become 100,000
or even more. Now more polygons mean more detail, but it also means that
polygons will get smaller, after all detail is small and small things are
represented using small polygons. Today a polygon has a size between 100
and 1000 pixels. Smaller polygons can appear but they are rare, they are
usually removed and the fact that they are missing is hidden in fog. Now
there is a very good reason for this: more and smaller polygons decrease
the efficiency of the accelerator, more specifically the actual render part
of the pipeline.
A simple sample to illustrate this:
The frame-buffer is stored in a linear way, this means left to right and
line per line downwards. Now when you want to draw a triangle in this memory
structure you end up drawing pixels on one line, some more on the next line
and so on. Now the jump from line to line is rather large. Often this jump
is large enough to cause a page break. A Page break is if the data is not
on the same " notebook " page (cfr. earlier). Now when you have
many small triangles you end up having a lot of line jumps, much more than
when you have less but bigger triangles. With a large triangle you write
many pixels on the same line, while with a small triangle you write a small
amount of pixels per line. Since you need to fill the same surface (the
screen) you thus have many more triangles filling the same space and many
more jumps. Each jump decreases your efficiency.
And this is not all… Memory is written to in a linear way. This means that
if you write 128-bits (or 256-bits) you have to write them sequentially
to the memory. Suppose you have a 10-pixel triangle. On the first line you
write 4 pixels. 4 pixels in 16-bits equal 64-bit. You need to write 128
(or even 256) bits because that's the way memory works. You thus write 128
(or 256) bits of which only 64 are really useful. This means efficiency
is only 50% (or 25% over a 265 bus)! On the next line you write 3 pixels,
the line below 2 and the last line 1 pixel. Efficiency drops like a rock.
Up to today, we've used large triangles and with large triangles you write
enough data per line to achieve a reasonable efficiency.
An early conclusion would say that due to T&L we need a much faster
render engine since efficiency drops. But isn't there a way to solve this
inefficiency? Well yes there is and it's called Tile Based rendering (note
that this is not equal to PowerVR's Tile Based Deferred rendering). Tile
Based rendering is a trick to keep memory access local and limited to a
small zone, a zone so small that you can buffer it on chip. What you do
is you split the screen up in small zones, lets say 32 by 16 pixels (can
be larger or smaller… depends on your bus bandwidth and buffer size). Now
before you render anything, you sort the polygons per tile (a tile is one
of those small zones). So basically you make a list of what polygons are
located in what tile. Once you know what triangles are located in what tile
you can render each tile in one go. So what you do is you render all polygons
that are in the tile at the top right of the corner. If all polygons are
rendered then the result is located in a small buffer on the chip. At that
point, you write this buffer to the main memory. This reduces the number
of jumps because of external memory access. The local ram is very fast and
flexible and it doesn't suffer from page breaks. This technique can keep
the performance at an acceptable level. It's unknown whether NVIDIA uses
such a principle. The Bitboys' Glaze chip uses a technique like this to
achieve an acceptable efficiency.