A Q&A with NVIDIA's Nick Triantos
1. T&L can have a negative impact on the render efficiency
of an accelerator mainly because most traditional renderers are optimized
to handle larger polygons. Now T&L introduces small triangles and
that makes the "external" memory access very inefficient... With today's
efficiency with large triangles hovering around 80% what will happen to
the efficiency if we introduce T&L? How can we avoid such a drop?
I'm not really sure what you mean by 80% efficiency on larger triangles.
There are cases where more data needs to be sent across the AGP bus to
the graphics part, but well-written apps perform a rendering step called
view frustum culling that should eliminate this case. In my mind, most
apps are incredibly fill-limited. That comes from two things: 1. High
resolutions of multiple passes, 2. Needing lots of textures. Once you
can tessellate a scene down into smaller triangles, you can frequently
eliminate one or more of the passes commonly used in today's games, which
is the "lighting" pass where light maps are rendered on top of the base
textures. It's my hope that now that lighting is available, some apps
will realize how nice the lighting can be when combined with the base
textures.
2. Many people hope that T&L will remove the need to upgrade
to the newest and latest processor and motherboards from Intel. Now NVIDIA
mentions in its Fast Write paper a bandwidth of 90bytes per triangle between
the CPU and the 3D card. With 10 million polygons this turns into a 900Mb
stream which is way more than any older memory structure can handle...
Does this mean that older systems will be limited by the bandwidth between
the main CPU (or main memory) and the 3D card?
I think that you're going to want an AGP4X motherboard for the best performance
of data-bound applications. By my measurements, 10M polygons, rendered
efficiently (through a triangle mesh built via a vertex buffer / vertex
array), can be achieved with less than 2 million vertices. (In a well-formed
mesh, interior vertices are typically shared by 6 triangles). It might
be better when talking about transform and lighting limits to talk about
vertices, though, so I'll do that. For a vertex that supplies X, Y, Z,
Nx, Ny, Nz, Tx, Ty (vertex + normal + 1 texture), you are looking at roughly
32 bytes per texture. With two textures this can go up, with pre-lit or
unlit vertices this will go down. That's only 320MB/sec, which would be
possible in some apps with AGP2X, and in more apps with AGP4X. With 3
vertices per triangle, though, this number could go over 900MB/sec required
bandwidth. Yikes. Anyway, there will be some applications that will be
AGP bus bound, even on an AGP2X system. In fact, there will be some apps
that are bus bound even on an AGP4X system. Apps will realize this, though,
and will likely tune the apps to not push these limits so hard if they
want to appeal to the broadest customer base possible. They'll instead
make the app scalable, so that users with AGP4X systems will simply get
more data than those with AGP2X systems.
3. I talked to 2 of your developer support guys and they told
me that you use a vertex buffer (or cache) in local memory. Could you
describe the basic principle of this? As I understand it the game can
upload static geometry data (object per object) at the beginning of the
game and use this data very quickly (just send commands that tell the
hardware what object from local ram should appear where - so translate,
rotate, scale object X)?
Sure, I started talking about this above. NVIDIA has been able to natively
draw triangle strips and fans since the RIVA 128. But, as I mentioned
above, a more efficient data structure is typically a triangle mesh. The
GeForce 256 supports this by letting the application build a buffer or
table of data that contains the vertices, then handing us a list of indices
into the table, and having our chip pull in all the data as needed. What
this means is that our driver does not have to transfer than 900MB we
talked about above. Instead, we just write the indices (typically 1 32-bit
integer per vertex) across the bus, and let the hardware pull in all the
data as it needs it. In an AGP bus-bound app, we could even put the vertex
buffer in video memory, so that each frame all that gets transferred across
the AGP bus is these little indices, instead of the X, Y, Z, RGBA, Nx,
Ny, Nz, Specular, Fog, Tex0X, Tex0Y, Tex1X, Tex1Y, etc. that an app might
need.