A Q&A with NVIDIA's Nick Triantos

1. T&L can have a negative impact on the render efficiency of an accelerator mainly because most traditional renderers are optimized to handle larger polygons. Now T&L introduces small triangles and that makes the "external" memory access very inefficient... With today's efficiency with large triangles hovering around 80% what will happen to the efficiency if we introduce T&L? How can we avoid such a drop?

I'm not really sure what you mean by 80% efficiency on larger triangles. There are cases where more data needs to be sent across the AGP bus to the graphics part, but well-written apps perform a rendering step called view frustum culling that should eliminate this case. In my mind, most apps are incredibly fill-limited. That comes from two things: 1. High resolutions of multiple passes, 2. Needing lots of textures. Once you can tessellate a scene down into smaller triangles, you can frequently eliminate one or more of the passes commonly used in today's games, which is the "lighting" pass where light maps are rendered on top of the base textures. It's my hope that now that lighting is available, some apps will realize how nice the lighting can be when combined with the base textures.

2. Many people hope that T&L will remove the need to upgrade to the newest and latest processor and motherboards from Intel. Now NVIDIA mentions in its Fast Write paper a bandwidth of 90bytes per triangle between the CPU and the 3D card. With 10 million polygons this turns into a 900Mb stream which is way more than any older memory structure can handle... Does this mean that older systems will be limited by the bandwidth between the main CPU (or main memory) and the 3D card?

I think that you're going to want an AGP4X motherboard for the best performance of data-bound applications. By my measurements, 10M polygons, rendered efficiently (through a triangle mesh built via a vertex buffer / vertex array), can be achieved with less than 2 million vertices. (In a well-formed mesh, interior vertices are typically shared by 6 triangles). It might be better when talking about transform and lighting limits to talk about vertices, though, so I'll do that. For a vertex that supplies X, Y, Z, Nx, Ny, Nz, Tx, Ty (vertex + normal + 1 texture), you are looking at roughly 32 bytes per texture. With two textures this can go up, with pre-lit or unlit vertices this will go down. That's only 320MB/sec, which would be possible in some apps with AGP2X, and in more apps with AGP4X. With 3 vertices per triangle, though, this number could go over 900MB/sec required bandwidth. Yikes. Anyway, there will be some applications that will be AGP bus bound, even on an AGP2X system. In fact, there will be some apps that are bus bound even on an AGP4X system. Apps will realize this, though, and will likely tune the apps to not push these limits so hard if they want to appeal to the broadest customer base possible. They'll instead make the app scalable, so that users with AGP4X systems will simply get more data than those with AGP2X systems.

3. I talked to 2 of your developer support guys and they told me that you use a vertex buffer (or cache) in local memory. Could you describe the basic principle of this? As I understand it the game can upload static geometry data (object per object) at the beginning of the game and use this data very quickly (just send commands that tell the hardware what object from local ram should appear where - so translate, rotate, scale object X)? 

Sure, I started talking about this above. NVIDIA has been able to natively draw triangle strips and fans since the RIVA 128. But, as I mentioned above, a more efficient data structure is typically a triangle mesh. The GeForce 256 supports this by letting the application build a buffer or table of data that contains the vertices, then handing us a list of indices into the table, and having our chip pull in all the data as needed. What this means is that our driver does not have to transfer than 900MB we talked about above. Instead, we just write the indices (typically 1 32-bit integer per vertex) across the bus, and let the hardware pull in all the data as it needs it. In an AGP bus-bound app, we could even put the vertex buffer in video memory, so that each frame all that gets transferred across the AGP bus is these little indices, instead of the X, Y, Z, RGBA, Nx, Ny, Nz, Specular, Fog, Tex0X, Tex0Y, Tex1X, Tex1Y, etc. that an app might need.