NV40 Pipeline OverviewAs we mentioned earlier, one of the key elements of NV40 is parallelism, and as such it features lots of vertex engines and lots of pipelines. Let's take a look at an overview of the NV40 rendering pipeline:
What we can see from the pipeline diagram is that NV40 features 6 distinct Vertex Shader engines, which equates to roughly twice the performance per clock of NV30, and 16 rendering pipelines. As we can see, despite the closeness of the Shader Model 3.0 Pixel and Vertex Shader sets, NVIDIA have opted to stick with distinct Vertex and Pixel shader engines, as opposed to a unified Vertex and Pixel Shader ALU (Arithmetic Logic Unit) structure. In fact when we asked David Kirk about the potential use of a unified structure he suggested that, as far as NVIDIA are concerned, this wasn't a route they are pursuing as it has performance implications such as thrashing caches - which raises the question whether they are looking at a unified pipeline structure even beyond NV4x.
The diagram above, though, isn't particulary accurate as the pixel pipelines aren't entirely independent, but rather grouped in quads, as most pipelines have been since DirectX8. A "quad" refers to a 2x2 pixel grouping from a triangle. As a triangle is setup and its size and position has been resolved to the screen-space the triangle is split into 2x2 pixel tiles which are then dispatched to a quad pipeline (or 4 pixel pipes) - all the associated colour, texture and pixel shader program information is then executed on that group of 2x2 pixels. Rendering in such a fashion is done because it can be more efficient for the pipeline and some instructions in the later APIs have dependencies on the data in the neighbouring pixels in a quad. [Note: As the diagram above suggests, rendering in 2x2 pixel quads can leave potential inefficiencies since at the edges of the triangle some pixels may fall outside of an area that requires rendering, meaning these pipelines are effectively not used as they can't operate on other triangles because some pipelines in the quad group are occupied. Yet the benefits of a quad pipeline organisation usually outweigh this pitfall.] With 16 pixel pipelines, NV40 can be rendering 4 quads at any one time. NV40 will dispatch quads from a triangle to each available quad pipeline in order until the triangle is fully dispatched to rendering quads, then quads from the next triangles will be dispatched as rendering quads become available. One advantage here is that, certainly for larger triangles, the quad rendering pipelines will be rendering from the same triangle often, which means they will often be accessing the textures and pixel shader programs resulting in good cache utilisation across all 16 pixel pipelines. With this quad structuring in mind we'd suggest the pipeline organisation would go something more like this:
With parallelism such as is on display here it's fairly easy to see how other devices in the NV4x range could be divided up. Although we don't yet know how other parts may be configured we can guess that the entry level parts will consist of one quad, the mainstream one or two quads, the performance on two or three quads, and, last, with the enthusiast segment filled with the 4 quad solution seen here. It may also be the case that rebranded NV40 ASICs will also, possibly temporarily at least, fill out one or two of these segments. The die size of NV40 is very large in comparison to other graphics chip dies, and it's very unlikely that all die on a wafer will come back clean and working (it may even be the case that dies operating with all 16 pipelines / 4 full quads will be in the minority); however, with multiple quads on the die it may be the case that NVIDIA will choose to operate a redundancy scheme in order to minimise cost for dies that are not fully functional to NV40 specification. As ATI did with the R300 ASIC and the Radeon 9700 and 9500 products, should a defect occur in a pixel pipeline, which would probably be one of the largest element of the die in these chips, then rather than wasting the entire ASIC the quad that the pipeline exists in can be disabled. This part could thus be sold as a 12 pipeline / 3 quad product in order to make sure that at least some revenue is coming in rather than wasting that die, which would push the price of the full operational die up. We pushed David Kirk on this point on a couple of occasions, but he wasn't willing to divulge that NVIDIA would actually be doing this, although his smile suggested that they would! [Edit - recent relevations suggest that the non-Ultra GeForce 6800 will indeed feature 12 pipelines enabled, which is one quad block turned off.] |