The Shader Core – Texture Units And The GEOD (Geometry Engine Of Doom)

As already noted, each SM also includes 4 samplers (TUs), each of which is capable of generating one address and fetching at most 4 texture samples per cycle (we say at most because not all texture formats are created equal, as you'll soon see). The TUs run at base clock, and support all filtering modes from bilinear to anisotropic with low-angle dependence.  They meet all DX11 requirements – 16384 U/V dimensions for 1D and 2D textures, fully addressable by the filtering hardware and support for the new BC6H and BC7 block-compression formats. Finally, each quad of TUs is backed by a 12KB streaming texture cache, with texture data also being cached in the L2 and apparently being able to almost fully occupy it (we believe that the maximum is a bit under total L2 size).

There's a mysterious feature that was presented at the Fermi launch, and one that we've not been able to trigger in practice yet, namely four-offset Gather4, which again is somewhat misleading nomenclature. It's true that there's an overloaded version of the Texture Object member function, Gather, that applies an offset prior to sampling, but the offset is applied to the base sample coordinate around which the sample kernel is applied, rather than on a per kernel sample coordinate basis – you're still practically sampling a quad around a coordinate, rather than 4 offset points in the neighbourhood of that coordinate.

What's more likely is that the compiler picks up a sequence of sample function calls for which the offset overloaded version is used, and if restrictions are met it can coalesce those into a single texture fetch instruction, with the key restriction being that the coordinates must be contained in a 64x64 grid - (maxU-minU) <= 64 && (maxV-minV) <= 64. We think this indirectly hints as to how texture data fetching is potentially done, in blocks of 64x64 texels, at least for the L2, since a 64x64 block wouldn't always fit in the L1 cache depending on texture format (note that like prior NVIDIA GPUs, Fermi keeps textures compressed in its L1 cache).

And now, for the grand closing, let's talk about the GEOD (yes we know, its marketing name is something else but we like ours better). This has quite probably been the most talked about Fermi architectural detail, since it's a big departure from the status quo and, depending on where one's sensibilities lie, can be regarded as the best thing ever or over-engineered and ultimately useless. We'll reiterate right from the start that we think it's primarily an elegant way to use architectural traits that were already decided upon due to where the evolutionary path NVIDIA envisions leads.

DirectX, just like OpenGL before it, relies on strict ordering semantics – commands must appear to execute in the order they are issued in code, with emphasis on appear. This has been one of the main stumbling blocks with regards to truly parallelising geometry processing, since any parallel hardware implementation must be able to maintain API ordering.  You could parallelise, but at some point you had to re-order – this was traditionally done before the setup stage, which has been a somewhat natural serialisation for GPUs since, well, they came in to being. Fermi changes this in a number of ways, moving away from the sort-last fragment with one setup block approach and going for something that, at least to us, looks quite a bit like sort-middle with distributed setup (for those interested, [Molnar et al, 1994] is the taxonomy we're relying on in our differentiation).

Basically, there are two levels of parallelism involved: all work up to rasterisation is parallelised across SMs, and rasterisation is parallelised across GPCs. The first part is not necessarily novel (remember though that the GTE works to balance out this work too), albeit it significantly differs from ATI's solution in that it parallelises the fixed function tessellator stage (TS) too, whereas on Cypress there's a single hardware block for it, which implicitly means another serialisation/choke point.

To be more precise, we think that on Slimer, tessellation is entirely handled on the ALUs – the first phase which deals with tessellation factor processing/conditioning being a native fit, since it uses 32-bit FP math, and the second, which actually dices the domain and uses a 16-bit fraction fixed-point format, leveraging the more than adequate INT throughput available. It's possible that there are actually 16 separate dedicated hardware blocks implementing this stage, but we're not sure how beneficial it would be, since it would require extra wiring/routing, as well as some extra transistors (not too many, mind you, a tessellator is a rather simple/cheap thing overall).

Viewport transform (we include clipping and culling under this umbrella) is parallelised too and is another difference to prior designs. What is certain is that it's damn fast. What is uncertain is whether it's implemented on the ALUs as well, or if the tiny hardware block NVIDIA has been using to implement this functionality for a long time simply got replicated across SMs, and set to run at base clock. We lean towards the latter variant, but would not be completely surprised if the former proved to be true.

After viewport transform we have visible triangles (points/lines) sitting nicely in screen-space that we want to actually rasterise.  Remember, Slimer has 4 rasterisers, one per GPC. Each rasteriser uniquely owns a set of screen tiles, runs at base clock, and can generate coverage information for at most 8 pixels per cycle (remember that attribute interpolation is handled on the SMs, by the SFUs, so the output of the rasteriser is coverage data/interpolation parameters).

We include Z-cull here too, by the way, and our measurements would suggest that it's able to discard at least 512 fragments per cycle if they are found to be conservatively occluded by pixels already in the framebuffer (the check is done against all pixel tiles that are at least partially inside the current primitive, generated in an early coarse rasterisation step). The last bit gives us a direct indication of the tile size, namely 8 pixels.

We'll later show this assumption to be correct, with tiles being probably 4 pixels wide by 2 pixels tall as in previous designs, with the somewhat small size being used to hopefully reduce imbalances (pathological cases can still manifest, which can't be circumvented by small tiles or reasonable buffering). Now that we know this, we can get back to our screen-space triangles, which we parallel processed up to now. The first thing that must be done is distribute them to the rasteriser, which is done using each triangle's bounding box (in this case practically a rectangle, we're in a 2D space after all).

Triangles that cross multiple tiles get distributed to owners of crossed tiles, and work gets replicated. Once this GTE-controlled distribution of triangles is performed, these get buffered and re-ordered at their destination GPCs, prior to rasterisation, to return to API ordering. Once this is done, rasterisation can proceed, and no further sorts are needed.

Data is kept on chip as much as possible, after the initial vertex fetch, with the lowest level in the memory hierarchy that gets hit being the L2, for data marshaling between pipeline stages, and the post geometry processing pre-rasterisation re-ordering. It is our humble opinion that having the L2 was the key to making the parallel approach to geometry tasks feasible, and the rest is mostly peanuts by comparison.

Once you have a nice fat enough conduit to marshal data around so that you can re-order when needed, and also enough buffering to make the re-ordering possible, the rest is easy as pie. In theory, all of this means that a Slimer should be capable of a 4 triangle/base clock rasterisation rate.

In practice, even small triangles can cross tile boundaries (the screen is statically partitioned, there's no rule against a 2 pixel triangle crossing two tiles), there are multiple attributes/multi-channel attributes to interpolate per pixel (so even if interpolation parameters for 4 triangles would be generated per base clock, the rate in which fragments belonging to those triangles would become available in the PS would be lower than expected), and TANSTAAFL still applies to the data-routing, buffering and re-ordering even if it all happens on chip, so the rate is significantly under that, even in good cases.

We've managed to hit about 2.07 triangles per base clock ourselves. Finally, on GeForces there's an additional artificial limit (amusingly enough, Teslas get it too) that we'll discuss when presenting our results.

On that somewhat hypnotic note, our incursion into Slimer's shader core is concluded. We could have written more, and actually we will, but in the spirit of not blowing the entire load in one fell swoop, we'll stop here and move to the last piece of the puzzle, the ROPs and the memory interface.