The Shader Core – Caches, ALUs And All Things Compute

For any modern GPU, most of its magic lies hidden in its shader core, and Fermi is no different. NVIDIA's architects clearly set out with a slew of lofty goals when designing it, trying to round off rough corners that hampered their prior works of art. Let's start with the GPCs, since they're both the first split in the chip's hierarchy, as well as the main coagulating elements of the shader core.

As we've already mentioned, a GPC aggregates a rasteriser and at most 4 SMs (differing SKUs get some SMs disabled, thus the "at most" part). We're reasonably certain that the interface with the L2 is per GPC, rather than per SM, so by way of consequence some buffering must be in place, to absorb read/write requests from/to SMs. Figuring out the actual width of this interface is non-trivial, because it didn't have the luck of being one of the things NVIDIA wanted to advertise intensely, and in-depth documentation on Fermi is scarce.

The only figure we've seen that is relevant is a 230GB/s L2 bandwidth mention in a more recent presentation, made in regards to a Tesla C2050, one that we're quite sure is rounded for convenience. However, this did not deter us from engineering differing tests that aimed at having a more solid grasp of what the link width is. Our current best (and somewhat educated) guess is that each GPC has a 1024-bit interface with the L2, for a grand total of 4096 bits across all 4 GPCs.

The interface itself is a probably a 4-way crossbar (not necessarily the only possible topology, something like a multi-stage network being another possibility, for example), with a crosspoint complexity of 16 and a diameter of 1. If you're thinking that's a lot of wires you'd be correct, however a narrower width would be somewhat hard to reconcile with the 128B cache line size used in both the L1s and the L2.

Granted, it's not impossible (we flirted with an 896-bit per GPC width for a while, and there are actually some valid reasons for assuming that), but we're fond of nicely aligned numbers ourselves. Obviously, enabled SM count impacts the total usable L2 bandwidth, so the theoretical maximum will have to be adjusted in accordance to it. We're not sure if NVIDIA is reusing the same SRAM cells they use for the L1/shared memory pool, albeit that's somewhat likely, as is the use of banking, with bank width sitting at 32-bits.

The L2 services all memory transactions with 32-byte granularity, and it's impossible to disable it and get uncached access, and all VRAM accesses go through it. This also indirectly indicates that each SM gets a 32-byte path out of the GPC, which means that it takes 4 base clocks to transfer a cacheline to/from the L2 from/to an SM's L1.

Click for a bigger version

Focusing on one of the SMs, we observe that it consists of 32 ALUs, each with access to a portion of the register file consisting of 1024 general purpose 32-bit registers (for a total of 32768 32-bit entries per SM, or 128 KB depending on how you want to look at things), 4 special function units (SFU), 16 load/store units (LSU), 2 multithreaded instruction units each consisting of a scheduler-dispatch pair, an instruction cache (we've not determined its size yet), and 64 KB of SRAM that can be partitioned between L1 cache and Shared Memory, in a fixed 48KB/16KB split.

There's also 64 KB of read-only constant/uniform cache and 4 texture units backed by 12 KB of texture cache -- the latter will be discussed in its own subsection, mind you. Also, there are independent input and output buffers for vertex/hull/domain/geometry/pixel shaders, with workload arrival and departure being independent of thread execution -- this helps with mapping the logical graphics API pipeline to the actual hardware implementation.

Moving from the bottom upwards, we'll start with the 64 KB of L1/shared memory. We're reasonably sure it's arranged as 64 32-bit wide banks.  There's not much to say about the shared memory facet, since it's pretty much what you'd expect from a DirectX 11 compliant implementation and an evolution of what prior NVIDIA chips had. It's worth noting that DirectX 11 mandates 32 KB of shared memory, and as such in graphics mode the split is locked at 48 KB shared memory and 16KB L1. Thus the effective, software-accessible arrangement is 32 32-bit wide banks.

This raises some interesting behind the scenes optimisation opportunities – for example, duplicating the first 16 banks in the extra 16 available, and hence maintaining full throughput even thought the access pattern would generate conflicts in theory (threads in the same warp would hit different addresses in the same bank). Broadcasting is supported, with the further refinement of allowing multiple words to be broadcast in a single transaction, which is probably tied to how the L1 aspect of the duality works.

The L1 is more interesting: it uses an 128-byte line size (access with higher granularity, for example 32-byte, won't hit in the L1), and it primarily serves as a register spillover cushion, making performance degradation with spills more graceful. We believe that the shared memory/L1 duality is achieved via muxing, in what is dubbed the interconnection network. The pool of SRAM runs at base clock, irrespective of partitioning, which means that bandwidth per SM equals 77.696 GB/s to shared memory/L1, for a chip aggregate of 1087.74 GB/s for our GTX 470.

It's unclear whether L1 and shared memory accessing can happen in parallel, albeit it seems likely. The L1 services transactions with 128-byte granularity. Throughput for the constant cache is similar, with the restriction that all threads in a warp read the same address (basically the equivalent of a broadcast from shared memory), since otherwise access gets serialised.

Hardware batch size (the so called warp – this can be different from thread group dimensions declared in code) is 32 threads of the same type (possibly 16 for vertex threads), just like with previous architectures. The SM manages a pool of 48 warps, with threads within the same warp sharing the same initial instruction counter. Fermi does dual-issuing right, as opposed to the contorted "missing MUL" effort, using its dual instruction units.  Those run at the base clock, and each issue cycle each selects a warp that is ready for execution (the next instruction in its instruction stream can be issued, with all dependencies satisfied) and issues the instruction to active threads in each respective warp.  Instructions can be SP float, DP float, load, store or SFU, with the mention that DP instructions can't be co-issued.

One scheduler handles odd warps, whereas the other handles even ones. To determine warp readiness for execution, the schedulers use a register dependency scoreboard. Ready warps are prioritised based on type (our tests would suggest that pixel threads are higher priority than other thread types in the context of graphics, and rightly so), instruction type and fairness, amongst other things.  Given warp independence, instructions for differing warps can be issued on differing issue cycles.

We've spoken of warps, but we've not yet mentioned how they come into being - this aspect is handled by the multiprocessor controller. Its role is to take in input data and tasks, arbitrate access to shared resources, the I/O paths and memory. It accumulates and packs tasks into warps, allocates a free warp in the SM pool, allocates registers for the threads in the warp, allocates shared memory and barriers in the case of compute threads, and starts warp execution only when the aforementioned allocations can be done. Upon warp completion (all threads exit), the controller unpacks results and frees resources.

Each fully-pipelined ALU is hardware multithreaded, supporting up to 48 threads. Each can execute one scalar instruction per thread per hot clock, which in the case of the GTX 470 is 1.215 GHz, for a total of 544.32 GInstr/s for most common FP instructions. The per-ALU RF gets partitioned amongst threads assigned to that particular ALU. The SM can manage and execute up to 1536 threads, assuming that the per thread register requirements fit into the available RF (for example, if each thread needs 21 registers, the SM is fully populated, but at 22 registers only 1489 threads can run concurrently, or ~46 threads per ALU). Maximum theoretical register allocation per thread is 63 32-bit registers.  Each ALU executes an instruction for two individual threads of a warp using two hot-clocks, and that's the basic execution profile of the hardware.

NVidia claims that each ALU has a fully pipelined fully capable 32-bit integer block alongside the floating point one. This is a bit misleading if one assumes based on it that the chip can issue 544.32 GInstr/s for INT operands – in practice, the INT block appears to be half-rate, so an INT warp is processed across 4 hot-clocks.  Another potential explanation for the anomaly involves the half-rate DP support that Fermi also touts, for which there are a few implementation possibilities: either some loopback or ALU "fusing" magic (which is no doubt interesting in theory, but can raise some serious practical challenges) or having 16 of the 32 ALUs in a SM be actually full DP ALUs, with them also handling the INT math, since they'd be quite adequately equipped for that too.

To be honest, we lean towards the latter variant. The no co-issuing for DP restriction still makes sense even in that arrangement, once we consider that a DP operand takes two 32-bit registers, and occupies two slots in the input and output buffers respectively – basically it's equivalent to processing two SP warps in parallel with regards to those particular metrics. We look forward to getting NVIDIA peeps on the Beyond3D mic to discuss this, amongst other things.

Fermi's SP/DP math is fully IEEE 754-2008 compliant, even including denormals.  Another change is the move from the traditional MAD that we've known and loved with so many GPUs in the past to the more precise FMA. This means that the multipliers in each FP block are wider than merely the mantissa bits would imply, in order to accommodate guard and round bits. No mechanism for FP exception detection is in place, and as such the GPU behaves as if exceptions were masked, delivering the masked response according to IEEE 754 specifications, in the case of a floating point exception being thrown.

The 16 LSUs run at hot-clock too, with all address generation happening here. Another significant departure with Fermi is the move to a generic, unified address space that encompasses all memory from VRAM to the L1 cache – thus no longer needing specialised load/store instructions for separate address spaces, albeit they stick to using integer byte addressing with register+offset address arithmetic. The address space itself is 40 bits virtual and physical (prior GPUs were already using a 40 bit virtual address space too, mind you).

We're willing to bet that load/store instructions actually use 32-bit byte addresses which get extended to 40 bits via offsetting, with an MMU handling virtual to physical translation. They're also fair game for co-issuing, mind you, so overall moving all addressing hoopla to dedicated units means is likely to be a win, saving ALU throughput and making it easier for the compiler to achieve optimal scheduling.

Moreover, Fermi makes a further step towards RISC-ism, being a proper load/store architecture, with all operands having to be moved into/out of registers, an example being shared memory: older architectures could use shared memory operands directly, whereas Slimer uses register load/store.

Finally, we come to the 4 SFUs, whose lineage can be tracked back to [Oberman and Siu, 2005].  An SFU can compute either transcendental functions or planar attribute interpolations.  For transcendental approximation it uses quadratic interpolation based on enhanced minimax approximations.  Three lookup tables holding coefficients for interpolation are used, with each approximated transcendental having its own set of tables for a total of 22.75 Kb per SFU for table storage, with accuracy for the resulting approximation ranging from 22 to 24 good bits.

The SFU approximates reciprocal, reciprocal square-root, log2x, 2x, sine and cosine respectively (square root isn't directly approximated, but rather the result of doing RCP(RSQRT)), and its throughput for transcendentals is one per cycle per SFU, for a total of 4 per SM.  For planar attribute interpolations it can interpolate one 32-bit attribute for a quad of pixels per cycle, thus up to 16 per-pixel attributes per SM.

Since this section is rapidly approaching unprecedented length, and since we are strongly opposed to putting our audience to sleep, we'll split it right here and move on to the less compute-heavy aspects of the SM.