Beyond3D would like to thank Henry Moreton, Steve Molnar et al., for making it possible for us to shorten this bit significantly versus prior efforts!  If that doesn't make any sense at all, we'd urge you to recall that we used to discuss things like the setup engine and, recently, tessellation.  Well, for Fermi that discussion gets moved to the shader core, since both aspects get shuffled per GPC, with bits going as granular as per SM.  That doesn't mean that we don't have anything to talk about here, mind you!

To bring a GF100 to life, and have it perform tricks, one fills up a command buffer on the host-side.  Either via natural means (it gets full), or unnatural ones (you call a flush), the commands must make their way to the GPU, and in the present case this happens when the host interface reads the (translated into NV's RISCy ISA) command buffer via the PCIe interface.

The macro-scheduler/dispatcher NVIDIA uses -- dubbed somewhat pompously the GigaThread Engine (GTE) -- then handles the fetching of the needed data from host memory to VRAM. This is as good a place as any to mention the addition of a second DMA engine (something the competition has had since the R600, if memory serves, not that it helped that chip suck any less), thus making it possible to have simultaneous reads/writes from/to host memory. This holds more relevance for compute applications, where such exchanges are likely to be somewhat more frequent, and to involve more ample data-sets.

That's only the first ball the GTE has to juggle. The more complicated part of its work involves handling of up to 24576 simultaneously active threads (this gives an indirect indication of the size of the queue used), their assemblage into thread groups and their dispatch to SMs for actual processing.

A hierarchy of threads is maintained, based on aspects like resource availability, time spent in the wait queue or nature of the thread, with GTE scheduling being based on it. This includes the distribution post-VS primitives (post-HS patches and tessellation factors in the case of enabled tessellation) to SMs for parallel processing of all stages up to and including Viewport Transform (just another type of threads, really). For this case it's also probable that it tracks assigned primitive count per-SM, and factors that in the scheduling too (once a certain count is reached, switch to another SM, so as not to have excessive input buffering needs).

Post viewport transform, it manages the redistribution of screen-space primitives to GPCs based on their bounding box – recall, each GPC has a rasteriser, and rasterisers uniquely own regions of screen space, thus the assignment is based on this ownership. Any primitive that crosses tiles is sent to multiple rasterizers, duplicating some work. Pre-raster primitives are buffered, and API submission order is reconstructed (one of the requirements of APIs like DirectX or OpenGL).  All of this back and forth happens via the L2 cache.

Two significant improvements are touted for the GTE versus its prior incarnations: one is reduced context switch cost, the other is support for concurrent kernel execution. For the former NVIDIA claims an under 25 μseconds cost per context switch. The latter is another novelty, and basically amounts to the possibility of differing kernels belonging to different streams from the same context executing in parallel, as long as enough resources are available.  Kernels from the same stream still execute serially, obviously.

Let's head to the place little thread groups go, after being scheduled, namely the shader core.