Richard Huddy

Richard Huddy was up first, discussing the Graphics Products Group's latest hardware and where they think the industry is headed as process nodes get ever smaller. The early part of his presentation centered around the best use of their R6-family of processors for both D3D9 and D3D10. Richard indicated that given the performance of RV610 in D3D10, it might not be unreasonable to consider it a good D3D9 GPU and stop there, essentially asking developers to consider leaving RV610 out of the loop when it came to D3D10 GPU targets.

That was in reference to the general lack of capability checking that favours D3D10 development. Given the base specification for D3D10, hardware implementations are always guaranteed to meet that minimum specification with no exceptions, with Microsoft working with the IHVs to ensure that base spec is as feature-rich as possible. That standard feature base means that game developers get to concentrate on scaling the game experience to different hardware platforms on the PC by rendering fidelity, not rendering ability.

The (hopefully) more predictable performance of D3D10 accelerators, that all offer hardware unified shading and transparent load balancing, means that developers can be left to offer simple resolution, GPU-based, and asset-based quality controls. Turning down texture or shadow map size for example, or enabling lesser levels of surface filtering and MSAA via the API and user control, means the end-user tailors the gaming experience to his or her machine. Huddy's advice for RV610 goes one step further, with a developer that follows that advice possibly not offering D3D10 execution on that GPU at all, because of its performance limitations.

Crossfire

Crossfire was next on the agenda, Richard talking about how important multi-GPU is going to be going forward. At this point it's one of those quietly whispered secrets that AMD's next generation graphics architecture is going to centre around a dual-chip offering for the high-end, with R700 likely comprising two discrete GPU dice.

With that in mind, Huddy's advice about Crossfire maybe signals intent on AMD's part to push multi-GPU as graphics progresses deep into D3D10 and then on to D3D11 and beyond.

Huddy says that AMD simply aren't interested in building large monolithic GPUs any more, instead preferring to scale their offerings via multi-GPU. Speaking about a future where GPUs are built on process nodes approaching 22nm and beyond, limitations because of the process start to encroach on how the IHVs build their chips. Huddy argues that chip size and pad count and one of the biggest things the GPG architects are considering going forward, with no real chance of 512-bit external memory buses on tiny chips.

We asked him why you wouldn't just build an R600-sized GPU at 22nm, encompassing way north of a billion transistors, while retaining the large external bus size, but he was firm that the future of graphics scaling lay not in large monolithic designs, despite their inherent attractions. Instead the future lies with multi-GPU, and AMD are looking to implement consumer rendering systems where more than two GPUs take part in the rendering process. Richard made mention that the software to drive it gets more complex, both at the application level and at the driver level, but AMD will ensure their part of the software stack works and ISVs should be prepared to do the same.

Richard urged developers to test on Crossfire systems now, since they're available and sport some of the basic scaling traits that AMD expect to see in future multi-GPU implementations. Tips for developers centered around rendertarget usage, and temporal usage of rendered frames for feedback algorithms including exposure systems and the like.

On a multi-GPU system rendering in a round-robin AFR fashion, with each GPU in the system rendering a discrete frame on its own, if one frame depends on the one (or more) that was rendered previously, shuffling its contents to the GPU drawing the current one costs performance in multiple (and sometimes subtle) ways.

The message was a clear one: Crossfire is important now, and it will only become even more so in the future, especially as AMD introduce new architectures and deliver consumer game rendering systems with more than 2 GPUs contributing to rendering. We can't help but think that the push to thinking about multiple hardware components contributing to rendering is a basic tenet of the industry going forward for graphics, and that it'll occur not just at AMD.

Back to the R6-family

Coming back to the R6 family, Richard made note of various current properties of their shipping hardware range. Notes like L2 cache sizes for the GPUs (256KiB on R600, half that on RV630 and no L2 texel cache on RV610), that each SIMD can only run one program type per clock group, and that you should keep the chip and its memory controller and on-chip memories happy in D3D10 by indexing RT arrays sensibly (use RTs 0, 1 and 2 rather than 0, 4 and 7 for example, if possible). The chip will speculatively fetch across sensible memory locations into cache, so you can reduce misses by packing data not just per-surface but by groups of surfaces.

Before we get to the tesselator chatter, it's worth noting that AMD recommend that a 1:4 input:output relationship for GS amplification as an upper bound to keep output primitive data on chip, and thus performance high, before the hardware has to regularly spill to board memory via SMX.

Tesselation

The R6-family tesselator's light burned strongly in Richard's presentation. A programmable primitive tesselator will become part of the official D3D spec with DirectX 11, however AMD are the first out of the gate with a pre-11 implementation. AMD look to expose it first in D3D9 applications if developers are interested, with D3D10 support provided to the ISVs probably closer to 2008.

Richard said the tesselator's main use is for added graphical fidelity, rather than situations where the tesselated geometry might interact in a crucial game-changing way. The unit's performance was also mentioned, where triangle throughput upwards of a billion per second is pretty trivial for the tesselator to push down to the rest of the hardware.