The Xenos ROPs with their eDRAM and limited resolution targets were obviously (as Richard Huddy remarked long ago) never going to be part of a PC market approach. How did the need to provide new ROPs for R600 impact design choices for the ROP, given it's now tied to a unified core?

The render back end blocks are still more a set of fixed function elements (both from a HW and API perspectives). Consequently, adding backend elements is still a very modular process, regardless of the shader core. We had options on the R6xx design to leverage the R5xx render backends or invest into a new design. From a design standpoint, taking into account features and schedule, we invested into new designs for the R600. But that was not necessarily linked to having a new shader design.

Were any features of the core designed explicitly with the GPU computing market in mind?

GPGPU (GPU computing) has been influencing us for some time. And continues to do so (you should see the number of daily emails I get from various folks!). When we designed the shader cache memory import/export for example, we envisioned both DX10 items (GPR virtualization, GS overflow) and GPGPU (inter-thread, memory writes). In general, we target real HW with both classic graphics and more GPGPU-centric items in mind. We've not had many features that are GPGPU centric and cannot be leveraged in graphics. There are a few that come to mind (such as improved shader precision), but there aren't any significant ones. Our head of application research (Raja Koduri) is always telling me that the two are intertwined and should not be viewed separately; They enhance each other and there are no real cases were benefits to one don't help the other.

I've been stubborn, but I believe him now. He's always right in the end; quite annoying.

When was 80HS chosen in the design cycle and did you know about the static leakage properties before you committed production on that node?

When we selected 80HS, we selected it based on faster transistors, better density and also the fact that 65nm would not be available for production for R600. 65nm did get pulled in so the schedule aspect was probably a little off. However, the transistors are faster and leakier, and we were aware of this. We ended up setting up the final clock based on TDP for the worst case scenarios. That usually means that there's a lot of headroom left for overclocking. Most boards seem to get engines over 800 MHz without any problems. With a good bottle of liquid nitrogen, you can get over 1 GHZ ;-)

Given how it turned out on 80HS, how much was the final product affected in terms of yields and shipping clocks? We get the sense that ~750MHz was the low end of the scale for final target clock, for example, but 80HS holds you back.

We knew we were going to hit the power envelope with the higher leakage parts, and we could have done some more binning to create more SKUs. 742 MHz was in the range of our expectations, though we thought we would be a little higher (~800 MHz) initially. Due to simplifying the binning process and various design decisions, the 740 ended up as a good middle ground that gives a competitive part in this price range, and gives good yields.

What kind of changes to the memory controller were made for R600? Did the ring bus investment made during the R520 era pay off more or less than you'd hoped, and is it looking like an important piece of the optimisation puzzle going forward?

The R600 uses an evolutionary approach to the memory controller with respect to the R5xx MC. We went fully distributed on both reads and writes, with arbitration decentralized to pairs of channels; also it supports full virtualization, including all elements of the windows advanced driver models. We went to 8channels of 64b data, giving the 512b memory subsystem. I think that this evolutionary approach actually allowed for a 512b interface to be built and used to our advantage. Without that, the complexity for physical layout would have been too substantial. We absolutely need to keep on optimizing its settings to get better performance out of current and future games. We've just started that work. The advantage of being flexible and programmable is that you can adapt to many different situations and get great performance. The disadvantage is that doing that work requires post silicon work and a team to optimize. And it's not an easy or quick thing to do; I expect further performance improvements as time goes on and as the teams get to know this new architecture.