Why go superscalar in the ALU structure? R580's fragment shader core is a result of shader analysis for instruction properties, so is the superscalar design in R600 an extension of that work?

Yes, it was a result of our continue analysis of shaders. R5xx what pretty much a direct reflection of DX9 SM2/3, with some tweaks for enhancements. The R600 has a scalar design structure, which is optimized for what we see in current shaders and what we see in upcoming shaders. More and more work gets done in scalar paths and focusing on this gives you the best flexibility to address that. Basically, you can do vector perfectly and deal with any scalar paths too. As well, from a physical standpoint, focusing on a scalar design is significantly easier than having to deal with vector datapaths.

How soon in the design process did you make the decision to focus the real performance improvement on the shader core, arguably leaving the sampler and ROP hardware lagging compared to the performances available in R580?

Around halfway through the design (some time in mid 2005), we decided to scale back ROPs to keep costs down and balance things to a GDDR3/512b memory subsystem. Scalability of ROPs is pretty simple to pick and choose, to hit a specific price/performance point. We ended up tweaking things a little later, but that was the main timelines. The samplers were designed to be 64b samplers from nearly the beginning, and matching that to BW and keeping the 4:1 ratio on ALU:Tex was the design choice made. In the latest games, where ALU:TEX ratios hit 15~20, this really shines.

How did scalability concerns impact the architectural discussions, and how did those tradeoffs manifest in the actual designs across the R6 family? Did DX10 and unified shaders impact those decisions in ways that weren't a factor with earlier families?

Of course. Inherent in every one of our architectures is the question of scalability. It's a worthless architecture if it cannot scale from the integrated space all the way to the high end multi-GPU systems. Having said that, DX10 certainly is a change from DX9. Obvious things such as vertex / pixel shader scalability are now gone. We only have one shader core now. However, we've made it scalable in 2 dimensions now, both in terms of pixels processed at the same time within a SIMD array, as well as number of parallel SIMDs we can put into the array. That's just an example. We are also scalable on ROPs, textures and various other internal ways. Also, there's functionality that is modular, such as HiZ and UVD, which can be present or not.

 

Click for a bigger version

NVIDIA's 8800 GTS part is quite a ways "down" performance and units wise from their GTX. Is there anything in particular about R600 that allowed you to avoid that kind of approach with 2900 XT (versus a possible and imagined XTX)?

All of our designs have had redundancy (pipeline and shader) with fuse support since R420. Basically, if a quad pipeline was detected to be bad, any one of them, or more than 1, could be turned off by fuse. This has lead to many of our mainstream parts, in the past. The same thing existed on R5xx parts, with the R580 and R530 capable of also scaling the amount of SIMDs per quad pipe as well, under electronic fuse control. The R6xx has expanded beyond this, with quad pipes, simds and various other aspects controlled by both redundancy and the ability to turn off subsets. The R600, for example, has redundant shader scalar cores, that replace broken ones internally, to maintain the same performance but increase yield.

As well, we can disable SIMDs or quad pipes internally, to recover parts that redundancy cannot fix. Overall, this strategy has allowed us to have a high end solution and then also capture back parts for lower end designs (the funny thing is that as yields improve, over time, this solution actually doesn't do as well, as it forces the average sell price of the die down to the lower end part; just a note). At this time, ATI Radeon HD 2900 XT is a full featured part and no other parts have been announced

Why did you eventually choose to not compete in the ultra-high end space, against 8800 GTX? Does AMD intend to compete for the single-card performance crown in the future?

Right now, our plan is to target the high end performance level with the R600, and the ultra high end will be covered by the crossfire configurations. Today, the crossfire R600 beats the high end ultra-super-expensive competition, by a significant margin, while being cheaper. For every price point we target, we always try to get the very best performance!