NVIDIA have gone from a minor reliance on clock domains in his last generation to a rather heavy reliance on them in their current generation. Do you see that as an approach that AMD might find useful in the future? If not, why not?

Well, I think we have over 30 clock domains in our chip, so asynchronous or pseudo-synchronous interfaces are well understood by us. But the concept of running significant parts of the chip at higher levels than others is generally a good idea. You do need to balance the benefits vs. the costs, but it's certainly something we could do if it made sense. In the R600, we decided to run at a high clock for most of the design, which gives it some unique properties, such as sustaining polygon rates of 700Mpoly/sec, in some of our tessellation demos. There's benefits and costs that need to be analyzed for every product.

This almost feels like promoting an urban legend to ask this question, but it seems to keep coming up in the community. So, here we go anyway: Has the Fast14 technology that AMD (in its ATI days) licensed from Intrinsity actually made an appearance in any of AMD's parts (particularly PC GPUs) yet? Is it in R600? If not, might it still show up at some point in AMD's future PC GPUs?

It has not appeared in any public product that I'm aware of, right now. Who knows about the future ;-)

As you increase your resolution, your relative texture size goes down, thus decreasing the amount of trilinear and anisotropic filtering that needs to be applied. Your competitor has opted to reduce the ALU:TEX ratio in its low/mid-range derivatives in a much more drastic way than you did. What's your take on that? Does it make their architecture more balanced for that target market, or just less future-proof?

We selected our mid range and low end products to hit a specific price performance. And I think we did a great job on the RV610 and RV630 based products. They are very popular. Our architecture is flexible in how we drop ALUs and textures ops and other aspects. We just picked the ratio of alu to texture to z to color ops that give us the best bang-for-buck. Associated with that, we selected ratios that were similar in all our parts, so that developers working on one would have substantially the same experience, scaled, on all parts. I'm not sure about what the competition used or the restrictions in their architecture that could force sub-optimal solution.

Higher resolutions also have better coherence (better compression, more cache hits) and thus lower memory bandwidth requirements for their computation. High levels of multisampling should also compress very nicely. As such, why is it that both your products and those of your competitors have much more bandwidth and sampler ability in the high-end than in the mid-range? Performance at all costs for enthusiasts, focusing less on performance/price than in other markets?

I think that it does boil down to “performance at all costs” instead of best per $. If we reduced the amount of cache, we would have, for example, worse performance on texture minification – That would give a worse performance for the high end, which is “Bad” ™. But we also have more units on high end machines, that consume more bandwidth and more caches, so we need to keep these feed. That's another reason for “more of”.

Some R6xx derivatives have different ratios of ROPs-samplers-ALUs. Are there intrinsic reasons why these ratios might make more sense for these specific market segments? These things are presumably also tweaked to hit a target die size?

I answered this above a little. We target the best-bang-for-buck for all the derivative parts, and that sometimes means tweaking ratios. We want to give the developer a similar experience to what they have on the high end as well, which sometimes leads to non-linear reductions in various elements. At the end, I feel that our derivatives have great performance per mm^2. In fact, the RV610 will offer the best DX10 performance at the lowest price point and lowest power. Hard to beat that.