When we tested RV530 against RV515 we noted that on many current tests, even pixel shader specific tests, a 3x performance gain was not achieved, and that will certainly be the case with the games applications that these parts will be reviewed / tested with. In your opinion would you say that there are many current apps that will really see the benefits of the 3x ALU increase?

[Eric Demers] If you look at apps such as 3dmark05, FEAR, D3/Q4, and many new upcoming titles, you'll see that the RV530 often doubles (or more) the performance against RV515, at the same clock. Same is true for X1900 vs X1800. Having said that, if you're running older apps, yes, the extra shaders might not benefit you as much, but again, we designed these parts for an excellent performance on current games, and an even better performance on the newest and upcoming games. It's a forward looking design. Also, both RV530 & R580 support more advanced depth shadow buffer features (well, in a generic single component texture way), and so, once these start being used by developers, should give even more performance.

To achieve this structure of tripling the Pixel Shader ALU's without tripling all the rest of the corresponding raster resources the shaders still use the same number of dispatch processors (four dispatch processors for both R520 and R580, each handling 128 batches/"threads") but is done so by increasing the number of pixels within a batch - i.e. 16 for R520 and 48 pixels for R580. Increasing the batch size should have ramifications on both overall shader efficiency and dynamic branching, how much of an impact do you see here?

[Eric Demers] Yes, we did have to triple the granularity size. We also did have to triple the GPR resources, since we did not want to reduce the number of threads. Having said that, we have not seen a single case where R580 doesn't significantly beat an R520 in real shaders with flow control. The reality is that 16 pixels or 48 pixels, from a granularity standpoint, are very similar when there's millions of pixels out there being rendered. Both R520 and R580 run circles around any other architecture out there that supports the most advanced shaders with flow control. In fact, R580 runs circles around anything :-) We could construct artificial cases where the granularity increase would reduce performance, but those cases would probably have to have a good amount of ALU, and so would still be much faster due to the 3x ALU count, anyway...

When we initially tested R520 we noted that the new batch processor (“Ultra Threaded Dispatch Processor”) worked best with shorter shaders that had high texture utilisation, but did very little in relation to R420's per-clock performance for longer shaders? Can you give us a little more understanding as to the nature of the changes behind the batch processor and whether some of these changes were required to make R580 and RV530's structure even feasible?

[Eric Demers] The new dispatcher was required to allow for a linearly scalable ALU architecture (say that 5 times fast!). The R3xx/R4xx sequencer was never designed with this in mind, so it had to be redesigned for that, at least. But it's more than that. With triple the ALU demand to texture resource, we need to be even more efficient on hiding the texture fetch latency (as well as flow control), and the high thread counts of the X1K architecture easily allow for that. We've found that the R580 efficiency is on par with the R520's, which indicates that our design and dispatcher are capable of pretty amazing efficiencies (and wasn't even taxed that hard on R520).

 

Our thanks to Eric and Richard for taking the time to answer these questions.


  • If you wish to comment on this article please do so here.

Other related articles: