Despite references to 192 processing elements in to the ROP's within the eDRAM we can actually resolve that to equating to 8 pixels writes per cycle, as well as having the capability to double the Z rate when there are no colour operations. However, as the ROP's have been targeted to provide 4x Multi-Sampling FSAA at no penalty this equates to a total capability of 32 colour samples or 64 Z and stencil operations per cycle.
Most PC graphics processors have to balance their output with the available bandwidth and as such their ROP units usually only cater for 2 Multi-Samples per pixel in a single cycle, and the Z output doesn't double with the number of Multi-Samples being produced either. Z and colour compression techniques are also employed in order to get close to the output capabilities with the bandwidth available. ATI's calculations lead to a colour and z bandwidth demand of around 26-134GB/s at 8 pixels with 4x Multi-Sampling AA enabled at High Definition TV resolutions. The lower end of that bandwidth figure is derived from having 4:1 colour and Z compression, however the lossless compression techniques are only optimal when there are no triangle edges intersecting a pixel, but with the presumed high geometry detail within a next generation console titles the opportunities for achieving this compression ratio across the entire frame will be reduced. So, with 256GB/s of bandwidth available in the eDRAM frame buffer there should always be sufficient bandwidth for achieving 8 pixels per clock with 4x Multi-Sampling FSAA enabled and as such this also means that Xenos does not need any lossless compression routines for Z or colour when writing to the eDRAM frame buffer.
So, as far as the operation is concerned, once pixel data has come through the shader array and is ready to be processed into colour values in memory the Z data of the pixel is matched with the correct colour data coming out of the shaders. Xenos supports an "Alpha to Mask" feature, which allows for the use of Multi-Sampling for sort-independent translucency. All of this processing is performed on the parent die and the pixels are then transferred to the daughter die in the form of source colour per pixel and loss-less compressed Z, per 2x2 pixel quad. The interconnect bandwidth between the parent and daughter die is only an eighth of the eDRAM bandwidth because the source colour data value is common to all samples of a pixel here, and the Z is compressed. Once on the daughter die the pixels are unpacked to their Multi-Sample level and each sample is driven through their Z and Alpha computations and the final data is stored on the eDRAM until either the entire frame or current tile (we'll cover this in more detail later) being rendered is finished.
When the frame or tile has finished rendering, the colour data will then be resolved on the daughter die, with the Multi-Samples being blended down to their pixel level. The resolved buffer information is then passed back from the daughter die to the parent which then outputs to system RAM such that, when all the tiles are finished, this can then be outputted to the display device. Although the resolved colour data has to be stored in system RAM, which uses some bandwidth during the transfer, the efficiency of the write as the resolved data comes out of the daughter die to be written to system RAM is very high. This high efficiency is due to the fact that it is dealing with a significant quantity of non-fragmented data and the bus isn't as busy with lots of other bandwidth consuming, high frequency and inefficient frame buffer read / write / modify operations for the back buffer. This helps in alleviating the fact that the parent die is also handling system memory requests. Also note that data can be written to the eDRAM at the same time as it is being cleared from the previous data that resided there, meaning there should be little to no wait when removing the previous data from the eDRAM ( We've heard comments from developers familiar to both designs that this element of Xenos bears similarities to the "Flipper" design for Nintendo's Gamecude, a part that was originally designed by ArtX, who of course were subsequently purchase by ATI, however ATI are keen to point out that while there may be apparent similarities the designs are entirely independent as there are distinct virtual and physical barriers between the groups working on the various console developments, past and present, and no members of the Flipper architecture team were involved in Xenos's development).
As all the sampling units for frame buffer operations are multiplied to work optimally with 4x FSAA this is actually the maximum mode available. Although the developer can choose to use 2x or no FSAA, there are no FSAA levels available higher than 4x. The sampling pattern is not programmable but fixed, although it does use a sample pattern that doesn't have any of the sample points intersecting one or another on either the vertical or horizontal axis. Although we don't know the exact sample pattern shape, we suspect it will be similar to that seen on other sparse sampled / jittered / rotated grid FSAA mechanisms we've seen over the past few years, such as this.
The ROP's can handle several different formats, including a special FP10 mode. FP10 is a floating point precision mode in the format of 10-10-10-2 (bits for Red, Green, Blue, Alpha). The 10 bit colour storage has a 3 bit exponent and 7 bit mantissa, with an available range of -32.0 to 32.0. Whilst this mode does have some limitations it can offer HDR effects but at the same cost in performance and size as standard 32-bit (8-8-8-8) integer formats which will probably result in this format being used quite frequently on XBOX 360 titles. Other formats such as INT16 and FP16 are also available, but they obviously have space implications. Like the resolution of the MSAA samples, there is a conversion step to change the front buffer format to a displayable 8-8-8-8 format when moving the completed frame buffer portion from the eDRAM memory out to system RAM.
The ROP's are fully orthogonal so Multi-Sampling can operate with all pixel formats supported.
Render to texture operations will also be rendered out to the eDRAM first and then read out to UMA memory, when complete, in order to be used as a texture surface for the final frame rendering. Render to texture operations can also have Multi-Sample FSAA applied and the result can either be resolved on the way out to system memory or kept at the high resolution Multi-Sample level. As with standard pixel operations, the eDRAM memory can be written to with either another render to texture operation or pixel data whilst the data from the previous render to texture is being pushed out to UMA memory.