Beyond3D - ATI Xenos: Xbox 360 Graphics Demystified

ATI Xenos: Xbox 360 Graphics Demystified - Page 5

Published on 13th Jun 2005, written by Dave Baumann for Consumer Graphics - Last updated: 21st Mar 2007

Z-Only Rendering Pass

Some games these days make use of graphics chips abilities to fast reject workload based on Z information. Engines such as Doom 3 or Source have the capabilities to, on each frame, run a geometry only pass which is for the purpose of pre-filling the Z buffer with the final Z depths of that frame. When the full frame is ready to be rendered, pixel information that has a higher Z depth than the information in the Z buffer is rejected before any pixel operations are carried out on it, meaning that there are no pixels written that are wasted due to overdraw. This z-only prepass is expected to be commonly used on Xenos as it has additional advantages for tiling, explained later.

A geometry pass to populate Z information is going to gain from a processor that has double the Z compare / write units in relation to its pure pixel fill-rate, which Xenos's does. However another factor is that this pass is actually going to require geometry processing over the vertex shaders. In a traditional shader capable graphics processor the number of vertex units can often be many times less than the pixel shader ALU's, however in the case of Xenos all of the shader units will be tasked purely with the geometry processing which should also ensure a fast operation of this early Z pass.

As with ATI's current desktop parts, Xenos features a Hierarchical Z buffer. Hierarchical Z buffers contain "coarser" Z information than the full resolution Z buffer - usually Hierarchical Z buffers are tiled down versions of the full resolution Z buffers and the highest Z value of that tile is stored for that group of pixels. In Xenos's case the Hierarchical Z Buffer stores down to 16 sample groups, which equates to 2x2 pixel groupings with 4x FSAA enabled. Once a triangle is setup, its pixel coverage areas can be compared against the Hierarchical Z buffer and if all of their Z values are greater than the value on the tile then they can all be rejected before any work is carried out, however if some are lower then they will be compared against the full resolution Z buffer. Because the Hierarchical Z Buffer exists on chip the checking operation is very fast and can also reject numerous pixel groups in a single cycle. Xenos can discard up to 64 pixels per clock cycle based on hierarchical z. As the Hierarchical Z buffer is populated on the Z only pass it will have the final Z values for its tile coverage when the full pass is done. This will result in more efficient use of the Hierarchical Z buffer in comparison to normal (PC) graphics processors on software that doesn't have an early Z only rendering pass built within the engine.

Something to note here is that with current PC parts the size of the on-chip ZCULL capabilities usually scales with the number of quads it is processing, and at least has to cater for the range of common PC resolutions, which are larger than those of even high definition TV sets. Being designed directly for the needs of a console Xenos can make some die savings in this area as the on chip Hierarchical Z buffer only needs to cater for a Z buffer size of these high definition TV resolutions.

Tiled Rendering

When FSAA is involved, the pixels always have to be stored at their sample levels until the frame is fully rendered. As the scene is rendered blends will be occurring on samples and, as the sub-samples for the pixels can contain different colour values, pixels cannot be down-sampled temporarily and then up-sampled if more blends have occurred. Basically, the down-sampling (resolve) step can only occur once it is known that all the operations for a given pixel are finished for a frame. The upshot of this is that in most traditional rendering cases the FSAA resolve is only done once the frame is finished and the back-buffer is written to the front-buffer (or even directly in the DAC's in some cases).

With the eDRAM being the primary rendering target for Xenos there looks to be a potential issue with rendering FSAA at High Definition TV (HDTV) resolutions: space. With only 10MB of rendering space available, the resolutions and FSAA depths that can be natively supported by the eDRAM could be limited. If we look back to our 512MB Radeon X800 XL review we see that the calculation for the size of frame-buffer requirements with FSAA goes along the following lines:

Back-Buffer = Pixels * FSAA Depth * (Pixel Colour Depth + Z Buffer Depth)
Front-Buffer = Pixels * (Pixel Colour Depth + Z Buffer Depth)
Total = Back-Buffer + Front-Buffer

Now, in the case Xenos the front-buffer only exists in UMA memory, so only the back-buffer size is of concern for the eDRAM space.

At the moment XBOX 360 is supporting 720p (progressive scan) and 1080i (interlaced) resolutions - 720p equates to 1280x720 pixels and 1080i equates to 1920x1080 pixels, however interlacing means that only the odd horizontal lines are refreshed on one cycle and the even lives on the next, which means that the frame buffer is only ever needing to handle 1920x540 pixels per refresh.

Here are the frame-buffer sizes for these HDTV resolutions and 640x480 with a colour depth of 32-bit (which will cover both the standard integer 32-bit format and the FP10) and a 32-bit Z/stencil buffer. Naturally, the sizes will increase if a higher Z-Buffer depth or a higher bit colour depth is used:

Framebuffer sizes

	No FSAA	2xFSAA	4xFSAA
640x480	2.3	4.7	9.4
1280x720	7.0	14.0	28.1
1920x540	7.9	15.8	31.6

MiB

As we can see, with these bit depths, all the resolutions will fit into the 10MB of eDRAM without FSAA and at 640x480 a 4x FSAA depth will stay within the eDRAM memory size, with these colour and Z depths. However, at HDTV resolutions nothing can fit into the 10MB of eDRAM with any mode FSAA enabled. Xenos was specifically designed to perform very well in these cases by dividing the screen into multiple portions that fit within the eDRAM render buffer space. This is similar to prior tile-based renderers, but with a much larger base tile and with additional functionality to optimize the tiling approach.

Tiling mechanisms can operate in a number of ways. With immediate mode rendering (i.e. the pixels being rendered are for the same frame as the geometry being sent) it is never known what pixels the geometry is going to be mapped to when the commands begin processing. This is not known until all the vertex processing is complete, setup has occurred and each primitive is scan converted. So if you wanted to tile the screen with an immediate mode rendering system, the geometry may need to be processed, setup and then discarded if it is found not to relate to pixels that are to be rendered in the current buffer space. The net result here is that geometry needs to be recalculated multiple times for each of the buffers. Another method for tiling would be to use Tile Based Deferred Rendering which processes the geometry and "bins" it into graphics RAM, saving which render "tile" the geometry affects as it does so - these mechanisms have traditionally operated by deferring the actual rendering by a frame in order to parallelise the geometry processing / binning and the rendering (you may wish to take a refresher on PowerVR's tile based deferred rendering process in our article here).

ATI and Microsoft decided to take advantage of the Z only rendering pass which is the expected performance path independent of tiling. They found a way to use this Z only pass to assist with tiling the screen to optimise the eDRAM utilisation. During the Z only rendering pass the max extents within the screen space of each object is calculated and saved in order to alleviate the necessity for calculation of the geometry multiple times. Each command is tagged with a header of which screen tile(s) it will affect. After the Z only rendering pass the Hierarchical Z Buffer is fully populated for the entire screen which results in the render order not being an issue. When rendering a particular tile the command fetching processor looks at the header that was applied in the Z only rendering pass to see whether its resultant data will fall into the tile it is currently processing and if so it will queue it, if not it will discard it until the next tile is ready to render. This process is repeated for each tile that requires rendering. Once the first tile has been fully rendered the tile can be resolved (FSAA down-sample) and that tile of the back-buffer data can be written to system RAM; the next tile can begin rendering whilst the first is still being resolved. In essence this process has similarities with tile based deferred rendering, except that it is not deferring for a frame and that the "tile" it is operating on is order of magnitudes larger than most other tilers have utilised before.

There is going to be an increase in cost here as the resultant data of some objects in the command queue may intersect multiple tiles, in which case the geometry will be processed for each tile (note that once it is transformed and setup the pixels that fall outside of the current rendering tile can be clipped and no further processing is required), however with the very large size of the tiles this will, for the most part, reduce the number of commands that span multiple tiles and need to be processed more than once. Bear in mind that going from one FSAA depth to the next one up in the same resolution shouldn't affect Xenos too much in terms of sample processing as the ROP's and bandwidth are designed to operate with 4x FSAA all the time, so there is no extra cost in terms of sub sample read / write / blends, although there is a small cost in the shaders where extra colour samples will need to be calculated for pixels that cover geometry edges. So in terms of supporting FSAA the developers really only need to care about whether they wish to utilise this tiling solution or not when deciding what depth of FSAA to use (with consideration to the depth of the buffers they require as well). ATI have been quoted as suggesting that 720p resolutions with 4x FSAA, which would require three tiles, has about 95% of the performance of 2x FSAA.

Taking the previous sampling requirements, the memory quantities required resolved to the following number of tiles being required:

Number of tiles

	No FSAA	2xFSAA	4xFSAA
640x480	1	1	1
1280x720	1	2	3
1920x540	1	2	4

tiles

Render to texture operations that have space requirements beyond 10MB can also operate in the tiled mode, however given that Xenos is going into a closed box environment its likely that developers of the system will consider what best fits the design of the console when they are developing their titles.