In addition to its other capabilities Xenos has a special instruction which is presently unique to this graphics processor and may not necessarily even be available in WGF2.0 and this is the MEMEXPORT function. In simple terms the MEMEXPORT function is a method by which Xenos can push and pull vectorised data directly to and from system RAM. This becomes very useful with vertex shader programs as with the capabilities to scatter and gather to and from system RAM the graphics processor suddenly becomes a very wide processor for general purpose floating point operations. For instance, if a shader operation could be run with the results passed out to memory and then another shader can be performed on the output of the first shader with the first shader's results becoming the input to the subsequent shader.
MEMEXPORT expands the graphics pipeline further forward and in a general purpose and programmable way. For instance, one example of its operation could be to tessellate an object as well as to skin it by applying a shader to a vertex buffer, writing the results to memory as another vertex buffer, then using that buffer run a tessellation render, then run another vertex shader on that for skinning. MEMEXPORT could potentially be used to provide input to the tessellation unit itself by running a shader that calculates the tessellation factor by transforming the edges to screen space and then calculates the tessellation factor on each of the edges dependant on its screen space and feeds those results into the tessellation unit, resulting in a dynamic, screen space based tessellation routine. Other examples for its use could be to provide image based operations such as compositing, animating particles, or even operations that can alternate between the CPU and graphics processor.
With the capability to fetch from anywhere in memory, perform arbitrary ALU operations and write the results back to memory, in conjunction with the raw floating point performance of the large shader ALU array, the MEMEXPORT facility does have the capability to achieve a wide range of fairly complex and general purpose operations; basically any operation that can be mapped to a wide SIMD array can be fairly efficiently achieved and in comparison to previous graphics pipelines it is achieved in fewer cycles and with lower latencies. For instance, this is probably the first time that general purpose physics calculation would be achievable, with a reasonable degree of success, on a graphics processor and is a big step towards the graphics processor becoming much more like a vector co-processor to the CPU.
Seeing as MEMEXPORT operates over the unified shader array the capability is also available to pixel shader programs, however the data would be represented without colour or Z information which is likely to limit its usefulness.
ATI indicate that MEMEXPORT functions can still operate in parallel with both vertex fetch and filtered texture operations.
Generally speaking we are used to the graphics processors being responsible for the display output capabilities, however in the case of the XBOX 360 that is not the case. Xenos itself outputs the frame-buffer digitally to another display device of Microsoft's choosing.
ATI state that Xenos is a fairly low power consumption design for several reasons. For starters, the mechanism of the ALU's is designed to operate by reducing latencies which, if fully successful, should increase the efficiency of the operation of the chip. Likewise, a unified pipeline also increases efficiency by removing cases where the vertex shader is idle, waiting for the pixel shader to have available slots, or the pixel shader is idle, waiting for the vertex shader to produce data. If such efficiencies are fully realised in relation to current graphics processing methodologies, this can result in either a smaller chip (hence cheaper) with the same performance as larger chips with a traditional architecture, or the same sized chip with more ALU's dedicated to processing, hence higher performance. Of course, this does depend on exactly how "inefficient" current graphics processors really are and whether future processors that use distinct vertex and pixel shaders don't find alternative methods for increasing efficiencies. Although it could be implemented in future designs, current graphics processors may not be able to clock gate between the vertex and pixel shader units, which results in power burnt if one end of the pipeline is waiting for the other; this inherently isn't an issue with a unified platform.
ATI also believe that Xenos specifically has the most advanced power management features of any chip they have produced so far. There is a top level power management system that can be controlled by the OS that allows for various elements of the pipeline to be turned off for various operations, such as DVD playback for instance. There are low power modes that regulate the speeds and voltages and, when inactive, the data is held in stasis rather than just switching transistors on and off to keep the data. However, in the graphics core itself there localized power management techniques applied at the block level to minimize power consumption during idle or low usage periods.
When we factor in the savings for both power and die size savings we can see that this potentially has some advantages over traditional architectures. In the case of the XBOX 360 not only does this result in a relatively smaller die size for a fairly high performance ratio but also means that the graphics need only be air cooled, without the use of its own additional fan. Beyond the immediate application we can see that unified designs that are bound for the PC could have smaller die sizes for equivalent performances as current discrete solutions or more silicon dedicated to either more ALU's for higher performance, or other transistors dedicated to other functionality.