Memory Interface

An important aspect of any architecture is the memory bus that is employed by it as this can have a great impact on not only the overall performance of the architecture but also on how its arranged constraints can be placed on what can be achieved.

On straight performance / bandwidth terms, the P10 architecture features an asynchronous, true 256-bit DDR memory interface which, at current DDR speeds, can allow for in excess of 20GB of onboard memory bandwidth. It's long since been suggested that 256-bit memory interfaces are not particularly feasible on consumer desktop chips due to the increased board costs to facilitate the increased number of trace lines a 256-bit bus requires; however, with the introduction of small package BGA memory chips, the connector density has largely negated this issue and driven the costs down. 3Dlabs estimate that the board costs of a 256-bit bus via BGA memory packaging is comparable in price to current 128-bit bus boards using conventional memory (at similar speeds), and, as such, this will filter down to Creative's P10 based consumer products later in the year, as well as the Workstation class products 3Dlabs will produce themselves.

P10 does not, however, feature a 'Crossbar' system akin to the controller employed by the memory interface in nVIDIA's Geforce3/4 architectures, though it features a similar approach (but not the same), which 3Dlabs class as 'patch' or 'cache' oriented. The P10 chip features a significant level of multilevel caching on-chip, sized according to the 8x8 pixel tile (patch) in which the system operates, so it will pack the caches until there is enough data to maximize the potential of the 256-bit bus, and this is fairly CPU-like in operation. 3Dlabs are trying to avoid the PR acronym trap, feeling that the actual implementation of a memory controller is really an architectural hardware detail and not something that should really concern the end user - the Crossbar system may well be optimal for the organization of GeForce3/4's architecture, but that doesn't necessarily hold true for P10's architectural organization. All the end user should be concerned with, according to 3Dlabs is the overall performance of the chip. 3Dlabs say they have put a lot of care into the caching mechanism of P10 and feel it is a very efficient system.

‘Virtual Memory’

Beyond the straight hardware memory bus implementation, 3Dlabs feel the 'Virtual Memory' system employed in P10 is of much more significance and potential impact to the 3D market and is, in fact, something that id's John Carmack has been requesting in hardware for a long time. The concept of Virtual Memory is again very similar to the memory system approached in CPUs: it breaks down the barriers between the various memory subsystems in a PC, such as local frame buffer, main system RAM, or even hard disk space, and allows the 3D processor to access them all freely.




In the Virtual Memory system of P10 there is a logical address space of up to 16GB and this is all broken down into 4Kbyte page sizes. The onboard RAM essentially becomes a large L2 cache to the chip, a system which is easy for compilers to understand.

Hardware can do much finer grain texture management than software can hope to do; only the hardware knows specifically what texture pages are being accessed, so with virtual memory the graphics memory on the card ceases to be this finite memory that has to be managed by the software and is doomed to run out. Similarly this can bring performance improvements; for example, if you have a flight simulator where an aircraft appears over the horizon, then, unlike other architectures, we won't have to have those all those textures unless you actually need to pick up the mip-map levels if the plane come closers. If it never comes closer then you don't waste time transferring those textures around the system. At the very least an advantage of a Virtual Memory system is that memory fragmentation does not become an issue. Under the Virtual Memory system everything is ordered and accessed by pages and, hence, various elements do not need to be organised in a continuous fashion in physical memory, and hence you shouldn't get lots of memory fragmentation.

At the very least an advantage of a Virtual Memory system is that memory fragmentation does not become an issue. Under the Virtual Memory system everything is ordered and accessed by pages and hence various elements do not need to be organised in a continuous fashion in physical memory and hence you don’t get lots of memory fragmentation.

The original 3Dlabs Oxygen boards featured Virtual Memory for texture addressing; however, P10 breaks down all the barriers allowing anything that requires buffering - be it frame buffer, vertex buffers, display lists, textures, etc. -- access to Virtual Memory. Naturally, there is a priority system to stop, say, a large texture pushing some of the frame buffer onto the system hard disk, as you would always want the frame buffer in the fastest RAM possible to maintain performance in real time applications. However, there could be non-real time applications that the system would be very useful for, such as if someone wished to create an unfeasibly large frame buffer that required greater than the RAM space on the board, it (or more specifically, the parts that overflow) could be paged into system RAM to allow the creation of the buffer, at the obvious cost of performance. Likewise, this also facilitates the use of massive textures even if they don't entirely fit into main system memory.

The functionality of Virtual memory should be API independent, although DirectX does have a tendency to like to know where its buffers are defined and located. However, Microsoft is interested in the Virtual Memory system and is integrating it into DirectX9 in order to resolve upcoming issues, not least being that of a 3D based user interface (Longhorn).

Other Chip Information

Each of the stages of the P10 pipeline are separated by FIFO buffers. There should be enough buffering between all stages to ensure optimal throughput and make sure that one stage is not bumping into the next stage.

P10 also features a 10-bit RAMDAC. 10-bit DACs allow for greater quality due to less quantisation errors, since with an 8-bit DAC it is possible to see banding, and yet with 10-bit DACs this is largely removed because of the extra headroom available.

P10 does not feature any Z compression routines or 'fast Z-clear', though there are some Z-clear optimisations; however, these are not in the same sense as those employed by Radeon or GeForce3/4 chips. With a 256-bit memory bus on the P10 chip, 3Dlabs may not have deemed this entirely necessary. P10 does have early Z-Check routines to depth test a pixel prior to the texture & shader stages and retire it early if it is not visible. No other overdraw removal routines, such as Hierarchical Z-buffering, are employed, yet that's not to say it wouldn't be possible to program the chip to do it.

Presently only AGP4X rates and the PCI bus are supported by P10.