Rasterisation Pipeline

As shown in the earlier pipeline diagram, the P10 VPU architecture also has programmable texture and pixel pipelines.


Click for a bigger version


Texture Processor

The texture processor consists of 128 32-bit SIMD processors with 64 floating point processors allocated to the texture coordinate generation element of the pipeline and a further 64 integer processors allocated to the shader stage.



As with the Vertex processing array, the key here is the programmability of the pipeline, although there are a few things that are 'hardwired' for the purposes of speed. Listed below are a few specifications to the texture pipelines abilities:

  • Bilinear & Trilinear filtering ‘hardwired’
  • Eight simultaneous texture per pass, each map independently 1D, 2D, 3D or cube with independent size, wrapping or filtering
  • Up to 8Kx8K or 1Kx1Kx1K texels per texture with arbitrary sizing i.e. no power of 2 restrictions (unless doing mip-mapping)
  • Programmable texture formats; Compressed DXT1-5, YUV422 supported natively, others can be programmed

Again, the ability for looping allows a greater flexibility for what can be programmed. There is enough caching capacity and data-paths to support 8 simultaneous textures in a single pass; however, should there be the need to combine more it would be possible to write a shader routine to combine as many textures as required. The limitation of 8 is presently really an API limitation. For example, OpenGL 2.0 will allow for as many simultaneous textures as required and P10 will facilitate this by looping through the texture registers – once a higher level language such as OpenGL 2.0 is in place then it will be left to the compiler to decide how many loops through the shader program are required.

Although Bilinear and Trilinear filtering are the only texture filtering algorithms that are 'hardwired' it's the chip other filtering schemes, such as Anisotropic or bi-cubic filtering, that can be programmed. In the case of anisotropic filtering, a shader routine can be written with the degree of anisotropy as a parameter and the shader will just loop through the sampling stages for the required number of samples for that degree of anisotropy. A secondary advantage is that non-standard anisotropic sampling patterns can be used; most hardware takes a rectangular sampling positions, but if you wished to use a keystone pattern then this could will be possible since 3Dlabs are currently working on a shader routine to do this, a routine which will be exposed through the drivers. Coming from the workstation environment, 3Dlabs occasionally are requested to provide other forms of non-standard texture sampling routines. However, with these previously needing to be hardwired it was not economical to provide them; yet with P10's programmability, almost any filtering algorithm can be written as a shader so that users will be able to have them without the need for 3Dlabs building specialised, expensive hardware. The programmability of the texturing pipeline also allows for other advanced functionality such as Wavelets, ray casting into volumetric textures and advanced video operations.

One omission from this part of the pipeline is that the texture shader stage is currently an integer based process and DirectX9 and onwards will be moving to a floating point texture stage. Floating Point texture shaders will allow for a greater dynamic range than currently available with Integer shaders, which enable more advanced algorithms to be produced without having to worry if the dynamic range is going to be exceeded. As the power of 3D chips are used more for arbitrary processing rather than just texturing, it will be useful to have floating point texture shader stages; however, 3Dlabs feel that this presently isn't much of an issue. The complexity required for floating point stages makes this prohibitive on a .15µ silicon process as the chip complexity would be too much (P10 is already one of the more complicated .15µ chips out there), so this will likely come with .13µ processes, which 3Dlabs feel will not be economical until 2003 (although, it seems nVIDIA will be bringing a .13µ-based NV30 later in the year).

Pixel Processor

The Pixel processor also consists of an array of 64 SIMD Scalar processors which allow for programmability in the Pixel pipeline.




The processor array, being 32-bit, means the native pixel format is also 32-bit; however, the R10:G10:B10:A2 format, which is a requirement for DX9, is also a natively supported format and the shipping drivers will be in this format. However, because the pipeline is programmable, any arbitrary format can be supported by packing bits on top of each other. For instance,should 64-bit colour processing become a requirement (as John Carmack has signalled for a number of times) then each of the components can just be broken into chunks and achieved over two loops at the cost of performance. 3Dlabs will not have 64-bit colour processing enabled in their initial drivers, though they are expecting some developers to request this format at some point in time.

The pixel pipeline also supports many different Anti-Aliasing formats. It supports 1-16 sampled, OpenGL edge AA, a 64-bit hardware accumulation buffer (which can support as many samples as required, at the cost of performance), SuperSampling to an unlimited number of samples (programmable, which, again, costs performance), and MultiSampling. Multisampling takes a leaf from Wildcats SuperScene AA, and though it doesn't feature Wildcat's cunning storage mechanism it does feature a sparse matrix sampling pattern whereby a maximum of 8 samples can be taken anywhere in a grid of 16x16 sample positions. And there are several levels of sampling pattern support that can be taken with P10's MultiSampling scheme; for instance, the developer could program their own sampling positions, and sample pattern maps could also be used and the best position from that map could be used based on various parameters, or the chip could just be left to do stochastic sampling.

Again the programmability of the pixel processor enables other things to be done such as Photoshop filters or different types of dithering.

The P10 chip is said to have '4 pixel pipes', though this may not physically be the case - just, overall, only 4 pixels could be produced per clock in theoretical terms. The 64 processor arrays through the pixel and texture pipelines are arranged in an 8x8 block, which is the basic unit of processing and memory transfer - 3Dlabs refer to this block as a 'tile' or 'patch'.