Every instruction can take between 2 and 4 arguments, one destination and up to 3 sources. Both source and destination are registers. There are 9 different types of registers as indicated in the table below:
Register Type |
VS1.0 | VS1.1 | VS2.0 | VS3.0 (*) |
Vertex Input | 16 | 16 | 16 | 16 |
Vertex Output | 13 | 13 | 13 | 13 |
Temporary Register | 12 | 12 | 16 | 32 |
Float Constants | 96 | 96 | 256 | 256 |
Integer Constants | N/A | N/A | 16 | 16 |
Boolean Constants | N/A | N/A | 16 | 16 |
Address Register | N/A | 1 (1D) | 1 (4D) | 1 (4D) |
Loop Register | N/A | N/A | 1 | 1 |
Sampler Register | N/A | N/A | 1 | 4 |
The “Vertex Input†(Vn) registers contain data taken directly from the Vertex Stream, as described in the Fixed Function section, containing vertex position, texture coordinates, etc. The “Vertex Output†(On) registers contain output data, like transformed position, vertex colours, transformed texture coordinates, etc. The “Temporary Registers†(Rn) can freely be used to read/write as part of the program. There are 3 possible types of constants used for different purposes: Floats, Integers and Booleans. These registers are constant for a large set of vertices, their values are set by special instructions at the start of the vertex program or using special function calls in the API, the Vertex Shader itself can not dynamically update these registers. The Address Register (A) has a special function related to relative addressing (addressing data based on a fixed address and a factor) and can be 1D (one value) or 4D (4 values). The Loop Register (AL) has a special function related to loops, it’s essentially a counter which increments each time a loop is executed. The Sampler Registers allow the sampling of texture data in the Vertex Shader, often identified with “displacement mapping†– the sampler registers contain texture ID (which texture to sample), texture type (1D, 2D, Cube, Volume) and texture filtering information (Point, BiLin, TriLin). The behaviour and functionality of the Sampler Registers is different in VS2.0/3.0 where VS2.0 only supports static texture coordinates and VS3.0 allows the Vertex Shader to specify texture coordinates prior to sampling, essentially a very powerful dependent read functionality.
Just like with the number of instructions these registers are limited in number because they directly translate into silicon space and thus chip cost.
As part of instruction encoding we need to be able to encode the address of all these registers. To come up with an efficient addressing mechanism we need to split the registers up in 3 groups: read-only, read/write, write-only. The Vertex Output Registers are obviously write-only, and the Vertex Input Registers read only. Constants are obviously read-only, Sampler Registers act as constants and are read-only (Vertex Shader can not change the filtering, texture ID, etc). The Address Register is special; it can be written to and read from but only using special instructions. The Loop Register can not be accessed directly, its part of the loop instruction set. The Temporary Registers are read/write. These read/write limitations influence the address encoding of the Source and Destination parts of our instructions. Obviously there is no use in allowing a constant address to be used as a destination or an output register as a source.
For the Source arguments we need to be able to address all the registers that are readable so that would be 16 Input Registers, 32 Temporary Registers, 256 Float Constants, 16 Integer Constants, 16 Boolean Constants, 1 Address Register and 4 Sampler Registers. Now we need to take into account relative addressing for the constants. Essentially that means that rather than direct addressing, e.g. FloatConstant[10], you can use FloatConstant[A.x + 5]. There are up to 5 different addressing modes for the constant area: direct addressing and 4 different relative addressing bases (Address register is 4D so it has x, y, z and w component that can be used as an addressing base). We can approximate the number of addressing bits required by simply adding up all the locations: 16 + 32 + 256*5 (addressing modes) + 16 + 16 + 1 +4 = 1365 locations, which translates to 11 bits addresses allowing 2048 locations to be addressed.
For the Destination arguments we need to be able to address all the registers that are writable so that would be 13 output registers, 32 Temporary Registers, and 1 Address Register or a total of 46 locations, which translates to 6 bits which allows us to address 64 locations.
Thinking back to our instruction format we had an OpCode (6 bits), a single Destination (6 bits) and up to 3 Source Arguments (11 bits) for a total of 6+6+33=45 bits per instruction.
Things are more complex since every input also has to support Swizzling and replicate. Swizzling is data re-ordering and replicate is data replication. Let me explain this with some examples, as you know each vertex has four float components xyzw. Swizzling allows you to juggle these components around and replicate allows you to send the same component to several different positions. For example:
Assume R0 = (1,2,3,4) then: R0.xyzw = (1,2,3,4) R0.wzyx = (4,3,2,1) R0.xxzz = (1,1,3,3) R0.wwww = (4,4,4,4) |
If each input has to support complete Swizzling and replicate, then we need two encoding bits per component, there are 4 components per input/output, or a total of 8bits per input/output which means we need 3x8=24 bits just to encode full modifiers (Swizzling and replicate is referred to as a modifier). Another modifier that has to be supported is “negateâ€. This requires an additional bit to encode, or a total of 27 bits. If we add this to the 45 bits we already have we get a total of 72 bits. The output has to support masking (write enable/disable) which requires 1 bit per component or 4 bits in total bringing our instruction bit count to 76 bits.
I won’t claim that this number is 100% correct, there might be more
constants to address, more addressing modes, more modifiers, smarter ways of
encoding all of this, etc. Main point of this little counting exercise is to
show you that the size of a single Vertex Shader instruction, and as such
the whole program, should “not†be underestimated. The size of the biggest
instruction determines the silicon space needed for storing the 256
instruction program. If we assume that these numbers are correct and we
round to bytes we would need 10 bytes (80 bits) per instruction or a total
of 2560 bytes just for the program storage or 2.5Kbytes. This number is
obviously peanuts when compared to a texture upload but remember that over a
256 bits (32 bytes) bus this still results in a 80 cycle stall, not
something you want hundreds of times during a frame when it’s not necessary
– even worse if the data has to come over the AGP bus.