Programmable Vertex Unit

To describe the structure of programmable vertex units I’ll start from the specifications as we have them from Siggraph 2002 presentations and the various DX9 specs on the World Wide Web. Compared to the fixed function pipeline the goal is to create a design that has full functional flexibility, meaning that we can do virtually anything we want with the vertex data - no matter how exotic (so transformations are no longer limited to Matrix operations and one fixed lighting model) - while maintaining high performance and efficient use of costly silicon space.

Because the goal includes high flexibility in terms of functionality its no surprise that a programmable approach was taken to tackle the problem. This means the Vertex Shader will have a typical CPU-like structure based on the execution of a program, build using an instruction set, and registers to store intermediate data.


Instruction Encoding

The first element is the program. According to the latest public information a DX9 Vertex Shader needs to support at least 256 static instructions. The number of instructions actually executed can be much higher because of looping and function call functionality which allows the same “static” instruction to be executed multiple times. Some might wonder why the number of static instructions is limited to “only” 256 which is only double the number of the old DX8 spec. The main reason for this is silicon space; a Vertex Shader has to be capable of processing hundreds of millions of vertices per second. This high speed means it is not realistic to place the program in external memory as is done by a regular CPU. Imagine that a program consists of just 10 instructions, if a single instruction would have a size of only 1 byte (we’ll later see that a single instruction is much bigger!) this would result in a program size of 10 bytes, these 10 bytes would have to be read and executed at a rate of hundreds of millions of vertices/second which translates into a dataflow, just for the instructions of the vertex program, of gigabytes/sec. Because the same vertex program is likely to be used on a large set of vertices and because bandwidth is on very limited offer on a 3D Accelerator it is logical to store (or at least cache a large part of) the whole program on the chip. Such storage comes at a premium: more storage equals more silicon space, and more silicon space equals a higher cost. This reality is limiting the size of the Vertex Shader programs and also stops the release of very low cost budget solutions.

The issue of instruction length also points at a potential issue that developers need to take into account: don’t change the Vertex Shader program unless you really have to, because uploading a new vertex program comes at the expensive of using bandwidth. Toggling between two Vertex Shader programs is thus definitely not a good idea for high throughput. Future Vertex Shader implementations might move to a loading (caching) mechanism for instructions but to make this possible the instruction size has to be kept as small as possible and the instruction re-use has to be substantial (e.g. processing of the same instruction on a large set of different vertices before moving to the next instruction).

The second element to look at are the actual instructions that form the program. Each instruction has a similar structure:

OpCode    Destination, Input 1, (Input 2, Input 3)

Each instruction consists of an OpCode which identifies the actual Operation that has to be executed. This OpCode can be something as simple as an addition, multiplication or something complicated like a reciprocal square root or a sine. Below is a table of the NVIDIA CineFX instruction set:

Add & Multiply

ADD, MUL, DP3, DP4, DPH, MAD, MOV, SUB

Math Functions

ABS, COS, EX2, EXP, FLR, FRC, LG2, LOG, RCP, RSQ, SIN

“Set on” Functions

SEQ, SFL, SGR, SGT, SLE, SLT, SNE, STR

Branching Instructions

BRA, CAL, RET

Address register instructions ARL, ARA
Graphics-oriented instructions DST, LIT, RCC, SSG

Minimum / maximum Instructions

MAX, MIN

One thing to note is that these instructions do not necessarily match the instruction set as supported by the actual Vertex Shader hardware. Quite possibly some instructions are macro instructions. A Macro Instruction can actually be written as a group of simple instructions. A silly example would be: “MAD”, which stands for “multiply and add”, can be written as a combination of the MUL (multiply) and ADD command. If we assume that the table above would represent the real complete instruction set of the hardware than we’d need a bit code, to represent the OpCodes, capable of representing at least 38 different instructions, meaning an OpCode consisting of 6 bits capable of representing up to 64 different instructions.