Add & multiply instruction: Both R300 and NV30 can meet the requirement of DX9 VS3.0 and have their advantages: NV30 provides DPH instruction, which is available in NV2x, and R300 provides new MADDX2, which can be used as a helper for a reflection vector calculation.
Math functions: The highlight of NV30 here is the native SIN and COS
instructions. Although it can emulate some functions offered by NV30 through
macros, such as SINCOS macro which can complete the same function of SIN and COS
instructions in NV30, R300 needs more instruction slots to do so, which can
decrease dramatically not only the total static instructions in the shader but
also the efficiency and the accuracy. Here is an implementation of SINCOS macro
in DX9 VS2.0 using Taylor Series.
; An implementation of sincos macro
instruction SRC2 should be constant (1.f/(7!*128), 1.f/(6!*64), 1.f/(4!*16), 1.f/(5!*16) ) SRC3 should be constant (1.f/(3!*8), 1.f/(2!*8), 1.f, 0.5f ) VECTOR v1 = EvalSource(SRC1); VECTOR v2 = EvalSource(SRC2); VECTOR v3 = EvalSource(SRC3); VECTOR v; MUL v.z, v1.w, v1.w ; x*x MAD v.xy, v.z, v2.xy, v2.wz MAD v.xy, v.xy, v.z, v3.xy MAD v.xy, v.xy, v.z, v3.wz ; Partial sin(x/2) and final cos(x/2) MUL v.x, v.x, v1.w ; sin(x/2) MUL v.xy, v.xy, v.x ; compute sin(x/2)*sin(x/2) and sin(x/2)*cos(x/2) ADD v.xy, v.xy, v.xy ; 2*sin(x/2)*sin(x/2) and 2*sin(x/2)*cos(x/2) ADD v.x, -v.x, v3.z ; cos(x) and sin(x) WriteResult(v, DST); |
Of course, R300 perhaps uses a more optimized method to implement SINCOS macro, but we need evidence to prove it. DX9 VS3.0 requires that SINCOS macro cannot take more than 2 slots, and NV30 exceeds its requirement here.
Both R300 and NV30 support POW instruction natively, which needs 3 slots in DX9 VS 2.0/3.0, and is useful in lighting computations. Besides, R300 also provides a new instruction, EXPE, which can be used for fog computations.
Although R300 seems to have no support for ABS, it can implement the same function using MAX instruction in one cycle: MAX(A, -A). For NV30, in fact, ABS is almost of no use, because instructions can take the absolute value of each operand at no cost by _abs modifier.
Floating point precision is an interesting thing here. In NV30, except for two approximate math functions just for the compatibility with old games and applications, LOG and EXP, all of other math function instructions are accurate to at least 22bits (True IEEE-32 precision).
As a comparison, DX9 VS2.0/3.0 requires that EXP and LOG reach at least 21 bits of precision, POW macro reach at least 15 bits of precision, EXPP and LOGP reach at least 10 bits of precision, and the max absolute error of SINCOS macro is no more than 0.002.
RCC in NV30 is a clamped RCP for avoiding overflow and underflow issues.
Set on instructions: R300 doesn't provide SGT and SLE, but it can provide same function by using SLT and SGE and swapping operand A and B. NV30 can support a little more here.
Flow control instructions: It seems that NV30 only provides 3 flow
control instructions, and two of them are CALL and RET used by the subroutine.
Compared with much more flow control instructions including condition, jump and
loop instructions provided by VS2.0/VS3.0 and R300, BRA, the only "true" flow
control instruction offered by NV30 seems very inferior. However, do not be
fooled by the façade. In fact, BRA is very powerful and can use one branch table
and execute data-dependent jump at necessary. That is to say that depending on
Conditional Code register, BRA implements not only all functions of DX9 VS2.0
static flow control instructions but also functions of VS3.0 dynamic flow
control. The true poor implementations are VS2.0 and VS3.0. Someone may be
curious about how BRA implements loop, the following code calls a function named
"function"
# c[0].x holds the number of
iterations to execute. # c[1].x holds the constant 1.0. MOVC R15.x, c[0].x; startLoop: CAL function (GT.x); # if (counter > 0) function(); SUBC R15.x, R15.x, c[1].x; # counter = counter - 1; BRA startLoop (GT.x); # if (counter > 0) goto start; endLoop: ... |
NV30 also supports nested subroutine up to 4 levels, and it seems that NV30 also can do nested loop.
Address register instructions: Combine two address registers with the unique ARA instruction, which adds pairs of components of an address register, and is useful for looping and other operations, NV30 can provide the most agile relative addressing ability.
Graphics-oriented instructions: SSG in NV30 adds a new "set sign" operation, which produces a vector holding negative for negative components, zero for components with a value of zero, and a positive for positive components. Equivalent results could be achieved (less efficiently) with multiple SLT, SGE, and arithmetic instructions. See functions of DST and LIT in Pixel Shader part.
Texture instructions: It is introduced by DX9 VS3.0 for texture lookup. It is one of the key reasons why NV30 VS cannot be named VS3.0 but VS2.0+. In VS3.0, four separate texture samplers stages (distinct from the displacement map sampler and the sixteen texture samplers in the pixel engine) exist and a vertex engine that can be used to sample textures set at those stages. The functionality is identical to the pixel textures except for the no anisotropic texture filtering support and no rate of change information.
Predicate register instruction: The last but not the least, all instruction provided by NV30, except three flow control instructions, can update condition code register (NV30's predicate) at the same time, if and only if an optional suffix of "C" is added. For example, there are two instructions to perform vector addition, "ADD" and "ADDC". Each component of the destination register is updated with the result of the vertex program instruction if and only if the component is enabled for writes by the component write mask and the optional condition code mask (if applicable) in NV30. Otherwise, the component of the destination register remains unchanged.
In VS3.0, a similar function is implemented by instruction modifiers. This instruction modifier costs an additional instruction slot. The syntax is as follows:
[[!](p[.swizzle])] InstOpcode
Instruction_Parameters e.g. (p.x) add_sat r0.xy, r1, r2 e.g. (!p) mul r0, r1, r2 |
The destination write mask is "and"ed with the per channel predicate boolean value and the data written back into the destination (after the usual application of the instruction modifiers). This update has no side effects i.e. does not change the predicate register, which can only be modified by SETP instruction. The only swizzles allowed inside the instruction modifier are full (.xyzw) or replicate (.x, .y, .z, .w). The presence of a (!) inside the modifier reverses the meaning of the predicate bits. Note: The instruction modifier is syntactically present before the opcode, but in the binary format it is the tailing token. In VS3.0, all instructions except flow control and SETP can be predicated.
Generally speaking, we must admit that instruction sets of NV30 even exceed VS3.0 in several aspects, such as BRA. It seems that NV30 has a better instruction set.
Vertex Processing Unit
We all know the details about the vertex processing engine in R300, but we know almost nothing about the architecture of the vertex processor in NV30 and there has been no leaked info about this. However, information from the "NV30 OpenGL Extensions presentation" may give us one clue.
NVIDIA declares that vertex program performance of NV30 to be over 3x faster than NV20 and over 1.5x faster than NV25, clock for clock. As we all know, NV20 has one VS unit and NV25 has two VS units, so we may assume that NV30 must have three VS units. The question has a clear answer.
It’s sure that VS unit in NV30 must have a complex and optimized architecture too, because NVIDIA claims that older VS1.0 and VS1.1 programs go faster too. More detail would require additional information from NVIDIA, though it is almost impossible that NV30 holds on the antique hardware T&L unit.
So, as we can see R300 has more raw power than NV30 in theory, clock for clock.