Nicolas Thibieroz

Nick is one of the GPG devrel ninjas, coming to AMD by way of ATI and PowerVR before that. Nick's had a hand in more triple AAA games from an ISV devrel perspective than we care to mention, and D3D10's his favourite D3D so far. So much so, in fact, that he decided to reprise his D3D10 For Techies presentation from Develop last year and present even more information.

Good D3D10 Practices And Things To Remember

Nick's talk centered around porting a D3D9-level renderer to D3D10, outlining some of the pitfalls and things to watch out for as you make that transition. He started by curging developers to separate their D3D10 shaders according to opaqueness or transparency, running different versions depending on whether the alpha channel is a consideration when shading, and to remember to set render state correctly at each point.

The geometry shader was talked about as the replacement for point sprites, where you emit the primitives for the sprite in the GS at the screen space point you require. Nick also reaffirmed that there's no longer a half texel offset when mapping texture UVs to geometry, so D3D9-era shaders you might port to D3D10 that take that in to account will need to be corrected. Input semantics for the vertex shader stage was also mentioned, with D3D10 more strict about your vertex structures and which shaders you feed them to. Make sure your input layouts match to ensure correct binding semantics, so that D3D10 can do the right thing with shader caching and binding.

Nick went on to discuss how D3D10 might have a base level spec, but there are still additions to that base spec that hardware might not implement, which require asking the API to check for compatibility before you perform the operation. Surface filtering, RT blending, MSAA on RTs and MSAA resolve for an RT were all mentioned in that group, and it's up to you to make sure the hardware can deal with the surface format you want to use for those operations. For example, current Radeon D3D10 hardware can't perform MSAA on the 96-bit RGB float surface type (DXGI_FORMAT_R32G32B32_FLOAT), but can if you make use of DXGI_FORMAT_R32G32B32A32_FLOAT instead, with its extra channel. Full orthogonality isn't there yet, so make sure the formats you use are compatible with what you're doing.

Talking about DX10's improved small batch performance, Nick highlights the fact that the improvements are harder to spot if you port your D3D9 engine in a naive way and don't make the best use of the features on offer. Making good and correct use of constant buffers, state objects, the geometry shader, texture arrays and resource views, and geometry instancing (to name a few key considerations), are key to extracting the best driver-level and runtime performance from the D3D10 subsystem, to reduce small batch overhead.

The general idea Nick pushed in that section of his talk is that you need to give the driver and runtime the very best picture of what you're trying to achieve in your application for any given frame, so that they can combine to drive the hardware efficiently. Make sure you only update key resources when you need to, which might not be every frame. That's leads nicely on to constant buffer management.

Constant Buffer Management

Somewhat surprisingly, at least to us, Nick says that the number one cause for poor performance in the first D3D10 games is bad CB management. Developers forget that when any part of a constant buffer is updated, the entire buffer is resent to the GPU, not just the changed values. 1-4 CBs is the optimal number for best performance on current hardware, with Nick mentioning that doesn't just apply to Radeon, but GeForce too. Nick also mentioned that you should keep your CBs in a controlled namespace, since filling up the global CB is a quick way to worse performance.

Best practice with CBs also centers around the position of constants in the buffer itself, to help caching to work properly. If at a particular point in a shader it asks for, for example, four constants from one of your CBs, make sure those four constants are next to each other in the CB index. Speculative constant fetch into the constant cache means that the hardware is more likely to have neighbouring constant values available as soon as possible, rather than if it has to jump all over the constant space for the values you require. Think about ordering constants in your CB index by their size, too. Profile your app before and after a CB order change if possible, to check how performance was affected.

More D3D10 Tips

Next up was a note that you need to make sure to specify the winding order for output primitives in the GS for best performance and to make sure the hardware actually draws your geometry. A mistake he's seeing in current games is to not specify the winding order and wonder why geometry is missing, so be sure and tell the API how it should instruct the hardware to draw your GS output.

If you're making use of streamout from your geometry stage, be it VS or GS (you can use VS streamout by setting a null or passthrough GS), hide latency in that stage of the pipeline by doing more math. Nick mentions that you can move math ops around the pipeline in certain cases, to help balance your workload and mask latency. If your VS is bound by streamout latency for example, think about giving it some of your PS work to do if that's possible.

After that was the reminder that the GS isn't only capable of amplifying geometry, but it can also cull or deamplify your geometry load. If there's scope for it in your application, consider using the GS for culling before rasterisation. General hints for programming the GS were sensible, and Nick outlined that the smaller the output struct the faster things will run, and the smaller your amplification level the better as well.

MSAA, Z and S access in your D3D10 shader

Moving on to MSAA, depth and stencil considerations in a D3D10 app, Nick talked about 10.0's restriction that you can only access depth and stencil buffers in your shader when MSAA is off, a restriction that's removed in D3D10.1. Depth can't be bound in your shader while depth writing is on in your render state, and you have to remember to bind the stencil buffer with an integer format for the view, not float.

For custom MSAA resolve in your shader, Nick made note that in a modern game engine with HDR pixel output, it's not always correct to let the hardware perform the resolve since that might take place before a tonemapping pass on the frame. Instead, you'll want to tonemap first to filter colour values into the right range before resolve, which you perform yourself in the shader.

Output Merger

Lastly, talking about the OM stage of the D3D10 pipe, Nick encouraged developers to remember that the blender can take two input colours, as long as you observe some semantic rules about which rendertarget you output to. Dual-source colour blending as implemented by D3D10 has some restrictions in terms of the blend ops you have available, but it's the beginnings of being able to fully program the ROP's blender which is on the D3D roadmap for the future.

Alpha-to-coverage was mentioned, Nick explaining how it works regardless of MSAA being on or off, with the hardware taking alpha samples as a coverage mask for the pixel, to determine how to blend your bound rendertargets in the output stage of the pipe.

Essentially Nick's talk and slides were all about using D3D10 sensibly, all while keeping in mind what the hardware can do for you at the various render stages. Chances are if you don't abuse the API you won't abuse the hardware, and common sense prevails. Being mostly hardware agnostic, the advice given for ISVs looking to make the most of their D3D10 investment, be it a fresh codebase or a D3D9 port, will apply to their development on other hardware, not just Radeons.