7. What's your opinion of how DirectX evolved, from DirectX 9's initial release up to DirectX 11? Which were the most significant changes/evolutions?

MH : DX9 had a wide ranging impact and really moved the industry forward. I think DX11's major contribution was the addition of DirectCompute that has allowed for some amazing effects (see #8).

AL : Overall I think the API has been evolving in a good direction. Arguably two of the most important trends have been towards increased orthogonally and less abstraction. DirectX 10 brought the move to immutable state objects and a common feature set across surface formats. This was probably the largest step since DirectX 9. Compute shader is pretty important too, but it hasn’t yet had a large impact.

8. Arguably, with the advent of Compute Shaders, we're already moving in a lower-level access direction: we have to start caring about memory layouts, IHV specific differences with regards to vector widths, peculiarities of memory hierarchies, building custom data structures etc. Why isn't it enough?

MH : Compute Shaders do get us lower, but not all that much lower than the performance impacts already visible in pixel shading. For maximal performance, you do have to care about how you are touching memory, how you are branching, how your instructions are being scheduled, your compute to bandwidth ratios, etc.. Compute Shaders expose much more of the machine in terms of scratch pads and instructions, but we could easily expose those to other shader stages if people found a good use for them. I think "why isn't it enough?" is that people want more control over the graphics pipeline, how stages are put together, scheduling, command buffer submission and management. You can do a lot with various hacky, non-portable techniques, but it's hard to use those in production.

But, if you look at how things are changing in many games, compute shaders are becoming where a lot of the heavy lifting of graphics is done. As Aaron Lefohn likes to say, "The killer application for GPU compute is graphics". We are seeing people use compute shaders for custom anti-aliasing, data structure manipulation, hybrid rendering (raster + ray traced effects), deferred rendering, advanced lighting, etc.

AL : Even with compute shader, there’s still a lot that is abstracted. The abstraction of SIMD width, while sometimes convenient, does significantly inhibit the ability for expert programmers to write optimal code. Furthermore while the memory layout might appear to be explicit now, it really isn’t. DirectX makes no guarantees on how things are stored; even things like array-of-structures to structure-of-arrays conversions happen behind the user’s back, which can greatly affect optimal access patterns. When mixing access to resources from both rendering and compute, it’s almost impossible to know what the memory layout of a given surface will be.

9. Microsoft tried to reduce CPU overhead significantly in DirectX 10 and DirectX 11 - how successful do you think they have been (and/or OpenGL if that's more your cup of tea) and where is there still room to improve, either incrementally or with a low-level API?

MH : The OS and the driver architecture can still sometimes get in the way. For example, if you look at the "small batch problem", this doesn't exist on the console to nearly the same degree as the PC despite the underlying HW being extremely similar. Multi-GPU support doesn't exist in very useful ways yet in either GL or DX. The tension is really how to maintain portability and system stability while still getting out of the way.

AL : Progress has been made but there’s still a ways to go. Things like changing textures and shaders is still more expensive than it should be. There’s a variety of things to blame for this – not just the API. There will always be an abstraction penalty paid when using portable APIs but I don’t think it has to be as high as it is right now.

10. There are random numbers quoted with regards to how much performance is left on the table when using a graphics API, but let's move beyond that (BeyondRNG!), and do some proper analysis. In your experience, just how many extra mspf does going through ID3D11DeviceContext bring versus something like manually filling out a command buffer and DMAing it to the GPU?

MH : Let's look at latency of submission. O(100) usecs to submit an empty kernel (set output value) and wait for completion, e.g. full round trip. Direct submission is O(10) usec. Talking in usecs this may not be enough to impact FPS greatly. However, using the API calls to build an identical command buffer, validate it, and submit it can be expensive depending on the state changes that may be opaque through the API. Allowing more direct command buffer submission, control, and management would speed things up significantly and likely get us closer to console performance. On the other hand, the OS is designed differently for the PC and console. Consoles don't have to deal with multi-user/multi-app constraints so they can better optimize memory allocation, layout, scheduling, etc.

AL : I don’t actually have those numbers, and even if I did they would depend heavily on the specific hardware requirements and command buffer format. There’s certainly performance left on the table, but exactly how much depends on a large number of factors.

11. What did Microsoft get right with DirectX 11? What would you have done differently, in order to prevent negative legacy from building up?

MH : API cleanup and back-level support (DX9 through DX11) were the major base improvements. DirectCompute has provided a way for developers to access the compute capabilities of the hardware and developers are again starting to push the limits of what the hardware can do and using compute to differentiate on rendering quality.

AL : The move to compiled/immutable state in DirectX 10/11 was a good shift, as was the move to constant buffers over individual registers. Both of these resources are tied to lifetimes and versioning that can be the most efficiently managed by an application with knowledge of the data being stored rather than an opaque runtime like in DirectX 9. There are a few things that were added that restrict the way we can build rendering pipelines in the future like UAV writes from shaders, but it’s always a tough balance between exposing low-level features and maintaining performance portability.

12. Most AAA PC engines already have different paths for different GPU architectures, although they generally share the vast majority of the code. When porting a game from another platform, how much of a difference do you think that can make? How do you think the development effort required for different levels of low-level access compares to this?

MH : Most game engines have the majority of their code similar for multiple vendors except for a handful of shaders that have very different performance. Depending on the nature of the algorithm and HW, there can be a large difference in performance. Level of effort really depends on how low level we are trying to go. For command buffer submission, if we had an API with abstract controls for building a command buffer that the user could cache and manage, then submit with perhaps a patch list, the effort is reasonably low. If they are going to manually build a command buffer, the effort of getting all of the state management and setup right and dealing with hardware and software workarounds can be extremely daunting across vendors and multiple generations.

AL : For conventional render pipeline stuff – say DirectX 9 and earlier – you can get moderate differences in the performance of features across vendors and generations of hardware. For instance, the first hardware that support dynamic control flow did so slowly enough that it was typically better to just avoid it. That said, since the API was generally higher level there were more opportunities for the hardware vendors to implement features in different ways. In DirectX 10/11 hardware, there are some really large performance differences to be aware of. For instance, I’ve had algorithm performance vary by a factor of four between two architectures because of differences in the speed of local memory atomic operations. That’s not a minor architecture-specific detail anymore; that’s the difference between an algorithm being usable or not! Once you have to use entirely different algorithms on different hardware generations or brands the API is largely incidental.

13. Does DirectX 9-level programmability make low-level APIs more or less important? Is it mostly about performance now, or are there still major features on the current generation that you think developers would love (and take the time) to use if only they were exposed?

MH : I think you can do some really amazing stuff with DX9 + Compute Shaders (or OpenGL + OpenCL). It seems this is what many ISVs are doing, so perhaps the concentration should be on making those work more tightly together. I think many of the performance challenges really come from how we are communicating with the HW from the applications and not the programmability we have or don't have.

AL : I don’t think it really affects it much. In the sense that modern graphics hardware can perform arbitrary computations, everything is a question of performance, or more realistically, power. Hardware features can enable more power-efficient algorithms, and those features may require low-level APIs to expose. Given the complexity of the graphics pipeline, I’m sure that there is and will continue to be useful features that are not exposed in the portable APIs. One only needs look at the consoles to see that game developers are quite willing to adapt their implementations and algorithms to run more efficiently for hardware with sufficient market penetration.