Remaining Limitations & Future
As we said previously, FP64 computations are not currently supported on any available GPUs, but this is supposed to in 2007. There remain a number of other limitations, however, and here's a list that should hopefully cover most of them:
- Recursive functions are not supported, at all. This is a scheduler limitation, as there currently are no real "functions" on the hardware side of things, and even if there was, there is no stack to push/pop arguments from either, unless you want to use uncached global memory for that - which is unlikely to be super-fast.
- There is no efficient way to do "global synchronisation" on the GPU, which likely forces you to divide the kernel and do synchronisation on the CPU. Given the variable number of multiprocessors and other factors, there may not even be a perfect solution to this problem.
- Various deviations from the IEEE-754 standard even if the precision level is identical. For example, neither denormals nor signaling NaNs are supported. The rounding mode also cannot be changed, and division/square root are not implemented in a fully standard way.
- Functions cannot have a variable number of arguements. Same problem as for recursion.
- Conversion of floating point numbers to integers is done differently than on x86 CPUs.
- The bus bandwidth and latency between the CPU and the GPU may be a bottleneck.
Many of these problems will likely be resolved, or at least improved upon, in the future. For example, PCI Express 2.0 should debut in a few months, while you'd also expect there to be some improvements for global synchronisation in the future. Other things, including support for denorm signalling, likely won't be seen for a while longer, if ever. Using an on-chip or cached stack would also be a tad ridiculous, given how many threads are in flight at any given time.
Going forward, however, the prospects of GPU computing are very exciting. In addition to further feature set improvements, we should expect significant performance improvements in the future, significantly outpacing Moore's Law in the next few years. Both transistor counts and clock speed increases will provide significant improvements and, in addition to that, the percentage of transistors dedicated to arithmetic operations is bound to increase further.
A good example of this trend is the R580 which, compared to the R520, increased its transistor count and die size by only 22%, but tripled its number of ALUs and the size of its register file. The G80 most likely already has a higher ALU ratio than the R520, but we believe that there remains plenty of room for NVIDIA to double or triple that proportion in the coming years. As such, barring any architectural revolution on the CPU front, the performance advantage of GPGPU solutions is likely to grow in the future, rather than shrink.
Closing Remarks
Today, NVIDIA publicly released the beta for CUDA, as well as the related FFT and BLAS libraries. Because of time constraints and the lack of a timely heads up on the beta's target launch date, we will not be looking at how to program in CUDA today, nor will we examine performance for real-world algorithms. Expect more on that from us in the coming days and weeks.
For now, hopefully this preview will have given you a good grasp of what CUDA is all about, and what it brings to the table. These certainly are exciting times for those interested in the GPGPU paradigm, and based on what's coming in the near future, we can't help but be extremely excited by the future prospects of that industry segment. It will also be very interesting to see what AMD brings to the table in this area when they finally launch their next-generation R600 architecture.
Finally, we would like to take this opportunity to urge NVIDIA (and AMD) not to artificially restrict GPGPU development on consumer cards. While we fully understand the push for higher average selling prices (ASPs) in the GPGPU market, we would like to remind point out that emerging markets require developer innovation. Allowing anyone with a some coding skills and a cool idea to experiment on real hardware can be a key catalyst for short-term and long-term innovation.
An exaggerated focus on the high-ASP part of the market would hinder that innovation, and reduce the perceived performance/dollar advantage which makes current GPGPU solutions so attractive. Furthermore, such limitations would completely block GPU computing from becoming a full part of AAA game middleware solutions and other non-game consumer applications. Applications that would be willing to implement CUDA and CTM paths instead of a single DX10 path should be able to do so.
While we do not have any specific reason to believe NVIDIA and AMD are considering artificially limiting their GPGPU solutions in the consumer space, this is so important from so many points of view that we absolutely had to point it out here. In addition to the level of support provided for professional solutions, they should only be differentiated by their FP64 performance, memory sizes, multi-GPU functionality, and other similar factors. Not by their feature-set. Most server CPUs have been differentiating themselves based on even less than that and, apparently, they're doing just fine!
