Making parallelism ‘easy’ (for multithreading…)

Fundamentally, there doesn’t seem to be any easy solution to the problems described on the previous page except changing the software paradigm. The goal must be to make data dependencies much clearer to the hardware architecture, effectively making the instruction stream describe more than just serial operations. One approach is ‘VLIW’ as described on the previous page, but it has limitations which we won't examine in this article. But another very simple yet surprisingly effective approach is data parallelism (as in GPUs) where, by definition, you know each threads' data is independent from each other. Since instructions from different threads are also independent from each other (by definition), the problem just got a whole lot easier.

This can then be gradually extended through things like shared memory and synchronisation points, allowing for much greater flexibility while still never risking any unexpected data dependency. Not only does this improve the hardware’s raw performance and make it relatively straightforward to use, it also forces the programmer to think in such a way that his algorithm cannot have any bad data dependency.

Just about every processor startup that proposes an approach to simplify parallel programming has the same idea at its core: force the programmer to think in such a way that bad data dependencies between threads are naturally minimised, while optimizing the heck out of the hardware for the ones that are still allowed (like shared memory on GPUs). Obviously the approaches to do that vary quite a bit, as do the algorithms they are seemingly best suited for.

And as we tried to imply above, the most important factor isn’t how limited the architecture is, but rather how it forces the programmer to think about the algorithm. That means there’s nothing preventing similar approaches being used on x86 CPUs, and indeed data parallelism for example works relatively well on them. But there’s a catch: x86 needs to be able to support every possible approach and isn’t very optimized for any one of them. So it’ll never be as fast or as cost/power-efficient (before we even consider the other traditional sources of x86 inefficiency).

The problem for specialised parallel processors is obviously hitting the right sweet spot between efficiency and flexibility. Not flexible enough and you’re useless. Not specialised and efficient enough and you don’t have much of an advantage anymore: your hardware is complex and expensive, and the programmer isn’t really forced to think in terms that minimize data dependencies. Unless, that is, you have a software suite that tries to force him to do so anyway – but if that is less flexible than your hardware, you’re potentially wasting some of your silicon.

We're not pretending this is not an immensely complex debate. It obviously is. Nobody knows of anything that even slightly resembles a perfect solution that is both easy and efficient for every parallel problem. It is tempting to believe that future solutions will be an extremely natural and incremental evolution of today’s approaches, but that kind of thinking is very often contradicted by history.

However, examining approaches that work for a large subset of the problem never hurt anybody. And that’s pretty much the goal of this article and especially the upcoming architecture-specific analysis pieces. It’s not realistic to expect everyone to agree with your expectations of future trends, but it’s much harder to have a fundamental disagreement about what has already been proven to work.