‘Serial’ Performance at All Costs (or some, anyway…)

First, let’s be honest with ourselves: if any algorithm was truly serial every step of the way, then something similar to our programmable NAND gate described on the first page would provably be the fastest computational architecture you could design for the problem*. When a superscalar out-of-order execution processor executes more than one instruction per cycle, it’s really just trying to extract the inherent parallelism that exists in the algorithm. And when a hardware engineer designs a hardware ADD or MUL unit, for example, they're also implicitly extracting the parallelism inherent in that mathematical operation.

With that preliminary out of the way, we can now think about the ways in which modern programmable cores try to extract parallelism out of a given program (to improve perceived ‘serial’ performance), either with or without the compiler and/or the instruction set’s help. There are two main strategies to achieve this; either you pack more processing in every (internal) instruction, or you execute more instructions in parallel. The former is frequent in DSPs and ‘configurable’ architectures such as Tensilica’s, but fairly rare in general purpose CPUs. The latter can be achieved through a technique known as VLIW, where multiple instructions are packed in a single word or packet where they are all guaranteed to be parallel by the programmer or compiler. Finally, it is possible for the processor to figure out which instructions in a serial program may be safe to execute in parallel.

On the other hand, the basic problem of parallel programming is roughly as follows: the programmer (and/or compiler) can in theory achieve a perfect understanding of the instructions being executed and how they depend on each other (i.e. they use the same registers, for example). However, he generally doesn’t have a good understanding of the data being processed and which instructions depend on which pieces of data. For example, he cannot easily find out what possible values a pointer could take; so one read might be dependent on another write in a non-obvious way (on the other hand, two reads will never pose a problem to each other). So from the programmer’s point of view, dividing the program into multiple threads (which basically makes coarse instruction dependencies explicit) is very error-prone as they could easily miss a data dependency; and if they do, it could be very hard to debug as the program just risks misbehaving randomly.

This is a relatively similar problem to what a modern out-of-order processor suffers from when it needs to automatically extract parallel instructions from a seemingly serial instruction stream. It can use more or less advanced techniques to analyse instruction dependencies; the instructions are static, so in principle it’s fairly easy to do. But once things become dependent on data, which it often is, it gets more difficult. This is basically because the instruction is decoded before the data is available; when a memory read/write instruction is found, the register indicating the memory address hasn’t been loaded yet. And when a branch instruction decides whether to skip a block of instructions based on a boolean value, the register storing that value hasn’t been loaded yet either.

There are two possible solutions for the processor (or rather, for the hardware engineer): either it assumes there is a real dependency and stalls execution if necessary until everything is safe, or it speculates based on statistics and reverts a lot of work if its bet was wrong. The latter is known as branch prediction for conditionals, and memory disambiguation for memory load/store commands. Branch prediction is practically always used, with the only distinction being how advanced the speculation mechanism is. On the other hand, most processors simply assume memory operations must be run in order and stall execution if necessary; the only exceptions are Intel processors starting with the Core 2 Duo as well as VIA’s Nano (aka Isaiah). Once again, this is only a problem for memory writes; executing multiple memory reads at the same time with no writes outstanding is not a problem as no conflict is possible between them.

Especially stunning is the effect of disambiguation on memory performance in application benchmarks: despite lacking an integrated memory controller, this allows the Core 2 Duo to be less dependent on low memory latency than even AMD’s K8, because it doesn’t have to stall on every conflict! But speculation will only get you so far, and implicit parallelism extraction has its limits that prevent this from magically being as fast as, say, a manually multi-threaded program running on multiple cores where the instructions in each thread are guaranteed by the programmer to be independent. Hardware speculation also gets very expensive if you want it to be any good, and by definition speculation based on statistics can never be as accurate as a detailed analysis based on an in-depth knowledge of the specific situation at hand.

In theory, there is not necessarily a hard line between serial and parallel computation; the former is just a way to hide the complexity of parallelism from the programmer, moving all the complexity to the compiler and, more frequently, the hardware. In practice, the distinction still makes a lot of sense: in just about every application domain, from gaming to telecommunications and every in-between, there is a little code which is repeated many times and warrants extensive optimization, and a lot of code where performance matters less and would be much more difficult to optimize. In gaming, that could be the physics engine and the gameplay code respectively; in wireless, it could be the physical layer and the protocol stack where the former uses mostly a small number of algorithms while the latter can be more than a million lines of interdependent code with strict reliability requirements.

Therefore, processors and programming environments that focus exclusively on ‘serial’ performance will likely always remain useful despite either their relatively low performance in the low-end, or the obvious efficiency penalty associated with their complex speculation mechanisms for both instruction and data dependencies in the high-end. But on the other hand, the performance gap between these processors and those focused on more explicit forms of parallelism is only going to get bigger.

And so the questions we must ask ourselves are how to help the programmer minimize data dependencies, how to make their analysis less error-prone when dividing the program in groups of independent instructions (i.e. threads), as well as how we can design hardware that benefits most effectively from explicit parallelization.

(*): Advanced: Actually, let’s try being even more honest: in the specific case of semiconductor electronics where many logic gates may be placed in series in a given stage, the fastest design would be one with a maximum number of NAND gates next to each other in a single stage; this would reduce the chip’s maximum clock frequency, but initially not as fast as you are adding gates (so there’s a clear sweet spot). And given that very few algorithms are optimally implemented using only NAND gates, placing all possible kinds of logic gates in parallel would also improve true serial performance.