Future ARM Architecture Evolution

The Cortex-A15 is still a 32-bit processor using the ARMv7 ISA. It adds support for 40-bit addressing (mostly for the sake of virtualisation) but that's only a stopgap to 64-bit - so the next step is ARMv8 with 64-bit support (and not one second too soon - tablets and flagship phones already have 1GB of RAM after all). There might be some noteworthy incremental changes to the main ISA (e.g. M3/R4 support hardware integer division but the A5/A8/A9 do not, only finally added on the A15!) but the more noticeable additions will likely be on the NEON side - especially FP64 support (which is currently only available on the VFP) and instructions targetted at specific algorithms and applications. Might NEON even evolve into a 256-bit ISA like Intel's AVX? That is unlikely because silicon partners are unlikely to embrace an expensive feature that results in no performance improvement for any existing program (and which is not useful for most anyway) and ARM is already selling handheld GPU IP with a strong emphasis on OpenCL.

But what's much more interesting than the ISA changes is how ARM might increase performance further while remaining within the power budget of a smartphone. The A15 back-end is already very powerful (8 issue ports!) so the first priority should be to improve the front-end. There are two ways to do that: increase the number of instruction decoders to dispatch more instructions, or add simultaneous multithreading (SMT aka Hyper-Threading). In fact, ARM has indicated that SMT is on their roadmap (only one source but it's credible and we've heard the same).

SMT: Why now?

This is actually a big reversal for ARM: back in 2006, John Goodacre criticised SMT for thrashing the L1 cache and said (as paraphrased by The Inquirer) that "SMT was only really useful as a way to rescue a modicum of performance from when you have a mismatch between the processor's speed and the rate at which you can feed it something to crunch on". Exactly! And that's precisely why SMT makes more sense for ARM today than it did back then with the ARM11 MPcore and Cortex-A9 MPcore for which Goodacre was responsible.

There are many possible reasons why you can't always feed the processor. One critical problem for the Pentium 4 and its very long pipeline was branch misprediction; everything must be flushed which means the execution units will be starved for many cycles unless another thread has instructions ready to issue. This wouldn't help ARM as much with a shorter pipeline but could at least indirectly save the cost of an even more expensive branch predictor (which also suffers from diminishing returns). And you could also get away with a smaller window size for Out-of-Order Execution to execute around data dependencies (obviously an in-order core doesn't look ahead so it helps even more).

Another possible reason is instruction mix. It is very frequent for SIMD instructions (and to a lesser extent load/stores and multiplies) to be executed in bursts; for 10000 cycles you might be running optimised physics code on the NEON unit, and then nothing for another 10000 cycles. So half the time you're starving your integer units and the other half you're starving NEON; with SMT, you could theoretically double your NEON utilisation (although less in practice as you're limited by the instruction decoders).

This is also AMD's justification for sharing a single FP block between two INT blocks on Bulldozer. In many ways, Bulldozer is SMT on steroids: on the front-end it's extremely similar (including a shared instruction cache) but then on the back-end it sports per-thread integer units and data caches. The idea of replicating the execution units is interesting but it's not clear that it makes sense in ARM's case (e.g. RealWorldTech mentions sharing the front-end x86 overhead as one justification) and it's probably more appealing to simply increase multithreaded performance at relatively low cost. But given ARM's comments about SMT cache trashing, one trick that could be borrowed from Bulldozer is dedicated L1 data caches per thread (although there's an even more elegant possibility: dynamically partition L1 based on each thread's behaviour and cache hit/miss rates rather than leaving it up to the cache controller, as proposed by some academic papers that remain untested in practice).

However the biggest and most obvious reason is memory latency: for an in-order processor even a load that always hits L1 cache is very expensive if it stalls your pipeline because the next instruction depends on it (it's the compiler's job to prevent that but it's not always possible); this is why the Intel Atom benefits so much from SMT. And while a good OoOE processor can easily reorder many things around a L1 or even L2 cache miss, there's not much it can do when waiting for external memory that is effectively more than 100 cycles away; this is the main benefit from SMT for an already very advanced and balanced architecture like Intel's Nehalem. Better cache miss tolerance is a big deal and this becomes even more important as CPU frequencies go up (the Cortex-A15 on 20nm will probably be clocked 3x as high as the Cortex-A8 on 65nm) while memory latency remains roughly constant (so for a given latency in nanoseconds, the CPU will experience a higher latency in cycles).

An optimistic guess

Given all these considerations (and more) it could be argued that SMT would make little sense on the ARM11/A5/A9, a lot of sense on the A8/A15, and even more sense on an architecture with more instruction decoders and even higher frequencies (thanks mostly to process technology). So if SMT looks like a sure bet, what else might change? One attractive feature of the Cortex-A9 and Cortex-A15 is their ability to share L2 cache for up to 4 cores; but that's also a weakness as it limits the performance of that cache for any individual core which only becomes a bigger problem with SMT. Intel went completely in the other direction with Nehalem with a small but very fast dedicated L2 and a much larger shared L3 cache - it's possible that ARM will also be forced down that road (or they might at least only allow sharing the cache between 2 cores/4 threads to reduce contention).

So if this educated speculation is correct, what is a possible architecture for ARM's next-generation CPU IP? Here's what we might expect if we were optimistic:
- 15+ Stages Out-of-Order with SMT and 4 Instruction Decoders
- 3xSimple/1xBranch/3xNEON-VFP/1xMultiply/2xLoad-1xStore
- Choice of basic (1x2xFMAC) or fast (2x4xFMAC) NEON
- 64KB Partioned Data L1 (static or dynamic?)
- Dedicated Per-Core L2

In a way, that's a natural evolution of the Cortex-A15. But if we step back a little we see it's a monster: in theory we're talking per-clock performance higher than anything AMD has ever produced, and if it's true that the Cortex-A15 uses about 15-16 gates per pipeline stage (as ARM implies) then ARM could already hit clock speeds comparable to most x86 processors if implemented on the same process with the same design techniques. Even if it's based on ARM instead of x86 and optimised for power efficiency every step of the way, would something this big really make sense for smartphones? I'm not sure it would in the 20nm/14nm timeframe.

And ARM might not want to push the performance envelope beyond what the market requires as their partners will be forced to license their new processor IP anyway because of ARMv8 and 64-bit support. It's probably better to start with something more conservative and then go for the kill later. What we described above looks more like a next-next-generation part in retrospect. Or maybe it could still be a next-generation architecture - for NVIDIA's Project Denver which targets PCs and HPC (but not smartphones). It's certainly the level of performance you'd expect but the architecture itself might be quite different.

A conservative guess

So what would that more conservative evolution look like? It could basically be a Cortex-A15 with ARMv8 and SMT plus minor improvements. You'd still have 3 instruction decoders and a similar back-end and the discussion about potentially partitioned/dedicated L1 Data caches still applies (although it might also be considered not worth the cost/trouble). Besides various microarchitectural tweaks here and there, a simple idea to improve performance very slightly would be to increase the number of simple ALUs from 2 to 3; however this may not be worth the incremental cost with only 3 instruction decoders.

One more interesting area to consider is the load/store unit: the Cortex-A15 can issue 1 load and 1 store simultaneously but not 2 loads; this is a fair trade-off between complexity and performance but given the ratio of stores-to-loads in real code, it's still an area for improvement (e.g. Sandy Bridge went for 2 loads+1 store). If the L1 Data Cache is partitioned rather than shared between the threads, it shouldn't be as expensive to support 2 loads (reusing the store port) if and only if they originate from 2 different threads - but once again, that's not necessarily worth the cost.

The Monster's Little Brother

Furthermore it's not difficult to imagine products that don't need that level of performance eventually wanting ARMv8 for 64-bit support, so what might this core's smaller derivative look like? The Cortex-A5 is already quite slow today and coming up with yet another effectively single-issue design several years later is ridiculous. So if you've got 2 instruction decoders, you've got 3 plausible design choices: an In-Order core like the Cortex-A8, an Out-of-Order core like the Cortex-A9, or an In-Order core with SMT like the Intel Atom. The last two options are more likely by far, and personally I believe the last one makes the most sense for both market segmentation and technical reasons. If a customer really cared about single-threaded performance (and also wanted ARMv8), you'd be better off selling them a single high-end core, so it makes more sense to focus on aggregate performance here.

One factor we didn't previously consider is variable pipeline length and OoOE variants. Before the Cortex-A9, all processors from ARM had an in-order fixed length pipeline (and the Cortex-A5 still does). But ARM skipped multiple possible steps on the way from the A5/A8 to the A9 (resulting in something very similar to most x86 processors but with smaller buffers). There are three parts to consider separately: the Reorder Buffer (ROB), Out-of-Order Issue, and register renaming. Depending on your exact definition of the terms, it's possible to have only the first (Marvell PJ1), only the second (Qualcomm Snapdragon), only the first two (maybe Marvell PJ4), only the last two (maybe MIPS 74K), or all three (Cortex-A9/A15). The analysis of these variants goes beyond the scope of this article (but this page is not a bad overview).

So here's one strong contender for the lower-end ARMv8 core:
- 9+ Stages In-Order with SMT and 2 Instruction Decoders
- Variable length pipeline (requires basic ROB)
- ALU+MUL/ALU/1xLoad-Store/1xNEON
- Choice of VFP or basic NEON
- Shared L2 (partitioned L1?)

Mixing It Up

This brings us to our final (and possibly most important) idea: support for heterogeneous multiprocessing. Marvell recently introduced this with the Armada 628 - it's not a new idea, but it's rarely applied and it's the first time anyone tried it for ARM. The Armada 628 has three PJ4 cores (more on that later) but they're not identical despite sharing their L2 cache: two of them have been optimised for speed (up to 1.5GHz) with high leakage transistors and more aggressive synthesis, while the third has been optimised for low power (up to 624MHz) with low leakage transistors are power-optimised synthesis. If you wonder why this is a very good thing, just consider background multitasking: even when your Android phone is sleeping, various applications frequently wake up the CPU. They don't need a lot of processing so just a few tens of MHz could be enough but performance-optimised processors is very inefficient at low frequencies (very high leakage and cannot scale voltage lower than the minimum which is enough for hundreds of MHz).

So heterogeneous cores are a perfect solution and indirectly it allows you to get away with higher leakage transistors (resulting in even higher frequencies) for your high performance cores since they don't need to be activated before they could be made reasonably busy (by moving the thread from the low power core to a high performance core when it's saturated). In Marvell's implementation, the 1.5GHz cores are would nearly never be clocked below 624MHz so their leakage is never a disproportionate part of their power consumption. This is something that ARM partners who license standard cores like Texas Instruments and Samsung probably couldn't do without some custom engineering from ARM - Marvell's advantage from being an architecture/ISA licensee is evident here. NVIDIA might be able to do it on their own (they implement more of the peripheral IP around the processor themselves than most partners, e.g. in-house L2 cache controller) but even that is far from certain. Either way it won't happen for the early Cortex-A15 SoCs and at best we can hope that second-generation ones on 20nm pull the same trick as Marvell.

But why stop there? A more radical implementation could use not only the same processor synthesised in different ways but also different processors with the same ISA. While not strictly necessary with PJ4s, when you're talking about a monster like the Cortex-A15 you're leaving more power efficiency on the table for the sake of performance you don't need. So it becomes attractive to couple it with a much simpler architecture like the Cortex-A5 (or our proposed ARMv8 derivative above). It might even be appealing to couple multiple A15s with multiple A5s and always power the A5s first for highly multithreaded workloads. Obviously that's not possible with this generation of processors but it's an interesting possibility in the long-term and heterogeneous multiprocessing is a good example of the kind of innovation that will be necessary to bring performance to the next level within a smartphone's power envelope.

Processor innovation in handhelds isn't dead yet... far from it!