Early History


Texas Instruments licensed the ARM6 core in 1993, using it in the first single-chip digital baseband for GSM networks in 1995. This deal and the ones that followed gave the ARM ISA enough momentum to become the uncontested standard for handhelds in the many years to come, achieving close to 100% market share for what would be the highest growth market of the semiconductor industry.

ARM had great technology and executed well, but what is it that 'locked in' the industry to the ARM ISA, and eventually led even Intel to acquire their own ARM processor core as part of a lawsuit settlement with DEC (StrongARM and later XScale, which they sold to Marvell in 2006)? Some of it must have been luck. In the 1990s, another company started licensing low-cost/low-power RISC cores: MIPS, then part of Silicon Graphics. But they focused on higher-end markets (including set-top boxes in the mid-90s where they are still dominating today) and so lost the first mover advantage in phones. Many of the early proprietary real-time operating systems for phones standardised on ARM, and by then it was apparently too late to do anything about it.

To turn back the tide, MIPS decided to target what they hoped would be a very high growth market: Windows CE from Microsoft, which supported not only ARM but also MIPS, x86, and plenty of ISAs including proprietary ones from Hitachi and others. And indeed plenty of high-end products were released with MIPS chips at first, but eventually the handheld part of the WinCE market was dominated by ARM, and in the high-end especially Intel's XScale - in that way, Intel actually helped a bit in consolidating ARM's dominance.

Operating Systems


At the same time, Nokia, Ericsson and Motorola formed a joint venture with Psion to form Symbian, which would become the leading smartphone operating system for more than a decade - and given ARM's dominance in existing phones at these manufacturers (especially Nokia via Texas Instruments), Symbian also standardised on ARM. Once an operating system which allows third party applications standardises on a single ISA, it's usually game over: Windows XP might support Intel's Itanium architecture, but existing software doesn't work on it, so PCs would likely never have transitioned to Itanium even if it had been competitive with x86. But ISA barriers are not as critical today, and this creates an opportunity both for Intel/x86 in handhelds and for ARM in PCs. We'll discuss Windows 8 for ARM shortly, but let's focus on handhelds first.

Two factors can make an OS fairly ISA-independent: Universal binaries (Multi-ISA) and Just-In-Time compilation (.NET/Java/Silverlight). MeeGo is an example of the former (ARM+x86 native), Android is a mix of the two (Java and ARM+x86 native), and Windows Phone 7 is an example of the latter (.NET for 3rd Party, but ARM-only for now). MeeGo has the best ARM+x86 support today, whereas Android x86 is expected to improve significantly but remains subpar for now (version releases lag behind, limited Google support, universal binaries are basically inexistent). It is likely that the immaturity of Android x86 will hurt Intel's smartphone ambitions with its Medfield SoC in 2H11, but that should progressively become a much smaller factor in 2012 and beyond (assuming Google cares enough, e.g. it remains to be seen whether they will push for universal binaries).

In addition to the official support for ARM and (to a lesser extent) x86, Android is also gaining unofficial support for other ISAs such as MIPS and smaller players like META from Imagination Technologies and ARC from Synopsys. All Java-based applications can be supported in theory, although these ISAs will have not get cross-compiled with ARM/x86 for native applications. That is arguably irrelevant for markets like set-top boxes, internet radios, or eReaders. On the other hand, it does mean that none of these companies have any real chance in the smartphone applications processor market.

Should this increased competition from Intel and others worry ARM? It certainly should keep them on their toes, but it's not as if they were resting on their laurels before. Their processors have evolved tremendously in the last 15+ years and given ARM's dominant position today, it's only fair that we begin this article with an in-depth analysis of their past, present and future processor IPs.

The Past & Present Architectures of ARM CPUs


From the ARM1 in 1985 to the original ARM7 in 1994, all ARM cores had the same simple architecture: a 3-stage pipeline (Fetch/Decode/Execute) with a mostly single-cycle 32-bit ALU (multi-cycle multiplication starting in ARM2) and a von Neumann memory architecture. The improvements between generations kept the basic pipeline unchanged and were mostly about modes/interrupts, bus/cache support, and other such things (with full 32-bit addressing in ARM6 being noteworthy). This level of performance was still impressive for its time; single-cycle execution and a few clever tricks in the ISA (especially free branch predication) meant it achieved performance per MHz somewhere between a 386 (which required two cycles even for a basic addition) and a 486.

ARM7 & Cortex-M3: Thumb & Thumb-2


The ARM7 had many versions (including both synthesisable cores and hard macros) but three variants are noteworthy: ARM7DI (JTAG and ICE for debugging), ARM7TDMI (Thumb 16-bit ISA and single-cycle 8x32 multiplier) and ARM7EJ (5-stage pipeline, Jazelle for Java, and ARMv5 with DSP ISA extensions). The ARM7TDMI is especially important as it apparently remains ARM's second most successful processor ever (beaten only by the Cortex-M3 which is a direct replacement!) - it is used in a ridiculous number of products and market segments, and this remains true even today. An extreme example is PortalPlayer, which used two 90MHz ARM7TDMIs in the PP5002 chip that was selected by Apple for the original iPod. They were acquired by NVIDIA in 2006, and Tegra2 still uses an ARM7TDMI as a cheap coprocessor for system control (whereas TI uses the Cortex-M3 in OMAP4).

A key feature of the ARM7TDMI is the Thumb 16-bit ISA for higher code density. This is achieved mostly by limiting access to 8 out of the 16 registers, removing predication, and supporting fewer instructions (and some use a source register as the destination, ala x86). This is a big improvement in terms of raw code density, but to you'd often need more instructions to do the same thing so this mitigates the density benefit somewhat and implies a fairly substantial performance cost. You can easily switch back to the ARM ISA (in fact this is required for some things) as part of a branch instruction (BX) but this often wastes an instruction (and a cycle). Thumb is very valuable for many things, but it's clearly a trade-off.

Thumb-2 is quite something else. It adds back predication with a new 16-bit instruction (IT sets the predication codes for up to four instructions), but most importantly it adds a full set of 32-bit instructions comparable to the ARM ISA that are self-sufficient (including interrupt handling) but require no extra instruction to switch to 16-bit. This results in excellent code density at performance very close to the ARM ISA. It was first released in 2004 on the Cortex-M3 which is not backwards compatible with the original ARM ISA. It is otherwise quite comparable to the ARM7TDMI in its core design (3-stage pipeline, 32x32 multiplier, etc.) although its memory architecture is closer to the ARM9.

ARM9 & ARM11 and MPCore


ARM launched the ARM9 in 1997 (we'll ignore both the ARM8 and ARM10 here due to their lack of commercial success). They increased its frequency by moving to a 5-stage pipeline (Fetch/Decode/Execute/Memory/Writeback) and even more importantly they switched to a Harvard memory architecture (separate instruction and data caches). The combination of the two meant higher performance per MHz as a data load/store instruction could execute at the same time as the fetch stage was loading an instruction. This is also the first generation with optional floating-point support through the VFP9 coprocessor (similar but slower than the more common VFP11 on ARM11). Despite being essentially scalar, it has a bizarre and rarely used vector mode (now obsolete) to improve utilisation and code density. It also supports FMAC but isn't fully IEE754 compliant at full speed.

When ARM moved to an 8-stage pipeline (2xFetch/2xDecode/3xExecute/Writeback) for the ARM11 in 2002 (4-5 years since the ARM9!) they added dynamic branch prediction (supposedly 85% accuracy). Other major additions include out-of-order completion (e.g. ALU execution can continue before completing a store or ahead of a stalled load - this might also imply more register ports) and integer SIMD instructions (superseded by NEON with the Cortex-A8 but still used in other markets e.g. the Cortex-R4).

The memory system architecture was overhauled to improve performance and paved the way for the ARM11 MPCore in 2004. It was never implemented as more than a single-core in handhelds (Tegra1) but it was a stepping stone to the Cortex-A9. It supported a shared L2 cache, snooping and moving data between L1 CPU caches to minimise external memory accesses, and the Cortex-A9 added hardware processor coherence - this means it maintains cache and virtual memory map coherence without any costly interrupt. The ARM11 MPCore delivered excellent scaling with independent processes (e.g. multitasking) whereas the A9 improved efficiency further for multiple threads within the same process.

One aspect that highlights the subtle differences between these cores is branching. The ARM7 takes 1 cycle for untaken branches (to process the instruction) and 3 cycles for taken branches (pipeline flush so decode/execute are empty). Surprisingly the ARM9 also usually takes 1 and 3 cycles despite having a 5-stage pipeline because the branch can get resolved in the Execute stage before Memory/Writeback, and the result can be immediately forwarded to the Fetch stage. The Cortex-M3 still doesn't predict branches, but it speculatively fetches instructions for both possible paths (this works because most instructions are 16-bit but it can fetch 32-bit per cycle) so taken branches usually only take 2 cycles (1 cycle to process the branch and 1 flushed stage). Finally, the ARM11 does dynamic prediction and resolves branches in the 3rd Execute stage or before, so branch misprediction penalty is 5-7 cycles.

Cortex-A8


The Cortex-A8 (launched in 2005) was ARM's first superscalar CPU with dual execution pipelines. It has a 13-stage integer pipeline (3xFetch/5xDecode/5xExecute + Writeback - that's really 14 but this is when ARM decided to follow competitors and no longer count the Writeback stage) and an additional 10-stage NEON pipeline (4xDecode/6xExecute) for a total of 23 stages worst-case as NEON is always executed last (so that all branches/exceptions have already been resolved which is the only way to keep everything in-order - and all L1/most L2 cache latency gets hidden for free).

Branch prediction accuracy has supposedly increased from 85% to 95% to counter the 13 cycles branch mispredict penalty, and there obviously are plenty of forwarding paths in the integer pipelines so that instruction results can get to the next instructions quickly despite the 6 execution stages. Only one branch and one load/store instruction may issue per cycle but there are no restrictions on what each of the two issued instructions can do (the ALUs are symmetric) except for multiplication which is on the first pipeline (which also means the compiler must position it first in an instruction pairing to dual-issue but this is a trivial limitation).

Finally, there's NEON. It is a 128-bit SIMD engine with support for 4xFP32/4xINT32/8xINT16/16xINT8 multiply-accumulate but it also has 64-bit instructions (half vector width reusing the same register file). NEON is implemented as a hybrid 64/128-bit engine on the Cortex-A8 and A9 (little known fact: 128-bit for non-multiplication integer & 64-bit for integer multiplication and floating-point) so 64-bit and 128-bit FP32 instructions run at full and half speed respectively. But on the Cortex-A15 (and Qualcomm's Snapdragon) it is implemented as a full 128-bit engine so both run at full speed (wasting half the peak performance with 64-bit). NEON can dual issue one data-processing instruction and one load/store/permute instruction on the A8, but while the processing pipelines share a single issue port they appear largely independent in terms of silicon.

There are 6 in total: integer MAC, integer shift, integer ALU, FMUL, FADD, load/store/permute (FMAC is done by using FMUL and FADD consecutively). NEON's FP32 is not IEEE compliant so a separate VFPLite unit exists in the A8's NEON to run 'legacy' code (and double precision), but it is unpipelined and therefore slow compared to the older VFP11. The Cortex-A9 recognises this issue and adds a separate high performance VFP which is mandatory if NEON is present and can also be used on its own, but that's a trivial detail compared to the A9's other changes.

Cortex-A9


The Cortex-A9 (launched in late 2007) is a huge architectural jump because it supports Out of Order Execution (OoOE) with register renaming. The execution units practically haven't changed (ALU+MUL/ALU/Load-Store/64-bit NEON) but everything else is extremely different. The integer pipeline is now a much shorter 8+ stages (3xFetch/3xDecode/Issue/1-3xExecute/Writeback, variable length execution is a given with OoOE and reduces the branch mispredict penalty by making most instructions single-cycle but multiplies and load-stores take 3 cycles) and NEON now works in parallel with integer execution rather than after it to minimise branching cost (thanks to the variable length pipeline; it's more efficient and moving data between NEON & core registers is faster. But you can no longer co-issue load/store/permute, also cache latency cannot be hidden as much as on the A8 so performance is a mixed bag and probably worse on optimised routines).

While only 2 instructions can be decoded per cycle, the dispatcher can issue 3 (+1 to the branch resolution unit) so maximum utilisation is possible when the issue queue has the right mix of instructions available (e.g. after a stall). The processor also has a fast-loop mode (ala Intel's Loop Stream Detector) for short loops (<64 bytes) that removes the branch cost and bypasses the instruction cache to save power. Despite the shorter pipeline and all of these improvements, the Cortex-A9 is achieving clock speeds comparable to the Cortex-A8 (although lower than Qualcomm's Snapdragon) and performance per clock seems to have increased by more than 25% in the real world (not as much in Dhrystone or Coremark because OoOE helps less when all your data fits in the L1 cache).

Finally, it's interesting (but barely mentioned by ARM) that register renaming works with a data-less ROB and maps the 32x32-bit architectural registers (16x32-bit visible at any given time) into a 56x32-bit Physical Register File (PRF) (ala Intel P4/Sandy Bridge and AMD Bobcat/Bulldozer but unlike Intel P6/AMD K7 derivatives). This saves area/power at the cost of an additional pipeline stage. Further technical analysis of the Cortex-A9 is beyond the scope of this article but one intriguing and rarely known fact is who designed it. The core team consisted of only 40 engineers in Sophia Antipolis, France. That design center previously worked on SecurCore, TrustZone, and the ARM11 MPCore.

For more information on the A9's architecture these are by far the three most detailed presentations from ARM:  Details of a New Cortex Processor Revealed / Cortex-A9 Processor Microarchitecture / ARM MPCore Architecture Performance Enhancement.

Cortex-A15


If the A9 was revolutionary, the Cortex-A15 (launched in late 2010) is an even bigger evolutionary step (very big indeed at about twice the die area). The pipeline has grown to 14+ stages (5xFetch/7xDecode/Issue/1-4xExecute/Writeback) and while some of that growth is due to increased complexity, a fair bit is also due to higher clock targets as indicated by the multiplier which has increased from 3 stages in both the A8 and A9 to 4 stages in the A15. ARM claims a 40%+ performance improvement per MHz for integer code versus the A9 but interestingly the basic execution units are very similar: two simple ALUs and a multiplier. It's all about increasing their utilisation by boosting everything else (including buffer sizes and branch prediction).

There are plenty improvements everywhere but perhaps the most significant is that there is no longer a single issue queue; instead there is one per 'execution cluster' (somewhat similar to AMD's K7/K8/K10). There are 5 execution clusters with 8 independent issue ports in total (2xSimple/1xBranch/2xNEON-VFP/1xMultiply/2xLoad-Store) being fed by 3 instruction decoders (versus 2 in the A9). So it can (very rarely) co-issue 2 simple ALU instructions along with a multiply, a load plus a store (but not 2 loads or stores), and 2 NEON instructions - all in a single cycle. And there's plenty of instruction fetch bandwidth to match (fetch could be a bottleneck on the A8/A9 because of how they handle taken branches).

As for NEON, the A15 runs 128-bit instructions at full speed as previously mentioned. However the claimed dual-issue support is very ambiguous. As we have seen, the A8's NEON could dual-issue one data-processing instruction and one load/store/permute instruction whereas the A9's could not. It seems likely that ARM not only added back that functionality but also generalised it (so that you can co-issue to any two execution pipelines e.g. ADD+MUL). There are many other possibilities (e.g. 2nd integer 128-bit non-MUL ALU) but we will not enumerate them here.

Another related improvement is out-of-order issue for loads and NEON/VFP (the A9 was in-order for both). While stores issue in-order and loads cannot bypass stores, stores only require the address to be issued (the store unit will wait for the data - this is a fairly unique approach) so they can be issued more rapidly and are less likely to stall loads or other stores. The progression here is very similar to that from AMD's K8 (in-order loads/stores) to K10 (which cannot issue stores before the data is ready but can issue loads ahead of stores if the address is available). On the other hand, the fast-loop mode evolved similarly to Intel's from Conroe to Nehalem (now completely disables decode stages and not only fetch). And strangely enough, ARM's presentation implies register renaming uses a 'result queue' rather than a PRF like the A9, but this seems unlikely and might be an attempt to mislead the competition.

Power Efficiency & Cortex-R4/A5 (+R5/R7)


Overall, the Cortex-A15 is a bigger increase in both performance and cost than the Cortex-A9 was compared to the Cortex-A8. The A9 is closest to AMD's Bobcat out of all existing x86 designs (but with some key differences and slightly lower performance) whereas the A15 is more like an AMD K10 after an ultra-strict diet. But even then, there's still plenty of muscle... and bone. While the A15 is certain to be much more power efficient for extreme performance levels than a significantly overclocked A9, it should still be less power efficient at the performance level of today's A9s if it was implemented on the same process.

This is because (to simplify a complex issue) its die size increased significantly more than its performance. This is to be expected when increasing single-threaded performance per clock (diminishing returns) and the idea is that (up to a certain point) this will still be more efficient than increasing the frequency. But a simpler design nearly always wins at lower performance targets and the same is true for the Cortex-A9 versus the older ARM11. This is even more obvious when we consider ARM11's direct replacements.

The Cortex-R4 and Cortex-A5 are both very similar to the ARM11. They all have an 8-stage pipeline with a single ALU, dynamic branch prediction (85/90/95% for ARM11/R4/A5) and an optional VFP. But that's about where the similarities end. They both support ARMv7/Thumb-2 (and the A5 has a NEON option) and remove the Writeback stage (merged with Execute) in favour of a 3rd Fetch stage (+1 cycle misprediction penalty). More surprisingly, they are dual-issue: the R4 can co-issue an integer instruction with either a branch, the VFP, or 32-bit load/stores (but not 64-bit ones) whereas the A5 can only co-issue with a branch. Since the A5 is much more recent (2009) than the R4 (2006) it seems ARM concluded the R4's approach was not worth the cost with a single ALU.

Their selling points are also very different: the A5 has a MMU so it can run a full OS (e.g. Linux) whereas the R4 must run a Real-Time Operating System but has optional reliability features (redundant second core for error checking, ECC/parity, etc.) which make it very attractive in automotive for example. One large wireless customer is Nokia's baseband group (now part of Renesas) which uses it where nearly everyone would still have an ARM9/ARM11 and a small DSP - that means it's probably also used in chips like the ST-Ericsson U8500 that are based on Nokia's modem IP.

Shortly before this article's publication, ARM launched the Cortex-R5 and R7. The former is basically a R4 with some new market-specific features (mostly to simplify software development) and support for true dual-core configurations (double performance with cache coherency and not only redundant for error checking). The latter is a Cortex-R version of the Cortex-A9 (also limited to dual-core) with even more market-specific functionality than the R4 (e.g. hard error detection). Strangely, ARM claims an 11-stage pipeline unlike the A9's 8-11 stages, so maybe it no longer supports a variable pipeline length (because of the real-time market's requirements?) - either way, both are about what you'd expect them to be and will ultimately be found in some LTE and LTE Advanced basebands to replace the existing ARM9/ARM11/R4s.

Is this the end of the road?


ARM's sales pitch for the Cortex-A5 is that its die size and power consumption are comparable to an ARM9 despite achieving higher performance than an ARM11 (and this appears credible). On the other hand, the Cortex-A15's main selling point is extremely high performance and high power efficiency - but it's important to understand that this power efficiency is only relative to what other architectures might need to reach that level of performance. While Dhrystone is severely outdated as a CPU benchmark, it's still instructive to compare Dhrystone MIPS per milliwatt (DMIPS/mW) for both the A9 and A5 at their respective clock targets on TSMC 40G: 8/mW vs >20/mW. In fact the difference is even larger in CoreMark, so it's probably at least 2x in the real world and it will only get worse with the Cortex-A15.

That's why a quad-core A9 would be more power efficient than a dual-core A15 (and the same is true for 2xA5 vs 1xA9). But obviously they are only comparable for highly multithreaded workloads - the rest of the time the A9s might only be half as fast! Still given the diminishing returns of further performance boosts and the Cortex-A15's ability to scale beyond four cores, is this the end of the road for CPU architecture evolution in handhelds? Not quite!