Fact-checked by Grok 2 weeks ago

ARM Cortex-A9

The ARM Cortex-A9 is a high-performance, power-efficient 32-bit processor core developed by Arm, implementing the ARMv7-A architecture and designed for embedded applications in low-power, thermally constrained, and cost-sensitive devices.^[1] Introduced on March 31, 2008, with its initial revision (r0p0), it supports the ARM, Thumb, and Thumb-2 instruction sets, enabling versatile execution in single-core or multi-core configurations.^[2] Key features of the Cortex-A9 include a dual-issue, partially out-of-order 8-stage superscalar pipeline for enhanced instruction throughput, dynamic branch prediction, and configurable L1 caches of 16KB, 32KB, or 64KB per core, with support for an optional unified L2 cache up to 8MB.^[1] It incorporates the ARMv7 Memory Management Unit (MMU) for virtual memory handling, TrustZone security extensions for protected execution environments, and optional NEON Advanced SIMD and Vector Floating-Point (VFPv3) units for multimedia and signal processing acceleration.^[2] The multiprocessor variant, known as Cortex-A9 MPCore, scales to up to four cores with cache coherency via the Accelerator Coherency Port (ACP) and a Snoop Control Unit (SCU), facilitating symmetric multiprocessing (SMP) in systems requiring parallel performance.^[1] In terms of performance, the Cortex-A9 delivers over 50% improvement in single-core efficiency compared to its predecessor, the Cortex-A8, while maintaining low power consumption suitable for battery-operated devices; it also integrates CoreSight components for comprehensive debug and trace capabilities.^[1] Widely deployed since its launch, the core powers applications in smartphones, digital TVs, consumer electronics, and enterprise systems, with notable implementations in devices such as the NVIDIA Tegra 2, STMicroelectronics SPEAr1300, and Texas Instruments OMAP4 SoCs.^[3] Its maturity and configurability as either speed-optimized or power-optimized IP have made it a foundational choice for Arm-based system-on-chips (SoCs) in the late 2000s and early 2010s.^[1]

Introduction and History

Development Timeline

The ARM Cortex-A9 was developed by ARM Holdings as part of the ARMv7-A architecture family, succeeding the single-core Cortex-A8 and emphasizing multi-core scalability to address increasing performance needs in mobile devices.^[4] ARM officially announced the Cortex-A9 single-core and MPCore multi-core processors on October 8, 2007, at the ARM Developers' Conference in Santa Clara, California, highlighting their support for up to four cache-coherent cores based on the ARMv7 instruction set.^[5]^[6] The initial processor release occurred in 2008, with first silicon samples becoming available in late 2009; early demonstrations included ST-Ericsson's multiprocessing implementation running Symbian OS at a private event in February 2009.^[7]^[8] Commercial availability began in 2010, as volume shipments of Cortex-A9-based silicon entered multiple market segments, including smartphones and embedded systems, with key partnerships such as ST-Ericsson enabling rapid adoption through early implementations like the U8500 platform.^[7]^[9]

Position in ARM Portfolio

The ARM Cortex-A9 serves as a high-performance, out-of-order processor core within the ARMv7-A architecture profile, designed specifically for applications processors in devices requiring robust computational capabilities while maintaining power efficiency.^[3]^[10] It introduced partial out-of-order execution to the ARM portfolio, marking a significant advancement over its predecessor, the Cortex-A8, which relied on an in-order pipeline and emphasized single-core implementations for simpler mobile applications.^[11] In contrast, the Cortex-A9 supported multi-core configurations, paving the way for its successor, the Cortex-A15, which further refined out-of-order processing with enhanced superscalar capabilities for even higher performance demands.^[4] Targeted at markets such as smartphones, tablets, and embedded systems, the Cortex-A9 balanced high performance with low power consumption, making it suitable for thermally constrained and cost-sensitive environments where multimedia and general-purpose computing were key.^[3] Within the broader ARMv7-A family, it positioned above lower-power options like the Cortex-A5, optimized for minimal area and energy use in basic embedded tasks, and the Cortex-A7, which focused on efficiency for entry-level devices with performance comparable to the A9 but in a smaller footprint.^[12]^[13] ARM offered the Cortex-A9 under a flexible licensing model, providing it as synthesizable intellectual property (soft core) in RTL format for custom integration across various process nodes, or as pre-optimized hard macros tailored for specific manufacturing processes to accelerate time-to-market and ensure performance guarantees.^[14]^[3] This approach enabled scalability, including dual-core configurations, to meet diverse system requirements without overhauling the core design.^[1]

Core Architecture

Processor Microarchitecture

The ARM Cortex-A9 processor employs an out-of-order superscalar microarchitecture to deliver high performance in embedded and mobile applications, implementing the ARMv7-A architecture with support for the Thumb-2 instruction set for efficient code density.^[1] This design incorporates dynamic scheduling, allowing instructions to execute out of program order when dependencies permit, thereby maximizing resource utilization and reducing stalls in the execution pipeline. The integer pipeline consists of up to 8 stages, enabling efficient handling of speculative execution while balancing power and area constraints typical of ARM's application processors.^[1]^[15] A key aspect of the microarchitecture is its support for dual-issue in integer operations, where up to two instructions can be dispatched per cycle from a variable-length decoder that processes the mixed 16- and 32-bit Thumb-2 encodings.^[1] This partially out-of-order model applies primarily to integer execution, with load/store operations also benefiting from dynamic reordering to overlap memory accesses effectively. Branch prediction is facilitated by a hybrid mechanism featuring a global history table, implemented as a 2-level dynamic predictor with a configurable Global History Buffer (GHB) with 1024, 2048, 4096, 8192, or 16384 entries, a Branch Target Address Cache (BTAC), and a return stack to anticipate control flow and minimize misprediction penalties.^[16]^[17] The core's scalability allows configuration as a single processor or in multi-core setups, such as the dual-core variant in the Cortex-A9 MPCore, where coherence between cores is maintained through the AMBA AXI interconnect protocol.^[18] This flexibility enables designers to tailor the processor for varying performance needs while integrating with AMBA-based system buses for instruction, data, and peripheral access.

Pipeline and Execution Units

The ARM Cortex-A9 features an 8-stage integer pipeline designed for out-of-order execution, enabling superscalar processing with up to two instructions issued per cycle in optimal conditions.^[2] The pipeline stages consist of fetch, where instructions are retrieved from the instruction cache; decode, which can process up to two instructions simultaneously; rename, for register renaming to handle dependencies; dispatch, allocating instructions to appropriate queues; issue, scheduling ready instructions to execution units; execute, performing the computations; writeback, returning results to the register file; and retire, committing instructions in program order while handling exceptions.^[2] This structure supports speculative execution to minimize stalls from branches and dependencies.^[2] The execution units include two integer arithmetic logic units (ALUs) for handling address calculations and general-purpose operations, a dedicated multiply-accumulate (MAC) unit for multiplication and accumulation tasks, and a load/store unit capable of one load and one store operation per cycle.^[2] These units allow for concurrent processing of up to four instructions in a cycle, including two ALU operations, one memory access, and one branch, enhancing throughput in integer workloads.^[2] Floating-point operations are supported through an integrated VFPv3 unit, which features a separate pipeline for scalar floating-point instructions compliant with IEEE 754.^[19] The VFPv3 unit achieves one double-precision fused multiply-accumulate (FMA) operation every two cycles, providing efficient support for single- and double-precision arithmetic.^[19]^[20] In multi-core configurations, the Snoop Control Unit (SCU) manages cache coherence by implementing a snooping protocol that ensures data consistency across up to four cores through directed snoop requests and responses. Power efficiency is enhanced via clock gating, which disables clocks to inactive pipeline stages and units, and power gating, allowing individual cores to enter low-power states while supporting dynamic voltage and frequency scaling.

Memory Hierarchy

The ARM Cortex-A9 processor features a multi-level memory hierarchy optimized for high-performance embedded applications, comprising Level 1 (L1) caches tightly integrated with the core, an optional external Level 2 (L2) unified cache, a two-level Translation Lookaside Buffer (TLB) for address translation, and a Memory Management Unit (MMU) for virtual memory support. This design balances low-latency access with scalability in single- and multi-core configurations, leveraging the ARMv7-A architecture.^[2] The L1 caches are Harvard-style, with separate instruction and data caches that are configurable in size to 16 KB, 32 KB, or 64 KB per cache. Both are 4-way set-associative with 32-byte cache lines, enabling efficient prefetching and branch target buffering integration. The data cache operates in write-back mode to minimize bus traffic, supporting write-allocate policies for cacheable regions.^[2] The L2 cache is a unified, external structure implemented via the ARM PrimeCell PL310 controller, configurable from 128 KB to 8 MB in 128 KB increments and typically organized as 16-way set-associative. It connects to the core through dedicated AXI master interfaces, providing shared access in multi-core setups and supporting exclusive caching modes to avoid data duplication between L1 and L2 levels. The TLB architecture uses a two-level hierarchy to reduce MMU lookup overhead. The first level includes separate micro-TLBs: a 32-entry fully associative data micro-TLB and a configurable 32- or 64-entry instruction micro-TLB. The second-level main TLB is unified for instruction and data, implemented as a configurable 2-way set-associative array of 64 to 512 entries plus four fully associative lockable entries, allowing selective retention of critical translations.^[2] The MMU provides comprehensive virtual-to-physical address translation and protection, supporting 4 KB small pages as the base granule, along with larger section (1 MB) and supersection (16 MB) mappings in the standard ARMv7 configuration. In multi-core variants, the Cortex-A9 employs AMBA AXI interfaces—typically two 64-bit AXI masters per core—for all external memory accesses, with the Snoop Control Unit (SCU) ensuring cache coherence by snooping AXI transactions and broadcasting invalidations across cores. This AXI4-compatible setup supports system-level interconnects while maintaining low-latency coherence for up to four cores.

Key Features

SIMD and Vector Processing

The ARM Cortex-A9 incorporates the NEON advanced SIMD extension as part of its ARMv7-A architecture, providing a dedicated media processing engine for vector operations. The NEON unit is 128-bit wide, enabling parallel processing of multiple data elements within this vector length, and features a register file consisting of 32 64-bit registers (equivalent to 16 full 128-bit vectors) that support both integer and floating-point operations. These registers are shared with the VFPv3 unit, allowing seamless integration between scalar and vector floating-point computations. Integer operations handle unsigned and signed data types from 8-bit to 64-bit, including polynomial arithmetic over GF(2, while floating-point support focuses on single-precision (32-bit) formats, with limited double-precision scalar capabilities.^[21] NEON instructions enable efficient vector arithmetic, such as VADD for element-wise addition and VMUL for multiplication, operating on vectors with up to 16 elements (e.g., sixteen 8-bit integers or four 32-bit floats per 128-bit vector). These instructions incorporate saturation modes to prevent overflow by clamping results to the representable range, and rounding modes for precise shifts and conversions, enhancing accuracy in signal processing tasks. Integration with VFPv3 extends this to vectorized floating-point operations, including fused multiply-add (VFMA) instructions that compute a*b + c in a single operation without intermediate rounding, reducing error accumulation in chained computations. This fusion applies to both scalar and vector forms, supporting up to four single-precision elements per instruction.^[22] In terms of performance, the NEON unit can achieve up to 8 single-precision floating-point operations per cycle when leveraging the Cortex-A9's dual-issue capability, where two NEON instructions (e.g., a multiply followed by an add) are dispatched simultaneously to the execution pipelines. This throughput is realized in multimedia acceleration scenarios, such as H.264 video decoding, where NEON handles motion compensation and inverse transforms on multiple pixel blocks in parallel, and 3D graphics processing, including vertex shading and texture filtering. These capabilities make NEON particularly suited for embedded applications requiring efficient handling of audio, video, and image data streams.^[23]^[24]

Integer and Floating-Point Operations

The ARM Cortex-A9 processor implements scalar integer operations as part of the ARMv7-A architecture, supporting both the traditional 32-bit ARM instruction set and the Thumb-2 instruction set, which combines 16-bit and 32-bit instructions to achieve better code density while maintaining performance comparable to ARM instructions.^[2] All scalar integer operations feature conditional execution, enabling instructions to execute only if specified conditions (such as equality or greater-than) are met, which helps minimize branching and improve pipeline efficiency. Additionally, the architecture includes media-oriented instructions for digital signal processing (DSP) tasks, such as SMLAD, which performs two 16-bit signed multiplies followed by a 32-bit addition, useful for audio and image processing applications. Cycle timings for integer operations vary by instruction type but emphasize low latency for common arithmetic. Basic data-processing instructions like ADD and SUB complete in a single cycle, allowing high throughput in sequential computations.^[25] Multiply operations, such as MUL for 32-bit results, typically require 3-5 cycles depending on operand size and whether accumulation is involved, balancing precision with performance.^[25] Division instructions, including signed (SDIV) and unsigned (UDIV), take longer at 10-14 cycles to ensure accurate results, reflecting the complexity of the iterative algorithm used.^[25] These timings assume in-order execution without interlocks; out-of-order execution in the Cortex-A9 can further optimize overall performance by scheduling dependent operations.^[2] For floating-point operations, the Cortex-A9 integrates an optional Vector Floating-Point (VFPv3) unit that handles single-precision (32-bit) and double-precision (64-bit) computations in compliance with the IEEE 754 standard, providing robust support for scientific and graphics workloads. The VFPv3 unit includes fused multiply-accumulate (FMA) operations, which combine multiplication and addition into a single rounded result to reduce error accumulation in iterative calculations. Floating-point addition and subtraction require 3 cycles, enabling efficient scalar math in loops, while division operations range from 14 cycles for single-precision to 28 cycles for double-precision, due to the reciprocal approximation method employed. These timings position the VFPv3 as a high-performance coprocessor when enabled, though it can be disabled for power savings in integer-only applications. The Cortex-A9 also supports the optional Jazelle extension, which accelerates Java bytecode execution by allowing direct hardware interpretation of most bytecodes as a third execution state alongside ARM and Thumb modes, though it is rarely utilized in modern implementations due to advancements in just-in-time compilation.^[26]

Security and Virtualization Support

The ARM Cortex-A9 processor incorporates ARM TrustZone technology, which provides hardware-enforced isolation between a secure world for sensitive operations, such as cryptographic processing, and a normal world for general-purpose computing. This separation is achieved through a dedicated secure state in the processor, where the secure world maintains exclusive access to protected resources while the normal world operates under restricted privileges. All bus transactions originating from the processor include a Non-Secure (NS) bit, which tags accesses as secure or non-secure, enabling peripherals and memory systems to enforce isolation at the hardware level.^[27]^[28] Virtualization support in the Cortex-A9 is provided via optional extensions to the ARMv7-A architecture, allowing for efficient hypervisor operation through two-stage memory address translation. In this setup, stage-1 translation maps virtual addresses to intermediate physical addresses (IPAs) within a guest operating system, while stage-2 translation, managed by the hypervisor, maps IPAs to physical addresses, supporting up to 40-bit IPAs when the extensions are enabled. These features enable secure partitioning of resources among multiple virtual machines, with the hypervisor running in a non-secure monitor mode to oversee guest isolation without compromising performance.^[27] World switching between secure and normal states is facilitated by Secure Monitor Calls (SMC), which trigger an exception to enter the monitor mode, a privileged state dedicated to handling transitions and maintaining isolation. The processor's interrupt controller integrates TrustZone by routing interrupts to either secure or non-secure handlers based on configuration bits, such as the FIQ enable bit, ensuring that secure interrupts remain protected from normal-world software. This dedicated handling prevents unauthorized access and supports real-time secure operations.^[27] The Cortex-A9 supports a Physical Address Extension (PAE) up to 40 bits when configured, expanding the addressable memory space beyond the standard 32 bits to accommodate large systems, such as those with up to 1 TB of RAM. This extension is optional and implementation-defined, allowing integrators to select it for applications requiring extensive physical memory mapping.^[27] Integration with the Memory Management Unit (MMU) extends these capabilities by supporting separate page tables for secure and non-secure worlds, where the NS bit determines which translation table is active during address resolution. In virtualization scenarios, the MMU applies both stages of translation, with secure page tables isolated to prevent tampering, thereby reinforcing TrustZone's protection model across virtualized environments.^[27]^[28]

Implementations

Single-Core Configurations

The ARM Cortex-A9 single-core processor, also known as the uniprocessor variant, is implemented as a standalone high-performance core without multi-core clustering, targeting embedded and mobile applications requiring scalable performance. ARM offers this configuration in both synthesizable RTL and hard macro forms to facilitate integration into system-on-chips (SoCs) on advanced process nodes. Hard macros are available on 40 nm and 28 nm processes, enabling optimized area and power for production designs.^[29] In terms of operating frequencies, the single-core Cortex-A9 achieves up to 2.5 GHz in speed-optimized hard macro implementations on 28 nm, supporting demanding workloads while maintaining compatibility with ARMv7-A architecture. Typical clock speeds in mobile deployments range from 1 to 2 GHz, balancing performance and thermal constraints in battery-powered devices. Power consumption for a single core is approximately 500 mW at 1 GHz in power-optimized variants, contributing to energy-efficient operation.^[30]^[31] Configuration flexibility is a key aspect of single-core setups, allowing designers to tailor the processor to specific needs. L1 caches can be configured as 16 KB, 32 KB, or 64 KB for both instruction and data sides, with four-way set associativity. An optional unified L2 cache, managed via the L2C-310 controller, supports sizes up to 8 MB for improved memory bandwidth. Additional options include Jazelle hardware acceleration for direct Java bytecode execution and ThumbEE extensions for just-in-time compilation in dynamic environments.^[32] ARM delivers the single-core Cortex-A9 as intellectual property (IP) suitable for standalone use, often integrated via the uniprocessor package that excludes multi-core interconnects. This out-of-order execution design enables efficient instruction throughput, supporting the high clock rates observed in these configurations.^[1]

Multi-Core Variants

The ARM Cortex-A9 MPCore implements multi-core configurations to enable symmetric multiprocessing (SMP), with support for up to four cores in a single cluster for enhanced parallelism while maintaining cache coherence.^[33] The dual-core variant is the most prevalent implementation, favored in many designs for its balance of performance gains and power efficiency, as quad-core setups can increase thermal and energy demands without proportional benefits in typical embedded workloads.^[18]^[34] In dual-core MPCore setups, the two Cortex-A9 processors share a unified L2 cache configurable up to 8 MB via the PL310 controller, which provides low-latency access and supports speculative linefills to optimize bandwidth.^[35] The Snoop Control Unit (SCU) ensures coherency among the L1 data caches of the cores using a snoop-based mechanism that broadcasts cache operations to maintain data consistency across the cluster.^[33] This SCU also arbitrates L2 cache accesses and handles evictions, integrating with the cores' AXI interfaces for efficient memory transactions.^[33] Cache coherency in multi-core Cortex-A9 systems follows a MESI-like protocol for intra-cluster L1 interactions, extended by AMBA AXI Coherency Extensions (ACE) to support the AXI interconnect and enable coherent external accesses.^[33] The integrated Generic Interrupt Controller (GIC) version 1.0 distributes interrupts across cores, supporting up to 224 shared peripheral interrupts (SPIs) with per-core private interrupts for timers and watchdogs, facilitating efficient task scheduling in SMP environments.^[33] Performance scaling in dual-core configurations demonstrates near-linear gains in threaded applications, with representative implementations achieving almost 2x the single-core throughput while consuming only about 40% more power, highlighting the architecture's efficiency for parallel workloads.^[34]

Integration in SoCs

The ARM Cortex-A9 core was widely integrated into system-on-chips (SoCs) for mobile and embedded applications during the early 2010s, leveraging its ARMv7-A compatibility to enable efficient multi-core processing in power-constrained devices.^[1] NVIDIA's Tegra 2, released in 2010, featured a dual-core Cortex-A9 configuration clocked at 1 GHz, marking one of the first mobile SoCs with symmetric multi-processing support for enhanced performance in graphics-intensive tasks. This SoC powered early Android tablets such as the Motorola Xoom and Samsung Galaxy Tab 10.1, combining the CPU with an integrated GeForce GPU for multimedia applications.^[36]^[34] Samsung's Exynos 4210, introduced in 2011 and manufactured on a 45 nm process, incorporated a dual-core Cortex-A9 setup operating at 1.4 GHz, paired with a Mali-400 MP4 GPU to deliver improved graphics rendering for smartphones. It was prominently used in the Samsung Galaxy S II, supporting high-definition video playback and multitasking in mobile environments.^[37]^[38] Apple's A5 SoC, also launched in 2011 on a 45 nm process (later revised to 32 nm), utilized a dual-core Cortex-A9 design clocked at 800 MHz in its iPhone 4S variant, with a higher 1 GHz speed in the iPad 2 configuration; this implementation included custom optimizations for power efficiency alongside a PowerVR SGX543MP2 GPU. The A5 enabled seamless integration in iOS devices, facilitating features like Siri and improved graphics in games.^[39]^[40] Texas Instruments' OMAP 4 series, spanning models like the OMAP4430 and OMAP4460 from 2011 onward, employed dual-core Cortex-A9 processors scalable up to 1.5 GHz, targeted at both consumer mobile devices and industrial embedded systems. These SoCs included dedicated hardware accelerators for imaging and video, making them suitable for applications in smartphones like the Motorola Droid RAZR and automotive infotainment.^[41]^[42] An example of a quad-core implementation is the NXP i.MX 6Quad, released in 2012 on a 40 nm process, featuring four Cortex-A9 cores at 1.0 GHz with integrated 2D/3D graphics acceleration. It has been widely adopted in industrial, automotive, and consumer embedded systems for applications requiring higher parallelism.^[43] Other notable integrations included low-cost SoCs for budget tablets, such as Rockchip's RK3066 from 2012, which featured a dual-core Cortex-A9 at up to 1.6 GHz with a Mali-400 GPU to support affordable Android media consumption devices. While some early entrants like Allwinner's A10 targeted similar markets, it used a single Cortex-A8 core instead, highlighting the Cortex-A9's role in bridging performance and cost in emerging consumer electronics.^[44]

Applications and Performance

Device Adoption

The ARM Cortex-A9 processor powered several first-generation 4G smartphones, including the Motorola Atrix 4G featuring Nvidia Tegra 2. These devices marked early adoption in high-speed mobile connectivity, enabling advanced multimedia and multitasking capabilities in the Android ecosystem. In the tablet market, the Cortex-A9 saw significant uptake through the Apple iPad 2, which utilized the custom A5 SoC with a dual-core Cortex-A9 configuration, contributing to over 30 million units sold during its lifecycle and establishing tablets as mainstream consumer devices.^[45] Similarly, the Samsung Galaxy Tab 10.1 employed the Tegra 2 SoC with dual-core Cortex-A9, enhancing portability and performance for media consumption in early Android tablets.^[46] The processor also appeared in set-top boxes and early smart televisions, notably powering Google TV platforms such as LG's L9 chipset-based models, which integrated a dual-core Cortex-A9 for seamless streaming and app integration.^[47] These implementations brought internet-connected features to home entertainment systems, with LG's early Google TV devices like the 47LM6700 series exemplifying the shift toward smart home interfaces. In automotive and embedded applications, the Freescale (now NXP) i.MX6 series, based on single- to quad-core Cortex-A9 configurations, was widely used in infotainment systems for features like navigation, media playback, and connectivity.^[48] The i.MX6's scalability supported rugged environments, powering dashboards in vehicles from manufacturers adopting Android Automotive OS precursors.^[49] The Cortex-A9 reached its market peak as the dominant processor in the 2011-2013 Android ecosystem, with widespread shipments across licensees enabling billions of devices in smartphones, tablets, and embedded systems.^[11] This era solidified its role in driving the explosion of mobile computing.

Benchmark Comparisons

The ARM Cortex-A9 processor exhibits substantial performance gains over the Cortex-A8, delivering more than 50% higher overall performance in single-core setups due to its out-of-order execution and dual-issue pipeline.^[3] In integer workloads, it achieves roughly twice the performance of the Cortex-A8 at equivalent clock speeds, while multimedia tasks utilizing NEON SIMD extensions show up to three times the throughput, benefiting from enhanced vector processing and reduced pipeline stalls.^[50] ^[51] Benchmark results from Geekbench 2 indicate dual-core Cortex-A9 configurations scoring approximately 800-1000 points, placing them on par with the Intel Atom N450 in contemporary netbook applications.^[52] ^[53] Compared to the later Cortex-A15, the A9 is 30-50% slower in CPU-intensive tasks per clock cycle but consumes less power, making it suitable for efficiency-focused designs.^[54] Power efficiency stands out at around 1000 DMIPS per watt in 28 nm processes, as evaluated via Dhrystone metrics, with the core rated at 2.5 DMIPS/MHz.^[55] ^[56]

Benchmark	Cortex-A9 (Single-Core, ~1 GHz)	Comparison Context
Dhrystone	2.5 DMIPS/MHz	Baseline for power-normalized efficiency in 28 nm.^[56]
Geekbench 2 (Dual-Core)	~800-1000	Comparable to Intel Atom N450 multi-threaded loads.^[52] ^[53]

NEON acceleration further boosts multimedia benchmarks, contributing to the A9's edge in vector-heavy workloads over in-order designs like the A8.^[51]

Legacy and Modern Relevance

The ARM Cortex-A9 processor significantly contributed to ARM's dominance in the mobile computing market by introducing scalable multi-core configurations that balanced performance and power efficiency for battery-constrained devices.^[18] Its MPCore variant, supporting up to four cache-coherent cores, enabled high-performance applications in early smartphones and tablets, setting the stage for advanced heterogeneous architectures like big.LITTLE.^[57] This multi-core innovation allowed ARM to capture a substantial share of the growing mobile processor market, influencing the shift toward clustered processing in portable electronics.^[58] As of 2025, the Cortex-A9 continues to find relevance in legacy embedded and industrial applications, particularly where cost and long-term stability outweigh the need for cutting-edge performance. For instance, NXP's i.MX 6DualPlus processor, featuring dual Cortex-A9 cores, remains actively available for multimedia-enabled edge computing, industrial IoT devices, and automotive systems like e-cockpits.^[59] Similarly, Artila's Matrix-770 serves as an Ubuntu Core-based IIoT gateway for industrial networking, leveraging the A9's reliability in low-to-mid-range connectivity solutions.^[60] These uses highlight its persistence in sectors such as IoT gateways and alternatives to higher-end single-board computers like Raspberry Pi, where mature ecosystems ensure ongoing viability.^[61] ARM has not declared the Cortex-A9 end-of-life, maintaining support through long-term maintenance agreements, with implementations like NXP's i.MX 6 series projected to receive updates until at least 2035.^[62] While new licensing for the A9 has diminished since the mid-2010s in favor of ARMv8-based designs, existing deployments benefit from sustained vendor support, ensuring compatibility and security patches for embedded systems.^[63] The Cortex-A9 profoundly shaped subsequent multi-core ARM designs by pioneering cache-coherent multiprocessing in the high-performance segment, facilitating seamless scaling in symmetric multi-processing environments.^[18] Its adherence to the ARMv7-A architecture provides backward code compatibility with ARMv8 processors through the AArch32 execution state, allowing legacy A9 software to run on modern 64-bit ARM systems without major rewrites.^[64] However, it has been outpaced by ARMv8 cores in power efficiency; for example, the Cortex-A53 delivers comparable single-threaded performance to the A9 while consuming approximately 40% less area and energy, making newer cores preferable for demanding applications.^[65] Despite this, the A9 retains cost-effectiveness for low-end embedded tasks, where its proven integration and lower licensing overhead justify continued use over more advanced alternatives.^[66] Successors like the Cortex-A53 have built upon this foundation, emphasizing efficiency in entry-level multi-core scenarios.^[67]

References

[1]
Cortex-A9 Product Support - Arm Developer
Get help with your questions about the Cortex-A9 with our documentation, downloads, training videos, and product support content and services.
[2]
Cortex-A9 Technical Reference Manual r4p1 - Arm Developer
This is the Technical Reference Manual for the Cortex-A9 processor.Missing: core | Show results with:core
[3]
The Cortex-A9 processor - Arm Developer
The Cortex-A9 processor can be configured with up to four cores delivering peak performance when required. Configurability and flexibility makes the Cortex-A9 ...
[4]
A Walk Through the Cortex-A Mobile Roadmap - Arm Developer
Nov 19, 2013 · Cortex-A9. Shortly after the introduction of Cortex-A8, ARM introduced our first multi-core ARMv7 CPU, the cortex-a9. The Cortex-A9 made use ...Missing: announcement 2006
[5]
Cortex-A9 makes good on ARM's multicore promise - EE Times
Oct 8, 2007 · The new architecture adds to ARM's established multiprocessor capability with an accelerator coherence port supporting hardware accelerators and ...
[6]
Cortex-A9 makes good on ARM's multicore promise - EDN
Oct 8, 2007 · The move, announced at the ARM Developers' Conference in Santa Clara, Calif., follows up on a July announcement of multiprocessing extensions to ...
[7]
The Birth & Evolution of Cortex-A9, and What's Coming Next…
Oct 26, 2013 · The initial release in 2008 transformed rapidly into volume shipments and Cortex-A9 MPCore based silicon is available since 2010 across close ...Missing: first date
[8]
ST-Ericsson set to show multiprocessing Cortex-A9 running ...
Feb 16, 2009 · ST-Ericsson said the demonstration, running on an ARM Cortex-A9 based system would be done at a “private event.” ST-Ericsson claimed that by ...Missing: 2006 tape- out
[9]
ST-Ericsson's U8500 brings dual-core 1.2GHz ARM Cortex-A9 to the ...
Feb 15, 2010 · ST-Ericsson's powerhouse U8500 system-on-chip has come a major step closer to appearing in mainstream devices with today's newly announced ...
[10]
[PDF] Cortex-A9 Technical Reference Manual - Arm
The Cortex-A9 processor is a high-performance, low-power, ARM macrocell with an L1 cache ... Out of order execution is not always possible. Some ...
[11]
The Birth & Evolution of Cortex-A9, and What's Coming Next…
Oct 26, 2013 · Just over 6 years ago, in October 2007, ARM introduced a product that had the potential to change the world. Not just in one market (like ...Missing: timeline history
[12]
[PDF] ARM Cortex processors - Public - August 2017
Cortex-A5/A7. Smallest and lowest power. Armv7-A. Cortex-A15/A17. Infrastructure ... low power. Cortex-M0+. Highest energy efficiency. Cortex-M4. Mainstream.
[13]
The Top 5 Things to Know about Cortex-A7 - Arm Developer
Nov 1, 2013 · The Cortex-A7 is fully compatible with the Cortex-A15 processor (which is the highest performance ARMv7-A processor). The Cortex-A7 processor ...
[14]
[PDF] Cortex-A9 Technical Reference Manual - Arm
First release for r1p0. 30 September 2009. D. Non-Confidential Restricted Access. First release for r2p0. 27 November 2009. E. Non-Confidential. Second release ...
[15]
Processor properties - Arm Developer
... Development Studio · Arm ... Cortex-A7. Cortex-A8. Cortex-A9. Cortex-A12. Cortex-A15. Release date. Dec 2009. Oct 2011. July 2006. March 2008. June 2013. April ...
[16]
Cortex-A9 Technical Reference Manual r4p1 - Arm Developer
The Prefetch Unit implements 2-level dynamic branch prediction with a Global History Buffer (GHB), a Branch Target Address Cache (BTAC) and a return stack.Missing: hybrid | Show results with:hybrid
[17]
The ARM Cortex-A9 Processors - Design And Reuse
Oct 8, 2007 · There are many examples of applications that demand the qualities of low cost and efficient performance: connected mobile computers and other ...
[18]
Cortex-A9 Floating-Point Unit Technical Reference Manual r4p1
**Summary of Floating-Point Pipeline for Cortex-A9 VFPv3 (Throughput for Double-Precision FMA or Multiply-Accumulate Operations):**
[19]
About the Cortex-A9 NEON MPE - Arm Developer
The Cortex-A9 NEON MPE extends the Cortex-A9 functionality to provide support for the ARM v7 Advanced SIMD and Vector Floating-Point v3 (VFPv3) instruction ...<|separator|>
[20]
Fused Multiply-Add extension - Arm Developer
The Fused Multiply-Add extension optionally extends the VFPv3 and the NEON architectures. It provides VFP and NEON instructions that perform multiply and ...Missing: Cortex- A9
[21]
Cortex-A9 NEON MPE instructions - Arm Developer
Table 3.1 shows the instructions supported by the Cortex-A9 NEON MPE, and the instruction set that they are in, either Advanced SIMD or VFP.Missing: vectors | Show results with:vectors
[22]
Efficient SIMD and Algorithmic Optimization Techniques for H264 ...
Jul 18, 2016 · This paper explains the two novel optimization techniques conducted on H.264 decoder (Baseline profile), Cortex A9 platform, to get the best performance.
[23]
Appendix B. Instruction Cycle Timings - Cortex-A9 - Arm Developer
This chapter describes the cycle timings of integer instructions on Cortex-A9 processors. It contains the following sections: About instruction cycle timing.
[24]
Cortex-A9 Technical Reference Manual r4p1
### Summary of Jazelle Extension in Cortex-A9
[25]
Cortex-A9 MPCore Technical Reference Manual r4p1 - Arm Developer
This is a technical reference manual for the Cortex-A9 MPCore, intended to assist in its use, and is for a developed product.
[26]
TrustZone for Cortex-A – Arm®
TrustZone is a system-wide security approach with hardware-enforced isolation, creating a secure world for trusted boot and OS, and a non-secure world.
[27]
ARM produces hard Cortex A9 for high performance | Electronics ...
ARM has produced a hard macro version of its Cortex-A9 processor which has been sold as soft IP since 2007. The idea is to give users of Cortex A9 a high ...
[28]
TSMC's 28nm Based ARM Cortex-A9 Test Chip Reaches Beyond ...
May 3, 2012 · “At 3.1 GHz this 28HPM dual-core processor implementation is twice as fast as its counterpart at TSMC 40nm under the same operating conditions, ...Missing: count | Show results with:count
[29]
ARM Announces 2 GHz Dual-Core Cortex-A9 - BDTI
Sep 23, 2009 · On September 21st ARM announced a new high-speed, hard macro implementation of the Cortex-A9 architecture, called “Osprey.” (A hard macro is ...
[30]
Cortex-A9 Technical Reference Manual r2p0 - Arm Developer
ARM, Thumb, and ThumbEE instruction set support. TrustZone ... three voltage domains. optional Preload Engine. optional Jazelle hardware acceleration.
[31]
[PDF] Cortex-A9 MPCore Technical Reference Manual
First release for r1p0. 2 October 2009. D. Non-Confidential Restricted Access. First release for r2p0. 27 November 2009. E. Non-Confidential Unrestricted Access.
[32]
[PDF] The Benefits of Multiple CPU Cores in Mobile Devices | NVIDIA
Dec 1, 2010 · Dual-core ARM Cortex A9 Architecture ... voltage, the dual core CPU consumes lower power than a single core CPU for the same.
[33]
[PDF] Whitepaper NVIDIA® Tegra™ Multi-processor Architecture
Dual-Core ARM Cortex A9 CPU: NVIDIA Tegra features the world's first dual core CPU for mobile applications in addition to support for Symmetric Multi-Processing ...
[34]
Exynos 4210 - Samsung - WikiChip
Mar 28, 2018 · General Specs. Family, Exynos. Series, Exynos 4. Frequency, 1,400 MHz. Microarchitecture. ISA, ARMv7 (ARM). Microarchitecture, Cortex-A9.
[35]
https://developer.arm.com/documentation/dui0448/latest/
[36]
https://www.nvidia.com/docs/io/90715/tegra_multiprocessor_architecture_white_paper_final_v1.1.pdf
[37]
A5 - The Apple Wiki
Sep 22, 2025 · The original processor is the S5L8940. Manufactured by Samsung, the processor itself is dual-core. The processor (H4P) is clocked at 850MHz in ...
[38]
[PDF] OMAP™ 4 mobile applications platform - Texas Instruments
SMP parallel processing for higher performance and efficiency – TI's OMAP 4 applications processor is one of the first dual-core, ARM Cortex-A9 MPCore based.
[39]
[PDF] OMAP™ 4 mobile applications platform - Texas Instruments
Feb 19, 2009 · Device Features. • Dual-core ARM Cortex-A9 MPCore SMP general-purpose processors for higher performance and efficiency. • IVA 3 Hardware ...<|separator|>
[40]
AllWinner A10 Soc / Processor - NotebookCheck.net Tech
Dec 31, 2012 · It contains a 1.2 GHz ARM Cortex A8 core (ARMv7), a ARM Mali 400 (single core) graphics card and a video processing unit. It is produced in 55nm ...
[41]
Low End Mac's Guide to Apple A-Series Processors
Sep 20, 2015 · 2011: Apple A5. Apple A5 CPU Based on the dual-core ARM Cortex-A9 processor, Apple's 1 GHz A5 was introduced with the iPad 2 in March 2011.Missing: details | Show results with:details
[42]
Samsung Galaxy Tab 10.1 LTE - Device Specification
Samsung Galaxy Tab 10.1 LTE - Specifications ; CPU: 2x 1.0 GHz ARM Cortex-A9, ; Cores: 2 ; GPU: ULP GeForce, 333 MHz, ; Cores: 8 ; RAM: 1 GB, 300 MHz
[43]
Google TV Gets New Legs with LG ARM TV - Linux.com
Jun 14, 2012 · LG's new ”LG47” is the first Google TV device powered by an ARM processor: LG's new L9 chipset, based on a 1GHz, dual-core Cortex-A9 CPU. In ...
[44]
i.MX 6 Processors - Multicore, Arm ® Cortex - NXP Semiconductors
High-performance i.MX 6 applications processors deliver exceptional multimedia capabilities and scalable integration for automotive and industrial designs.i.MX6D · i.MX6ULL · i.MX6Q · i.MX 6UltraLite
[45]
SABRE|Automotive-Infotainment|i.MX6 - NXP Semiconductors
30-day returnsThe SABRE for automotive infotainment based on the i.MX 6 series of processors enables rapid deployment of automotive consumer user experiences.
[46]
ARM Cortex-A57 and A53 vs Cortex A8, A9, A15 and A7 - ITPro
Oct 5, 2022 · The A7 is 50 per cent more powerful than the A8, while the A15 is 40 per cent more powerful than the A9. A smaller manufacturing process means ...
[47]
ARM Cortex series (A8/A9/A15/A7) NEON multimedia ... - EEWorld
Jul 13, 2016 · NEON is a SIMD data processing architecture. The 256-byte register stack contains 32 64-bit wide registers or 16 128-bit wide registers. All ...<|separator|>
[48]
https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-6-processors:IMX6X_SERIES
[49]
The ARM Cortex-A9 Can Beat Out The Intel Atom - Phoronix
Sep 3, 2012 · Here's some interesting test results recently uploaded to OpenBenchmarking.org that compares the performance of ARM Cortex A8 and Cortex A9 ...Missing: comparison integer multimedia<|separator|>
[50]
ARM Cortex A9 vs ARM Cortex A15 - What to expect, and what's the ...
May 22, 2012 · ARM Cortex A9 vs ARM Cortex A15 - What to expect, and what's ... 2.5 GHz per core, something we'll probably be able to see around mid ...
[51]
First 28nm ARM Cortex-A9 Processor Optimization Pack now ...
Feb 23, 2012 · New ARM Physical IP Delivers Optimized Performance and Energy-Efficiency for 28nm Applications, such as Mobile.Missing: macro 40 nm 22<|control11|><|separator|>
[52]
ARM Cortex-A9 Overview - element14 Community
Jun 1, 2012 · Supporting the configuration of 16, 32 or 64KB four way associative L1 caches, with up to 8MB of L2 cache through the optional L2 cache ...<|control11|><|separator|>
[53]
[PDF] Competitive, Synthesizable, Parameterized RISC-V Processor
Jun 13, 2015 · We have demonstrated BOOM running Linux, SPEC CINT2006, and CoreMark. BOOM, configured similarly to an ARM Cortex-A9, achieves 3.91 CoreMarks/ ...
[54]
ARM's big.LITTLE Concept - Semiconductor Engineering
Nov 8, 2012 · It is also claimed to have “PC-class” performance. The Cortex-A53 is said to deliver the performance of a Cortex-A9, is 40%+ smaller in the ...
[55]
[PDF] The ARM Cortex-A9 Processors - BioRobotics
Using a convenient synthesizable flow and IP deliverables, the Cortex-A9 processor provides an ideal upgrade path for existing ARM11™ processor-class ...<|control11|><|separator|>
[56]
i.MX 6DualPlus Applications Processors | Dual Arm Cortex-A9 for ...
30-day returnsThe i.MX 6DualPlus is a high performance applications processor with two Arm Cortex-A9 cores and advanced 3D graphics and multimedia features.
[57]
Product of the Week: Artila's Matrix-770 Ubuntu-Core-Based Cortex ...
The Matrix-770 Ubuntu-core-based Cortex-A9 IIoT Gateway from Artila Electronics is an industrial IoT (IIoT) gateway designed to provide the previously mentioned ...
[58]
i.MX 6SoloX Applications Processors | Arm Cortex-A9 Cortex-M4
The i.MX 6SoloX processor utilizes both Arm Cortex-A9 and Cortex-M4 cores to enable secure, connected homes and vehicles within the IoT.
[59]
The future of 32-bit Linux - LWN.net
Dec 4, 2020 · MX6 (Cortex-A9) family up to 20 years ending in 2035, and Microchip recently announced their SAMA7G54 (Cortex-A7) that may live even longer. On ...
[60]
Support: Product Status – Arm®
Our product portfolio evolves in response to advances in technology, industry demands, strategic acquisitions, product maturity and the development of new ...Missing: position | Show results with:position<|control11|><|separator|>
[61]
[PDF] ARM Cortex-A Series Programmer's Guide for ARMv8-A - CS140e
Mar 24, 2015 · For the avoidance of doubt, ARM makes no representation with respect to, and has undertaken no analysis to identify or understand the scope and ...
[62]
ARM's Cortex A53: Tiny But Important - Chips and Cheese
May 28, 2023 · According to ARM's keynote presentation, Cortex A53 can deliver the same performance as the 2-wide, out-of-order Cortex A9 while being 40% ...
[63]
[PDF] Which ARM Cortex Core Is Right for Your Application - Silicon Labs
The Cortex-A9 has been ... It can be clocked up to 600 MHz (delivering 2.45 DMIPS/MHz), has an 8-stage pipeline with dual-issue, pre-fetch and branch.
[64]
[PDF] Arm Cortex-A Processor Comparison Table
The Cortex-A series of applications processors provide a range of solutions for devices undertaking complex compute tasks, such as hosting a rich operating ...