Fact-checked by Grok 2 weeks ago

ARM Cortex-A72

The ARM Cortex-A72 is a high-performance 64-bit central processing unit (CPU) core developed by Arm Holdings, implementing the ARMv8-A architecture and designed primarily for premium mobile devices, embedded systems, and automotive applications. Announced on February 3, 2015, it serves as the successor to the Cortex-A57, enabling sustained operation at frequencies up to 2.5 GHz on a 16 nm FinFET process. The core supports configurations of 1 to 4 symmetrical multiprocessing (SMP) cores per cluster, each with dedicated 48 KB L1 instruction cache and 32 KB L1 data cache, paired with a shared L2 unified cache of up to 2 MB, and is optimized for big.LITTLE heterogeneous computing when combined with efficiency cores like the Cortex-A53. Key microarchitectural enhancements in the Cortex-A72 include a widened out-of-order superscalar with 3-wide decode, 8-wide issue, and 5-wide dispatch of micro-operations per cycle, alongside an advanced that reduces energy waste from mispredictions. It incorporates low-latency floating-point units, such as a 3-cycle multiply (FMUL) with 40% reduced latency versus the Cortex-A57, and an enhanced load/store unit with prefetching that boosts by more than 50%. Additional features encompass TrustZone for security, SIMD and VFPv4 extensions for media processing, support, and compatibility with AMBA 5 CHI or AMBA 4 ACE interconnects for . The Cortex-A72 delivers a 3.5× uplift in sustained CPU over 2014-era Cortex-A15-based devices while achieving 75% lower energy consumption at matched workloads, making it suitable for demanding tasks like video decoding at 120 fps, console-quality , and advanced driver-assistance systems (ADAS) in . First commercial implementations appeared in 2016 SoCs, such as those from and , targeting smartphones, tablets, and high-end embedded platforms with error-correcting code () support for reliability in automotive and storage environments. Its design emphasizes power efficiency through dynamic voltage and , individual core power-down modes, and dormant states, aligning with the toward denser, more capable and computing.

History and Development

Announcement and Release

The ARM Cortex-A72 was publicly announced by on February 3, 2015, during a press event unveiling a suite of IP targeted at premium mobile experiences. Positioned as the successor to the Cortex-A57, the core was designed to deliver greater than 90 percent single-thread performance uplift at the same power envelope compared to the A57, or 20 percent lower power consumption for equivalent performance. It also enabled devices up to 3.5 times faster than those based on the earlier Cortex-A15, with 75 percent lower energy consumption at matched performance levels. Licensing for the Cortex-A72 became available immediately following the announcement, with initial partners including HiSilicon, MediaTek, and Rockchip, and over ten licensees reported by early 2015. Architectural details were further disclosed in April 2015, highlighting a 20 to 60 percent increase in instructions per cycle over the Cortex-A57, alongside support for clock speeds up to 2.7 GHz. First silicon implementations emerged in 28 nm processes by late 2015, with 16 nm FinFET-based systems-on-chip shipping in mobile devices during 2016, primarily fabricated by TSMC. The Cortex-A72 was primarily developed at ARM's Austin design center in . Concurrent announcements emphasized its role in ARM's big.LITTLE architecture, where it pairs with efficient Cortex-A53 cores to extend performance and reduce energy consumption by an additional 40 to 60 percent across varied workloads.

Design Evolution

The ARM Cortex-A72 was developed as a direct successor to the Cortex-A57, primarily to rectify the predecessor's notable inefficiency issues that became apparent in high-performance applications. The A57, while delivering strong single-threaded , suffered from elevated power consumption due to its aggressive and branch prediction mechanisms, which led to suboptimal energy use in thermally constrained environments. To address this, ARM's design team undertook a comprehensive redesign of key components, including an enhanced branch prediction unit that improved accuracy by approximately 20% and execution units with reduced latencies, resulting in a 20-60% uplift in (IPC) without proportionally increasing power draw. Central to the Cortex-A72's design were deliberate trade-offs to balance premium for demanding tasks—such as and console-level —with stringent power requirements for and systems. The core targeted delivering over 3.5 times the of the 28nm Cortex-A15 at equivalent power levels, while achieving 15-35% better compared to the A57 through optimizations like suppressed unnecessary register accesses and early tag lookups in the execution . This efficiency focus enabled the A72 to operate sustainably at frequencies up to 2.5-2.7 GHz on 16nm FinFET processes, providing a 75% reduction in relative to 2014-era devices when matched for , thus extending life in big.LITTLE configurations paired with efficiency cores like the Cortex-A53. The A72's architecture drew influences from early explorations of flexible core clustering concepts that foreshadowed ARM's later DynamIQ technology, though it remained firmly rooted in the big.LITTLE paradigm with support for homogeneous or heterogeneous multi-core setups via the CoreLink CCI-500 interconnect. Development faced significant challenges in shrinking the die area by 15-20% compared to the A57—accomplished through re-optimization of every logical block—while preserving the full 64-bit instruction set focus and with 32-bit code. Internal prototypes began conceptualization around 2013-2014, aligning with ARM's roadmap for next-generation premium mobile SoCs, progressing to readiness by mid-2015 ahead of the February 2015 announcement and detailed architectural reveal in April 2015.

Microarchitecture

Pipeline Design

The ARM Cortex-A72 features a 15-stage out-of-order designed for high-performance and floating-point processing, enabling efficient while balancing power consumption in mobile and applications. The pipeline supports up to 3-wide decode and dispatch, allowing multiple instructions to progress simultaneously through the front-end stages, which contrasts with simpler in-order designs by permitting reordering to hide latency from dependencies. This out-of-order capability is facilitated by a reorder buffer holding up to 128 entries, ensuring instructions complete in program order despite parallel execution. Instruction fetch in the Cortex-A72 operates on 16-byte aligned windows, enabling the core to acquire 16 bytes per cycle under optimal conditions without taken branches, thus sustaining high throughput from the 48 KB L1 instruction cache. The decode stage follows with a 3-wide decoder capable of dual-issue for AArch64 instructions, fusing certain operations into macro-ops that generate an average of 1.08 micro-ops per instruction to enhance parallelism and reduce front-end bottlenecks. This design optimizes for variable-length ARM instructions by recognizing boundaries within and across fetch windows, minimizing stalls in dense code sequences. Branch prediction employs an advanced hybrid mechanism incorporating TAgged GEometric (TAGE)-like components, which dynamically learns from branch histories to predict both direction and targets with high accuracy, thereby reducing the penalties associated with mispredictions in . The predictor includes a Branch Target Buffer (BTB) supporting 2,000 large or 4,000 small entries, alongside indirect and return stack predictors, allowing the front-end to redirect fetch efficiently and maintain pipeline momentum. Power optimizations, such as conditional disabling of the predictor in predictable workloads, further integrate with this unit to lower energy use without compromising performance. The retirement stage features a 4-wide unit that resolves by committing up to four instructions per cycle in order, coordinating with the dispatch logic widened to 5 micro-ops per cycle for improved throughput over prior designs. To support this, the Cortex-A72 allocates 64 physical rename registers for integer operations and 128 for floating-point and , compared to the Cortex-A57's unified 128 rename registers, enabling deeper out-of-order windows and better handling of complex dependencies. These enhancements in rename capacity directly feed into the execution units for , as detailed separately.

Execution Units

The ARM Cortex-A72 employs a superscalar, engine with specialized hardware units for , floating-point/SIMD, and memory operations, enabling efficient instruction throughput while balancing power consumption. These units integrate with the processor's dispatch logic to issue up to three or two floating-point operations per , supporting the ARMv8-A instruction set for both 32-bit and 64-bit computations. Integer execution is handled by three arithmetic logic units (ALUs): two simple ALUs for basic arithmetic, logical, and shift operations, and one complex ALU dedicated to multiplications and divisions. All ALUs support 64-bit operations, with simple ALU instructions typically exhibiting 1-cycle latency and throughput of two per cycle, while complex operations like 64-bit multiplies incur 3-4 cycle latencies. This configuration allows the core to sustain high integer in compute-intensive workloads, such as or general-purpose processing. The floating-point and SIMD processing is managed by a NEON unit capable of 128-bit vector operations, paired with a dual-lane floating-point unit (FPU) for scalar and vector computations in single- and double-precision formats. The FPU achieves up to 4x throughput relative to scalar for common operations like additions and multiplications on 32-bit elements within 128-bit vectors, with multiply latency reduced to 3 cycles and fused multiply-add to 6 cycles compared to prior generations. This setup supports vectorized code for multimedia and scientific applications, issuing two 64-bit scalar FP operations or one 128-bit NEON vector per cycle across the two lanes. Memory operations are executed via a load/store unit with one load port and one store port, enabling up to one 16-byte (128-bit) load or one 8-byte (64-bit) store per cycle, including support for non-temporal hints to bypass caching for sequential data access. A dedicated (AGU) calculates effective addresses for these operations, handling unaligned accesses without performance penalties and instructions for in multithreaded environments. Store-to-load forwarding occurs with a 7-cycle , optimizing dependent chains. The is facilitated by a 128-entry (ROB) that tracks dependencies and ensures in-order commit, providing a robust window for reordering up to 128 macro-operations to hide latencies from branches and accesses. This larger ROB, compared to predecessors, enhances in irregular code patterns.

Memory and Cache System

Cache Hierarchy

The ARM Cortex-A72 features a multi-level designed to balance performance, power efficiency, and in multi-core configurations. Each core includes private Level 1 (L1) caches, consisting of a 48 cache and a 32 data cache. The L1 cache is 3-way set-associative with 64-byte cache lines and employs a least recently used (LRU) policy; it is physically indexed and physically tagged (PIPT) for efficient fetch operations. The L1 data cache is 2-way set-associative with 64-byte cache lines, also using LRU and PIPT organization; it supports non-blocking loads, allowing up to six outstanding 64-byte requests to improve tolerance for . Both L1 caches include optional error correction mechanisms, with parity protection for the cache and error-correcting code () for the data cache, to enhance reliability in and applications. The Level 2 () cache is a unified, for clusters of up to four cores, configurable in sizes of 512 , 1 , 2 , or 4 , and implemented as 16-way set-associative with 64-byte lines. It maintains strict inclusion with respect to the L1 caches, ensuring that all L1 is also present in , which simplifies management but requires careful sizing to avoid excessive power draw from redundant storage. The cache uses a software-programmable replacement policy, selectable between and pseudo-random algorithms via the (), allowing system designers to optimize for specific workloads such as those with predictable access patterns or high randomness. is managed with a write-back policy, read allocation on loads, and write allocation on stores, supporting up to 128-bit transfers for improved throughput. The cache also features optional protection and a prefetcher that can generate 0 to 3 additional requests per miss, configurable through the CPU Extended (), to anticipate sequential accesses in instruction and data streams. Coherence across the cache hierarchy is maintained through the AMBA ACE (AXI Coherent Extension) protocol for the L2 interface, enabling efficient snooping and consistency in multi-core clusters without requiring software intervention for inner-shareable domains. The L1 data cache operates under a MESI (Modified, Exclusive, Shared, Invalid) protocol, while the L2 supports an extended MOESI variant (adding Owned state) via its snoop tag array, ensuring data visibility across cores. The Cortex-A72 does not include a private Level 3 cache; instead, it relies on system-level caches or interconnects provided by the SoC integrator for outer-level sharing, with the point of coherency defined at L2 for uniprocessor operations and potentially L3 for multiprocessor setups as indicated by the Cache Level ID Register (CLIDR_EL1). In multi-core configurations, the L2 cache's banked structure—divided into two tag banks each with four data banks—facilitates parallel access and supports partitioning strategies to mitigate contention, though explicit lockdown is not provided. Bandwidth in the hierarchy is optimized for the core's , with the L1 instruction connected via a 128-bit interface for fetches and the L1 data sustaining up to 16 bytes per cycle on loads, though aggregate peaks can reach higher through non-blocking operations. The L2 provides 32 bytes per cycle of bandwidth to the cores, leveraging its wider internal paths to service multiple requests simultaneously and reduce for L1 misses. These characteristics contribute to the core's emphasis on energy-efficient performance in and systems, where hit rates directly impact overall consumption.

Memory Management

The ARM Cortex-A72 implements a (MMU) compliant with the ARMv8-A , providing stage-1 translation from virtual to intermediate physical addresses and stage-2 translation from intermediate physical to physical addresses to support in environments. This dual-stage mechanism enables efficient guest OS execution under a host , with stage-1 handling per-process s and stage-2 managing host-level mappings. The MMU incorporates the Large Physical Address Extension (LPAE), supporting physical addresses up to 40 bits (1 TB ) and page sizes ranging from 4 to 2 in standard configurations, with additional support for 16 and 1 GB pages through extended table walks. The Translation Lookaside Buffers (TLBs) in the Cortex-A72 form a two-level to accelerate translations. Each features a 48-entry fully associative L1 TLB (ITLB) that caches translations for 4 KB, 64 KB, and 1 MB pages, optimized for fetch streams in program execution. The L1 data TLB (DTLB) is a 32-entry fully associative structure supporting the same native page sizes for load and store operations, ensuring low- access for data references. A shared 1024-entry, 4-way set-associative L2 unified TLB serves all cores in a , caching translations across a broader range of page sizes including 4 KB, 64 KB, 1 MB, 16 MB, 2 MB, and 1 GB to handle diverse mappings. To minimize during TLB misses, the MMU includes dedicated walk caches that store intermediate entries encountered during traversal, parallelizing lookups across stages for improved efficiency in virtualized workloads. For system-level integration, the Cortex-A72 connects to external memory subsystems via configurable interfaces supporting the AMBA 4 AXI protocol or the AMBA 5 protocol, enabling -coherent operation in multi-core clusters. These interfaces support up to 128-bit wide AXI buses, facilitating high-bandwidth data transfers while maintaining through snoop-based protocols. Multi-core is further optimized by an integrated snoop filter in the L2 controller, which tracks cache line ownership across cores to reduce unnecessary snoops and interconnect traffic. The can handle 16 to 32 outstanding memory misses per core, depending on configuration, allowing sustained bandwidth for parallel memory accesses without stalling the excessively. Memory protection in the Cortex-A72 relies on Address Space Identifiers (ASIDs) and Identifiers (VMIDs), both 16-bit fields that tag TLB entries to isolate address spaces without full flushes on context switches. ASIDs distinguish processes within an OS, while VMIDs separate under a , preventing cross-context pollution. Speculative TLB invalidations are supported via broadcast TLB invalidation (TLBI) instructions, which propagate efficiently across the TLB and snoop filter to maintain consistency during dynamic operations like page faults or remapping.

Features and Capabilities

Instruction Set Extensions

The ARM Cortex-A72 implements the full , providing native support for the execution state and its 64-bit A64 instruction set, while maintaining backward compatibility with the AArch32 execution state through the A32 () and T32 (Thumb) instruction sets. This baseline compliance enables 64-bit addressing, enhanced security features, and a unified exception model across execution states. In addition to the core instruction set, the Cortex-A72 includes the Advanced SIMD () extension as defined in ARMv8-A, supporting vector processing for and tasks with 128-bit wide registers and operations on , fixed-point, and floating-point data types. Later revisions of the ARMv8-A architecture introduced enhancements to , such as instructions (e.g., UDOT, SDOT) and support, but these are not available in the Cortex-A72 implementation. The processor fully supports the Virtualization Memory System Architecture (VMSA) of ARMv8-A, facilitating hardware-assisted through Exception Level 2 (EL2) for non-secure virtual machines under a and Exception Level 3 (EL3) for secure monitor functionality. This includes stage-2 address translation, virtualized interrupt handling via the Generic Interrupt Controller (GIC), and context isolation between secure and non-secure worlds. Optionally, licensees can include the ARMv8-A Cryptography Extensions, which integrate with the unit to accelerate common cryptographic algorithms through dedicated instructions: AES operations (, , for encryption, decryption, and mix columns), SHA-1 hashing (, , , , , ), SHA-256 hashing (, , , ), and for carryless polynomial multiplication used in modes like AES-GCM. These extensions are controlled by the CRYPTPVOFF bit in the ID_AA64ISAR0_EL1 register and require separate licensing; when absent, the relevant fields in ID_AA64ISAR0_EL1 (, , , ) report 0b0000. The Cortex-A72 does not support the Scalable Vector Extension (SVE), which was introduced in ARMv8.2-A for scalable vector lengths up to 2048 bits. Similarly, the (RAS) extensions from ARMv8.2-A, including error record registers and injection mechanisms, are not implemented as a core feature, though custom integrations may vary by licensee.

Power and Efficiency Optimizations

The ARM Cortex-A72 incorporates fine-grained mechanisms to minimize dynamic power consumption by selectively disabling clock signals to inactive hardware units. This includes per-unit gating for components such as the , which shuts off when instruction windows exceed 16 bytes to avoid unnecessary activity, as well as dedicated clock gating in the decode and integer execute stages for additional power savings. Separate power domains are provided for the integer unit, , and SIMD extension, allowing independent control to power down unused sections during operation. Dynamic voltage and frequency scaling (DVFS) is supported through architectural hooks that enable the operating system to adjust voltage and clock speeds based on demands, facilitating efficient . The core is designed to operate at up to 2.5 GHz on a 16 nm FinFET process node, with implementations on 28 nm nodes typically achieving lower frequencies such as 1.8 GHz, balancing performance and thermal constraints. Efficiency targets for the Cortex-A72 emphasize high , achieving approximately 4.7 DMIPS/MHz and 4.0 /MHz in typical implementations, representing about 20% improvement in over the predecessor Cortex-A57. In multi-core configurations, a shared per reduces power redundancy by minimizing data movement between cores, while support for core parking allows idle cores to enter low-power states, further optimizing overall system energy use. Thermal management is enhanced by the integrated Performance Monitor Unit (PMU), which provides counters for monitoring events such as stalls and inefficiencies that can contribute to thermal buildup, enabling proactive throttling when necessary.

Implementations and Adoption

Licensing Model

The ARM Cortex-A72 was offered under ARM's traditional processor licensing model, which enabled semiconductor companies to integrate the core into their custom system-on-chip () designs for applications such as mobile devices and systems. Licensees typically acquired the IP through a non-exclusive agreement that included an upfront access fee, often ranging from $1 million to $10 million depending on the scope of use and configuration options selected. This model provided two primary delivery formats: pre-configured binary cores optimized for specific manufacturing processes, or synthesizable () source code for broader integration flexibility. In addition to the initial licensing fee, charged royalties on a per-device basis for each incorporating the Cortex-A72 that was manufactured and shipped. These royalties followed a percentage-of-selling-price structure, typically 1.5% to 2% for Cortex-A series processors like the A72, which equated to approximately $0.50 to $2 per unit based on costs of $25 to $100 and volume discounts for high-production runs. The exact rate varied by , total core count per chip, and any additional bundled, but it incentivized widespread adoption by scaling down with larger deployment volumes. Note that starting in 2024, ARM shifted to a per-device selling price model for new agreements, increasing potential compared to the traditional per-chip approach. Customization under the standard license was limited to configuration parameters rather than deep architectural changes, allowing licensees to adjust elements such as L2 cache size (512 KB, 1 MB, 2 MB, or 4 MB). Full modifications were restricted to prevent compatibility issues. The core was designed for traditional big.LITTLE symmetrical clusters. These options balanced performance tailoring with ARM's architectural integrity. The Cortex-A72 entered ARM's IP portfolio in February 2015, coinciding with the release of its Technical Reference Manual (TRM) for revision r0p1, which details guidelines and is available to licensees via ARM's resources. Vendor agreements were non-exclusive, fostering among partners; at launch, at least ten companies committed to designs using the core, including , , and , leading to several known integrations across various high-performance SoCs.

Notable SoCs and Devices

The 820 and 821 SoCs incorporate custom CPU cores architecturally derived from the Cortex-A72, delivering high-performance computing for flagship smartphones. These processors powered devices such as the and smartphones, both launched in 2016, enabling advanced mobile experiences with improved efficiency over prior generations. HiSilicon's 950 and 955 SoCs employ a big.LITTLE configuration with four Cortex-A72 cores paired with four Cortex-A53 cores, fabricated on a 16 nm FinFET+ process for balanced power and performance. They were integrated into Huawei's Mate 8 in 2015 and the P9 in 2016, supporting premium features like high-resolution displays and long battery life in these devices. MediaTek's Helio X20 and X25 SoCs introduced a tri-cluster design featuring two Cortex-A72 prime cores, alongside efficiency clusters of Cortex-A53 cores, marking an innovative approach to deca-core processing on a 20 nm node. These chips appeared in mid-range smartphones, including the Pro 6 in 2016, which benefited from enhanced multitasking and graphics capabilities via the integrated Mali-T880 GPU. Rockchip's RK3399 SoC utilizes a dual Cortex-A72 and quad Cortex-A53 configuration, optimized for multimedia and computing tasks in resource-constrained environments. It has been widely adopted in single-board computers and embedded systems, such as the Orange Pi RK3399 and various Rock Pi models, facilitating applications in , media players, and development boards since 2016. The AWS Graviton1 processor, launched in 2018, features 16 cores and was used in EC2 instances for , demonstrating the core's applicability in server environments. While no standard variants directly implement the , the core's design influenced broader adoption in high-end embedded processors during the mid-2010s. By the , the was largely phased out in favor of successors like the A76 and A78 for mobile flagships, yet it persists in embedded and devices for its reliable performance and low power profile as of 2025.

Performance Analysis

Benchmark Results

The ARM Cortex-A72 core delivers solid performance in standard benchmarks, particularly when implemented in big.LITTLE configurations with efficiency cores. In reference designs like the 950 (four A72 cores at up to 2.3 GHz on 16 nm), the core demonstrates competitive integer processing capabilities. SPECint2006 scores for the A72 reach approximately 11.8 per core in normalized tests, reflecting its and improved branch prediction that enable efficient handling of complex workloads. On 16 nm processes at 2.0 GHz, typical scores range from 8 to 10, scaling to over 12 on advanced 10 nm nodes due to higher clock speeds and density improvements. In synthetic CPU tests like 4, the A72 in the 950 achieves single-core scores of approximately 1700 and multi-core scores up to 5300 in an eight-core setup, highlighting its strength in single-threaded tasks common to mobile applications. These results stem from the core's dual-issue pipeline and enhanced floating-point units, which contribute to balanced performance across 4 and 5 variants without excessive thermal throttling. For overall system-level metrics, v6 scores for A72-based devices like the 8 ( 950) fall in the 83,000 to 94,000 range, with CPU subscores emphasizing the core's role in driving responsive user interfaces and multitasking. Power efficiency remains a key attribute, with the A72 consuming 2 to 4 W per core at peak loads in mobile SoCs, enabling sustained operation at frequencies up to 2.5 GHz on 16 nm FinFET processes. ARM reports energy efficiency gains of 18% to 30% over the predecessor A57 at iso-performance. Performance exhibits variability across implementations, as reference designs like the Kirin 950 yield baseline results, while process shrinks to 10 nm in later chips boost scores by 20% to 50% through better voltage scaling and thermal headroom. Custom variants from licensees, though less common for the A72 itself, further optimize outcomes via tailored cache hierarchies and interconnects, though major adopters like Samsung favored proprietary cores (e.g., Mongoose in Exynos 8890) over the stock A72 for specific workloads.

Comparisons with Other Cores

The ARM Cortex-A72 offers significant improvements over its predecessor, the Cortex-A57, particularly in power efficiency and area optimization. At the same power envelope, the A72 delivers up to 90% higher performance, achieved through a combination of architectural enhancements enabling higher instructions per clock (IPC) and support for 10% higher clock speeds. Additionally, it features a 15% smaller die area compared to the A57, enabling more compact implementations in mobile and embedded systems. These enhancements, including a more balanced fixed pipeline design, make the A72 better suited for sustained workloads, where the A57's wider but less efficient out-of-order execution could lead to thermal throttling under prolonged loads. In comparison to the Cortex-A73, the A72 provides similar overall levels but with a focus on peak throughput rather than sustained . The A73 achieves 30% higher sustained and over 20% better than the A72 at the same node and frequency, emphasizing broader workload optimization including memory-intensive tasks. However, the A72 maintains an advantage in peak throughput due to its higher ceiling, making it preferable for bursty, compute-bound operations where absolute speed trumps long-term savings. Relative to 2016 x86 contemporaries like Intel's Skylake mobile cores, the Cortex-A72 achieves comparable in constrained thermal and power budgets typical of devices, leveraging its efficient Armv8-A to match or exceed efficiency in low-power scenarios. However, it trails in absolute performance, with Skylake delivering higher peak speeds under unconstrained conditions due to larger caches and more aggressive . This positions the A72 as a strong contender for battery-limited applications but less ideal for high-power desktops of the era. Against successors like the Cortex-A76 and A78, the A72 lags significantly in , with the A76 offering about 35-40% higher performance than the A73 (cumulatively ~70% over the A72) at the same power level and the A78 providing an additional 20% uplift over the A76 through enhanced branch prediction and vector processing. Cumulatively, this results in 50-100% higher for the A78 compared to the A72, reflecting generational advances in . Despite this, the A72 remains viable for cost-sensitive applications as of 2025, where its mature design supports legacy software without the complexity or licensing costs of newer cores, including continued use in devices like the and industrial IoT systems. Key trade-offs for the A72 include its strength in legacy applications, benefiting from broad software compatibility and scalar integer efficiency, but relative weakness in and vector workloads due to the absence of scalable vector extensions (SVE) and limited 64-bit vector units. Later cores like the A78 incorporate SVE and dot-product instructions for improved ML acceleration, highlighting the A72's niche in traditional computing tasks over emerging demands.

References

  1. [1]
    ARM Sets New Standard for the Premium Mobile Experience
    Feb 3, 2015 · "We are pleased to partner with ARM for the launch of Cortex-A72, bringing the ARMv8-A architecture to market with leading performance and ...
  2. [2]
    A walk through of the Microarchitectural improvements in Cortex-A72
    May 4, 2015 · In early 2015, ARM announced a suite of IP for Premium Mobile designs, with the ARM® Cortex®-A72 Processor delivering a 3.5x increase in ...
  3. [3]
    ARM Cortex-A72 MPCore Processor Technical Reference Manual ...
    This is the ARM Cortex-A72 MPCore Processor Technical Reference Manual, revision r0p3, which is a non-confidential document for a developed product.
  4. [4]
    Cortex-A72 Product Support - Arm Developer
    The Cortex-A72 processor cluster has one to four cores, each with their L1 instruction and data caches, together with a single shared L2 unified cache.
  5. [5]
    Cortex-A72 | Fast Performance for Mobile and Embedded - Arm
    The Cortex-A72 processor cluster has one to four cores, each with their L1 instruction and data caches, together with a single shared L2 unified cache. Visit ...
  6. [6]
    ARM introduces new-generation Cortex-A72, second-gen 64-bit core
    Feb 3, 2015 · The ARM Cortex-A72 is the new “big” core from ARM that could be installed into Big.Little configurations with ARM Cortex-A53 in order to wed ...Missing: announcement | Show results with:announcement
  7. [7]
    ARM details its new high-end CPU core, Cortex A72 - Ars Technica
    Apr 23, 2015 · The first 16nm FinFET mobile SoCs with the Cortex A72 CPU will likely ship in 2016, fabricated by TSMC. In the words of Mike Filippo, ARM's ...
  8. [8]
    ARM debuts new chip designs, with Austin flavor
    Sep 24, 2016 · One design, the company's Cortex A72, will be featured in phones next year, and was developed in ARM's Austin office. ARM, which employs ...
  9. [9]
    ARM announces new ARM Cortex-A72 processor - TechRadar
    Feb 3, 2015 · With the big.LITTLE design, the processor will gain an additional 40-60% reduction in energy consumption on top of the 75% energy efficiency ...
  10. [10]
    ARM Cortex-A72 and the New Premium Mobile Experience
    Feb 10, 2015 · It is with this need of the future consumer in mind that last week, we announced the first ARM IP Suite for the premium mobile experiences – at ...
  11. [11]
    ARM Cortex-A72 fetch and branch processing
    Dec 8, 2020 · Let's take a closer look at instruction fetch, decode and dispatch in the Cortex-A72 micro-architecture. These are the “front-end” stages of the core pipeline.Missing: depth | Show results with:depth
  12. [12]
    ARM's Cortex A72: aarch64 for the Masses - Chips and Cheese
    Nov 10, 2023 · ARM's Cortex A72 is a 3-wide, speculative, out of order microarchitecture launched in 2016.
  13. [13]
    [PDF] The Arm Neoverse N1 Platform: Building Blocks for the Next-Gen ...
    The Neoverse N1 core is designed to achieve high performance while maintaining the. Performance Power Area (PPA) advantage point established with Cortex-A72. To ...
  14. [14]
    Reorder buffer size of various CPUs
    ... Cortex A72 ... Cortex A73 and Cortex A75 do not use a reorder buffer. ROB GPR FPR Vec Year 1120 228 240 128 2024 SiFive P870 ROB size apparently is 280 merged ...
  15. [15]
    Cortex-A72 - Arm Developer
    It supports dynamic and static branch prediction. The instruction fetch unit includes: L1 instruction cache that is a 48KB 3-way set-associative cache with a 64 ...Missing: precursors DynamIQ
  16. [16]
    [PDF] ARM® Cortex®-A72 MPCore Processor Technical Reference Manual
    Apr 4, 2023 · 0003-05. 22 April 2016. Non-Confidential. First release for r0p3. 0003-06. 01 December 2016. Non-Confidential. Second release for r0p3. Non- ...<|separator|>
  17. [17]
    About the L2 memory system - Arm Developer
    Configurable L2 cache size of 512KB, 1MB, 2MB and 4MB. · Fixed line length of 64 bytes. · Physically indexed and tagged cache. · 16-way set-associative cache ...
  18. [18]
    Cache Level ID Register, EL1 - Arm Developer
    Indicates the Level of Unification Uniprocessor for the cache hierarchy. This value is: Indicates the Level of Coherency for the cache hierarchy.
  19. [19]
    Per-Bank Bandwidth Regulation of Shared Last-Level Cache ... - arXiv
    Jul 21, 2025 · For instance, the LLC of the ARM Cortex-A72 processor has two independent tag banks, each of which is further divided into four data banks [1] .
  20. [20]
  21. [21]
  22. [22]
  23. [23]
    ID_AA64ISAR0_EL1: AArch64 Instruction Set Attribute Register 0
    From Armv8.4, the only permitted value is 0b0010 . TS, bits [55:52]. Indicates support for flag manipulation instructions ...Missing: A72 | Show results with:A72
  24. [24]
    ARM Cortex-A72 MPCore Processor Technical Reference Manual ...
    Security state · Can access both the Secure and the Non-secure memory address space. · When executing at EL3, can access all the system control resources.<|control11|><|separator|>
  25. [25]
    About the Cortex-A72 processor Cryptography engine
    The Cryptography Extensions add new instructions that the Advanced SIMD can use to accelerate the execution of AES, SHA1, and SHA2-256 algorithms.
  26. [26]
    AArch64 Instruction Set Attribute Register 0, EL1 - Arm Developer
    Provides information about the Cryptography Extension instruction set that the processor can support. Note. The optional Cryptography engine is not included in ...
  27. [27]
    ARM processors DMIPS/MHz comparison - bluelogic - 博客园
    Jan 13, 2020 · DMIPS/MHz, DMIPS/MHz*. ARM11, v7-A, 32, 1.25. Cortex-A7, v7-A, 32, 1.9 ... Cortex-A72, v8-A, 32/64, 5.4, 4.7. Cortex-A73, v8-A, 32/64, 7.0, 4.8.
  28. [28]
    ARM Business Model - Strategyzer
    The licensing fees vary between an estimated $1 million to 10 million. The royalty is usually 1 to 2% of the selling price of the chip. 4. Scale without ...Missing: Cortex- A72 types customization
  29. [29]
    How ARM Makes Money - The ARM Diaries, Part 1 - AnandTech
    Jun 28, 2013 · Royalties make up roughly 50% of ARM's total revenues, licensing fees are just over 33% and the remainder is equally distributed between ...Missing: A72 customization<|separator|>
  30. [30]
    Arm to Change Pricing Model Ahead of IPO | TechPowerUp Forums
    Mar 27, 2023 · Currently, the old model charges around 1-2 percents per chip in each smartphone, considering the ASP of smartphone chips to be $40 for Qualcomm ...
  31. [31]
    Cache organization - Arm Developer
    The cache sizes are configurable with sizes of 512KB, 1MB, 2MB, and 4MB. You can configure the L2 memory system pipeline to insert wait states to take into ...
  32. [32]
    Kryo: Qualcomm's Last In-House Mobile Core - Chips and Cheese
    Jul 12, 2023 · Cortex A72 also has 5 cycle multiplication latency, and only one port available for integer multiplies. With just two ALUs, port contention is ...
  33. [33]
    Kirin 950 Chipset | HiSilicon Official Site
    16 nm FinFET+ process, 4x Cortex-A72 + 4x Cortex-A53 big.LITTLE architecture, and Mali-T880 GPU combine to offer unparalleled performance and standby time.
  34. [34]
    HUAWEI Mate 8 Smartphone Review: Enter the Kirin 950
    Aug 26, 2016 · Processor, HiSilicon Kirin 950 Octa-core SoC (4x 2.3 GHz Cortex-A72 + 4x 1.8 GHz Cortex A53) ; Graphics, Mali-T880 MP4 ; Memory, 3GB ; Display, 6.0 ...
  35. [35]
    MediaTek Launches Helio X20 | Tri-Cluster CPU SoC
    May 12, 2015 · The Tri-Cluster CPU consists of one cluster of two ARM Cortex-A72 cores (running at 2.1GHz - 2.3GHz for extreme performance) and two ...
  36. [36]
    Meizu's 10-core phone gets a 10-LED camera flash - Engadget
    Apr 13, 2016 · Indeed, today Meizu announced its Pro 6 smartphone which has nabbed exclusivity over the flagship Helio X25, yet it only starts from 2,499 yuan ...
  37. [37]
    RK3399 - Rockchip open source Document
    Aug 29, 2019 · Based on Big.Little architecture, it integrates dual-core Cortex-A72 and quad-core Cortex-A53 with separate NEON coprocessor. Many embedded ...Overview · SoC Features
  38. [38]
    Orange Pi RK3399 - Orangepi
    Orange Pi RK3399 is an open-source single board computer with dual-band wireless WiFi and Bluetooth 4.1. It is highly compact with a dimension of 99X129mm.
  39. [39]
    ARM Cortex-A72 VS Cortex-A76 Processors - ARMxy SBC
    Apr 13, 2025 · Both ARM Cortex-A72 and Cortex-A76 are high-performance ARM processors. ... ✓ Memory dependence predictor (20% lower latency). Retained ...
  40. [40]
    Huawei Kirin 950 SoC beats Exynos 7420 in leaked GeekBench score
    Nov 3, 2015 · The new Kirin 950, obliterates Samsung's current crown jewel – the Exynos 7420, which can do about 1486 points on single core tests and 4970 on multi-core.
  41. [41]
    The Kirin 950 SoC goes official, posts a record AnTuTu score
    Nov 5, 2015 · Still Huawei did show off a stellar test score of almost 83 000 points on AnTuTu with a demo rig.
  42. [42]
    Huawei HiSilicon Kirin 950 Antutu Benchmark score
    The Antutu benchmark score of Huawei HiSilicon Kirin 950 is around 94164. Its rank and results are higher than its competitors, for example the Qualcomm ...
  43. [43]
    New ARM Cortex-A73 Processor drives efficiency, performance for ...
    May 27, 2016 · The Cortex-A73 delivers 30% more sustained performance than our most recent previous high-performance CPU, the Cortex-A72.<|separator|>
  44. [44]
    Arm's Cortex A73: Resource Limits, What are Those?
    Jul 18, 2024 · Cortex A72 might have a higher IPC ceiling, but A73 enjoys better code fetch bandwidth from L2. 1 IPC certainly isn't great, but it's better ...
  45. [45]
    ARM Cortex-A72 vs Cortex-A76 Processors - BLIIoT
    Sep 18, 2025 · Execution: Floating-point throughput doubled (256-bit NEON). Memory: L1D cache increased to 64KB (vs 32KB on A72). 3. Performance. Test Item ...
  46. [46]
    None
    ### Summary of Cortex-A72 vs A57, A73, A76, A78 Comparison