ARM Cortex-X3
The ARM Cortex-X3 is a high-performance CPU core developed by Arm Holdings as part of its Cortex-X custom program, implementing the Armv9.0-A 64-bit architecture and targeting premium mobile devices such as flagship smartphones and laptops.[1] Announced on June 28, 2022, alongside the Cortex-A715 and Cortex-A510 as components of Arm's Total Compute Solutions 2022, it emphasizes single-threaded performance gains through architectural enhancements like a widened execution pipeline and improved branch prediction.[1] The core supports DynamIQ cluster configurations, enabling integration with other CPU types for heterogeneous computing in system-on-chips (SoCs).[2] Key features of the Cortex-X3 include the AArch64 execution state across exception levels EL0 to EL3, Scalable Vector Extension (SVE) with 128-bit vector length and SVE2 support, Advanced SIMD and floating-point units, and an optional Cryptographic Extension for enhanced security.[3] It incorporates a Memory Management Unit (MMU) with 40-bit physical and 48-bit virtual addressing, separate L1 instruction and data caches, a private unified L2 cache, and error correction via parity or ECC on caches and translation structures.[3] Additional capabilities encompass the Generic Interrupt Controller (GIC) v4.1 interface, 64-bit Generic Timers, Reliability, Availability, and Serviceability (RAS) extensions, Activity Monitoring Unit (AMU) for power profiling, Performance Monitoring Unit (PMU), and debug features like Embedded Trace Extension (ETE) and Statistical Profiling Extension (SPE).[3] The Cortex-X3 delivers up to 25% improved single-threaded performance over the Cortex-X2 in benchmarks relevant to Android flagship smartphones, with 34% gains compared to mainstream laptops, marking the third year of double-digit instructions-per-cycle (IPC) uplift.[1] It features a 50% larger L1/L2 branch target buffer and a 10x larger L0 branch target buffer for better prediction accuracy, alongside support for the DynamIQ Shared Unit-110 (DSU-110) enabling clusters of up to 12 cores with 16 MB L3 cache.[1] These advancements prioritize compute-intensive tasks like AI processing and gaming while maintaining power efficiency in big.LITTLE configurations.[1] Since its release, the Cortex-X3 has been adopted in leading mobile SoCs, including the Qualcomm Snapdragon 8 Gen 2 with a prime core clocked up to 3.2 GHz for devices like the Samsung Galaxy S23 series, and the MediaTek Dimensity 9200 series with speeds exceeding 3 GHz for high-end Android phones.[4][5] It also powers the Google Tensor G3 in Pixel 8 devices, contributing to advancements in on-device machine learning and multimedia workloads.[6] The core's design facilitates customization by partners, supporting innovations in areas like ray-traced graphics and extended reality applications.[7]Overview
General characteristics
The ARM Cortex-X3 is a high-performance CPU core designed by Arm Holdings and launched in 2022, with its official announcement on June 28.[1] It serves as the flagship processor in Arm's portfolio, targeting demanding workloads in mobile and compute devices.[8] The core implements the Armv9.0-A instruction set architecture, operating exclusively in 64-bit AArch64 mode without support for 32-bit AArch32, which enables optimizations for modern software ecosystems.[2] It features a 48-bit virtual address space and a 40-bit physical address space, supporting efficient memory management in high-end systems.[3] The microarchitecture, known as Makalu-ELP, represents a performance-oriented evolution within Arm's Cortex-X series. As part of Arm's Total Compute Solutions 2022 (TCS22), the Cortex-X3 integrates alongside the efficiency-focused Cortex-A715 and Cortex-A510 cores, the Immortalis-G715 GPU, and CoreLink system IP to form scalable big.LITTLE configurations.[1] It supports up to 12 cores per cluster in DynamIQ configurations using the DSU-110 shared unit, facilitating flexible multi-core setups for high-performance applications such as flagship smartphones and tablets.[1] The core leverages DynamIQ shared memory technology for coherent multi-core operation.[2]Design and announcement
The ARM Cortex-X3 was announced on June 28, 2022, as part of Arm's Total Compute Solutions 2022 (TCS22) roadmap, which outlined advancements in CPU, GPU, and system IP for next-generation mobile and computing devices.[9] This third-generation high-performance core in the Cortex-X series succeeded the Cortex-X2 and was positioned to push the boundaries of single-threaded performance in premium segments.[8] The design goals centered on achieving a 25% uplift in peak single-threaded performance compared to the Cortex-X2 at identical power and process conditions, with a strong emphasis on sustained performance for AI and machine learning workloads as well as large-screen experiences in premium mobile devices.[8] Development prioritized a balance of high instructions per cycle (IPC) and power efficiency within the Armv9 ecosystem, incorporating support for Scalable Vector Extension 2 (SVE2) to enable advanced vector computing capabilities.[3] The core bears the codename Makalu-ELP.[10] IP licensing for the Cortex-X3 became available to SoC designers starting in late 2022, enabling integration into custom silicon designs, with the first commercial products featuring the core emerging in 2023.[6]Architecture
Core microarchitecture
The ARM Cortex-X3 core utilizes an out-of-order execution microarchitecture featuring an 8-wide rename and dispatch stage, capable of processing up to 8 macro-operations (MOPs) and dispatching up to 16 micro-operations (μOPs) per cycle under specific constraints.[11] This design enables efficient handling of instruction-level parallelism by dynamically scheduling operations based on data dependencies. The reorder buffer (ROB) accommodates 320 entries, facilitating robust tracking and retirement of instructions to maintain precise exception handling and ordering.[12] Branch prediction in the Cortex-X3 relies on a two-level adaptive predictor, augmented by an indirect target buffer and enhancements to the return stack, which improve prediction accuracy for indirect branches and function returns in intricate code sequences.[13] For integer computations, the core incorporates 6 arithmetic logic units (ALUs) to execute parallel operations such as additions, multiplications, and logical functions, enhancing throughput for general-purpose workloads.[12] The load/store unit comprises dual load pipelines and dual store pipelines, collectively supporting up to 6 memory operations per cycle, including handling of unaligned accesses with minimal penalties when aligned properly.[11] To sustain this parallelism, the register file includes 192 physical integer registers and 160 floating-point registers, allowing extensive renaming to reduce stalls from register pressure.[10] The Cortex-X3 fully implements the Armv9.0-A architecture in AArch64 execution state, incorporating extensions such as dot product instructions (e.g., for INT8 and FP16) and complex number support to accelerate matrix multiplications and signal processing in machine learning applications.[2] This core integrates into Arm's DynamIQ technology for flexible multi-core scaling in heterogeneous configurations.[14]Instruction pipeline and execution
The instruction pipeline of the ARM Cortex-X3 employs a superscalar, out-of-order execution paradigm to maximize throughput while maintaining in-order retirement for precise exception handling. Instructions are initially fetched from the L1 instruction cache and proceed through the frontend, where they are decoded into internal macro-operations (MOPs). These MOPs may further split into micro-operations (μOPs) during subsequent stages. The renamed and dispatched operations are then issued out-of-order to a set of 17 execution pipelines, encompassing integer, load/store, branch, and floating-point/advanced SIMD units, before results are committed in program order.[11] The frontend features enhanced decode bandwidth, capable of processing up to 6 instructions per cycle, enabling efficient handling of dense code sequences through macro-op fusion techniques that combine adjacent operations. Dual decode clusters facilitate this fusion, allowing the system to fetch up to 12 instructions per cycle from the instruction cache while integrating branch prediction mechanisms for speculative execution. This design supports the Armv9.0-A instruction set, with dynamic scheduling to minimize stalls from dependencies or mispredictions.[13] In the backend, the Cortex-X3 achieves peak throughput of up to 8 instructions retired per cycle via a 3-wide issue mechanism to execution units, leveraging a deep pipeline optimized for high-frequency operation. The integer pipeline spans 19 stages to balance latency and clock speed, while dynamic scheduling ensures operands are ready before dispatch to the appropriate pipelines. For vector and floating-point workloads, 4-wide FP/NEON pipes handle scalar and SIMD operations, augmented by SVE2 support with 128-bit vector length for scalable vector processing in machine learning and multimedia tasks.[11][13] Efficiency during varying loads is enhanced by fine-grained power gating and clocking domains, which isolate and deactivate idle sections of the pipeline—such as frontend clusters or backend units—reducing leakage power without impacting active execution paths. This allows seamless transitions between full-performance mode and low-activity states, preserving battery life in mobile applications.[15]Memory subsystem
The memory subsystem of the ARM Cortex-X3 core employs a hierarchical cache structure optimized for low latency and high throughput in demanding workloads, integrated within the DynamIQ shared unit framework.[16] Each core features a split L1 cache consisting of a 64 KiB instruction cache and a 64 KiB data cache, both 4-way set associative, with parity or ECC protection for reliability.[3][17] The private L2 cache per core is configurable as 512 KiB or 1 MiB and is 8-way set associative, inclusive of L1 content to ensure efficient data reuse without duplication overhead.[18][10] In a DynamIQ cluster, cores share an L3 cache ranging from 512 KiB to 16 MiB, controlled by the DSU-110 unit, which enforces coherence through snoop-based protocols to maintain data consistency across multiple cores.[19] Memory access capabilities include up to 64 bytes per cycle for loads and 48 bytes per cycle for stores, with native support for 64-bit DDR5 and LPDDR5 interfaces to deliver high system-level bandwidth.[16] Address translation is handled by a 64-entry L1 instruction TLB and 64-entry L1 data TLB per core, complemented by a 4096-entry L2 unified TLB for larger working sets.[16] Key features encompass non-temporal stores for bypassing caches in streaming scenarios, hardware prefetchers that detect sequential or strided access streams to proactively load data into L1 and L2 caches, and adherence to the Coherent Hub Interface (CHI) protocol for scalable, coherent interconnects with external memory controllers and other cluster components.[20][21]Improvements from previous generation
Key enhancements over Cortex-X2
The ARM Cortex-X3 introduces several architectural modifications compared to the Cortex-X2, primarily aimed at increasing instruction throughput and out-of-order execution capacity while optimizing for high-performance workloads. A key change is the expansion of the decode bandwidth from 5-wide to 6-wide, enabling the core to process up to six instructions per cycle for improved overall throughput. This adjustment, combined with a reduced micro-op (MOP) cache size to 1,500 entries (half of the X2's 3,000), maintains high bandwidth at 8 MOPs per cycle but shortens the pipeline to 9 stages from 10, reducing latency in the front-end.[12][13] In the execution backend, the Cortex-X3 increases the number of integer ALUs from four to six (four simple and two complex units, with one supporting integer division), which enhances integer computation capabilities. Load/store resources are also upgraded, shifting from two load/two store queues in the X2 to a configuration with three address generation units (AGUs)—two for read/write and one read-only—doubling the load bandwidth to 32 bytes per cycle from 24 bytes. These changes support dual load and dual store operations with higher throughput, better handling memory-intensive tasks. The reorder buffer (ROB) is expanded from 288 to 320 entries, allowing a larger window of out-of-order instructions (up to 640 in flight), which improves speculation and reduces stalls in complex code sequences.[12][13] Branch prediction receives significant attention, with an enhanced predictor featuring a 10x larger L0 branch target buffer (BTB), over 50% larger L1 and L2 BTBs, and a new dedicated indirect branch predictor paired with an expanded indirect target array. This contributes to an overall 6.1% reduction in branch mispredictions and a 12.2% decrease in cycles lost to mispredictions compared to the X2.[13][12] These enhancements deliver an 11% instructions-per-cycle (IPC) gain within the same power envelope, balancing performance uplift against efficiency in premium mobile designs. Smaller L1 caches and optional L2 configurations (512 KiB or 1 MiB) further allow tuning for power savings, with the larger L2 reducing cache misses by up to 27%.[22][13]Performance optimizations
The ARM Cortex-X3 core achieves an average 11% uplift in instructions per cycle (IPC) compared to the Cortex-X2 when evaluated under identical frequency, process node (such as TSMC 4nm), and cache configurations.[13][6] This improvement stems from architectural refinements that enhance instruction throughput without increasing power draw at iso-conditions.[12] In peak performance scenarios, the Cortex-X3 delivers up to 25% higher single-threaded performance than the Cortex-X2, operating at 3.3 GHz versus the prior core's 2.9 GHz in typical smartphone implementations.[6][12] This gain reflects combined benefits from IPC increases and higher sustainable clock speeds, enabling better handling of demanding tasks like application launches and computations.[1] Optimizations such as a smaller L1 cache and configurable L2 (512 KiB or 1 MiB) further contribute by cutting cache refill and writeback activity by up to 26.9%, lowering overall energy consumption in prolonged scenarios.[13] Benchmark results highlight these gains in integer-heavy tasks, with SPECint 2017 showing approximately 8.5-11% improvements over the Cortex-X2, aligning with the core IPC uplift.[12] In real-world testing on implementations like the Qualcomm Snapdragon 8 Gen 2, Geekbench 6 single-core scores reach 1,985-2,100 points at stock clock speeds, demonstrating strong responsiveness for everyday computing.[23][24] For machine learning inference workloads, the Cortex-X3 leverages SVE2 extensions to accelerate vector-based operations common in AI tasks.[13] These optimizations, including improved branch prediction that cuts mispredicts by 6.1%, reduce execution cycles by about 12.2% compared to the prior generation.[13][12]Implementations and usage
Adoption in mobile SoCs
The ARM Cortex-X3 core saw rapid adoption in flagship mobile system-on-chips (SoCs) following its announcement in mid-2022, primarily through ARM's DynamIQ licensing model that allows SoC vendors to customize and integrate the IP into heterogeneous core clusters. This flexibility enabled configurations such as 1+3+4 or 1+4+3 clusters, combining one Cortex-X3 prime or ultra core with mid-tier and efficiency cores for balanced performance and power efficiency in premium smartphones.[16] Qualcomm was among the first to integrate the Cortex-X3, featuring it as the prime core in the Snapdragon 8 Gen 2 (SM8550) SoC, announced on November 15, 2022.[25] MediaTek followed shortly after, incorporating a single Cortex-X3 ultra core in the Dimensity 9200 (MT6985), unveiled on November 8, 2022, and its refreshed variant, the Dimensity 9200+, announced on May 10, 2023.[26][27] Google also adopted the core in its Tensor G3 SoC, announced on October 4, 2023, for the Pixel 8 series, utilizing one Cortex-X3 core within a 9-core (1+4+4) arrangement.[28] The first commercial devices powered by Cortex-X3-equipped SoCs emerged in late 2022, with the Vivo X90 Pro launching on December 6, 2022, as the initial handset featuring the MediaTek Dimensity 9200.[29] Adoption expanded widely in 2023 flagship smartphones, including the Samsung Galaxy S23 series, released globally on February 17, 2023, with the Qualcomm Snapdragon 8 Gen 2; the Oppo Find X6, available from March 24, 2023, powered by the Dimensity 9200; and the Google Pixel 8 series, which shipped starting October 12, 2023, with the Tensor G3.[30][31][28]Clock speeds and configurations
The ARM Cortex-X3 core is typically implemented as a single prime core in mobile system-on-chips (SoCs), clocked between 2.9 GHz and 3.35 GHz depending on the vendor's tuning for performance and thermal constraints.[32][33][34] In Qualcomm's Snapdragon 8 Gen 2, the configuration features one Cortex-X3 core at up to 3.2 GHz in the standard variant (SM8550), up to 3.36 GHz in the enhanced 'for Galaxy' variant (SM8550-AC), or 3.19 GHz in the SM8550-AB variant, paired in a 1+2+2+3 cluster with two Cortex-A715 cores at 2.8 GHz, two Cortex-A710 cores at 2.8 GHz, and three Cortex-A510 efficiency cores at 2.0 GHz.[33][35] MediaTek's Dimensity 9200 employs one Cortex-X3 core at 3.05 GHz within a standard 1+3+4 arrangement, including three Cortex-A715 cores at 2.85 GHz and four Cortex-A510 cores at 1.8 GHz, while the overclocked Dimensity 9200+ variant boosts the X3 to 3.35 GHz with the A715 cores reaching 3.0 GHz.[36][34][37] Google's Tensor G3 adopts an unusual 1+4+4 configuration emphasizing efficiency, with one Cortex-X3 core at 2.91 GHz, four Cortex-A715 cores at 2.37 GHz, and four Cortex-A510 cores at around 1.7 GHz, reflecting a focus on sustained operation over peak speeds in Pixel devices.[32] These clock speeds enable short bursts of high performance, but thermal limits in mobile form factors typically cause throttling after brief sustained loads, with the X3 core configurable through Arm's DynamIQ Shared Unit for cluster interconnect and power gating to manage heat. The core's design includes an extra-low-power (ELP) variant derived from the Makalu architecture, optimized for balanced performance and power in premium mobile profiles.[10] While most implementations stick to a single X3 core to avoid excessive thermal buildup, dual-X3 configurations have been explored for ultra-premium SoCs but remain rare in 2023-2024 production due to power and heat challenges in compact devices.[13]Comparisons
Versus Cortex-X2
The Cortex-X3 offers targeted improvements over the Cortex-X2, focusing on wider execution resources and enhanced branch prediction to deliver higher single-threaded performance in demanding workloads such as mobile gaming and productivity applications. While both cores share the Armv9.0-A instruction set architecture, the X3 shifts emphasis toward broader parallelism in integer and floating-point operations, building on the X2's foundational Armv9 adoption to prioritize scalability in DynamIQ configurations. This evolution enables the X3 to achieve an 11% increase in instructions per cycle (IPC) under identical process, clock, and cache conditions.[13] Key architectural specifications highlight the X3's expanded capabilities for out-of-order execution and resource allocation:| Feature | Cortex-X3 | Cortex-X2 |
|---|---|---|
| Decode width | 6 instructions | 5 instructions |
| ALUs | 6 | 5 |
| Reorder buffer (ROB) | 320 entries | 288 entries |
| L2 cache range | 512–1024 KiB | 512–1024 KiB |