Fact-checked by Grok 2 weeks ago

Uncore

In computer architecture, particularly within Intel's multi-core processor designs, the uncore refers to the integrated hardware components on the processor die that operate outside the individual CPU cores, encompassing shared resources essential for system performance.^[1] These components manage inter-core communication, memory access, and input/output operations, distinguishing them from the core execution units that handle instruction processing.^[2] The term was introduced by Intel with the Nehalem microarchitecture in 2008, marking a shift toward on-die integration of previously off-chip elements like the memory controller.^[3]^[4] Key uncore components typically include the last-level cache (LLC), on-chip interconnects (such as the ring bus in earlier designs or mesh topology in newer ones), memory controllers for DRAM access, and caching/home agents (CHA) for maintaining cache coherence across cores.^[4]^[5] Additional elements often encompass I/O stacks for PCIe and other peripherals, as well as Ultra Path Interconnect (UPI) blocks for multi-socket communication in server processors.^[2] The architecture of these components has evolved across generations; for instance, Nehalem-era uncores focused on QuickPath Interconnect (QPI) and integrated memory controllers, while modern implementations like those in Sapphire Rapids emphasize modular designs for enhanced scalability and power efficiency.^[6]^[7] The uncore plays a critical role in overall processor performance by handling bandwidth-intensive tasks, reducing latency for shared data access, and enabling features like Intel Data Direct I/O (DDIO) for efficient I/O processing via the LLC.^[2] It also supports power management through uncore frequency scaling, which dynamically adjusts clock speeds for components like the LLC and interconnects to optimize energy consumption without impacting core performance.^[4] Performance monitoring units (PMUs) in the uncore allow developers to track events such as memory traffic and cache misses, aiding in workload optimization for high-performance computing and data centers.^[5] As processor core counts increase, the uncore's design continues to influence scalability, with recent advancements focusing on 3D stacking and heterogeneous integration to address wire delays and thermal constraints.^[8]

Definition and Terminology

Core vs. Uncore Distinction

In modern multicore processor architectures, particularly those developed by Intel, the processor die is conceptually divided into cores and the uncore, representing a fundamental separation of responsibilities to enhance overall system performance and efficiency. The cores serve as the primary execution units, each responsible for handling the core computational tasks of a processor, including instruction fetch, decode, execution through arithmetic logic units (ALUs) and floating-point units (FPUs), and management of private, low-level caches such as L1 and L2 caches. These private resources are dedicated to individual cores to minimize latency for single-threaded operations and ensure isolation between execution contexts.^[9]^[10] In contrast, the uncore encompasses all non-execution-core elements integrated on the same CPU die, including shared resources such as higher-level caches, memory controllers, and I/O logic, which support multiple cores collectively rather than performing direct computation. This distinction allows the uncore to manage system-wide functions that are independent of individual core activities, such as maintaining cache coherency across cores and routing data between execution units, memory, and peripherals, thereby preventing bottlenecks in multi-threaded workloads. Unlike the core-specific operations focused on instruction processing, uncore tasks ensure seamless coordination without interfering with per-core execution pipelines.^[9]^[11] The integration of the uncore with the cores on a single die yields significant architectural benefits, particularly in reducing latency for inter-core communication through shared on-die interconnects and caches, which eliminates the need for slower off-chip transfers. Additionally, it improves overall bandwidth for memory access and I/O operations by embedding controllers directly on the die, enabling faster data movement and higher throughput in bandwidth-intensive applications. This design promotes scalability in multicore systems, where the uncore offloads global resource management from the cores, allowing them to focus on computation while maintaining power efficiency and system coherence.^[9]^[1]

Historical Naming Conventions

The term "uncore" was first introduced by Intel in 2008 to describe the non-core components of its processors, specifically in the context of the Nehalem microarchitecture, which integrated elements like the memory controller and interconnects on-die.^[12] This nomenclature highlighted the modular separation between processing cores and supporting logic, enabling independent power and frequency management.^[12] With the release of the Sandy Bridge microarchitecture in 2011, Intel shifted to the term "System Agent" in public and marketing materials, emphasizing the enhanced integration of I/O interfaces, PCIe support, and power management features within this subsystem.^[13] This renaming reflected a broader architectural focus on the uncore's role as a centralized agent for system-level operations, including graphics and interconnect handling, while aligning with Intel's branding for more holistic processor ecosystems.^[13] Despite the public transition, "uncore" persisted in Intel's technical documentation and performance analysis resources, as evidenced by its explicit use in the 2010 Intel Technology Journal article detailing modular designs for high-performance cores.^[14] No formal deprecation occurred, and both terms continue to coexist in modern references through 2025, with "uncore" appearing in performance monitoring guides for recent Xeon processors and "System Agent" in datasheets for client architectures.^[15]^[16]

Historical Development

Origins in Early Integrated Designs

In the pre-2000s era, x86 processor designs typically separated core compute logic from key system components such as the memory controller and I/O interfaces, which were implemented in a discrete northbridge chip connected via the front-side bus (FSB). This off-chip arrangement introduced substantial overhead, as memory requests had to cross multiple chip boundaries, resulting in high DRAM access latencies; for example, during the Pentium 4 (NetBurst) period around 2000–2004, main memory latencies reached approximately 100 ns due to the external northbridge's role in handling DRAM transactions.^[17] The mid-2000s saw the beginnings of a shift toward on-die integration to address these bottlenecks, with AMD leading the change by incorporating a memory controller directly onto the CPU die in its Opteron processors launched in 2003. This on-die integrated memory controller (IMC) bypassed the traditional northbridge for memory operations, significantly lowering access latencies and improving bandwidth efficiency in multiprocessor systems, a move that pressured competitors like Intel to reevaluate their architectures.^[18] Intel's initial responses maintained much of the external structure but laid groundwork for deeper integration, as seen in the 2006 Core 2 architecture, which enhanced the FSB for better core-to-system communication while keeping the memory controller and I/O in off-chip chipsets like the 965 Express. This hybrid approach reduced some FSB-related delays but still suffered from the latencies inherent to external memory handling, highlighting the need for further consolidation to match advancing core performance. The transition culminated in Intel's 2008 Nehalem architecture, which marked the company's first full on-die integration of the IMC, enabling direct CPU access to DDR3 memory channels and cutting overall system latency by removing northbridge intermediaries—benchmarks showed memory access times dropping by roughly 30–50% compared to prior FSB-based designs.^[19]

Introduction in Nehalem Architecture

The uncore architecture debuted with Intel's Nehalem microarchitecture in 2008, marking the first implementation of an integrated on-die memory controller (IMC) and a shared last-level (L3) cache in an x86 processor, as seen in the Bloomfield (desktop) and Gainestown (server) variants such as the Core i7 and Xeon 5500 series.^[14] This design integrated key non-core elements directly onto the processor die, fabricated on a 45 nm process, to address limitations of prior off-chip configurations and enable better multi-core scalability. The uncore encompassed the IMC for direct DRAM access, an inclusive 8 MB L3 cache shared among up to four cores, and interfaces for inter-socket communication, fundamentally shifting from the traditional front-side bus (FSB) model.^[20]^[21] A pivotal integration was the QuickPath Interconnect (QPI), a packet-based, point-to-point serial link that replaced the FSB for multi-socket systems, providing up to 25.6 GB/s of full-duplex bandwidth per link at 6.4 GT/s and supporting the MESIF cache coherency protocol.^[14] The IMC supported three DDR3 channels with up to 32 GB/s aggregate bandwidth per socket, while the L3 cache, organized as 16-way associative, minimized data replication and handled coherency on-die to reduce power and latency overhead. The uncore's modular structure, including a global queue acting as a crossbar between cores and uncore elements, allowed for separate power and clock domains, with uncore frequency dynamically adjustable relative to the core clock for balanced operation.^[20]^[21] These changes yielded notable performance benefits, including a greater than 25% reduction in memory latency compared to FSB-based predecessors, dropping local DRAM access from approximately 70 ns off-chip to around 60 ns on-die, alongside improved bandwidth scalability for multi-threaded workloads.^[14]^[21] The shared L3 further contributed to about 30% lower effective latency for cache-coherent accesses, enhancing overall system efficiency without the bottlenecks of external chipsets. This initial uncore design laid the foundation for subsequent Intel architectures by prioritizing on-die integration for lower power consumption and higher throughput in server and desktop environments.^[14]

Evolution from Sandy Bridge Onward

With the introduction of the Sandy Bridge microarchitecture in 2011, Intel renamed the uncore to "System Agent" to better reflect its expanded role in managing non-core functions and to align with industry terminology.^[22] This redesign integrated the graphics processing unit directly onto the die alongside the CPU cores and last-level cache, enabling the Power Control Unit to dynamically allocate power and thermal budgets between them for enhanced overall efficiency.^[22] The System Agent also incorporated PCIe 2.0 support for I/O connectivity, operating at up to 5 GT/s per lane.^[22] Subsequent iterations from Ivy Bridge (2012) to Skylake (2015) focused on bandwidth enhancements and scalability. Ivy Bridge introduced PCIe 3.0 controller support in the System Agent, doubling the per-lane bandwidth to 8 GT/s compared to Sandy Bridge, which improved data transfer rates for peripherals and storage.^[23] For multi-socket server configurations, Ivy Bridge-based Xeon processors upgraded the QuickPath Interconnect (QPI) to speeds of up to 9.6 GT/s, facilitating higher inter-processor communication throughput.^[24] Broadwell (2014), built on a 14 nm process, expanded the last-level cache capacity to 35 MB in high-end Xeon variants, reducing latency for shared data access across cores.^[25] Skylake introduced dynamic uncore frequency scaling (UFS), allowing the System Agent clock to adjust independently based on workload demands, which balanced performance with power efficiency in varying usage scenarios.^[26] From Coffee Lake (2017) to Ice Lake (2019), uncore designs emphasized integration for mobile platforms and interconnect evolution. Coffee Lake processors for laptops adopted soldered BGA-1528 packaging, integrating the System Agent more tightly with the platform to reduce form factor and improve thermal management in thin designs.^[27] Ice Lake, Intel's first 10 nm client architecture, incorporated native support for a Wi-Fi 6 (802.11ax) controller into the Platform Controller Hub (PCH) via the CNVi interface, enabling direct support for high-speed wireless connectivity without discrete components and optimizing power for always-connected devices.^[28]^[29] In parallel server advancements, Cascade Lake (2019) replaced QPI with the Ultra Path Interconnect (UPI) for multi-chip configurations, operating at up to 10.4 GT/s to provide scalable, point-to-point links with improved latency and energy efficiency over QPI.^[30] By Comet Lake (2019), uncore power management features, including advanced gating mechanisms, contributed to overall idle power reductions through finer-grained control of inactive domains, supporting Intel's efficiency goals in 14 nm refreshes.^[31]

Key Components

Last-Level Cache

The last-level cache (LLC), also known as the L3 cache, serves as a primary shared resource within the uncore domain of Intel processors, providing a unified storage layer accessible to all cores on the die for improved data locality and reduced off-chip memory accesses.^[32] This shared structure enables efficient data sharing among cores while minimizing inter-core communication overhead, positioning the LLC as a cornerstone of uncore functionality since its introduction in the Nehalem microarchitecture.^[33] In early designs like Nehalem, the LLC adopted an inclusive policy, ensuring that all data in the private L1 and L2 caches of individual cores was also present in the LLC to simplify coherency management.^[32] This approach persisted through Sandy Bridge, where the LLC remained inclusive of lower-level caches, facilitating straightforward invalidations and probes across cores.^[33] By Skylake, however, Intel shifted to a non-inclusive design, treating the LLC as a victim cache that primarily holds data evicted from L2 caches, which reduced redundancy and allowed for larger effective capacity without duplicating core-private data.^[34] This evolution continued in later generations, such as Alder Lake, maintaining non-inclusivity to optimize for heterogeneous core layouts while preserving shared access.^[35] The LLC's capacity has scaled significantly over generations to accommodate increasing core counts and workload demands, starting at 8 MB in Nehalem for quad-core configurations and expanding to 30 MB or more in Alder Lake's hybrid designs, with multi-tile implementations in modern server processors reaching 96 MB or beyond through distributed slicing.^[32] To balance bandwidth and latency, the LLC is partitioned into slices, typically one per core or core cluster, enabling parallel access and load balancing across the uncore fabric.^[36] Coherency in the LLC is maintained via Intel's MESIF protocol, an extension of the standard MESI scheme that adds a Forward state to designate a single cache line owner for efficient sharing without unnecessary broadcasts.^[37] A key uncore feature, the snoop filter, resides alongside the LLC to track cache line states across cores, filtering out redundant snoops and minimizing core-to-core traffic by directing probes only to relevant locations.^[9] Performance characteristics of the LLC include hit latencies of approximately 26-40 cycles, varying by core proximity to the cache slice and interconnect distance, which underscores its role in bridging core-private caches and main memory.^[38] Bandwidth capabilities have advanced to over 100 GB/s in Alder Lake-era uncore implementations, supporting high-throughput data movement via the ring or mesh interconnect while sustaining peak rates of around 32 bytes per cycle under optimal conditions.^[39]

Integrated Memory Controller

The Integrated Memory Controller (IMC), a core element of the uncore, manages data transfers between the processor cores and off-chip DRAM, optimizing access patterns and ensuring efficient memory subsystem operation. First integrated into Intel's architecture with the Nehalem microarchitecture in 2008, the IMC eliminated the need for a separate northbridge chip by placing memory control directly on the CPU die, initially supporting three-channel DDR3 configurations for enhanced bandwidth in both consumer and server platforms. This design reduced latency compared to prior external controllers and laid the foundation for scalable memory handling within the uncore.^[40]^[21] Subsequent generations expanded IMC capabilities to accommodate evolving DRAM standards and higher densities. By the Alder Lake architecture in 2021—aligning with the timeline toward 2022 implementations—the IMC supported dual-channel DDR5 for desktop systems and up to four channels of LPDDR5 in mobile variants, while maintaining compatibility with single-channel DDR4 in hybrid setups (though only one memory type per system). High-end server and workstation IMCs, such as those in Xeon Scalable processors, introduced quad-channel support starting with Sandy Bridge-E in 2011, enabling greater parallelism for demanding workloads. These configurations allow flexible population of DIMMs or soldered memory, with the uncore distributing addresses across channels to balance load.^[41] To enhance reliability and performance, the IMC incorporates features like error-correcting code (ECC) for single-bit error detection and correction in supported server environments, hardware prefetching to proactively load anticipated data into the memory pipeline, and rank interleaving, which stripes data across multiple ranks within channels to maximize throughput by enabling concurrent bank accesses. In multi-socket systems, the uncore IMC coordinates across nodes via QPI or UPI links to form Non-Uniform Memory Access (NUMA) domains, directing remote requests to the appropriate local controller while maintaining cache coherence. Access latencies through the IMC typically range from 60 to 80 ns, reflecting the combined effects of DRAM timing and controller overhead, while peak bandwidth scales to approximately 25.6 GB/s per channel in DDR4-3200 setups, providing critical context for bandwidth-intensive applications.^[42]^[21]^[43]^[44]

I/O and Interconnect Interfaces

The uncore in Intel processors incorporates high-speed interconnect interfaces to enable efficient communication between multi-chip configurations and peripheral subsystems, supporting cache coherency and data transfer without relying on memory-specific pathways. These interfaces handle point-to-point links for inter-processor traffic and chipset connectivity, ensuring low-latency operations in scalable systems.^[45] The Intel QuickPath Interconnect (QPI), deployed from 2008 to 2017, served as a point-to-point serial interface for multi-socket coherency in Xeon processors. It operated at speeds ranging from 6.4 GT/s to 9.6 GT/s, providing bandwidth up to 38.4 GB/s per link in bidirectional configurations to facilitate request, snoop, response, and data transfers across sockets. QPI's packetized protocol supported MESI-based cache coherency with options for source or home snoop modes, optimizing for both small-scale and large-scale systems.^[45] Succeeding QPI in 2017, the Intel Ultra Path Interconnect (UPI) enhances multi-chip scalability with similar point-to-point links, achieving up to 16 GT/s in the Sapphire Rapids generation for improved coherency traffic. UPI maintains backward compatibility with QPI protocols while integrating with the uncore's internal ring-bus or mesh topology to route external socket-to-socket communications efficiently. This design supports up to three or four links per processor, enabling high-bandwidth transfers in dual- or multi-socket environments.^[46] The Direct Media Interface (DMI) provides the uncore's primary link to the chipset for I/O subsystem access, operating at 8 GT/s in modern Xeon Scalable generations. As a PCIe-derived interface with up to eight lanes, DMI handles non-coherent traffic to peripherals and power management signals, bridging the processor's mesh domain to external controllers.^[47] Within the uncore, the router box (R-box) arbitrates traffic across internal ports, including those connected to QPI/UPI agents, to manage intra- and inter-processor flows via a crossbar structure. It employs multi-level arbitration—queue, port, and global—to select and route packets without intermediate storage, supporting performance monitoring for occupancy and stalls to maintain efficient data movement.^[48]

Peripheral Integration

The uncore in Intel processors integrates the PCIe root complex within the System Agent, enabling direct connectivity for high-speed peripherals. This root complex supports PCIe generations evolving from Gen3 in 2012 with Ivy Bridge to Gen5 by 2021 in Alder Lake architectures, providing bandwidth scaling from 8 GT/s to 32 GT/s per lane. Typical configurations allocate 16 to 28 lanes depending on the processor family, with desktop variants often featuring 16 lanes dedicated to graphics or storage while server models like Xeon Scalable offer up to 64 lanes in multi-socket setups. Bifurcation capabilities allow these lanes to be split into multiple independent links, such as x16 into x8+x8 or x4+x4+x4+x4, facilitating simultaneous use by multiple devices without performance bottlenecks. Integrated graphics processing units (iGPUs) have been embedded in the uncore's System Agent since the Sandy Bridge architecture in 2011, marking a shift from discrete graphics integration. This placement enables the iGPU to directly access the last-level cache (LLC) and share system memory bandwidth with CPU cores, reducing latency for graphics workloads compared to external GPUs connected via PCIe. For instance, the iGPU siphons portions of the LLC for texture and framebuffer data, optimizing coherence in unified memory architectures without dedicated VRAM. Beyond PCIe and graphics, the uncore incorporates controllers for advanced peripherals, including Thunderbolt starting from Gen3 in platforms like Skylake (2015) and evolving to integrated support in mobile SoCs such as Ice Lake and later. USB 3.x hubs are also handled via uncore-integrated root ports, providing up to 10 Gbps per port in configurations like those in Coffee Lake. In the Tiger Lake architecture (2020), Wi-Fi 6E integration occurs through the CNVi interface within the platform's uncore ecosystem, supporting tri-band operation up to 6 GHz with modules like AX210 for enhanced wireless throughput. Uncore designs further support dynamic PCIe lane allocation to balance integrated and discrete components; for example, disabling the iGPU in BIOS reassigns its reserved lanes—typically 2-4—to the discrete GPU or other PCIe endpoints, boosting overall I/O flexibility in hybrid setups.

Architectural Design

Modular Unit Structure

The uncore in Intel processors is organized into a modular structure of specialized units, or "boxes," that handle distinct aspects of cache coherence, memory management, and interconnect routing. These units interconnect via on-die fabrics such as ring buses in earlier designs or mesh networks in later ones, enabling scalable communication among cores, caches, and I/O components. This modular approach allows for distributed processing of uncore tasks, with each box responsible for specific protocol handling and traffic management.^[9] C-boxes, or caching boxes, function as cache controllers, with one dedicated to each slice of the last-level cache (LLC). They manage snoop requests from cores and other agents, enforce directory-based coherency protocols, and interface between the core complex and the LLC to process incoming transactions such as reads, writes, and coherence probes. Each C-box includes queues for tracking requests and responses, ensuring ordered delivery and conflict resolution within its cache slice. In Haswell-based processors like the Xeon E5 v3 family (2013), configurations typically feature 4 to 8 C-boxes per socket, distributed to balance load across the LLC slices.^[9]^[7] The Home Agent (HA) serves as the central coordinator for memory-side operations, managing incoming memory requests from the ring or mesh interconnect, tracking cache line states in the directory, and interfacing with the integrated memory controller to fulfill DRAM accesses. It handles coherence for remote sockets in multi-socket systems, processes snoop filtering, and maintains ordering rules for memory transactions to prevent conflicts. In earlier architectures like Haswell, a single HA per socket oversees all channels, but this evolved into the Coherency Home Agent (CHA) in Skylake and later generations, where HA functionality is distributed across multiple integrated units for improved scalability. Each CHA combines caching and home agent roles, with one instance per LLC slice or tile.^[9]^[47] The R-box acts as a router for intra-uncore traffic, facilitating packet routing and protocol translation between uncore units and external interfaces like PCIe or inter-socket links. It manages credit-based flow control and serialization of messages on the on-die interconnect, ensuring efficient data movement without bottlenecks. Comprising sub-units such as R2PCIe for PCIe traffic and R3QPI/UPI variants for socket-to-socket communication, the R-box connects key elements like C-boxes/CHAs and the HA/CHA to the broader system fabric. Meanwhile, the Power Control Unit (PCU) provides global coordination across uncore modules, acting as a centralized agent for resource arbitration and state synchronization among boxes. Operating via an internal microcontroller, the PCU interfaces with all uncore components to maintain system-wide consistency in operations.^[9]^[47]

Clocking Mechanisms

The Uncore operates within an independent clock domain known as the Uncore clock (UCLK), which drives key components such as the last-level cache and interconnects. In early designs like the Nehalem architecture, UCLK is derived from the base clock (BCLK, typically 133 MHz) via a configurable ratio reported in the CURRENT_UCLK_RATIO MSR, with stock ratios often yielding frequencies around 2.4–2.66 GHz (e.g., ratio of 18–20).^[49] Across generations, UCLK typically ranges from 2 GHz to 3.5 GHz at stock settings, scaling with processor advancements while remaining decoupled from core clocks for optimized operation.^[50] The ring bus interconnect, which links the coherency boxes (C-boxes) in pre-Skylake Uncore implementations, operates at the UCLK frequency to facilitate data transfer between cores, cache, and I/O.^[48] In later generations starting with Skylake server (Skylake-SP) and high-end desktop (Skylake-X) processors, the ring bus topology evolves into a 2D mesh interconnect, still clocked by UCLK but offering improved scalability for higher core counts by distributing traffic across a grid-like structure.^[51] Uncore frequency supports dynamic scaling to adapt to workload demands, with internal algorithms monitoring activity and adjusting UCLK accordingly; Turbo modes enable boosts above the base frequency when thermal and power limits allow.^[52] The UCLK is computed as \text{UCLK} = \text{BCLK} \times \text{ratio}, for instance, a BCLK of 100 MHz with a ratio of 26 yields 2.6 GHz.^[53] In Ivy Bridge processors, certain BIOS implementations unlock the UCLK multiplier for overclocking, allowing frequencies up to approximately 3.4 GHz via elevated ratios (e.g., 34× at 100 MHz BCLK).^[54] Modular Uncore units, including C-boxes and the integrated memory controller, are synchronized to the UCLK domain for cohesive operation.^[9]

Power Management Features

The Uncore incorporates C-state mechanisms to achieve energy efficiency during idle periods, where deeper sleep modes such as C6 and C7 significantly reduce power consumption by disabling key components. In C6, core clocks are turned off, the last-level cache (LLC) remains unflushed but enters a low-power state, and the integrated memory controller (IMC) transitions to self-refresh mode, effectively disabling active operations in the Uncore when the system is idle. Similarly, C7 extends this by powering down the IMC further while maintaining minimal state retention in the LLC, allowing the Uncore to enter deeper idle without compromising quick resumption of activity. These states are coordinated by the power control unit (PCU), ensuring that Uncore components like the LLC and IMC are isolated from power rails when no workload demands their use.^[55]^[9] Dynamic voltage and frequency scaling (DVFS) in the Uncore enables independent control of voltage and frequency for its units, separate from core scaling, to optimize power under varying loads. This is implemented through Uncore Frequency Scaling (UFS), which adjusts the frequency of the ring interconnect, LLC, and IMC based on workload demands, typically ranging from minimum to maximum ratios set via model-specific registers. In low-load scenarios, lowering the Uncore frequency reduces overall power consumption, minimizing leakage and dynamic power without significantly impacting latency-sensitive operations. UFS operates across distinct clock domains within the Uncore, allowing granular adjustments tied to interconnect activity.^[52]^[4]^[56] Thermal throttling in the Uncore relies on dedicated sensors integrated into the PCU and IMC to monitor temperatures and prevent overheating. These sensors trigger downclocking when temperatures approach 90-100°C, reducing frequency across Uncore units to limit power dissipation and maintain safe operating conditions. The mechanism activates via the PROCHOT# signal, which engages the thermal control circuit to modulate clocks and voltages specifically for Uncore components like the memory controller, ensuring protection without full system shutdown unless critical thresholds (around 130°C) are exceeded.^[9]^[57] In the Skylake architecture of 2017, uncore frequency scaling achieved up to 15.6% savings in power during workloads with minimal performance impact.^[58]

Performance Aspects

Frequency Scaling and Optimization

Intel's uncore frequency scaling, often referred to as Uncore Frequency Scaling (UFS), dynamically adjusts the uncore clock (UCLK) to balance performance and power consumption based on workload characteristics, thermal headroom, and core activity levels. This mechanism, introduced in Haswell processors, allows the uncore components—such as the last-level cache, memory controller, and interconnects—to operate at frequencies independent of the core clocks, typically ranging from a minimum of 1.2 GHz to a maximum turbo frequency that can exceed the base by up to 20% in single-threaded scenarios where core demand is low but uncore utilization requires a boost.^[4]^[26] The scaling leverages hardware algorithms that monitor uncore utilization and core stalling events every approximately 10 ms, increasing UCLK when thermal headroom permits and demand from active cores indicates potential bottlenecks.^[52]^[4] Workload detection plays a central role in UFS, prioritizing memory-bound tasks that exhibit high last-level cache misses or interconnect traffic, such as database operations involving frequent data lookups and transfers. In these scenarios, the algorithms detect elevated uncore usage—often triggered by more than one-third of cores stalling on memory accesses—and elevate UCLK to reduce latency and improve throughput, while lowering it for compute-bound workloads to conserve power.^[4]^[52] This adaptive approach ensures that uncore resources are allocated efficiently, enhancing overall system responsiveness without unnecessary energy expenditure. Optimization techniques for UFS involve fine-tuning frequency ratios through Model-Specific Registers (MSRs), particularly MSR 0x620 (UNCORE_RATIO_LIMIT), which sets the minimum and maximum allowable UCLK ratios relative to the base frequency. Users or system software can write to this register to cap or extend the scaling range, enabling custom profiles for specific applications; for instance, the boosted UCLK can be calculated as:

\text{Boosted UCLK} = \text{Base UCLK} \times (1 + \text{headroom factor})

where the headroom factor is derived from available thermal and power margins, typically resulting in increments of 100 MHz.^[59]^[60] These adjustments allow for precise control, with transitions occurring in 0–1.5 ms, though full adaptation to workload changes may take up to 10 ms due to control loop latency.^[26] In Skylake processors released in 2015, the implementation of uncore scaling significantly enhanced multi-threaded performance, improving memory bandwidth by 25% in SPEC benchmarks through reduced LLC access latencies at higher frequencies (e.g., from 119 cycles at 1.4 GHz to 83 cycles at 2.4 GHz in pointer-chasing workloads representative of SPEC memory-intensive tests).^[26]^[61] This optimization underscores UFS's role in addressing bandwidth limitations in parallel environments, where uncore turbo modes provide headroom for sustained high-frequency operation under varying core demands.^[62]

Monitoring Tools and Counters

The uncore Performance Monitoring Unit (PMU) enables detailed observability of non-core components in Intel processors, tracking events such as last-level cache (LLC) misses, Ultra Path Interconnect (UPI) traffic, and integrated memory controller (IMC) bandwidth. Introduced with the Sandy Bridge microarchitecture, the uncore PMU provides access to over 100 performance events per socket, distributed across specialized monitoring units for components like the caching agents (CBo), home agents (HA), and interconnect layers. These events allow for precise measurement of resource utilization without interfering with core execution, using model-specific registers (MSRs) or PCI configuration space for counter programming and readout.^[63] Key metrics captured by the uncore PMU include uncore cycles, which count ticks of the uncore clock (UCLK) via fixed counters like U_MSR_PMON_UCLK_FIXED_CTR, and snoop responses, monitored through events such as SNOOP_RESP in the HA to quantify coherence traffic. For LLC misses, events like UNC_C_TOR_INSERTS.MISS_OPCODE in the CBo track requests that bypass the cache, while UPI traffic is measured by RxL_FLITS_G0 and TxL_FLITS_G0 for received and transmitted flits (up to 2 per cycle per direction). IMC bandwidth is assessed via IMC_READS and IMC_WRITES, each incrementing up to 4 times per cycle to reflect CAS commands scaled by 64 bytes. A representative example is the UNC_C_TOR_OCC event in the CBo, which monitors request occupancy in the tracker occupancy resource (TOR) to gauge queue pressure, with a maximum increment of 20 per cycle. Per-socket counter availability varies by component but typically supports 8-16 programmable counters across critical units, with widths of 44-48 bits for high-precision accumulation.^[63]^[64] Software tools facilitate access to these counters for practical observability. Intel VTune Profiler integrates uncore PMU events into its analysis workflows, enabling users to collect and visualize metrics like memory bandwidth and interconnect utilization during application runs. Similarly, the Intel Performance Counter Monitor (PCM) provides command-line utilities and an API for real-time uncore monitoring, including UPI flit counts and IMC throughput, with support for Sandy Bridge and later architectures. Direct MSR reads, such as those targeting UCLK fixed counters, allow low-level access for custom scripting, often combined with libraries like libpfm for event decoding. These tools emphasize non-intrusive sampling to maintain system performance while exposing uncore-specific insights.^[2]^[65]^[66]

System-Level Impact

The uncore subsystem plays a critical role in overall CPU efficiency by managing shared resources like the last-level cache and memory controllers, where saturation can impose significant bottlenecks on performance scaling. In memory-bound applications, uncore contention often accounts for a substantial portion of total memory access latency, with studies showing up to 70% of latency originating from the uncore and DRAM components at reduced frequencies, limiting the benefits of adding more cores or threads. This saturation particularly hampers multi-threaded scaling, as increased core counts amplify pressure on shared uncore paths, reducing effective throughput in bandwidth-constrained scenarios. Benchmarks illustrate uncore's contributions to instructions per cycle (IPC) gains, particularly in multi-threaded environments. Tuning uncore frequency provides 20-30% latency headroom for optimization in memory-intensive workloads, translating to notable IPC improvements by alleviating stalls from cache misses and memory accesses. For instance, enhancements to the uncore and memory subsystem in Skylake processors delivered 50-65% higher bandwidth in the STREAM benchmark compared to Broadwell, underscoring uncore's role in elevating multi-threaded memory performance, though subsequent optimizations in later architectures yield more incremental gains around 10-15% in similar tests.^[67] Trade-offs in uncore configuration highlight the balance between performance and efficiency. Elevating the uncore clock (UCLK) frequency enhances memory bandwidth and reduces access delays, but it correspondingly raises power draw; for example, shifting from a low uncore frequency (e.g., 0.8 GHz) to a higher one (e.g., 2.2 GHz) can increase CPU package power by up to 82 W, representing around 40% of total CPU consumption in demanding loads.^[68] In server and high-performance computing (HPC) workloads of the 2020s, targeted uncore optimizations deliver 10-20% uplifts in throughput by dynamically adjusting frequency to match workload demands, minimizing energy waste while preserving performance in power-constrained environments.^[69] These gains are especially pronounced in heterogeneous systems, where uncore tuning complements GPU acceleration without excessive overhead.^[68]

Modern Implementations

Role in Hybrid Microarchitectures

In Intel's Alder Lake processors, released in 2021, the uncore supports the hybrid microarchitecture by managing shared resources for both performance cores (P-cores) and efficiency cores (E-cores) through the ring bus interconnect. This structure enables the uncore to handle communication between heterogeneous cores, including access to the shared last-level cache (LLC). E-cores connect to the ring bus and access the P-core LLC, maintaining cache coherency with some additional latency compared to intra-P-core access.^[70]^[71] Building on this foundation, the 2022 Raptor Lake processors enhance uncore capabilities for larger hybrid configurations, supporting up to 8 P-cores and 16 E-cores while retaining the ring bus interconnect for core-to-uncore data flow. The uncore's arbitration logic optimizes resource allocation, favoring E-cores in low-power efficiency modes to balance performance and energy use across the mixed core types. This scaling maintains coherent shared LLC access similar to Alder Lake, with refinements to the interconnect reducing contention in multithreaded workloads.^[70]^[72] The 2023 Meteor Lake processors introduce a disaggregated uncore design using a tile-based architecture, where the SoC tile centralizes low-power I/O, memory controllers, and other uncore elements alongside integration points for P-cores and E-cores. Fabricated on a mix of processes (e.g., Intel 4 for compute tiles and TSMC N6 for the SoC tile), this modular setup leverages Foveros 3D packaging to connect tiles efficiently, ensuring low-latency coherency across the hybrid cores without the monolithic constraints of prior generations. The uncore's role in this design emphasizes scalability, allowing E-cores to interface seamlessly with shared resources while minimizing inter-tile communication overhead.^[73]^[74]

Integration with Emerging Technologies

In recent Intel processor architectures post-2020, the uncore has evolved to facilitate AI acceleration by integrating dedicated Neural Processing Units (NPUs) that leverage shared system resources for efficient computation. For instance, in the Meteor Lake Core Ultra series, the NPU—internally designated as NPU 3.0 or "Gaussian" for its optimized neural acceleration capabilities—is interconnected via the Scalable Fabric and shares bandwidth with the integrated memory controller (IMC) for LPDDR5 access, allowing seamless offloading of AI workloads from the CPU and GPU.^[75] This integration enables the NPU's two Neural Compute Engine tiles, each with 2 MB of SRAM, to deliver up to 10 TOPS of INT8 performance for tasks like image recognition and natural language processing, while minimizing power draw through dedicated low-precision execution units.^[75] The uncore's role in security has expanded to support trusted execution environments, particularly through Intel Software Guard Extensions (SGX) enclaves, where the Memory Encryption Engine (MEE)—a uncore component—encrypts enclave data as it transits to and from memory, ensuring confidentiality and integrity against privileged software attacks.^[76] Additionally, the uncore's PCIe root complex contributes to secure boot by managing device enumeration and firmware verification during initialization, preventing unauthorized code execution in the boot chain. Connectivity enhancements in the uncore underscore its adaptation to high-speed peripherals, as seen in the 2024 Lunar Lake Core Ultra 200V series, where the Platform I/O tile incorporates Wi-Fi 7 (BE201) and Thunderbolt 5 controllers for multi-gigabit wireless and up to 80 Gbps wired transfers, respectively.^[77] The uncore's network-on-chip (NoC) fabric provides low-latency routing for these interfaces, prioritizing real-time data flows and integrating with the 12 MB Level 3 cache on the compute tile to reduce bottlenecks in bandwidth-intensive applications like 8K video streaming or AR/VR.^[77] By the Arrow Lake Core Ultra 200S series in 2024, uncore advancements include the Next-Generation Uncore (NGU) as part of the SoC tile, supporting improved interconnect performance. This is complemented by the integrated NPU, delivering 13 TOPS for edge AI, where uncore interconnects ensure efficient tensor movement across tiles without core intervention.^[78]^[79] In 2025, the Panther Lake processors, Intel's first client implementation on the 18A process, further advance the disaggregated uncore with a merged design drawing from Lunar Lake and Arrow Lake architectures. Featuring NPU 5 with up to 48 TOPS of INT8 performance across three Neural Compute Engines, the uncore's enhanced tile integration and Scalable Fabric optimize AI workloads, supporting up to 16 cores (P-cores and E-cores) with improved power efficiency and low-latency data sharing via Foveros packaging. As of November 2025, Panther Lake emphasizes heterogeneous computing scalability for AI PCs and edge devices.^[80]^[81]

Comparisons with Other Architectures

Equivalents in AMD Processors

In AMD processors, the I/O die (IOD) serves as the primary equivalent to Intel's uncore, housing critical non-core components such as the memory controller, PCIe interfaces, and integration points for the Infinity Fabric interconnect.^[82] Introduced with the chiplet-based Zen architecture in 2017, the IOD enables modular scaling by separating I/O functions onto a dedicated die fabricated on a more cost-effective process node compared to the core complex dies (CCDs).^[83] This design contrasts with Intel's more integrated uncore but achieves similar goals of offloading memory, I/O, and interconnect responsibilities from the compute cores.^[84] The Infinity Fabric, AMD's high-speed on-package and inter-socket interconnect, functions analogously to Intel's Ultra Path Interconnect (UPI) by linking CCDs, the IOD, and external components with coherent data transfer. Operating at clock speeds of 1 to 2 GHz, it connects core complexes (CCXs) within CCDs to the IOD, supporting scalable multi-chiplet configurations while maintaining low-latency communication for shared resources like caches and memory.^[85] For instance, intra-socket fabric latencies range from approximately 20-30 ns for same-die accesses to 40-50 ns across dies, enabling efficient data movement in multi-core environments.^[86] A key difference lies in AMD's chiplet modularity, which distributes uncore-like functions across the IOD to support higher core counts without monolithic scaling challenges. In server-oriented EPYC processors, up to 128 Zen 5 cores across 16 CCDs share a single central IOD, allowing cost-effective expansion for data center workloads.^[87] This approach enhances scalability compared to Intel's uncore, where integration is tighter but less flexible for extreme core densities. For example, the Ryzen 7000 series (launched in 2022) features an IOD that integrates DDR5 memory support and PCIe 5.0 interfaces, with fabric latencies around 20-30 ns—slightly higher than Intel's typical on-die uncore access times of about 15 ns due to the multi-die hops.^[88]^[86]

Similar Concepts in ARM and Others

In ARM-based architectures, the functional equivalent to Intel's uncore encompasses a suite of System IP blocks designed to manage shared resources, interconnects, and memory access outside the core processors. The CoreLink CCI-500 Cache Coherent Interconnect serves as a central component, enabling high-bandwidth, low-latency communication between multiple Cortex-A processor clusters, accelerators, and peripherals while maintaining cache coherency across the system.^[89] This interconnect supports up to four clusters and integrates seamlessly with AMBA CHI protocols to facilitate efficient data sharing in multi-core environments. Complementing this, Arm's CoreLink DMC-520 Dynamic Memory Controller handles DDR4/LPDDR4 memory interfaces for Cortex-A series processors, providing high-throughput access to external DRAM with quality-of-service mechanisms to prioritize critical traffic and reduce contention.^[90] These elements are particularly prominent in big.LITTLE hybrid configurations, where high-performance "big" cores (e.g., Cortex-A78) pair with energy-efficient "LITTLE" cores (e.g., Cortex-A55), allowing the interconnect and memory systems to dynamically route traffic for optimal power and performance balance without shared L2 caches between clusters.^[91] A key parallel to uncore design principles in ARM is the emphasis on on-die integration to minimize latency and maximize bandwidth for coherent operations. For instance, the CoreLink CMN-600 Coherent Mesh Network, introduced for advanced ARMv8-A and later systems, employs a scalable 2D mesh topology to connect up to 128 compute nodes, I/O devices, and up to 128 MB of shared last-level cache, supporting frequencies up to 2 GHz in 2020s-era chips like those in Neoverse N1 platforms such as Ampere Altra.^[92]^[93] This mesh reduces interconnect delays by distributing traffic across a grid of crosspoints, enabling efficient scaling for data-center and mobile SoCs while adhering to AMBA 5 CHI standards for protocol-level coherence.#CHI) Beyond ARM, Apple's M-series system-on-chips (SoCs) implement an uncore-like structure through a highly integrated shared fabric that unifies memory access for the CPU, GPU, Neural Engine, and I/O subsystems. This unified memory architecture (UMA) pools high-bandwidth, low-latency LPDDR4X DRAM directly on-package, eliminating traditional bottlenecks between discrete components and allowing all accelerators to access the same address space without data copying. The fabric incorporates dedicated I/O tiles for peripherals like Thunderbolt controllers and media engines, ensuring coherent and prioritized bandwidth allocation across the die. For example, the M1 SoC (2020) delivers 68 GB/s of memory bandwidth through this integrated fabric, providing performance comparable to Intel's contemporaneous uncore designs in bandwidth-intensive workloads while consuming significantly less power due to the monolithic integration.^[94]

References

[1]
4th Gen Intel® Xeon® Scalable Processor XCC (Codename ...
May 9, 2025 · Uncore is logic outside CPU cores on the same die, handling traffic, coherency, DIMM access, power, and sleep states.
[2]
Analyzing Uncore Perfmon Events for Use of Intel® Data Direct I/O...
When you profile software applications for performance, an uncore event refers to the functions of the CPU that operate outside the core. These functions ...
[3]
Nehalem Revolution: Intel's Core i7 Processor Complete Review
Nov 2, 2008 · Features like the integrated memory controller, QPI links and shared L3 cache fall into the “uncore” category. All of these components that you ...
[4]
Covert Channels Exploiting Uncore Frequency Scaling
Following Intel's terminology, a multi-core processor consists of multiple cores and an uncore. The uncore typically includes the last-level cache (LLC), the on ...<|control11|><|separator|>
[5]
3rd Gen Intel® Xeon® Processor Scalable Family, Codename Ice ...
The uncore sub-system consists of a variety of components, many assigned to the aforementioned responsibilities, ranging from the CHA cache/home agent to ...
[6]
[PDF] MCA Enhancements in Intel Xeon Processors - Cloudfront.net
Uncore. Refers to the functionality in processor socket other than processor cores. Uncore encompasses Intel QPI support logic, memory controller and so forth.
[7]
[PDF] Offcore, Uncore, and Northbridge Performance Events in Modern ...
Uncore supports varies by vendor and processor model. This is a quick overview of uncores found on recent processors. For actual lists of events supported, see.
[8]
[PDF] Core vs. Uncore: The Heart of Darkness - UCSD CSE
With 3D die-stacking, cores and uncore com- ponents manufactured in different technologies can be inte- grated into a single package to reduce wire lengths, ...
[9]
[PDF] Intel® Xeon® Processor E5-2600 Product Family Uncore ...
• Definition: Number of Intel® QPI qfclk cycles spent in L1 power mode. L1 is a mode that totally shuts down a Intel® QPI link. Use edge detect to count the ...
[10]
[PDF] A Runtime for Uncore Power Conservation on HPC Systems - SC19
From Haswell processors onward, Intel has introduced uncore fre- quency ... From Sandy Bridge processors onward, Intel introduced running average power limit.
[11]
[PDF] 6th Generation Intel® Core™ Processor Family Uncore Performance ...
This is a reference manual for the 6th Generation Intel Core Processor Family Uncore Performance Monitoring, dated April 2016, order number 334060-001.
[12]
None
### Summary of "Uncore" Mentions in Nehalem Microarchitecture Document
[13]
[PDF] Intel(R) 64 and IA-32 Architectures Optimization Reference Manual
Jun 7, 2011 · ... SANDY BRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2. 2.1.1. Intel® microarchitecture code name Sandy ... System Agent ...
[14]
[PDF] Intel® Technology Journal | Volume 14, Issue 3, 2010
An example Uncore is described in another article in this issue of the Intel Technology Journal [2]. In this article, we detail the key features of the core ...
[15]
5th Gen Intel® Xeon® Scalable Processor XCC (Codename ...
Uncore Monitoring Guide for the 5th Gen Intel® Xeon® Scalable Processor XCC (Codename Emerald Rapids).
[16]
System Agent Enhanced Intel SpeedStep® Technology - ID:743844
System Agent Enhanced Intel SpeedStep® Technology is a dynamic voltage frequency scaling of the System Agent clock based on memory utilization. Unlike processor ...
[17]
Intel's Netburst: Failure is a Foundation for Success
Jun 16, 2022 · The Pentium EE 965 has a 2 MB L2 for each core, with a latency of about 25 cycles. AMD's L2 is smaller and takes fewer cycles to return data, ...
[18]
[PDF] the amd opteron processor
In contrast, an integrated memory controller provides a 128- bit 333-MHz DDR interface at each node in an Opteron system. A multiprocessor config- uration can ...
[19]
Inside the Nehalem: Intel's New Core i7 Microarchitecture
Aug 25, 2008 · As AMD introduced many years ago, an IMC allows for higher peak memory bandwidth and lower memory latency though Intel is taking it another step ...
[20]
Inside Nehalem: Intel's Future Processor and System - Page 2 of 10
Apr 2, 2008 · On-die communication has the benefit of both lower latency and lower power. Additionally, a shared last level cache can reduce replication.Missing: IMC | Show results with:IMC
[21]
[PDF] Performance Analysis Guide for Intel® Core™ i7 Processor and Intel ...
The 8-core implementation based on the Nehalem microarchitecture ... The performance monitoring of the uncore is mostly accomplished through the uncore.
[22]
Intel's Sandy Bridge Microarchitecture - Page 9 of 10
Sep 25, 2010 · The Sandy Bridge system agent contains everything outside the CPU cores, L3 cache and graphics. In previous generations, it was described as ...Missing: integrated UCLK 2.6
[23]
Ivy Bridge CPUs Feature PCI-Express 3.0 | TechPowerUp
Apr 27, 2011 · PCI-Express 3.0 doubles bandwidth over PCI-Express 2.0, and comes with a number of new features and electrical specifications. With Panther ...Missing: uncore | Show results with:uncore
[24]
[PDF] Intel's High-End Server Processors and Platforms
Nov 10, 2018 · Example: E5/E7 processor features of Intel's Ivy Bridge based server processors [117] ... (18-Core), w/ QPI up to 9.6 GT/s. Haswell-EX.
[25]
Intel® Xeon® Processor E5-2680 v4 (35M Cache, 2.40 GHz)
Total Cores. 14 ; Total Threads. 28 ; Max Turbo Frequency. 3.30 GHz ; Intel® Turbo Boost Technology 2.0 Frequency · 3.30 GHz ; Processor Base Frequency. 2.40 GHz.
[26]
[PDF] Energy Efficiency Features of the Intel Skylake-SP ... - TU Dresden
Like its predecessors, Skylake-SP processors support Per-. Core P-States (PCPs) and Uncore Frequency Scaling (UFS). This enables fine-grained control over ...
[27]
[PDF] 8th and 9th Generation Intel® Core™ Processor Families and Intel ...
... Coffee Lake. Supporting 9th Generation Intel® Core™ Processor Families H/S ... BGA package (BGA1528) for U-Processor Line. • A 42 mm x 28 mm BGA package ...
[28]
Intel reveals final details on Ice Lake mobile CPUs - Ars Technica
Aug 2, 2019 · Ice Lake uses Intel's new 10nm process and offers more IPC (Instructions Per Clock) as well as greatly improved integrated graphics as compared to earlier ...Missing: soldered UPI QPI Cascade 10.4 GT/ s
[29]
[PDF] Intel® Xeon® Scalable Processors Datasheet, Vol. 1: Electrical
Intel® UPI. Current-mode 9.6 GT/s or 10.4 GT/s. Nominal voltage is 1.0 V. Open Drain. Open Drain buffers: 1.05 V tolerant. PCI Express*. PCI Express interface ...Missing: Cascade Lake
[30]
[PDF] 10th Generation Intel® Core™ Processors Datasheet, Volume 1 of 2
Note: Power down refers to state which all processor power rails are off. 1.2. Supported Technologies. • Intel® Virtualization Technology (Intel® VT). • Intel ...
[31]
[PDF] Next Generation Intel® Microarchitecture (Nehalem)
Next generation Intel microarchitecture (Nehalem) uses several innovations to reduce branch mispredicts ... core snoops to reduce latency and improve performance.
[32]
[PDF] Earlier Generations of Intel® 64 and IA-32 Processor Architectures
SYSTEM AGENT. The system agent implemented in the second generation Intel Core processor family contains the following components: • An arbiter that handles ...
[33]
[PDF] Intel® 64 and IA-32 Architectures - Optimization Reference Manual
This manual describes how to optimize software for IA-32 and Intel 64 processors, for hardware manufacturers and software developers.
[34]
Alder Lake's Caching and Power Efficiency - Chips and Cheese
Jul 7, 2022 · A Golden Cove core has a three-level cache hierarchy, with a 48 KB L1D, 1280 KB L2, and a 30 MB L3 shared across all cores (and the iGPU).
[35]
Sandy Bridge (client) - Microarchitectures - Intel - WikiChip
Dec 29, 2020 · Sandy Bridge is Intel's first microarchitecture to integrate the graphics on-die. ... 2.6 GHz. 2,600 MHz 2,600,000 kHz. 16 GiB. 16,384 MiB
[36]
A Case Study for Broadcast on Intel Xeon Scalable Processors
Cache coherency in the Intel Xeon Scalable processor architecture is implemented using the MESIF protocol [8], with transitions among the following states: ...
[37]
[PDF] 356477-Optimization-Reference-Manual-V2-002.pdf - Intel
The LLC hit latency, ranging between 26-31 cycles, depends on the core location relative to the LLC block, and how far the request needs to travel on the ring.
[38]
Intel Unveils Alder Lake: Next-Generation Mainstream ...
Aug 19, 2021 · In that configuration, the last level cache can have a capacity of up to 30 MiB. Memory. Alder Lake is introducing support for both DDR5 and ...
[39]
[PDF] Inside Intel® Core™ Microarchitecture (Nehalem) - Hot Chips
Aug 26, 2008 · • Integrated DDR3 Memory. Controller. Page 9. 9. Designed for ... • Design integrated memory controller for low latency. • Design cache ...<|control11|><|separator|>
[40]
Memory Controller (MC) | 12th Generation Intel® Core™ Processors
Each controller is capable of supporting up to four channels of LPDDR4x and LPDDR5, two channels of DDR5 and one channel of DDR4. The two controllers are ...
[41]
Basic Diagnostics for Correctable/Uncorrectable ECC Memory Errors...
ECC correctable error represents a threshold overflow for a given Dual In-line Memory Modules (DIMM) within a given timeframe.Missing: interleaving | Show results with:interleaving
[42]
NUMA Deep Dive Part 2: System Architecture - frankdenneman.nl
Jul 8, 2016 · This part covers the Intel Xeon microarchitecture and zooms in on the Uncore. Primarily focusing on Uncore frequency management and QPI design decisions.Missing: UPI | Show results with:UPI
[43]
https://frankdenneman.nl/2016/07/08/numa-deep-dive-part-2-system-architecture/
[44]
[PDF] An Introduction to the Intel QuickPath Interconnect
There are two basic types of snoop behaviors supported by the Intel® QuickPath Interconnect specification. Which snooping style is implemented is a ...
[45]
[PDF] 4th Gen Intel® Xeon® Processor Scalable Family, Codename ...
Aug 1, 2024 · Intel® Ultra Path Interconnect (Intel® UPI) Agent. An internal logic block providing interface between internal mesh and external Intel® UPI.
[46]
[PDF] Intel® Xeon® Processor Scalable Memory Family Uncore ...
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice.
[47]
[PDF] Intel® Xeon® Processor 7500 Series Uncore Programming Guide
This guide covers the Intel Xeon 7500 series uncore programming, including an introduction, uncore PMU overview, and uncore performance monitoring.
[48]
[PDF] Intel® Xeon® Processor 5500 Series Datasheet, Volume 2
Apr 2, 2009 · Status Register reporting the current Uncore Clk Ratio relative to BCLK (133Mhz). This is the clock in which the Last Level Cache (LLC) runs.
[49]
Uncore / core frequency - Intel Community
Oct 27, 2016 · On the Xeon E5-2660 v4, the maximum uncore frequency is 2.7 GHz, so if the core is pinned to 2.0GHz a ratio of 1.35 is reasonable. In the uncore ...
[50]
SkylakeSP · RRZE-HPC/likwid Wiki - GitHub
Intel® Xeon® Processor Scalable Memory Family uses a new coherent interconnect for scaling to multiple sockets known as Intel® Ultra Path Interconnect (Intel ...Missing: coordination | Show results with:coordination
[51]
Intel Uncore Frequency Scaling - The Linux Kernel documentation
The uncore can consume significant amount of power in Intel's Xeon servers based on the workload characteristics. To optimize the total power and improve ...
[52]
Asus P6T Intel i7 Overclocking Guide | AnandTech Forums
Jul 6, 2009 · Uncore Frequency is BCLK x Uncore multiplier (170 x 16 = 2720). ... CPU Frequency = BCLK x CPU Multiplier (CPU Ratio) DRAM Frequency = BCLK ...
[53]
UCLK Frequency - Tech ARP
This BIOS feature gives you a list of possible UCLK frequencies, based on available clock ratios and the internal base clock frequency (BCLK).Missing: typical | Show results with:typical
[54]
[PDF] 743844-015.pdf - Intel
May 2, 2025 · ... power C-states, power saving actions only take place once the processor IA core C-state is resolved. processor IA core C-states are ...
[55]
[PDF] Dynamic Uncore Frequency scaling to reduce power consumption
Nov 4, 2021 · The uncore frequency ranges from 1.2 GHz to 2.7 GHz. We used chifflet-1 for our experiments. • YETI is equipped with four Intel Xeon Gold 6130 ...
[56]
Information about Temperature for Intel® Processors
Processors have two modes of thermal protection, throttling and automatic shutdown. When a core exceeds the set throttle temperature, it will reduce power ...<|separator|>
[57]
[PDF] Combining Uncore Frequency and Dynamic Power Capping to ...
Mar 30, 2023 · → Power savings: by up to 15.6 % with no performance loss. → Power savings: by up to 7.46 % with less than 5 % performance loss.
[58]
[PDF] Explicit uncore frequency scaling for energy optimisation policies ...
Intel architectures include a dynamic uncore frequency scaling that automatically adapts the uncore frequency. However, even though the hardware is doing a good ...
[59]
Manually setting the Uncore frequency on Intel CPUs - hofmann.id
In it I briefly discuss what the Uncore frequency is, why you should care about it, and suggest two ways to set it.
[60]
http://hofmann.id/articles/manually-setting-uncore-frequency/
[61]
[PDF] Dynamic Power Savings in Cloud-Native 5G Wireless Infrastructure ...
As an example, the Intel Xeon 6338N CPU supports between 800 MHz and 1600 MHz uncore frequencies, and an uncore Turbo frequency of up to 2200 MHz that can ...
[62]
[PDF] Sandy Bridge-EP Uncore Performance Monitoring Events - Intel
Intel® 64 architecture requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary de- pending on the specific hardware ...
[63]
PerfMon Events - Intel
This site provides a reference for performance monitoring events supported by Intel Performance Monitoring Units (PMUs).
[64]
Intel® Performance Counter Monitor - A Better Way to Measure CPU...
Nov 30, 2022 · The Xeon E5 series processor's uncore has multiple 'boxes' similar to the Xeon E7 processor (Intel microarchitecture codename Westmere-EX).
[65]
Intel® Performance Counter Monitor (Intel® PCM) - GitHub
PCM provides a number of command-line utilities for real-time monitoring: pcm : basic processor monitoring utility (instructions per cycle, core frequency ( ...Releases 41 · Package pcm · Issues · Pull requests 8
[66]
[PDF] 14G with Skylake How much better for HPC - Dell
Some of the performance improvements are due to faster memory, some due to AVX512, some due to additional cores and some due to the combination of all the Intel ...
[67]
[PDF] Minimizing Power Waste in Heterogenous Computing via Adaptive ...
According to [5], the uncore frequency is reduced only when the CPU's power approaches its thermal design power (TDP). In practice, CPU pack- age power rarely ...Missing: 20W | Show results with:20W<|separator|>
[68]
Uncore power scavenger - ACM Digital Library
Nov 17, 2019 · The uncore: A modular approach to feeding the high-performance cores. In Intel Technology Journal, volume 14, 2010. Google Scholar. [26]. C.-h ...
[69]
Hybrid Architecture (code name Alder Lake) - Intel
The hybrid architecture uses two core types: Performance-cores and Efficient-cores, optimized for various workload types.
[70]
Alder Lake – E-Cores, Ring Clock, and Hybrid Teething Troubles
Dec 15, 2021 · Having an E-Core active increases latency by 9-10 cycles. With this additional latency, Golden Cove might have more trouble absorbing the ...Missing: LLC | Show results with:LLC
[71]
Intel® Performance Hybrid Architecture - 008 - ID:743844
P core and E core frequency's will be determined by the processor algorithmic, to maximize performance and power optimization. The following instruction sets ...
[72]
The 'Blank Sheet' that Delivered Intel's Most Significant SoC Design ...
Jan 17, 2024 · For Meteor Lake, we broke things apart in ways we haven't before. It was a blank sheet program,” he says. “We wanted to leverage disaggregation ...
[73]
Intel Details Core Ultra 'Meteor Lake' Architecture, Launches ...
Sep 19, 2023 · Meteor lake has four disaggregated active tiles mounted atop one passive interposer: a Compute (CPU) tile, graphics (GPU) tile, SoC tile, and I/ ...
[74]
Intel Meteor Lake’s NPU
### Summary of Meteor Lake’s NPU Architecture
[75]
[PDF] Intel SGX Security Analysis and MIT Sanctum Architecture
This part documents the authors' concerns over the shortcomings of. SGX as a secure system and introduces the MIT Sanctum processor developed by the authors: a ...
[76]
Intel Lunar Lake Technical Deep Dive - Connectivity & IO
Rating 5.0 · Review by W1zzard (TPU)Jun 3, 2024 · The Platform I/O tile packs many new generation components, including an integrated WLAN controller with Wi-Fi 7 + Bluetooth 5.4, support for 5.8 Gbps WLAN ...Missing: uncore | Show results with:uncore
[77]
Arrow Lake NGU Overclocking - SkatterBencher
Oct 24, 2024 · We have a closer look at tuning the performance of the the Arrow Lake NGU, or Next-Generation Uncore, located on the SoC Tile.Missing: offload inference
[78]
Intel unwraps Lunar Lake architecture: Up to 68% IPC gain for E ...
Jun 3, 2024 · The platform also supports Bluetooth 5.4 and Wi-Fi 7 that's partially embedded into the Platform Controller Tile. Wi-Fi 7 functionality ...
[79]
Hot Chips 34 – Intel's Meteor Lake Chiplets, Compared to AMD's
Sep 9, 2022 · Most iGPU memory accesses should be caught by the iGPU's private cache, so the iGPU to SoC link is handling (hopefully infrequent) GPU cache ...
[80]
AMD "Zen" Core Architecture
Each chiplet houses a number of “Zen”-based cores, and more chiplets can be added to a package to create a higher performance model processor.
[81]
Bulldozer, AMD's Crash Modernization: Caching and Conclusion
Jan 24, 2023 · The Northbridge is AMD's equivalent of Intel's “Uncore”, and traces its ancestry back to K8. In K8, more commonly known as Athlon 64, the ...
[82]
Configuring Memory Speed for Optimal Memory Latency with AMD ...
Jul 15, 2021 · On AMD EPYC Gen 2 processors, the maximum Infinity Fabric frequency is 1467 MHz. Thus, at a configured memory speed of 3200 MHz, the 1:1 ratio ...
[83]
AMD EPYC Infinity Fabric Latency DDR4 2400 v 2666: A Snapshot
Jul 24, 2017 · As you can see, the Intel inter-socket latency is roughly equivalent to the intra-socket latency for AMD EPYC Infinity Fabric. Intel is still ...
[84]
How Intel, AMD are gluing their latest CPUs together - The Register
Oct 24, 2024 · Specifically we're looking at the 128-core Zen 5 variant of the chip, which features 16 4 nm core complex dies (CCD) which surround a single I/O ...
[85]
AMD's Ryzen 7000 CPUs will be faster than 5 GHz, require DDR5 ...
May 23, 2022 · That I/O die will be responsible for some of Ryzen 7000's biggest new features, including PCIe 5.0 and the DDR5 memory controller. AMD also ...