Intel Ultra Path Interconnect
The Intel Ultra Path Interconnect (UPI) is a cache-coherent, point-to-point serial interconnect technology developed by Intel for enabling high-bandwidth, low-latency communication between multiple processors in scalable server systems, primarily within the Intel Xeon Scalable processor family.[1] It supports shared memory coherence across sockets, facilitating tasks such as data sharing, I/O routing, and system configuration in multi-processor environments.[2] Introduced on July 11, 2017, alongside the first-generation Intel Xeon Scalable processors (codename Skylake-SP), UPI replaced the earlier QuickPath Interconnect (QPI) to provide improved scalability and efficiency for data center workloads.[3] Subsequent generations have evolved the technology, with speeds increasing to 11.2 GT/s in Ice Lake-SP (third generation, 2021) and the introduction of UPI 2.0 in Sapphire Rapids (fourth generation, 2023) at up to 16 GT/s, further enhanced in Emerald Rapids (fifth generation, late 2023) at up to 20 GT/s and Granite Rapids (sixth generation, 2024) at up to 24 GT/s for broader platform compatibility.[4][5][6] UPI employs a layered architecture consisting of physical (PHY), link, routing, and protocol layers, with early versions utilizing 20 differential lanes per link for bidirectional data transfer and embedded clocking to minimize latency (UPI 2.0 uses 24 lanes).[4] Key specifications include support for up to three UPI links per processor in multi-socket configurations (two for dual-socket systems), up to 46-bit physical addressing in early versions (52-bit in UPI 2.0), and home-snoop coherency protocols.[1] Bandwidth varies by version and processor: early implementations operate at 9.6 or 10.4 GT/s (approximately 19.2 or 20.8 GB/s per direction per link), while UPI 2.0 in Sapphire Rapids supports speeds up to 16 GT/s (up to ~38.4 GB/s per direction per link) for enhanced performance in demanding applications, with later enhancements to 24 GT/s.[4][2] Advanced features in UPI 2.0 encompass 16-bit CRC for protocol protection, error detection and logging, viral error modes for reliability, optional cryptography for secure memory transactions including encryption, decryption, and authentication, and CXL compatibility.[2]Overview
Introduction
The Intel Ultra Path Interconnect (UPI) is a point-to-point, packetized, cache-coherent interconnect designed for multi-socket CPU systems, enabling efficient communication between processors in server environments.[1] It serves as the primary mechanism for linking multiple processor dies, supporting shared address spaces and maintaining data consistency across sockets.[1] UPI plays a crucial role in facilitating scalable shared-memory architectures for high-performance computing and data center applications, allowing systems to expand from two to eight sockets while preserving coherency.[1] At its core, it operates via bidirectional serial links that transfer data packets between processor dies, ensuring low-latency exchanges for cache lines and system requests.[1] Introduced in 2017 as a replacement for the QuickPath Interconnect (QPI), UPI debuted with the first-generation Intel Xeon Scalable processors.[1] Modern implementations support speeds up to 24 GT/s, enhancing inter-socket bandwidth for demanding workloads.[7]Key Features
The Intel Ultra Path Interconnect (UPI) exclusively employs a directory-based home snoop cache coherency protocol, enabling scalable multi-socket systems to maintain data consistency across processors in a shared address space without relying on broadcast snooping mechanisms.[1] This approach contrasts with prior interconnects by focusing solely on directory-based operations for efficient coherency management in high-core-count environments.[1] UPI features a redesigned packetization format that enhances data transfer efficiency through structured flits and virtual channel flow control, supporting low-power states such as L0p to reduce energy consumption during idle periods.[1] These packets facilitate quick transaction completions and integrate with the link layer for optimized bandwidth utilization in coherent traffic.[2] A core architectural element of UPI is the integration of distributed Caching and Home Agents (CHA) within each core and Last Level Cache (LLC) bank, allowing for scalable resource distribution across the processor's mesh interconnect.[1] Each CHA combines coherency agent functionality for request generation and snoop servicing with home agent logic for conflict resolution, minimizing bottlenecks in multi-socket configurations.[2] UPI supports Sub-NUMA Clustering (SNC), which partitions the processor into localized domains for address space interleaving, mapping memory addresses to specific LLC slices to optimize latency-sensitive workloads.[1] This feature enables finer-grained control over NUMA topology, improving performance by reducing remote access overhead within a socket.[8]History and Development
Origins and Replacement of QPI
The Intel QuickPath Interconnect (QPI) was introduced in 2008 as part of Intel's Nehalem microarchitecture, marking the first implementation in 45-nm processors produced in the second half of that year.[9] Designed as a high-speed, packetized, point-to-point interconnect, QPI replaced the traditional front-side bus (FSB) to enable a distributed shared-memory architecture with improved bandwidth and reduced latency in multi-processor systems.[9] It supported flexible cache coherency protocols, including source snoop for low-latency operations in smaller configurations and home snoop with directory-based tracking for better scalability in larger systems, allowing targeted communication between caching agents rather than broadcasts.[9] Despite its advancements, QPI faced challenges in power consumption and efficiency as server demands grew, particularly in large-scale data center environments where multi-socket configurations amplified energy use and interconnect overhead.[1] The protocol's support for multiple snoop modes, while versatile, could lead to increased traffic and power draw in expansive setups, limiting overall system optimization for hyperscale computing.[1] By 2017, these limitations contributed to QPI reaching end-of-life, as Intel shifted focus to more efficient alternatives for enterprise workloads.[10] The development of the Intel Ultra Path Interconnect (UPI) was motivated by the need to address QPI's shortcomings through enhanced power efficiency and superior scalability tailored for data centers.[1] Key improvements included the introduction of a low-power L0p state to reduce idle energy consumption and a streamlined directory-only coherency protocol, which minimized unnecessary snoop traffic by relying solely on home snoop with directory tracking, simplifying design and boosting performance in multi-socket systems.[1] UPI's initial implementation supported up to three links per processor at 10.4 GT/s, enabling configurations from two to eight sockets with optimized bandwidth allocation.[1] The transition from QPI to UPI began with the launch of the Skylake-SP-based Intel Xeon Scalable processors in July 2017, phasing out QPI entirely in new server platforms as UPI became the standard interconnect for subsequent generations.[1][10] This shift aligned with Intel's broader strategy to support denser, more power-efficient computing infrastructures.[1]Introduction in First-Generation Xeon Scalable
The Intel Ultra Path Interconnect (UPI) debuted in July 2017 with the first-generation Intel Xeon Scalable processors, codenamed Skylake-SP, marking Intel's shift to a new high-speed, cache-coherent inter-processor communication protocol designed for multi-socket server environments.[11] This launch replaced the previous QuickPath Interconnect (QPI) to support enhanced scalability in data center workloads.[12] UPI 1.0 operated at a transfer rate of 10.4 GT/s per link, with each processor supporting up to three links to enable flexible connectivity, though initial implementations emphasized two-socket systems for optimal performance.[4] It integrated across the Xeon Scalable lineup, including the Platinum, Gold, Silver, and Bronze series, and complemented the on-die mesh architecture by providing efficient off-die extensions for inter-core and inter-socket data movement.[13] In typical two-socket configurations, UPI delivered an aggregate bandwidth of 20.8 GB/s per direction. Early adoption of UPI-equipped first-generation Xeon Scalable processors focused on data center deployments, where they powered virtualization platforms and high-performance computing (HPC) applications by offering improved I/O scalability and memory bandwidth for demanding enterprise tasks.[14] Systems from major vendors like Dell, HPE, and Lenovo rapidly incorporated these processors, enabling up to 1.65x performance gains over prior generations in virtualization-heavy environments.[15]Evolution in Subsequent Generations
The second-generation Intel Xeon Scalable processors, codenamed Cascade Lake-SP and launched in 2019, retained the original Intel Ultra Path Interconnect (UPI) 1.0 specification from the first generation, operating at 10.4 GT/s per link with support for up to three links per processor to maintain multi-socket scalability.[16][17] This configuration ensured compatibility with existing platforms while introducing enhancements like support for Intel Optane DC persistent memory, which leveraged UPI for efficient data sharing across sockets in memory-intensive workloads.[18] In the third-generation Xeon Scalable processors, codenamed Ice Lake-SP and released in 2021, UPI increased link speeds to 11.2 GT/s for improved inter-processor bandwidth, with up to three links per processor.[19][20] These advancements, built on a 10 nm process, also incorporated protocol optimizations for better efficiency without altering the core coherency model.[21] The fourth-generation Xeon Scalable processors, codenamed Sapphire Rapids and introduced in 2023, marked a significant upgrade with UPI 2.0, boosting link speeds to 16 GT/s across four links per processor to support up to eight-socket configurations.[22][23] A key addition was an integrated cryptography engine within the UPI module, enabling inline security features such as data encryption and integrity checks for secure multi-socket communication in enterprise environments.[2] This version also improved signaling integrity and error correction to handle higher data rates reliably.[24] Building on UPI 2.0, the fifth-generation Xeon Scalable processors, codenamed Emerald Rapids and launched later in 2023, raised link speeds to 20 GT/s while retaining four links per processor for consistent multi-socket performance.[25][26] Integration with DDR5 memory channels complemented these UPI enhancements, allowing faster data movement between sockets and memory subsystems in bandwidth-sensitive applications.[26][27] The sixth-generation Xeon Scalable processors, including the performance-oriented Granite Rapids and efficiency-focused Sierra Forest variants launched in 2024 and early 2025, further advanced UPI 2.0 to 24 GT/s per link with support for up to six links per processor, enabling enhanced scalability in two-socket systems for AI and edge computing workloads.[28][29] These improvements facilitate higher inter-socket throughput for distributed AI inference and training, where rapid data exchange between cores is critical.[30][31] Across these generations, UPI has trended toward progressively higher link speeds—from 10.4 GT/s to 24 GT/s—and increased link counts for greater multi-socket bandwidth, alongside power efficiency gains through advanced low-power states and reduced latency in coherency traffic.[32] These evolutions have optimized UPI for denser core counts and emerging workloads like AI, without disrupting backward compatibility in Xeon platforms.[23]Technical Architecture
Coherency and Protocol
The Intel Ultra Path Interconnect (UPI) employs a directory-based home snoop coherency protocol to maintain cache consistency across multi-socket systems. This approach utilizes distributed home agents within the Coherency Home Agent (CHA) modules to track the state of cache lines in a directory structure, enabling targeted snoop probes rather than broadcasting to all sockets. By directing coherency traffic only to relevant agents, the protocol minimizes unnecessary inter-socket communication and reduces latency in scaled configurations.[1][2] The UPI protocol stack consists of a protocol layer and a link layer, each handling distinct aspects of transaction processing and transmission. The protocol layer, implemented primarily in the CHA, PCIe root complex, and configuration agent (Ubox), manages high-level operations such as injecting, generating, and servicing transactions for memory access and interrupts. It processes message classes including requests (REQ), snoops (SNP), and responses (RSP) to enforce the MESIF (Modified, Exclusive, Shared, Invalid, Forward) cache states. The link layer, in contrast, operates below this to convert protocol messages into fixed-size flits for transmission, incorporating error detection via cyclic redundancy checks (CRC) and retry mechanisms for corrupted packets.[2][33] Key transaction types in UPI include read and write requests issued by cache agents (CA) for data access, coherency probes (snoops) generated by home agents (HA) to query remote caches, and data responses that complete transfers with optional writebacks (WB) for dirty lines. Flow control is enforced through a credit-based system at the link layer, where virtual channels (e.g., VN0 for requests, VN1 for snoops) allocate buffers and prevent overflows by tracking available credits per peer. This ensures reliable ordering and prevents deadlocks in bidirectional traffic.[33][2] Unlike its predecessor, the QuickPath Interconnect (QPI), which supported multiple snoop modes—including no-snoop, early snoop, home snoop, and directory—UPI is optimized exclusively for directory-based operation. This simplification eliminates the overhead of mode selection and preallocation, reducing protocol complexity and enabling better scalability in systems with many sockets by avoiding broadcast storms. The directory-only design lowers coherency latency in large-scale deployments, as home agents resolve states locally without probing unnecessary nodes.[1] The coherency overhead in UPI can be modeled conceptually, where directory lookups enable constant-time O(1) resolution per transaction via the home agent, in contrast to traditional snooping's O(n) scaling with the number of sockets (n). This efficiency arises because the directory maintains sharer and owner pointers, allowing snoops to fan out selectively rather than to all peers. \text{Directory Lookup Time} \approx O(1) \quad \text{vs.} \quad \text{Snooping Time} \approx O(n) Such a model highlights UPI's advantage in multi-socket environments, where n grows beyond small configurations.[1]Physical Layer and Links
The physical layer of Intel Ultra Path Interconnect (UPI) implements high-speed serial point-to-point links designed for cache-coherent communication between processors. These links employ differential signaling using complementary positive (DP) and negative (DN) signal pairs to transmit data reliably over short distances within multi-socket server platforms. Embedded clocking is integrated into the data stream, eliminating the need for separate clock lines and enabling synchronous operation at data rates up to 24 GT/s in recent generations, which simplifies board design and reduces pin count.[34][4][35] Data integrity on UPI links is maintained through proprietary encoding at the physical layer for reliable transmission and cyclic redundancy checks (CRC) at the link layer for error detection, with retry mechanisms to handle corrupted flits. Each link direction supports up to 24 lanes in UPI 2.0 implementations, with earlier versions using 20 lanes; these lanes operate in parallel to achieve aggregate bandwidth, where a full-width link aggregates the throughput of all active lanes after encoding efficiency is applied.[34][4] Link training occurs during system initialization to establish reliable communication, involving a sequence of reset, detection, speed negotiation, and equalization phases. The physical layer logic negotiates the highest supported data rate—such as 10.4 GT/s in first-generation UPI or 24 GT/s in later versions—while applying adaptive equalization to compensate for signal degradation over traces. This process ensures optimal lane alignment and margining for error-free operation across varying platform conditions.[36][37] In multi-socket configurations, each CPU typically supports 2 to 6 UPI links, depending on the processor generation and platform topology, with links configurable to form a mesh network for balanced interconnectivity. For instance, two-socket systems often use two links per CPU for direct connectivity, while larger four- or eight-socket setups employ up to six links to maintain low-latency paths in a non-blocking mesh arrangement.[23][37] Power management in the UPI physical layer includes low-power states to optimize energy efficiency during idle or light-load scenarios. The L0 state represents full active operation with maximum performance, while the L0.3 state provides a partial low-power mode by idling unused lanes or reducing voltage without fully entering deeper sleep states, enabling quick reactivation for bursty workloads. These states integrate with overall platform power policies to balance performance and thermal constraints.[2][4]Integration with Processor Design
The Intel Ultra Path Interconnect (UPI) is deeply integrated into the processor's uncore architecture, serving as the primary interface for coherent communication between multi-socket configurations while leveraging the on-die mesh fabric for internal routing.[1] In designs like the 4th Gen Intel Xeon Scalable processors (codename Sapphire Rapids), UPI controllers are placed near dedicated mesh stops on the periphery of the tile-based structure, allowing efficient access to the overall system fabric without excessive hops.[38] This placement ensures that UPI traffic integrates seamlessly with the processor's modular layout, where compute and I/O elements are connected via embedded multi-die interconnect bridges (EMIB).[39] Coherency Home Agents (CHAs) are distributed across the processor tiles to maintain balanced load distribution, with one CHA typically associated per core tile and last-level cache (LLC) slice.[1] This arrangement enables scalable handling of memory requests and snoops, as each CHA connects directly to the mesh interconnect, facilitating uniform access to shared resources like the LLC and memory controllers.[40] In tile-based implementations, such as those in Sapphire Rapids, this per-tile CHA setup supports the logical monolithicity of multi-tile packages, where all cores can access global resources transparently.[39] On-die connectivity for UPI relies on the bidirectional mesh fabric, which routes packets from core tiles to I/O dies hosting the UPI links.[38] In Sapphire Rapids' chiplet design, EMIB bridges extend the mesh across tiles, minimizing latency for UPI hops by treating the multi-die package as a unified interconnect domain.[23] This mesh-based routing replaces earlier ring topologies, providing higher bandwidth and scalability for intra-package traffic destined for inter-socket UPI transmission.[1] UPI's design enables scalability from 2 to 8 sockets through support for optimized topologies, such as the 8-socket 4-UPI configuration, and NUMA-aware routing that maps memory domains across sockets via UPI links.[39] This allows processors to maintain coherent shared address spaces in large-scale systems, with the mesh distributing UPI-related traffic to prevent bottlenecks.[1] In Granite Rapids-based processors, this extends to up to 6 UPI links per socket, further enhancing multi-socket connectivity in 1- to 8-socket setups.[41] The integration of UPI has evolved from monolithic die designs in Skylake-based processors, where the mesh directly interfaced UPI on a single silicon expanse, to multi-tile architectures in Sapphire Rapids and Granite Rapids.[1] In these later generations, chiplet-based tiles—comprising compute, memory, and I/O elements linked by EMIB—allow UPI to scale with higher core counts while preserving low-latency access through extended mesh routing.[29] This progression supports denser packaging and improved resource sharing without compromising coherency.[41]Performance and Specifications
Bandwidth Capabilities
The Intel Ultra Path Interconnect (UPI) provides high-speed, full-duplex data transfer capabilities between multi-socket processors, with bandwidth scaling across versions through increased transfer rates. In its initial UPI 1.0 implementation, each link operates at 10.4 GT/s, delivering approximately 20.8 GB/s of unidirectional throughput (41.6 GB/s bidirectional).[11][42] This configuration supports symmetric upload and download speeds, enabling efficient cache-coherent communication in dual-socket systems. Subsequent generations increased the per-link speed to 11.2 GT/s with UPI 2.0 in third-generation Ice Lake-SP processors, yielding about 22.4 GB/s unidirectional (44.8 GB/s bidirectional), an improvement that enhanced inter-socket data movement without altering the fundamental link architecture. Later evolutions of UPI 2.0 further boosted performance, with operational speeds up to 16 GT/s in fourth-generation Sapphire Rapids processors, equating to roughly 32 GB/s unidirectional per link (64 GB/s bidirectional), and peak speeds of 24 GT/s in sixth-generation Intel Xeon 6 processors (Granite Rapids, launched 2024), achieving approximately 48 GB/s unidirectional (96 GB/s bidirectional).[4][34][35][42] These rates maintain full-duplex operation, ensuring balanced throughput in both directions across each link. Early UPI 1.0 versions used 8b/10b encoding (80% efficiency), while UPI 2.0 employs 128b/130b encoding (approximately 98% efficiency) for reduced overhead. Aggregate bandwidth depends on the number of links per processor, typically three in early generations and up to four in fourth-generation, or six in sixth-generation Xeon 6. For example, UPI 1.0 systems with three links per socket provide 62.4 GB/s total unidirectional bandwidth (3 × 20.8 GB/s), sufficient for many server workloads. In contrast, sixth-generation Xeon 6 processors with six UPI 2.0 links at 24 GT/s deliver around 288 GB/s aggregate unidirectional throughput (6 × 48 GB/s), significantly expanding multi-socket scalability. This scaling follows the formula for effective bandwidth: (GT/s × 20 lanes × encoding efficiency) / 8 bits per byte.[4][35][42] UPI's throughput is influenced by encoding overhead, which minimizes wasted bandwidth while preserving signal integrity over differential pairs. This efficiency ensures that raw transfer rates closely translate to usable data rates, though actual performance can vary slightly based on protocol overhead and link configuration. Brief references to multi-link setups highlight UPI's advantage over alternatives like NVLink in bandwidth density for x86 ecosystems, but detailed comparisons fall outside core throughput metrics.| UPI Version | Max GT/s per Link | Unidirectional Bandwidth per Link (GB/s) | Typical Links per Socket | Aggregate Unidirectional (GB/s) |
|---|---|---|---|---|
| 1.0 | 10.4 | 20.8 | 3 | 62.4 |
| 2.0 (Ice Lake) | 11.2 | 22.4 | 3 | 67.2 |
| 2.0 (Sapphire Rapids) | 16 | 32 | 4 | 128 |
| 2.0 (Granite Rapids) | 24 | 48 | 6 | 288 |