Fact-checked by Grok 2 weeks ago

Intel Ultra Path Interconnect

The Intel Ultra Path Interconnect (UPI) is a cache-coherent, point-to-point serial interconnect technology developed by Intel for enabling high-bandwidth, low-latency communication between multiple processors in scalable server systems, primarily within the Intel Xeon Scalable processor family.^[1] It supports shared memory coherence across sockets, facilitating tasks such as data sharing, I/O routing, and system configuration in multi-processor environments.^[2] Introduced on July 11, 2017, alongside the first-generation Intel Xeon Scalable processors (codename Skylake-SP), UPI replaced the earlier QuickPath Interconnect (QPI) to provide improved scalability and efficiency for data center workloads.^[3] Subsequent generations have evolved the technology, with speeds increasing to 11.2 GT/s in Ice Lake-SP (third generation, 2021) and the introduction of UPI 2.0 in Sapphire Rapids (fourth generation, 2023) at up to 16 GT/s, further enhanced in Emerald Rapids (fifth generation, late 2023) at up to 20 GT/s and Granite Rapids (sixth generation, 2024) at up to 24 GT/s for broader platform compatibility.^[4]^[5]^[6] UPI employs a layered architecture consisting of physical (PHY), link, routing, and protocol layers, with early versions utilizing 20 differential lanes per link for bidirectional data transfer and embedded clocking to minimize latency (UPI 2.0 uses 24 lanes).^[4] Key specifications include support for up to three UPI links per processor in multi-socket configurations (two for dual-socket systems), up to 46-bit physical addressing in early versions (52-bit in UPI 2.0), and home-snoop coherency protocols.^[1] Bandwidth varies by version and processor: early implementations operate at 9.6 or 10.4 GT/s (approximately 19.2 or 20.8 GB/s per direction per link), while UPI 2.0 in Sapphire Rapids supports speeds up to 16 GT/s (up to ~38.4 GB/s per direction per link) for enhanced performance in demanding applications, with later enhancements to 24 GT/s.^[4]^[2] Advanced features in UPI 2.0 encompass 16-bit CRC for protocol protection, error detection and logging, viral error modes for reliability, optional cryptography for secure memory transactions including encryption, decryption, and authentication, and CXL compatibility.^[2]

Overview

Introduction

The Intel Ultra Path Interconnect (UPI) is a point-to-point, packetized, cache-coherent interconnect designed for multi-socket CPU systems, enabling efficient communication between processors in server environments.^[1] It serves as the primary mechanism for linking multiple processor dies, supporting shared address spaces and maintaining data consistency across sockets.^[1] UPI plays a crucial role in facilitating scalable shared-memory architectures for high-performance computing and data center applications, allowing systems to expand from two to eight sockets while preserving coherency.^[1] At its core, it operates via bidirectional serial links that transfer data packets between processor dies, ensuring low-latency exchanges for cache lines and system requests.^[1] Introduced in 2017 as a replacement for the QuickPath Interconnect (QPI), UPI debuted with the first-generation Intel Xeon Scalable processors.^[1] Modern implementations support speeds up to 24 GT/s, enhancing inter-socket bandwidth for demanding workloads.^[7]

Key Features

The Intel Ultra Path Interconnect (UPI) exclusively employs a directory-based home snoop cache coherency protocol, enabling scalable multi-socket systems to maintain data consistency across processors in a shared address space without relying on broadcast snooping mechanisms.^[1] This approach contrasts with prior interconnects by focusing solely on directory-based operations for efficient coherency management in high-core-count environments.^[1] UPI features a redesigned packetization format that enhances data transfer efficiency through structured flits and virtual channel flow control, supporting low-power states such as L0p to reduce energy consumption during idle periods.^[1] These packets facilitate quick transaction completions and integrate with the link layer for optimized bandwidth utilization in coherent traffic.^[2] A core architectural element of UPI is the integration of distributed Caching and Home Agents (CHA) within each core and Last Level Cache (LLC) bank, allowing for scalable resource distribution across the processor's mesh interconnect.^[1] Each CHA combines coherency agent functionality for request generation and snoop servicing with home agent logic for conflict resolution, minimizing bottlenecks in multi-socket configurations.^[2] UPI supports Sub-NUMA Clustering (SNC), which partitions the processor into localized domains for address space interleaving, mapping memory addresses to specific LLC slices to optimize latency-sensitive workloads.^[1] This feature enables finer-grained control over NUMA topology, improving performance by reducing remote access overhead within a socket.^[8]

History and Development

Origins and Replacement of QPI

The Intel QuickPath Interconnect (QPI) was introduced in 2008 as part of Intel's Nehalem microarchitecture, marking the first implementation in 45-nm processors produced in the second half of that year.^[9] Designed as a high-speed, packetized, point-to-point interconnect, QPI replaced the traditional front-side bus (FSB) to enable a distributed shared-memory architecture with improved bandwidth and reduced latency in multi-processor systems.^[9] It supported flexible cache coherency protocols, including source snoop for low-latency operations in smaller configurations and home snoop with directory-based tracking for better scalability in larger systems, allowing targeted communication between caching agents rather than broadcasts.^[9] Despite its advancements, QPI faced challenges in power consumption and efficiency as server demands grew, particularly in large-scale data center environments where multi-socket configurations amplified energy use and interconnect overhead.^[1] The protocol's support for multiple snoop modes, while versatile, could lead to increased traffic and power draw in expansive setups, limiting overall system optimization for hyperscale computing.^[1] By 2017, these limitations contributed to QPI reaching end-of-life, as Intel shifted focus to more efficient alternatives for enterprise workloads.^[10] The development of the Intel Ultra Path Interconnect (UPI) was motivated by the need to address QPI's shortcomings through enhanced power efficiency and superior scalability tailored for data centers.^[1] Key improvements included the introduction of a low-power L0p state to reduce idle energy consumption and a streamlined directory-only coherency protocol, which minimized unnecessary snoop traffic by relying solely on home snoop with directory tracking, simplifying design and boosting performance in multi-socket systems.^[1] UPI's initial implementation supported up to three links per processor at 10.4 GT/s, enabling configurations from two to eight sockets with optimized bandwidth allocation.^[1] The transition from QPI to UPI began with the launch of the Skylake-SP-based Intel Xeon Scalable processors in July 2017, phasing out QPI entirely in new server platforms as UPI became the standard interconnect for subsequent generations.^[1]^[10] This shift aligned with Intel's broader strategy to support denser, more power-efficient computing infrastructures.^[1]

Introduction in First-Generation Xeon Scalable

The Intel Ultra Path Interconnect (UPI) debuted in July 2017 with the first-generation Intel Xeon Scalable processors, codenamed Skylake-SP, marking Intel's shift to a new high-speed, cache-coherent inter-processor communication protocol designed for multi-socket server environments.^[11] This launch replaced the previous QuickPath Interconnect (QPI) to support enhanced scalability in data center workloads.^[12] UPI 1.0 operated at a transfer rate of 10.4 GT/s per link, with each processor supporting up to three links to enable flexible connectivity, though initial implementations emphasized two-socket systems for optimal performance.^[4] It integrated across the Xeon Scalable lineup, including the Platinum, Gold, Silver, and Bronze series, and complemented the on-die mesh architecture by providing efficient off-die extensions for inter-core and inter-socket data movement.^[13] In typical two-socket configurations, UPI delivered an aggregate bandwidth of 20.8 GB/s per direction. Early adoption of UPI-equipped first-generation Xeon Scalable processors focused on data center deployments, where they powered virtualization platforms and high-performance computing (HPC) applications by offering improved I/O scalability and memory bandwidth for demanding enterprise tasks.^[14] Systems from major vendors like Dell, HPE, and Lenovo rapidly incorporated these processors, enabling up to 1.65x performance gains over prior generations in virtualization-heavy environments.^[15]

Evolution in Subsequent Generations

The second-generation Intel Xeon Scalable processors, codenamed Cascade Lake-SP and launched in 2019, retained the original Intel Ultra Path Interconnect (UPI) 1.0 specification from the first generation, operating at 10.4 GT/s per link with support for up to three links per processor to maintain multi-socket scalability.^[16]^[17] This configuration ensured compatibility with existing platforms while introducing enhancements like support for Intel Optane DC persistent memory, which leveraged UPI for efficient data sharing across sockets in memory-intensive workloads.^[18] In the third-generation Xeon Scalable processors, codenamed Ice Lake-SP and released in 2021, UPI increased link speeds to 11.2 GT/s for improved inter-processor bandwidth, with up to three links per processor.^[19]^[20] These advancements, built on a 10 nm process, also incorporated protocol optimizations for better efficiency without altering the core coherency model.^[21] The fourth-generation Xeon Scalable processors, codenamed Sapphire Rapids and introduced in 2023, marked a significant upgrade with UPI 2.0, boosting link speeds to 16 GT/s across four links per processor to support up to eight-socket configurations.^[22]^[23] A key addition was an integrated cryptography engine within the UPI module, enabling inline security features such as data encryption and integrity checks for secure multi-socket communication in enterprise environments.^[2] This version also improved signaling integrity and error correction to handle higher data rates reliably.^[24] Building on UPI 2.0, the fifth-generation Xeon Scalable processors, codenamed Emerald Rapids and launched later in 2023, raised link speeds to 20 GT/s while retaining four links per processor for consistent multi-socket performance.^[25]^[26] Integration with DDR5 memory channels complemented these UPI enhancements, allowing faster data movement between sockets and memory subsystems in bandwidth-sensitive applications.^[26]^[27] The sixth-generation Xeon Scalable processors, including the performance-oriented Granite Rapids and efficiency-focused Sierra Forest variants launched in 2024 and early 2025, further advanced UPI 2.0 to 24 GT/s per link with support for up to six links per processor, enabling enhanced scalability in two-socket systems for AI and edge computing workloads.^[28]^[29] These improvements facilitate higher inter-socket throughput for distributed AI inference and training, where rapid data exchange between cores is critical.^[30]^[31] Across these generations, UPI has trended toward progressively higher link speeds—from 10.4 GT/s to 24 GT/s—and increased link counts for greater multi-socket bandwidth, alongside power efficiency gains through advanced low-power states and reduced latency in coherency traffic.^[32] These evolutions have optimized UPI for denser core counts and emerging workloads like AI, without disrupting backward compatibility in Xeon platforms.^[23]

Technical Architecture

Coherency and Protocol

The Intel Ultra Path Interconnect (UPI) employs a directory-based home snoop coherency protocol to maintain cache consistency across multi-socket systems. This approach utilizes distributed home agents within the Coherency Home Agent (CHA) modules to track the state of cache lines in a directory structure, enabling targeted snoop probes rather than broadcasting to all sockets. By directing coherency traffic only to relevant agents, the protocol minimizes unnecessary inter-socket communication and reduces latency in scaled configurations.^[1]^[2] The UPI protocol stack consists of a protocol layer and a link layer, each handling distinct aspects of transaction processing and transmission. The protocol layer, implemented primarily in the CHA, PCIe root complex, and configuration agent (Ubox), manages high-level operations such as injecting, generating, and servicing transactions for memory access and interrupts. It processes message classes including requests (REQ), snoops (SNP), and responses (RSP) to enforce the MESIF (Modified, Exclusive, Shared, Invalid, Forward) cache states. The link layer, in contrast, operates below this to convert protocol messages into fixed-size flits for transmission, incorporating error detection via cyclic redundancy checks (CRC) and retry mechanisms for corrupted packets.^[2]^[33] Key transaction types in UPI include read and write requests issued by cache agents (CA) for data access, coherency probes (snoops) generated by home agents (HA) to query remote caches, and data responses that complete transfers with optional writebacks (WB) for dirty lines. Flow control is enforced through a credit-based system at the link layer, where virtual channels (e.g., VN0 for requests, VN1 for snoops) allocate buffers and prevent overflows by tracking available credits per peer. This ensures reliable ordering and prevents deadlocks in bidirectional traffic.^[33]^[2] Unlike its predecessor, the QuickPath Interconnect (QPI), which supported multiple snoop modes—including no-snoop, early snoop, home snoop, and directory—UPI is optimized exclusively for directory-based operation. This simplification eliminates the overhead of mode selection and preallocation, reducing protocol complexity and enabling better scalability in systems with many sockets by avoiding broadcast storms. The directory-only design lowers coherency latency in large-scale deployments, as home agents resolve states locally without probing unnecessary nodes.^[1] The coherency overhead in UPI can be modeled conceptually, where directory lookups enable constant-time O(1) resolution per transaction via the home agent, in contrast to traditional snooping's O(n) scaling with the number of sockets (n). This efficiency arises because the directory maintains sharer and owner pointers, allowing snoops to fan out selectively rather than to all peers.

\text{Directory Lookup Time} \approx O(1) \quad \text{vs.} \quad \text{Snooping Time} \approx O(n)

Such a model highlights UPI's advantage in multi-socket environments, where n grows beyond small configurations.^[1]

Physical Layer and Links

The physical layer of Intel Ultra Path Interconnect (UPI) implements high-speed serial point-to-point links designed for cache-coherent communication between processors. These links employ differential signaling using complementary positive (DP) and negative (DN) signal pairs to transmit data reliably over short distances within multi-socket server platforms. Embedded clocking is integrated into the data stream, eliminating the need for separate clock lines and enabling synchronous operation at data rates up to 24 GT/s in recent generations, which simplifies board design and reduces pin count.^[34]^[4]^[35] Data integrity on UPI links is maintained through proprietary encoding at the physical layer for reliable transmission and cyclic redundancy checks (CRC) at the link layer for error detection, with retry mechanisms to handle corrupted flits. Each link direction supports up to 24 lanes in UPI 2.0 implementations, with earlier versions using 20 lanes; these lanes operate in parallel to achieve aggregate bandwidth, where a full-width link aggregates the throughput of all active lanes after encoding efficiency is applied.^[34]^[4] Link training occurs during system initialization to establish reliable communication, involving a sequence of reset, detection, speed negotiation, and equalization phases. The physical layer logic negotiates the highest supported data rate—such as 10.4 GT/s in first-generation UPI or 24 GT/s in later versions—while applying adaptive equalization to compensate for signal degradation over traces. This process ensures optimal lane alignment and margining for error-free operation across varying platform conditions.^[36]^[37] In multi-socket configurations, each CPU typically supports 2 to 6 UPI links, depending on the processor generation and platform topology, with links configurable to form a mesh network for balanced interconnectivity. For instance, two-socket systems often use two links per CPU for direct connectivity, while larger four- or eight-socket setups employ up to six links to maintain low-latency paths in a non-blocking mesh arrangement.^[23]^[37] Power management in the UPI physical layer includes low-power states to optimize energy efficiency during idle or light-load scenarios. The L0 state represents full active operation with maximum performance, while the L0.3 state provides a partial low-power mode by idling unused lanes or reducing voltage without fully entering deeper sleep states, enabling quick reactivation for bursty workloads. These states integrate with overall platform power policies to balance performance and thermal constraints.^[2]^[4]

Integration with Processor Design

The Intel Ultra Path Interconnect (UPI) is deeply integrated into the processor's uncore architecture, serving as the primary interface for coherent communication between multi-socket configurations while leveraging the on-die mesh fabric for internal routing.^[1] In designs like the 4th Gen Intel Xeon Scalable processors (codename Sapphire Rapids), UPI controllers are placed near dedicated mesh stops on the periphery of the tile-based structure, allowing efficient access to the overall system fabric without excessive hops.^[38] This placement ensures that UPI traffic integrates seamlessly with the processor's modular layout, where compute and I/O elements are connected via embedded multi-die interconnect bridges (EMIB).^[39] Coherency Home Agents (CHAs) are distributed across the processor tiles to maintain balanced load distribution, with one CHA typically associated per core tile and last-level cache (LLC) slice.^[1] This arrangement enables scalable handling of memory requests and snoops, as each CHA connects directly to the mesh interconnect, facilitating uniform access to shared resources like the LLC and memory controllers.^[40] In tile-based implementations, such as those in Sapphire Rapids, this per-tile CHA setup supports the logical monolithicity of multi-tile packages, where all cores can access global resources transparently.^[39] On-die connectivity for UPI relies on the bidirectional mesh fabric, which routes packets from core tiles to I/O dies hosting the UPI links.^[38] In Sapphire Rapids' chiplet design, EMIB bridges extend the mesh across tiles, minimizing latency for UPI hops by treating the multi-die package as a unified interconnect domain.^[23] This mesh-based routing replaces earlier ring topologies, providing higher bandwidth and scalability for intra-package traffic destined for inter-socket UPI transmission.^[1] UPI's design enables scalability from 2 to 8 sockets through support for optimized topologies, such as the 8-socket 4-UPI configuration, and NUMA-aware routing that maps memory domains across sockets via UPI links.^[39] This allows processors to maintain coherent shared address spaces in large-scale systems, with the mesh distributing UPI-related traffic to prevent bottlenecks.^[1] In Granite Rapids-based processors, this extends to up to 6 UPI links per socket, further enhancing multi-socket connectivity in 1- to 8-socket setups.^[41] The integration of UPI has evolved from monolithic die designs in Skylake-based processors, where the mesh directly interfaced UPI on a single silicon expanse, to multi-tile architectures in Sapphire Rapids and Granite Rapids.^[1] In these later generations, chiplet-based tiles—comprising compute, memory, and I/O elements linked by EMIB—allow UPI to scale with higher core counts while preserving low-latency access through extended mesh routing.^[29] This progression supports denser packaging and improved resource sharing without compromising coherency.^[41]

Performance and Specifications

Bandwidth Capabilities

The Intel Ultra Path Interconnect (UPI) provides high-speed, full-duplex data transfer capabilities between multi-socket processors, with bandwidth scaling across versions through increased transfer rates. In its initial UPI 1.0 implementation, each link operates at 10.4 GT/s, delivering approximately 20.8 GB/s of unidirectional throughput (41.6 GB/s bidirectional).^[11]^[42] This configuration supports symmetric upload and download speeds, enabling efficient cache-coherent communication in dual-socket systems. Subsequent generations increased the per-link speed to 11.2 GT/s with UPI 2.0 in third-generation Ice Lake-SP processors, yielding about 22.4 GB/s unidirectional (44.8 GB/s bidirectional), an improvement that enhanced inter-socket data movement without altering the fundamental link architecture. Later evolutions of UPI 2.0 further boosted performance, with operational speeds up to 16 GT/s in fourth-generation Sapphire Rapids processors, equating to roughly 32 GB/s unidirectional per link (64 GB/s bidirectional), and peak speeds of 24 GT/s in sixth-generation Intel Xeon 6 processors (Granite Rapids, launched 2024), achieving approximately 48 GB/s unidirectional (96 GB/s bidirectional).^[4]^[34]^[35]^[42] These rates maintain full-duplex operation, ensuring balanced throughput in both directions across each link. Early UPI 1.0 versions used 8b/10b encoding (80% efficiency), while UPI 2.0 employs 128b/130b encoding (approximately 98% efficiency) for reduced overhead. Aggregate bandwidth depends on the number of links per processor, typically three in early generations and up to four in fourth-generation, or six in sixth-generation Xeon 6. For example, UPI 1.0 systems with three links per socket provide 62.4 GB/s total unidirectional bandwidth (3 × 20.8 GB/s), sufficient for many server workloads. In contrast, sixth-generation Xeon 6 processors with six UPI 2.0 links at 24 GT/s deliver around 288 GB/s aggregate unidirectional throughput (6 × 48 GB/s), significantly expanding multi-socket scalability. This scaling follows the formula for effective bandwidth: (GT/s × 20 lanes × encoding efficiency) / 8 bits per byte.^[4]^[35]^[42] UPI's throughput is influenced by encoding overhead, which minimizes wasted bandwidth while preserving signal integrity over differential pairs. This efficiency ensures that raw transfer rates closely translate to usable data rates, though actual performance can vary slightly based on protocol overhead and link configuration. Brief references to multi-link setups highlight UPI's advantage over alternatives like NVLink in bandwidth density for x86 ecosystems, but detailed comparisons fall outside core throughput metrics.

UPI Version	Max GT/s per Link	Unidirectional Bandwidth per Link (GB/s)	Typical Links per Socket	Aggregate Unidirectional (GB/s)
1.0	10.4	20.8	3	62.4
2.0 (Ice Lake)	11.2	22.4	3	67.2
2.0 (Sapphire Rapids)	16	32	4	128
2.0 (Granite Rapids)	24	48	6	288

Latency and Power Efficiency

The Intel Ultra Path Interconnect (UPI) is designed to minimize latency in multi-socket configurations, enabling efficient coherent data sharing across processors. In a two-socket setup, the end-to-end latency for remote cache access typically ranges from approximately 120-150 ns for a single hop, based on measurements of memory access patterns in NUMA systems. This latency encompasses transmission across the physical link, protocol processing, and queuing delays at the endpoints. The directory-based home snoop coherency protocol further optimizes this by localizing snoop traffic at the home agent, reducing unnecessary broadcasts and thus lowering overall remote access delays compared to broadcast-based alternatives.^[43]^[1] UPI's power efficiency stems from its support for multiple low-power states that allow links to scale dynamically with workload demands. The L0 state maintains full operational speed for active data transfer, while the L1 state shuts down the link entirely during idle periods, achieving significant energy savings with rapid entry and exit times on the order of microseconds to avoid performance penalties. In configurations like four-socket systems using first-generation Xeon Scalable processors (e.g., Platinum 8180), inter-socket latencies are approximately 130-150 ns.^[2]^[33] Compared to its predecessor, QuickPath Interconnect (QPI), UPI delivers improved power efficiency through refined packet formats that reduce overhead and enhance transfer rates per wire, alongside the introduction of deeper idle states like L1 for better idle power control. Subsequent generations, such as UPI 2.0 integrated in third-generation Xeon Scalable processors, further enhance efficiency with support for higher transfer rates up to 11.2 GT/s initially and voltage optimizations that enable finer-grained power scaling without compromising coherency performance. These advancements result in reduced energy use for inter-socket communication, particularly in bandwidth-intensive server workloads, while maintaining latencies suitable for scalable shared-memory systems. In later iterations like those in fourth-generation Xeon (Sapphire Rapids), protocol enhancements and faster link speeds contribute to progressively lower effective latencies, approaching sub-150 ns in optimized two-socket topologies despite increasing core counts.^[1]^[23]

Scalability Limits

Intel Ultra Path Interconnect (UPI) 2.0 supports configurations up to eight sockets in high-end server platforms, such as those using 4th Generation Intel Xeon Scalable processors (Sapphire Rapids), where Platinum-series CPUs enable scaling to this maximum through multiple UPI links per processor.^[1] This limit arises from constraints in link connectivity and the coherency directory size, which tracks cache line states across sockets in a shared memory system; beyond eight sockets, the protocol's efficiency diminishes due to increased overhead in maintaining coherence.^[1] Topology constraints in multi-socket UPI systems, such as ring or mesh arrangements, impose limits on remote access latency as the number of sockets grows, with larger configurations like eight-socket setups potentially requiring up to four hops for data traversal between distant processors. In these topologies, each additional hop introduces propagation delays, making full-mesh impractical for scalability beyond four sockets without specialized hardware.^[44] Key bottlenecks include the maximum of six UPI links per CPU in recent implementations, which restricts direct connectivity in dense configurations and can lead to contention on shared paths.^[29] Additionally, in systems exceeding four sockets, the directory-based coherency mechanism experiences increased thrashing and contention as more agents query the home directory for cache states, potentially degrading performance under high sharing workloads.^[1] To mitigate these limits, system administrators employ NUMA tuning techniques, such as thread affinity and memory allocation policies, to minimize inter-socket traffic and favor local access. Sub-NUMA Clustering (SNC) modes further partition each socket into smaller domains, reducing coherency overhead by isolating memory controllers and caches within subsets of cores.^[45] These approaches enhance effective scalability in multi-socket environments without altering hardware topology.

Applications and Implementations

Usage in Server Platforms

Intel Ultra Path Interconnect (UPI) primarily serves as the high-speed coherency fabric in multi-socket server platforms, enabling seamless communication between processors in configurations such as 2-socket (2S) and 4-socket (4S) systems deployed in data centers for high-performance computing (HPC), cloud infrastructure, and AI training workloads.^[1] In these environments, UPI facilitates scalable multi-processor designs, allowing servers to handle demanding parallel processing tasks without bottlenecks in inter-socket data transfer.^[46] Prominent examples of UPI integration include Dell PowerEdge servers and HPE ProLiant systems equipped with Intel Xeon Scalable processors, where UPI links support up to three connections per socket to optimize performance in enterprise-grade multi-socket setups.^[47] For instance, in HPE ProLiant servers, BIOS configurations allow tuning of UPI options to balance power and performance for specific server roles.^[48] These platforms leverage UPI's cache-coherent protocol to enable efficient memory sharing across sockets, providing key benefits for workloads like large-scale databases—where it reduces latency in transactional queries—and virtualization environments, supporting higher virtual machine density through unified address spaces.^[1]^[49]^[50] As of 2025, UPI continues in fifth-generation Xeon Scalable processors like Emerald Rapids, supporting advanced AI and HPC workloads with improved multi-socket scaling.^[5] As of the November 2025 TOP500 list, UPI-based Intel Xeon processors power 57% of systems on the TOP500 list of supercomputers, underscoring their widespread adoption in production HPC deployments.^[51] To ensure reliability in these mission-critical settings, UPI incorporates built-in diagnostics and error handling mechanisms, such as dynamic link width reduction and corrupt data containment, which detect and mitigate link failures by adjusting connectivity without full system downtime.^[52] These features log events via the System Event Log (SEL) for proactive maintenance in server farms.^[53]

Configurations and Compatibility

The number of UPI links per processor varies by generation and model, typically 2 or 3, with up to 4 in fourth-generation Sapphire Rapids and up to 6 in third-generation Cooper Lake for multi-socket systems. In multi-socket configurations, effective links across sockets can reach up to 6 or more depending on the topology.^[1]^[45]^[34] Cost-optimized setups utilize 2-link modes, ideal for dual-socket platforms where interconnect bandwidth needs are moderate and cost efficiency is prioritized. Balanced configurations employ 3 links per socket, offering improved data throughput for multi-socket servers handling moderate to high workloads. High-bandwidth modes leverage up to 6 links per socket in advanced multi-processor systems, enabling robust scalability for demanding environments. UPI maintains protocol-level backward compatibility with prior versions, facilitating evolution across processor generations while preserving coherency mechanisms. In multi-socket setups, all interconnected processors must operate at identical UPI speeds to ensure stable communication and avoid protocol mismatches. Configurations do not support mixing UPI with the legacy QuickPath Interconnect (QPI), as UPI fully replaces it in scalable designs. Certified platforms limit scalability to a maximum of 8 sockets to maintain reliability and performance.^[1]^[2]^[4] BIOS settings provide essential tuning for UPI operation, including speed selection among supported rates such as 10.4 GT/s, 11.2 GT/s, or higher depending on the generation. Lane reversal options allow flexible cabling arrangements by inverting signal lanes without hardware changes, simplifying multi-socket installations. For verification, UPI speeds and link counts are detailed in Intel ARK product pages or official processor datasheets.^[54]^[55]^[56]

Comparisons to Other Interconnects

Intel's Ultra Path Interconnect (UPI) is primarily designed for coherent, low-latency CPU-to-CPU communication within multi-socket x86 systems, contrasting with AMD's Infinity Fabric (IF), which serves a broader role encompassing CPU, I/O, and accelerator interconnects in a more integrated chiplet architecture.^[1]^[57] UPI emphasizes directory-based cache coherency optimized for homogeneous CPU environments, achieving lower inter-socket latency (typically 60-100 ns round-trip) suitable for latency-sensitive workloads like databases and simulations.^[58] In comparison, IF provides higher peak bandwidth—up to approximately 100 GB/s aggregate in 2-socket EPYC configurations—but incurs higher latency (often 130-200 ns inter-socket) due to its scalable, fabric-based design that prioritizes flexibility across heterogeneous components.^[59]^[60] Compared to NVIDIA's NVLink, UPI focuses on CPU-centric, coherent multi-socket scaling, while NVLink targets high-bandwidth GPU-to-GPU and GPU-to-CPU links in accelerated computing environments.^[1]^[61] NVLink delivers significantly higher bandwidth, with fifth-generation implementations offering up to 1,800 GB/s bidirectional per GPU via multiple links, enabling massive parallel data movement for AI and HPC tasks.^[62] However, standard NVLink is non-coherent, requiring extensions like NVLink-C2C for memory coherence, whereas UPI natively supports full cache coherence without additional protocols.^[63] Relative to Intel's predecessor, QuickPath Interconnect (QPI), UPI doubles the bandwidth per link while halving power consumption through improved encoding and low-power states like L0p. QPI, at its peak 8.0 GT/s, provided about 25.6 GB/s bidirectional per link, but UPI's 10.4 GT/s initial speed (and up to 16 GT/s in later versions) achieves roughly 41.6 GB/s bidirectional per link with enhanced efficiency.^[4]^[64] UPI excels in x86-specific ecosystems for reliable, low-latency multi-CPU coherence but offers less flexibility for heterogeneous compute compared to IF's chiplet scalability or NVLink's GPU-centric bandwidth.^[65]^[61]

Interconnect	Aggregate Bandwidth (2-Socket Setup, Bidirectional)	Typical Inter-Socket Latency (Round-Trip)
Intel UPI 2.0	~64 GB/s (2 links)	60-100 ns
AMD Infinity Fabric	~100 GB/s	130-200 ns
NVIDIA NVLink 5.0	~1,800 GB/s (per GPU pair, multiple links)	<50 ns (with coherence)

References

[1]
Intel® Xeon® Processor Scalable Family Technical Overview
Jan 12, 2022 · Intel UPI is a coherent interconnect for scalable systems containing multiple processors in a single shared address space. Intel Xeon processors ...
[2]
Intel Ultra Path Interconnect Version 2.0 (Intel UPI 2.0) Functional ...
The Intel® UPI 2.0 link is the coherent communication interface between processors. Intel® UPI 2.0 architecture can be used in a wide variety of server platform ...
[3]
2nd Gen Intel® Xeon® Scalable Processors Brief
Intel® Ultra Path Interconnect (Intel® UPI): Four Intel® Ultra Path ... Intel® Xeon® Scalable processors introduced in July 2017. HPC is no longer just the ...
[4]
[PDF] Intel® Xeon® Scalable Processors Datasheet, Vol. 1: Electrical
the Intel® UPI defined in the Intel® Ultra Path Interconnect (Intel® UPI) Specifications. Page 49. Second Generation Intel® Xeon® Scalable Processors. 49.
[5]
[PDF] Efficient Performance for General-Purpose Workloads - Intel
Intel® Ultra Path Interconnect (Intel® UPI) 2.0 provides up to 24 GT/s of inter-socket bandwidth—a 20 percent increase over the prior generation.
[6]
Intel® Xeon® Scalable processor Max Series
Aug 4, 2022 · Intel® UPI Interconnect Speed, Up to 11.2 GT/s, Up to 10.4 GT/s ; PCIe. Up to 64 lanes per CPU (bifurcation support: x16, x8, x4); PCIe Gen 4 ...
[7]
[PDF] An Introduction to the Intel QuickPath Interconnect
Intel® IBIST tools provide a mechanism for testing the entire interconnect path at full operational speed without the need for external test equipment.
[8]
[PDF] Akhilesh Kumar - Hot Chips
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice.
[9]
[PDF] Product Brief: Intel® Xeon® Scalable Platform
Jul 4, 2017 · Intel® Ultra Path Interconnect (Intel® UPI): Up to three. Intel UPI channels increase scalability of the platform to as many as eight sockets ...
[10]
Detailed Specifications of the "Skylake-SP" Intel Xeon Processor ...
Jul 11, 2017 · With the “Skylake-SP” architecture, Intel has replaced the older QPI interconnect with UPI. The throughput per link increases from 9.6GT/s ...Missing: timeline | Show results with:timeline
[11]
Intel Xeon Scalable Processor Family Platform Level Overview
Jul 11, 2017 · The new Intel Xeon Scalable CPUs support 2 UPI or 3 UPI links per CPU. Intel can support a lower performance, lower cost 4S design (Intel Xeon ...
[12]
Intel Unveils Xeon Scalable Processors - HPCwire
Jul 11, 2017 · The new Intel Xeon Scalable processor also delivers a 3.1x performance improvement generation-over-generation in cryptography performance. Intel ...
[13]
Intel launches Scalable processors for data centers - BetaNews
Jul 12, 2017 · Offering up to 28 CPU cores and six terabytes of system memory, the new Xeon Scalable platform provides up to 1.65x performance increase versus ...
[14]
Cascade Lake Processors - HECC Knowledge Base
May 13, 2021 · The Cascade Lake processors support up to three UPI links. The configuration at NAS has two links enabled. The UPI runs at a speed of 10.4 ...
[15]
Cascade Lake: Overview - Intel
Learn about 2nd Generation Intel® Xeon® Scalable Processors with Intel® C620 Series Chipsets, formerly Cascade Lake (Purley Refresh).
[16]
Detailed Specifications of the "Cascade Lake SP" Intel Xeon ...
Apr 2, 2019 · This article provides in-depth discussion and analysis of the 14nm Intel Xeon Processor Scalable Family (formerly codenamed “Cascade Lake-SP” or ...
[17]
Intel 3rd Gen Xeon Scalable Launched: 10nm Ice Lake-SP To ...
Rating 4.0 Apr 6, 2021 · 3rd Gen Intel Xeon Scalable processors feature up to 3 UPI links between processors, at speeds up to 11.2 GT/s (vs. 10.4 GT/s in 2nd Gen Xeons) ...
[18]
Detailed Specifications of the "Ice Lake SP" Intel Xeon Processor ...
Apr 6, 2021 · This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable ...
[19]
Intel Xeon Processor Ice Lake - Hyperscale
In stockNov 29, 2024 · Up to six Intel UltraPath Interconnect (Intel UPI) channels increase platform scalability and improve inter-CPU bandwidth for I/O-intensive ...
[20]
Intel Launches 3rd Gen Ice Lake Xeon Scalable - WikiChip Fuse
Apr 6, 2021 · Ice Lake bumps this to 204.8 GB/s, a 1.6x improvement. Intel kept the number of UPI links the same as Skylake, however, they have increased the ...
[21]
Intel 4th Gen Xeon CPUs Official: Sapphire Rapids With Up To 60 ...
Jan 10, 2023 · There's also an improved multi-socket scaling via Intel UPI, delivering up to 4 x24 UPI links at 16 GT/s and a new 8S-4UPI performance-optimized ...
[22]
4th Gen Intel Xeon Processor Scalable Family, sapphire rapids
Jul 25, 2022 · Intel® UPI Interconnect Speed, Up to 11.2 GT/s, Up to 10.4 GT/s, Up to 16 GT/s. PCIe. Up to 64 lanes per CPU (bifurcation support: x16, x8, x4) ...
[23]
Intel Details Sapphire Rapids Xeon at Architecture Day 2021
Aug 19, 2021 · We also get UPI 2.0 with up to 4×24 UPI links at 16GT/s. This is important. Intel is likely assuming that AMD will continue scaling Infinity ...
[24]
5th Gen Intel Xeon Processors Emerald Rapids Resets Servers by ...
Dec 14, 2023 · The 5th Gen Intel Xeon Scalable codenamed “Emerald Rapids” adds more cores, lots more cache, a handful of new features, and new underlying chip architecture.
[25]
Intel 'Emerald Rapids' 5th-Gen Xeon Platinum 8592+ Review
Rating 4.0 · Review by Paul AlcornDec 14, 2023 · UPI speed increased from 16 GT/s to 20 GT/s. Intel Emerald Rapids 5th-Gen Xeon Architecture. Image 1 of 7.
[26]
2.3 Emerald Rapids: 5th-Generation Intel® Xeon® Scalable ...
... UPI lanes composed of 2 die (Fig. 2.3.1) in a multichip package. This generation delivers an 18% performance improvement for general integer compute ...
[27]
Intel Shoots “Granite Rapids” Xeon 6 Into The Datacenter
Sep 24, 2024 · The combination of Granite Rapids plus the “Sierra Forest” Xeon 6 chips announced in June of this year will help Intel slow the CPU market share losses in the ...
[28]
[PDF] Intel® Xeon® 6+ Processors
12ch DDR5 8000MT/s. Platform. Compute & memory. Compute & memory. Intel® UPI up to 6 UPI 2.0 (up to 24 GT/s per lane). PCI Express up to 96 lanes PCIe 5.0 (x16 ...
[29]
Intel Launches New Xeon 6 Processors Optimized For AI At The Edge
Feb 24, 2025 · The bulk of Intel's 'Granite Rapids' family of Xeon 6 processors has arrived, with core counts ranging from 4 to 86.
[30]
Intel Hot Chips Reveal: A Deeper Look Into Xeon In 2024
Intel positions itself as already owning the performance crown in AI with Sapphire Rapids, and only creating more separation with Granite Rapids. Granite ...
[31]
A closer look at two newly announced Intel chips | Network World
Aug 26, 2021 · Sapphire Rapids also offers Intel Ultra Path Interconnect 2.0 (UPI), a CPU interconnect used for multi-socket communication. UPI 2.0 features ...
[32]
[PDF] Intel® Xeon® Processor Scalable Memory Family Uncore ...
The number of Intel UPI Links varies according to the specific version of the product. ... Second) - Max Intel UPI Bandwidth is 2 *. ROUND (Intel UPI Speed , 0).
[33]
[PDF] 4th Gen Intel® Xeon® Processor Scalable Family, Codename ...
Aug 1, 2024 · Intel® Ultra Path. Interconnect. (Intel® UPI2.0). A cache-coherent, link-based Interconnect specification for Intel® processors. Also known as ...
[34]
[PDF] An Introduction to the Intel QuickPath Interconnect
It has a snoop protocol optimized for low latency and high scalability, as well as packet and lane structures enabling quick completions of transactions.
[35]
None
Summary of each segment:
[36]
4th Gen Intel Xeon Scalable Sapphire Rapids Leaps Forward
Jan 10, 2023 · The XCC is something new. It integrates four compute tiles. It then uses EMIB to extend the on-tile mesh to adjacent tiles.
[37]
None
### Summary of UPI Integration in Sapphire Rapids
[38]
Intel® Data Direct I/O Technology Performance Monitoring
Aug 2, 2024 · UPI: Intel Ultra Path Interconnect, a high-speed interconnect used in multi-socket server configurations to connect multiple sockets together.<|control11|><|separator|>
[39]
Intel Unveils Future-Generation Xeon with Robust Performance and ...
Aug 28, 2023 · Intel Xeon processors with P-cores (Granite Rapids) are optimized to deliver the lowest total cost of ownership (TCO) for high-core performance- ...
[40]
Performance Characteristics of Common Transports and Buses
Jul 19, 2013 · The values listed below describe a single QPI/UPI link on an Intel Xeon processor. ... 10.4 GT/s, 20.8 GB/s. AMD Infinity Fabric. The values ...
[41]
[PDF] Intel® Xeon® 6 Product Brief
Intel® Ultra Path Interconnect (Intel® UPI) 2.0 provides up to 24 gigatransfers per second (GT/s) of inter- socket bandwidth—a 20 percent increase over the.
[42]
Latency of memory accesses via interconnectors - Server Fault
May 5, 2021 · The latency of memory access connected to other CPU's socket which is connected by UPI to the CPU's socket is about 140ns (so one "hop" of ...Missing: inter- | Show results with:inter-
[43]
AMD EPYC Infinity Fabric Latency DDR4 2400 v 2666: A Snapshot
Jul 24, 2017 · As you can see, the Intel inter-socket latency is roughly equivalent to the intra-socket latency for AMD EPYC Infinity Fabric. Intel is still ...
[44]
Intel Ultra Path Interconnect - Wikipedia
The Intel Ultra Path Interconnect (UPI) is a scalable processor interconnect developed by Intel which replaced the Intel QuickPath Interconnect (QPI) in Xeon ...
[45]
Intel Rounds Out “Granite Rapids” Xeon 6 With A Slew Of Chips
Feb 24, 2025 · Granted, these UPI links run at 24 GT/sec, which is amazingly fast. But the Xeon 6900P, which only scales to only two processors in a single ...
[46]
[PDF] Intel® Xeon® CPU Max Series Configuration and Tuning Guide
The Quadrant mode results in two NUMA nodes (one node for each socket), while the SNC4 mode results in eight NUMA nodes (four per socket). Each NUMA node ...Missing: mitigations | Show results with:mitigations
[47]
https://dl.dell.com/manuals/all-products/esuprt_solutions_int/esuprt_solutions_int_solutions_resources/servers-solution-resources_white-papers8_en-us.pdf
[48]
[PDF] PowerEdge Servers and 2nd Generation Intel Xeon Scalable ... - Dell
PowerEdge Servers and 2nd Generation Intel® Xeon® Scalable. Processors ... • 3 UPI links @ 10.4 GT/s. • Up to 3.80 GHz (4 cores). • 6-ch DDR4 @ 2933 MT/s ...
[49]
https://dl.dell.com/manuals/common/poweredge_r740_and_2nd_generation_intel_xeon_scalable_processors.pdf
[50]
[PDF] Dell EMC PowerEdge R740 and 2nd Generation Intel Xeon ...
We found that for Oracle Database performance, the Dell EMC PowerEdge R740 with 2nd Generation Intel Xeon Scalable processors handled over 2.6 times the TPM and ...Missing: UPI | Show results with:UPI
[51]
Tuning UEFI Settings for Performance and Energy Efficiency on 4th ...
Nov 16, 2023 · When the Intel UPI link is the bottleneck, these remote reads stay in the TOR for a long time. When one socket's TOR is saturated with ...Missing: protocol + | Show results with:protocol +
[52]
June 2025 - TOP500
The 65th edition of the TOP500 showed that the El Capitan system retains the No. 1 position. With El Capitan, Frontier, and Aurora, there are now 3 Exascale ...Missing: UPI | Show results with:UPI
[53]
[PDF] System Event Log (SEL) - Troubleshooting Guide - Intel
BIOS POST has reduced the Intel® UPI Link Width because of an error condition seen during initialization. Table 61. Intel® UPI Link Width Reduced Sensor Typical ...
[54]
H3C G6 Servers Intel Platform RAS Technology White Paper-6W103
Intel UPI Dynamic Link Width Reduction—Dynamically adjusts lane width to recover from hard failures occurred on one or multiple data lanes of an Intel UPI link.
[55]
[PDF] BIOS Setup Utility User Guide for the Intel® Server Boards D50TNP ...
Jan 7, 2025 · The major difference is that SNC LLC is unified, and COD LLC is separated. SNC (Sub NUMA) enables the two-cluster SNC. Two-way interleave of ...<|separator|>
[56]
[PDF] Intel® Server Board S2600 Family - BIOS Setup User Guide
Nov 1, 2019 · Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software, or service activation.Missing: backward | Show results with:backward
[57]
Where to Find the Intel® Ultra Path Interconnect (Intel® UPI) Speed...
The Intel® UPI Speed of Intel® Xeon® Scalable Processors is specified in ARK. Note, This information can also be found in the related Datasheet documents for ...
[58]
[PDF] High Performance Computing (HPC) Tuning Guide for AMD EPYC ...
This provides the lowest memory latency. However, maximum memory bandwidth is still achieved using DDR4-3200 R2 DIMMs.
[59]
UPI Round Trip Latency (60-100ns) vs PCIe Gen ... - Real World Tech
Apr 24, 2024 · Comparing the numbers for local socket versus far socket shows that the Ultra Path Interconnect (UPI) adds 60ns to 100ns of round trip ...Missing: bandwidth | Show results with:bandwidth
[60]
Pushing AMD's Infinity Fabric to its Limits - Chips and Cheese
Nov 25, 2024 · A CCD's IFOP link provides 32 bytes per cycle of read bandwidth and 16 bytes per cycle of write bandwidth, at the Infinity Fabric clock (FCLK).
[61]
NVLink & NVSwitch: Fastest HPC Data Center Platform | NVIDIA
The rack switch is designed to provide high bandwidth and low latency in NVIDIA GB300 NVL72 systems supporting external fifth-generation NVLink connectivity.
[62]
NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and ...
Mar 18, 2024 · The NVLink GPU-to-GPU bandwidth is 1.8 TB/s, which is 14x the bandwidth of PCIe. The fifth-generation NVLink is 12x faster than the first ...
[63]
NVLink-C2C | Chip Interconnect Technology - NVIDIA
NVLink-C2C Benefits. High Bandwidth. Supports high-bandwidth, coherent data transfers between processors and accelerators. Low Latency.
[64]
Skylake Processors - HECC Knowledge Base
May 13, 2021 · The UPI runs at a speed of 10.4 gigatransfers per second (GT/s). Each link contains separate lanes for the two directions. The total full-duplex ...
[65]
The Heart Of AMD's Epyc Comeback Is Infinity Fabric
Jul 12, 2017 · AMD measures the cross-sectional bandwidth of Infinity Fabric across the Epyc MCM as four times that, or 42.6 GB/sec, but we think the more ...<|separator|>