Fact-checked by Grok 2 weeks ago

Systolic array

A systolic array is a homogeneous network of tightly coupled processing elements that rhythmically produce and pass data through the system in a pipelined manner, enabling efficient parallel computation with minimal memory access, much like the synchronous pumping of blood in the human circulatory system. The architecture features simple, regular communication paths between identical processors, supporting high throughput for data-intensive operations such as matrix multiplication and linear algebra transformations. Introduced in 1978 by H. T. Kung and Charles E. Leiserson at Carnegie Mellon University, systolic arrays were conceived to leverage the scaling of very-large-scale integration (VLSI) technology, addressing the growing demand for specialized hardware in compute-bound applications by emphasizing modularity, scalability, and reduced data movement overhead. Historically, the design evolved from earlier parallel array processors like the , shifting focus toward pipelined structures that minimize global and exploit spatial locality to achieve predictable . Key principles include synchronous via a global clock, repeatability of processing elements for ease of fabrication, and data flow that enters from the boundaries, propagates inward for computation, and exits without buffering, which optimizes for repetitive algorithms in and beyond. Early implementations targeted problems like dense computations and , demonstrating potential for very high computational density in VLSI chips. In modern computing, systolic arrays have seen resurgence in application-specific accelerators, particularly for deep neural networks (DNNs) and convolutional neural networks (CNNs), where they excel in handling the matrix-heavy workloads of inference and training. Google's (), introduced in 2016, exemplifies this with a 256×256 systolic array of multiply-accumulate units that performs 92 tera-operations per second at low power, achieving up to 83 times better performance-per-watt than contemporary CPUs for neural network tasks by reusing data across adjacent elements without frequent memory fetches. Other notable applications include for image and audio analysis, cryptography algorithms, and computer vision tasks like , underscoring the architecture's enduring efficiency in domains requiring high parallelism and predictable latency. Despite advantages in throughput and , challenges such as limited flexibility for non-matrix operations and high design complexity persist, driving ongoing research into hybrid and scalable variants.

Fundamentals

Definition and Principles

A systolic array is a homogeneous network of tightly coupled processing elements (PEs), forming a parallel computing architecture in which data flows in a pipelined, rhythmic manner from external memories through the array, while computations occur synchronously in a pulse-like fashion. The term "systolic" derives from the physiological concept of , referring to the rhythmic contraction of the heart that pumps blood through the body, mirroring how data is rhythmically computed and passed through the network of processors. This design enables efficient handling of regular, data-intensive algorithms, such as matrix operations, by emphasizing concurrent data movement and computation without reliance on complex control mechanisms. At its core, each in a systolic array performs simple, localized operations—typically multiply-accumulate () functions—on received from adjacent neighbors, ensuring that processing remains decentralized. There is no global directing operations; instead, the architecture depends on nearest-neighbor communications for exchange, with only the boundary PEs interfacing with external streams to inject into the array and extract results. This pipelined flow allows multiple computations to proceed simultaneously across the array, akin to a propagating through the structure, promoting high throughput for repetitive numerical tasks. Key characteristics of systolic arrays include their regularity, manifested in the uniform arrangement and identical functionality of PEs, which simplifies and fabrication; modularity, enabling the array to be scaled by adding more PEs to meet varying performance needs; and local interconnects, which restrict communication to short, nearest-neighbor links to reduce and wiring . These features collectively support a highly synchronized, pipelined environment optimized for VLSI implementation. For illustration, consider a simple one-dimensional systolic array: a linear chain of identical s connected sequentially, ideal for vector operations such as or dot products. In this setup, one (e.g., a ) enters from the left boundary PE, while another (e.g., filter coefficients) enters from the top of each PE; each PE multiplies the incoming values, adds to an accumulated partial sum passed from the previous PE, and forwards the results to the next, with the final output emerging from the right end after a delay equal to the length.

Historical Development

The concept of the systolic array was proposed in the late by and Charles Leiserson at , drawing inspiration from the rhythmic pumping action of blood flow in the —termed "systole" by physiologists—and the emerging trends in very large-scale (VLSI) technology that favored modular, pipelined designs. This approach addressed the growing limitations of architectures, which suffered from memory-computation bottlenecks in handling parallel, compute-intensive tasks such as and numerical computations. A foundational milestone came in 1978 with Kung and Leiserson's , "Systolic Arrays for (VLSI)," which outlined the as a network of processors that rhythmically compute and pass data to exploit VLSI's potential for massive parallelism while minimizing global communication. Throughout the , theoretical models evolved into practical designs, with Kung's group at Carnegie Mellon pioneering hardware prototypes; a key example was the machine, a linear programmable systolic array with a prototype completed in 1985 and full 10-cell systems in 1986, capable of 10 MFLOPS per cell and targeted at signal and image processing applications. These developments integrated systolic principles with VLSI paradigms, enabling efficient implementations for (DSP) tasks like and operations. By the 1990s, systolic arrays shifted from custom application-specific integrated circuits () to more flexible programmable systems, exemplified by the iWarp project—a collaboration between Carnegie Mellon and —that deployed its first 64-cell system in 1990, supporting both systolic communication and shared-memory models for scalable numerical computations. This evolution maintained an initial emphasis on and linear algebra problems, laying groundwork for broader adoption in while highlighting the architecture's adaptability to advancing semiconductor technologies.

Architectural Design

Core Components

A systolic array is fundamentally composed of processing elements (PEs), which serve as the basic computational units. Each PE is an identical cell equipped with an (ALU) for performing s such as multiplication and accumulation, registers for temporary , and local logic to manage autonomous . These elements operate in a rhythmic, pipelined manner, receiving inputs from neighbors, computing locally, and passing results onward without requiring global coordination. The interconnects form the structural backbone, consisting of fixed, nearest-neighbor wiring patterns that facilitate data transfer between adjacent PEs. Common topologies include linear arrays for sequential operations, 2D mesh configurations for computations, and hexagonal arrangements for more complex data flows, all avoiding global buses to minimize and wiring complexity. This regular, local interconnection ensures modularity and ease of fabrication in VLSI implementations. Boundary conditions define the interfaces at the array's edges, where ports connect to external or systems for loading and result extraction. Unlike traditional architectures, systolic arrays lack an internal , relying instead on these boundary ports to inject initial streams and retrieve outputs, with unused boundary outputs often ignored to simplify design. is achieved by multiple PEs into 1D, , or even 3D configurations, allowing the array size to match computational demands. For instance, a array of PEs can efficiently handle by aligning flows along rows and columns, enabling pipelined processing of larger problems through repeated boundary injections. This modular extension supports high throughput without redesigning the core PE logic.

Data Flow and Synchronization

In systolic arrays, data flows rhythmically through the network of processing elements (PEs), with inputs typically entering from the array boundaries and propagating unidirectionally or bidirectionally between adjacent PEs in a pipelined manner. This systolic rhythm mimics a , where data pulses through the array without requiring global , enabling efficient reuse as operands interact locally to perform computations like inner products. Outputs generally exit from the opposite boundaries, and partial pipelining—achieved through registers within PEs—minimizes idle cycles or bubbles by staging data transfers to overlap computation and movement. Synchronization in systolic arrays is achieved via a global clock that drives simultaneous activations across all PEs, ensuring that data transfers and computations occur in without the need for complex due to the fixed, interconnect paths. Registers, often referred to as shadow or latching registers in implementations, temporarily hold data during transfers between PEs, preventing race conditions and maintaining timing integrity as the clock edge triggers both computation and shifting. This clock-driven approach simplifies control, as each PE follows a predefined sequence of operations synchronized to the array's pulsations, avoiding the overhead of asynchronous handshaking. Common flow patterns include forward flows for input operands entering from one side and backward flows for partial results propagating in the reverse direction, as seen in linear arrays for matrix-vector multiplication where vectors move oppositely to align multiplications. In two-dimensional arrays, skewing techniques stagger input data—such as delaying each row by one clock cycle—to create diagonal wavefronts that align computations across the array, ensuring that dependent operations arrive precisely when needed for pipelined execution. Data dependencies in iterative algorithms are handled through local buffering within PEs, where small registers store intermediate results to resolve hazards without stalling the global flow, preserving the regularity of the dependence graph while enabling wavefront propagation. This local resolution leverages the array's modular structure, allowing dependencies to be mapped spatially so that data arrives just in time, thus maintaining high throughput without global intervention.

Goals and Benefits

Performance Objectives

Systolic arrays are designed to maximize throughput by leveraging pipelined data flows that enable a constant rate of computation across the array, particularly for linear algebra operations such as . This approach targets the use of O(n²) processors to achieve O(n) for problems like solving dense systems of n linear equations, where data is rhythmically pumped through the network to overlap input, computation, and output phases. By ensuring a steady stream of data via simple, regular communication paths, systolic arrays sustain high data rates without bottlenecks from global broadcasts or reductions. Latency reduction forms another core objective, achieved through local interconnects that minimize communication overhead between neighboring processing elements. The goal is constant-time execution per operation across the array, as data moves only to adjacent cells in a or hexagonal , avoiding long-distance wires that dominate delay in conventional designs. For instance, in matrix-vector multiplication, this pipelining reduces overall from O(wn) in sequential implementations to O(2n + w), where w is the size. Scalability targets emphasize linear speedup proportional to array size for tasks, such as banded operations on fixed-size arrays that handle inputs of arbitrary length. Additionally, is incorporated via modular replacement strategies, where redundant processing elements and bypass links allow reconfiguration to isolate and substitute faulty units, thereby maintaining performance and extending in large-scale implementations. Techniques like virtual arrays with 25-100% redundancy achieve linear improvements in reliability metrics, independent of array dimensions. Resource efficiency objectives prioritize optimization under VLSI area and time constraints, favoring uniform processor layouts and local connections to facilitate dense, low-cost fabrication. Regular grid structures minimize wiring complexity and support modular expansion, aligning with the systolic principle of simplicity for high-volume production. This design enables efficient silicon utilization, where the number of processors scales with computational demands without proportional increases in interconnection overhead.

Advantages Over Conventional Architectures

Systolic arrays address key limitations of conventional architectures, which suffer from a central where must shuttle between a and a single processor, limiting scalability for compute-intensive tasks. By distributing both storage and computation across a of processing elements (PEs) connected via local links, systolic arrays enable multiple computations per access, with flowing rhythmically through the system to minimize global communication overhead. This distributed approach supports massive parallelism without relying on shared buses, allowing for efficient handling of data-parallel algorithms like operations. The simplicity and regularity of systolic array designs offer significant advantages over irregular multiple-instruction multiple-data (MIMD) architectures, which often require complex control logic and irregular interconnects that complicate implementation. Systolic arrays employ a of identical PEs with local, nearest-neighbor connections, facilitating straightforward very-large-scale integration (VLSI) fabrication and reducing design complexity. This regularity also leads to lower power consumption, as data movement is confined to short, predictable paths, avoiding the energy overhead of long-distance wiring in conventional designs. Performance in systolic arrays is highly predictable due to their fixed data paths and deterministic synchronization, contrasting with cache-dependent systems where execution times vary based on unpredictable memory hits and misses. The rhythmic data flow ensures near-100% utilization of PEs for well-suited algorithms, as computations proceed in lockstep without idle cycles from resource contention. Systolic arrays achieve cost-effectiveness through their modular structure, which allows scaling by simply adding more PEs without redesigning the core architecture, making per-operation costs low for high-volume numerical processing. This scalability aligns system cost proportionally with required performance, enabling efficient deployment in specialized hardware for tasks like signal processing.

Operational Mechanics

Processing Cycles

In a systolic array, each clock cycle follows a rhythmic where processing elements (PEs) synchronously receive input from neighboring PEs, perform a local —typically a multiply-accumulate such as updating a partial sum with C \leftarrow C + A \times B—temporarily store the results in local registers, and then forward the to adjacent PEs for the next cycle. This structure ensures that flows without global communication, mimicking a heartbeat-like pulsation across the array. In two-dimensional systolic arrays, such as those used for , the computations advance via wavefront propagation, where active processing spreads diagonally from the top-left corner toward the bottom-right, progressively activating diagonals of PEs as input operands align. This diagonal wavefront enables efficient overlap of data movement and computation, with each cycle advancing the frontier by one position along the anti-diagonals. The operational phases include pipeline filling, steady-state execution, and draining: initial cycles load input data into the array, often injecting zeros or boundary values to initialize partial results; steady-state follows, where the array reaches full utilization with every contributing to throughput proportional to the input ; and final cycles output completed results as the exits the array boundaries. For an N \times N array, filling and draining typically require on the order of $2N cycles, bounding the transient overhead. For algorithms with iterative loops, such as successive approximation methods, the systolic array handles multiple iterations by pipelining computations across cycles, allowing new iterations to enter as prior ones drain, or through data recirculation where outputs are fed back as inputs for subsequent passes. This approach maintains rhythmic data flow without halting the array, enabling continuous processing of looped workloads.

Mathematical Foundations

The fundamental operation within a processing element () of a systolic array is the multiply-accumulate (), expressed as c \leftarrow c + a \cdot b where a and b are scalar inputs received from neighboring s or boundaries, and c represents the partial result passed onward after the computation. This recursive form enables pipelined accumulation across the array, with each performing one per cycle while data flows rhythmically through local interconnects. Systolic arrays are synthesized by algorithms modeled as dependence graphs onto topologies via systolic . A dependence graph consists of nodes representing index-domain computations and directed edges denoting dependencies with associated delays. The selects a scheduling vector \mathbf{s}^T to assign execution times to nodes and a \mathbf{p}^T to map multi-dimensional index spaces to the array's lower-dimensional geometry, ensuring no two dependent computations are scheduled simultaneously on the same . For instance, in matrix- \mathbf{c} = A \mathbf{b} where A is an n \times m matrix and \mathbf{b} is an m-, the dependence graph captures partial sums c_i^{(k+1)} = c_i^{(k)} + a_{i k} b_k for i = 1 to n and k = 1 to m. A linear , such as \mathbf{p}^T = [1 \, 0], maps this to a 1D array of n PEs, with dependencies resolved by propagating b_k and partial c_i along the chain. The validity of the requires \mathbf{s}^T \mathbf{d} \geq 1 for each dependency \mathbf{d}, minimizing latency while respecting the systolic constraint of no global communication. Timing in systolic arrays is governed by node scheduling functions that dictate wavefront propagation. A standard scheduling function for 2D arrays is \psi(i,j) = i + j, which assigns time steps along diagonals, enabling simultaneous execution of independent computations in a that advances one step per cycle. This creates rhythmic data movement, where inputs propagate without stalling, provided the index is derived as (i,j) \cdot \mathbf{p} \mod array bounds. To synchronize arrivals, data skewing introduces initial delays in input s; for example, in the matrix-vector case, elements of \mathbf{b} are skewed by offsets proportional to row indices, ensuring a_{i k} and b_k meet at PE i exactly when needed. The skewing is computed as \mathbf{t} = \mathbf{s}^T \mathbf{e} for s \mathbf{e}, aligning flows to the . Complexity analysis reveals tradeoffs between processors and time under systolic constraints. For matrix-vector on an n \times m configured as a 1D linear systolic of size \min(n,m), the is T = n + m - 1 cycles: n - 1 cycles to fill the with initial partial sums, m cycles for accumulation (or if m < n), minus 1 to account for overlapping output. This derives from the wavefront model, where input propagation takes \max(n,m) steps, but pipelining overlaps computation and data movement, yielding total latency T = n + m - 1. The processor-time product is p \cdot T \approx n m, matching the O(n m) arithmetic operations, but systolic designs optimize p to O(\min(n,m)) while bounding T = O(n + m), outperforming sequential O(n m) time with p = 1. For 2D extensions like matrix , scaling to p = O(n^2) yields T = O(n), with the constant factor (e.g., 3 for square matrices) from the schedule's dot product |\mathbf{s}^T \mathbf{p}|. These tradeoffs are formalized by minimizing T = \max(\mathbf{s}^T \mathbf{I}) over index domain \mathbf{I}, subject to processor count p = |\mathbf{P} \mathbf{I}| where \mathbf{P} is the projection matrix.

Examples

Polynomial Multiplication

Systolic arrays provide an efficient mapping for multiplying two polynomials of degree n, denoted as A(x) = \sum_{i=0}^n a_i x^i and B(x) = \sum_{i=0}^n b_i x^i, to compute the product C(x) = A(x) B(x) = \sum_{k=0}^{2n} c_k x^k. This is achieved using a linear consisting of n+1 processing elements (PEs) arranged in a one-dimensional chain. Each PE performs local multiply-accumulate operations on incoming coefficient streams, leveraging the rhythmic data flow inherent to systolic designs to compute the convolution of the coefficient sequences without global communication. The mapping exploits the structure of polynomial multiplication as a convolution, where the coefficients of A(x) are input from one end of the array (e.g., the left), flowing rightward, while the coefficients of B(x) enter from the opposite end (e.g., the right), flowing leftward. Each , indexed from 0 to n, receives one coefficient from each polynomial at appropriate timings, multiplies them to form a partial product, and accumulates it into a local register that tracks contributions shifted by the PE's position in the array. This ensures that partial products for each degree k of the result are systematically gathered across the PEs, with boundary conditions handled by preloading zeros for leading and trailing coefficients beyond degree n (e.g., a_i = 0 for i > n and similarly for b_i). The design maintains constant , as each PE communicates only with its immediate neighbors. Execution proceeds in a pipelined manner over $2n+1 cycles, with the first c_0 emerging after n+1 and subsequent coefficients output one per . In each after the initial , data propagation aligns the inputs such that the next result emerges from a designated output , typically the rightmost PE or a dedicated accumulator . The core for each follows the : c_k = \sum_{i=0}^k a_i b_{k-i} for k = 0 to $2n, with the systolic scheduling ensuring that all terms in the sum for a given k are accumulated without conflicts. Boundary conditions are managed by injecting zero values at the array ends after the initial coefficients, preventing extraneous contributions and ensuring exact degree-$2n results. The architecture can be visualized as a linear chain of n+1 PEs, with forward data paths for the A coefficients (left to right) and backward data paths for the B coefficients (right to left), connected via nearest-neighbor links. Each PE includes input buffers, a multiplier, an accumulator, and output ports, with control logic to synchronize the flows and handle zero padding for boundaries. This setup exemplifies the systolic principle of localized computation and rhythmic pulsing, enabling high throughput for operations in VLSI implementations.

Convolution Operations

Systolic arrays provide an efficient architecture for performing convolution operations, particularly in signal and image processing tasks involving (FIR) filters. The core algorithm computes the output sequence y = \sum_{k=0}^{M} h \cdot x[n - k], where x represents the input signal and h is the fixed of length M+1. This computation is mapped onto a linear systolic array comprising M+1 processing elements (PEs), each responsible for a specific coefficient. In this configuration, the input signal x flows unidirectionally through the array, while the kernel coefficients h propagate in the opposite direction to align with the sliding window. Each PE executes a multiply-accumulate operation, multiplying the arriving x and h values and adding the product to an incoming partial sum, which is then forwarded to the adjacent PE. This pipelined data flow ensures that the dependencies of the convolution are resolved locally within the array. Once the is filled, typically after an initial of O(M) cycles, the array generates one output sample y per clock cycle, achieving constant-time processing per sample independent of kernel size. This design is optimized for FIR filters but extends to (IIR) variants by incorporating paths in the PE structure to handle recursive dependencies. For two-dimensional , common in filtering, the architecture scales to a 2D of PEs, leveraging separable kernels to decompose the operation into two sequential one-dimensional passes. In the first stage, horizontal linear arrays perform row-wise convolutions across the , producing intermediate maps. These intermediates then undergo column-wise convolutions in a vertical pass using the same or reconfigured , effectively computing the full 2D output y[i,j] = \sum_{p=0}^{M} \sum_{q=0}^{M} h[p,q] \cdot x[i-p, j-q] for kernel size (M+1) \times (M+1). This separable approach on a mesh maintains high utilization, with data streaming through rows and columns in a systolic manner to minimize global communication. Efficiency mirrors the 1D case, yielding constant time per output pixel after pipeline fill, while the mesh parallelism scales throughput with the array dimensions.

Sorting Algorithms

Systolic arrays implement parallel comparison-based through mapping algorithms like Batcher's odd-even or the onto a two-dimensional mesh structure. These algorithms construct sorting networks composed of compare-exchange operations, which are ideally suited to the regular, pipelined data flow of systolic architectures, enabling simultaneous across multiple paths. In the mapping, input data elements enter the array boundaries and propagate through an arrangement of processing elements (PEs) organized into sequential stages that mirror the layers of the . Each PE functions as a compare-exchange unit, receiving two data values from neighboring PEs or inputs, comparing them, and swapping if the values are out of order to direct smaller elements toward one output (e.g., upward or leftward) while larger ones move oppositely. This localized operation ensures no global communication is needed, maintaining the systolic principle of rhythmic data pulsing without buffering delays. occurs via the array's clock, with data advancing one step per across the . The execution proceeds in a series of logarithmic stages, where the odd-even merge sort recursively divides the input into halves, sorts them, and merges via alternating odd and even indexed comparisons. For an n × n sorting up to n² elements, the process requires log₂ n merging levels, with each level comprising up to n/2 compare-exchange operations per row or column; the overall completes in O((log n)²) cycles, achieving high throughput as multiple data streams can through the network. The variant follows a similar staged structure but builds bitonic sequences (increasing then decreasing) before merging, offering comparable complexity and adaptability to the . Variants of these implementations employ truncated networks for tasks like finding or selection, where only a partial set of stages is activated to isolate the k-th smallest element without completing the full sort. This reduces computational overhead, as the network up to the position suffices, leveraging the same compare-exchange PEs but with fewer cycles and potentially reconfigurable connections in the systolic mesh.

Applications

Traditional Signal Processing

Systolic arrays found early and prominent applications in traditional (DSP), particularly for tasks requiring high-throughput computation that exceeded the capabilities of general-purpose CPUs at the time. These architectures were especially suited to operations like filtering, where linear systolic arrays could efficiently implement (FIR) filters by pipelining data through processing elements, each performing multiply-accumulate operations on incoming streams. This design allowed for sustained high sample rates, addressing the limitations of architectures that suffered from memory-computation bottlenecks in handling continuous data flows from sensors or audio inputs. In audio processing, linear systolic arrays were employed to realize filters, enabling low-latency equalization and in systems. For instance, a linear array of processing cells could compute filtered outputs every few clock cycles, supporting applications in where voice signals demanded precise temporal processing without delays. Similarly, two-dimensional systolic arrays extended this capability to image enhancement in early video systems, performing convolutions across pixel grids to sharpen or denoise frames, which was critical for broadcast and technologies in the 1980s. These configurations leveraged the rhythmic data flow inherent to systolic designs, ensuring modular scalability for varying filter lengths or sizes. Fast Fourier Transform (FFT) computation also benefited from systolic arrays, with linear or hexagonal configurations mapping the Cooley-Tukey to achieve pipelined, spectral analysis. Such implementations were vital for frequency-domain processing in communications, where systolic structures reduced latency compared to sequential FFT methods, enabling efficient and at high data rates. In adaptive for and systems, systolic arrays facilitated the computation of optimal weights for multiple sidelobe cancellers, enhancing spatial filtering to suppress and focus on target signals in environments. This was particularly impactful in 1980s defense applications, where adaptability was essential for detecting submerged threats or objects. The adoption of systolic arrays surged in the 1980s within and sectors, driven by the need for specialized hardware to handle intensive DSP workloads. For example, systolic chips were developed for tasks, using algorithms mapped to linear arrays for connected word processing in systems. Projects like wafer-scale systolic integrations supported radar in defense initiatives, while telecom firms explored them for high-speed equalization. These early realizations underscored systolic arrays' role in overcoming CPU limitations for sample rates exceeding millions per second, paving the way for dedicated DSP silicon.

Modern Machine Learning Accelerators

Systolic arrays play a central role in modern accelerators by efficiently handling the matrix multiplications and general matrix multiply () operations that dominate deep and . These architectures enable high-throughput computation with minimal data movement, addressing the bottlenecks in conventional processors. For instance, Google's Tensor Processing Units (TPUs), introduced in , leverage 2D systolic arrays to perform tensor operations at scale in datacenters. The first-generation TPU (TPUv1) features a single 256×256 systolic array of multiply-accumulate (MAC) units optimized for 8-bit integer operations, delivering up to 92 while consuming 40 , primarily for workloads. Subsequent versions expanded this design: TPUv2 and TPUv3 incorporate multiple 128×128 systolic arrays per chip (equivalent in compute capacity to 256×256), supporting bfloat16 precision for and achieving higher through pipelined . TPUv4 (2021) builds on this with four 128×128 arrays per chip, adding support for int8 and structured sparsity in embeddings to reduce compute for pruned models, while maintaining compatibility with low-precision formats like FP16 and INT8 for broader acceleration. Later generations continued advancements: TPUv5 (2023) nearly doubled v4 performance with enhanced systolic arrays; TPUv6e (, GA May 2025) offers improved efficiency for large-scale ; and TPUv7 (, announced April 2025, GA Q4 2025) provides over 4x the performance of Trillium, emphasizing with advanced systolic designs for generative . These optimizations enable TPUs to process large-scale models, such as those in Google's search and translation services, with high utilization on dense tasks. Beyond Google, other vendors have integrated systolic-like units into their AI accelerators. Apple's Neural Engine, debuting in the A11 Bionic chip in 2017 and evolving through M-series processors, employs units based on systolic arrays to accelerate on-device for tasks like image recognition and . Performance has scaled significantly, from 15.8 on A15 (2021) to 38 on M4 (2024), with the M5 series (October 2025) featuring an enhanced 16-core Neural Engine for even greater efficiency in mobile deployment. Similarly, Huawei's Ascend series uses DaVinci cores with Cube units—systolic array-based matrix multiply accelerators—that support and scalar processing for both and , with recent models like Ascend 910C (2024/2025) emphasizing scalability in datacenter environments and upcoming Ascend 950 (2026) promising further petaflop-scale performance. In the 2020s, systolic arrays have seen renewed adoption in FPGA and ASIC designs for edge , where reconfigurable fabrics like those in Versal AI Edge Series Gen 2 (2023) implement sparse systolic slices to handle resource-constrained on devices such as sensors, offering flexibility over fixed . Recent advancements focus on systolic designs that combine arrays with other elements, such as units or sparse cores, to enhance versatility for diverse workloads. For example, systolic architectures integrate reconfigurable dataflows to support both dense and sparse operations, reducing by up to 8% in inference while boosting throughput by 57% for models like transformers. These hybrids prioritize for mobile and datacenter applications, enabling systolic arrays to achieve 2.3× energy savings over prior in low-power scenarios, thus supporting sustainable scaling of deployment.

Implementations and Debates

Hardware Realizations

Early prototypes of systolic arrays emerged in the mid-1980s, with Carnegie Mellon University's project delivering a notable example. A of the Warp system, featuring a 2-cell linear systolic array, was completed in June 1985, with the full 10-cell version completed in early 1986; each cell was capable of 10 MFLOPS, serving as a programmable for high-performance signal and image processing tasks. This prototype emphasized systolic data flow to achieve efficient pipelined computations, demonstrating feasibility for attached processors in research environments. Building on , the iWARP project in the 1990s advanced scalable multicomputer architectures with systolic principles. Developed collaboratively by Mellon and , iWARP integrated a single-chip component combining a Lisp processor, , and communication support into a mesh-connected system; the first 64-cell production systems were delivered in 1990, enabling high-speed for scientific applications. These systems supported systolic communication patterns across nodes, scaling to hundreds of cells while maintaining low-latency data rhythmic flow. Commercial realizations in the 1980s and 1990s included -based systems and specialized . , developed by INMOS, were adapted into systolic arrays for , such as in multilayered implementations where one- and two-dimensional transputer configurations achieved efficient data pipelining for model reduction and feedback control tasks. Additionally, systolic like those for image processing emerged, with designs leveraging orthogonal interconnects to handle convolutions and correlations in dedicated . Intel's iWARP, commercialized from the academic prototype, represented a key ASIC evolution, packaging systolic elements into scalable, off-the-shelf multicomputers. In modern hardware, systolic arrays have seen widespread adoption in AI accelerators, exemplified by Google's (TPU) series since 2015. The first TPU featured a 256×256 systolic array of 8-bit multiply-accumulate units, optimized for multiplications in neural networks, delivering up to 92 tera operations per second while minimizing data movement. Subsequent iterations, such as TPU v3 with dual 128×128 arrays (2018), and later versions up to the seventh-generation TPU v7 (Ironwood, announced November 2025), have further scaled systolic array designs for cloud-scale and advanced AI workloads, integrating systolic computation with host interfaces for tensor operations. Recent has explored scalable systolic arrays for sparse neural networks and energy-efficient edge devices, with prototypes demonstrating up to 68% power savings through optimized dataflows (as of 2024). For , open-source FPGA implementations on platforms have proliferated, such as parametric systolic array generators for and hardened designs with error detection, enabling customizable prototypes for DNN acceleration. Despite advances, hardware realizations face challenges in fabrication and . Large systolic arrays suffer from reduced yields due to increased defect probabilities in dense interconnects and processing elements, necessitating fault-tolerant designs like to maintain reliability. Integration with GPUs or CPUs introduces overheads from data transfer latencies and mismatched paradigms, often requiring architectures to balance systolic efficiency with general-purpose flexibility.

Classification Controversies

The classification of systolic arrays has sparked ongoing debates within parallel computing literature, primarily revolving around strict versus loose interpretations of the core definition. In its original formulation, a systolic array is defined as a homogeneous network of processors that rhythmically compute and pass data through local interconnections, with no global control lines and synchronous operation to enable pipelined throughput without additional logic overhead. This strict view, emphasizing pure systolic data flow where data "pumps" through the array like blood through the heart, excludes designs incorporating hybrid control mechanisms, such as centralized scheduling or external intervention, which later extensions have introduced to enhance flexibility. Controversies arise over the inclusion of reconfigurable arrays, which allow dynamic reconfiguration of processing elements (PEs) during operation, potentially violating the rhythmic, fixed-flow principle of pure systolic designs. Similarly, arrays employing global clocks for synchronization face criticism, as clock skew and distribution challenges in large-scale implementations can disrupt the intended local, self-timed rhythm, leading some researchers to question their systolic purity. Disputes extend to emerging contexts like neuromorphic computing, where systolic-inspired structures simulate neural dynamics but incorporate non-local feedback loops, blurring the boundaries of traditional definitions. In quantum computing, analogous array-based approaches for quantum circuit simulation have been proposed, yet they often hybridize with classical control, prompting debates on whether such adaptations retain systolic essence. Key arguments trace back to and literature, where papers questioned the fit of non-homogeneous PEs—varying functions within the —arguing that they compromise the and regularity central to systolic efficiency. For instance, controversially places systolic arrays under MISD (), but this is widely critiqued as problematic since data modification at each aligns more closely with SIMD-like behavior, rendering MISD "nonsensical" in refined taxonomies. Modern perspectives, such as those on Google's Tensor Processing Units (TPUs), describe them as "systolic-inspired" rather than purely systolic due to integrated host interfaces and hybrid data management that deviate from unadulterated local flow. These debates have significant implications for in , as ambiguous boundaries complicate comparisons across architectures and hinder standardized . The issues remain unresolved amid evolving hardware paradigms, with no on delineating pure systolic designs from their extensions.