Systolic array

A systolic array is a homogeneous network of tightly coupled processing elements that rhythmically produce and pass data through the system in a pipelined manner, enabling efficient parallel computation with minimal memory access, much like the synchronous pumping of blood in the human circulatory system.^[1] The architecture features simple, regular communication paths between identical processors, supporting high throughput for data-intensive operations such as matrix multiplication and linear algebra transformations.^[1] Introduced in 1978 by H. T. Kung and Charles E. Leiserson at Carnegie Mellon University, systolic arrays were conceived to leverage the scaling of very-large-scale integration (VLSI) technology, addressing the growing demand for specialized hardware in compute-bound applications by emphasizing modularity, scalability, and reduced data movement overhead.^[1] Historically, the design evolved from earlier parallel array processors like the ILLIAC IV, shifting focus toward pipelined structures that minimize global synchronization and exploit spatial locality to achieve predictable performance.^[2] Key principles include synchronous operation via a global clock, repeatability of processing elements for ease of fabrication, and data flow that enters from the boundaries, propagates inward for computation, and exits without buffering, which optimizes for repetitive algorithms in signal processing and beyond.^[1] Early implementations targeted problems like dense matrix computations and polynomial evaluation, demonstrating potential for very high computational density in VLSI chips.^[1] In modern computing, systolic arrays have seen resurgence in application-specific accelerators, particularly for deep neural networks (DNNs) and convolutional neural networks (CNNs), where they excel in handling the matrix-heavy workloads of machine learning inference and training.^[2] Google's Tensor Processing Unit (TPU), introduced in 2016, exemplifies this with a 256×256 systolic array of multiply-accumulate units that performs 92 tera-operations per second at low power, achieving up to 83 times better performance-per-watt than contemporary CPUs for neural network tasks by reusing data across adjacent elements without frequent memory fetches.^[3] Other notable applications include digital signal processing for image and audio analysis, cryptography algorithms, and computer vision tasks like object detection, underscoring the architecture's enduring efficiency in domains requiring high parallelism and predictable latency.^[2] Despite advantages in throughput and energy efficiency, challenges such as limited flexibility for non-matrix operations and high design complexity persist, driving ongoing research into hybrid and scalable variants.^[2]

Fundamentals

Definition and Principles

A systolic array is a homogeneous network of tightly coupled processing elements (PEs), forming a parallel computing architecture in which data flows in a pipelined, rhythmic manner from external memories through the array, while computations occur synchronously in a pulse-like fashion.^[1] The term "systolic" derives from the physiological concept of systole, referring to the rhythmic contraction of the heart that pumps blood through the body, mirroring how data is rhythmically computed and passed through the network of processors.^[1] This design enables efficient handling of regular, data-intensive algorithms, such as matrix operations, by emphasizing concurrent data movement and computation without reliance on complex control mechanisms.^[4] At its core, each PE in a systolic array performs simple, localized operations—typically multiply-accumulate (MAC) functions—on data received from adjacent neighbors, ensuring that processing remains decentralized.^[4] There is no global control unit directing operations; instead, the architecture depends on nearest-neighbor communications for data exchange, with only the boundary PEs interfacing with external input/output streams to inject data into the array and extract results.^[1] This pipelined data flow allows multiple computations to proceed simultaneously across the array, akin to a wavefront propagating through the structure, promoting high throughput for repetitive numerical tasks. Key characteristics of systolic arrays include their regularity, manifested in the uniform arrangement and identical functionality of PEs, which simplifies design and fabrication; modularity, enabling the array to be scaled by adding more PEs to meet varying performance needs; and local interconnects, which restrict communication to short, nearest-neighbor links to reduce latency and wiring complexity. These features collectively support a highly synchronized, pipelined multiprocessing environment optimized for VLSI implementation.^[4] For illustration, consider a simple one-dimensional systolic array: a linear chain of identical PEs connected sequentially, ideal for vector operations such as convolution or dot products. In this setup, one data stream (e.g., a vector) enters from the left boundary PE, while another (e.g., filter coefficients) enters from the top of each PE; each PE multiplies the incoming values, adds to an accumulated partial sum passed from the previous PE, and forwards the results to the next, with the final output emerging from the right end after a delay equal to the array length.^[1]

Historical Development

The concept of the systolic array was proposed in the late 1970s by H.T. Kung and Charles Leiserson at Carnegie Mellon University, drawing inspiration from the rhythmic pumping action of blood flow in the human body—termed "systole" by physiologists—and the emerging trends in very large-scale integration (VLSI) technology that favored modular, pipelined designs.^[1] This approach addressed the growing limitations of von Neumann architectures, which suffered from memory-computation bottlenecks in handling parallel, compute-intensive tasks such as signal processing and numerical computations.^[5] A foundational milestone came in 1978 with Kung and Leiserson's technical report, "Systolic Arrays for (VLSI)," which outlined the architecture as a network of processors that rhythmically compute and pass data to exploit VLSI's potential for massive parallelism while minimizing global communication.^[1] Throughout the 1980s, theoretical models evolved into practical designs, with Kung's group at Carnegie Mellon pioneering hardware prototypes; a key example was the Warp machine, a linear programmable systolic array with a prototype completed in 1985 and full 10-cell systems in 1986, capable of 10 MFLOPS per cell and targeted at signal and image processing applications.^[6] These developments integrated systolic principles with VLSI paradigms, enabling efficient implementations for digital signal processing (DSP) tasks like convolution and matrix operations. By the 1990s, systolic arrays shifted from custom application-specific integrated circuits (ASICs) to more flexible programmable systems, exemplified by the iWarp project—a collaboration between Carnegie Mellon and Intel—that deployed its first 64-cell system in 1990, supporting both systolic communication and shared-memory models for scalable numerical computations.^[7] This evolution maintained an initial emphasis on DSP and linear algebra problems, laying groundwork for broader adoption in high-performance computing while highlighting the architecture's adaptability to advancing semiconductor technologies.^[7]

Architectural Design

Core Components

A systolic array is fundamentally composed of processing elements (PEs), which serve as the basic computational units. Each PE is an identical cell equipped with an arithmetic logic unit (ALU) for performing operations such as multiplication and accumulation, registers for temporary data storage, and local control logic to manage autonomous operation.^[1] These elements operate in a rhythmic, pipelined manner, receiving inputs from neighbors, computing locally, and passing results onward without requiring global coordination.^[8] The interconnects form the structural backbone, consisting of fixed, nearest-neighbor wiring patterns that facilitate data transfer between adjacent PEs. Common topologies include linear arrays for sequential operations, 2D mesh configurations for matrix computations, and hexagonal arrangements for more complex data flows, all avoiding global buses to minimize latency and wiring complexity.^[1] This regular, local interconnection ensures modularity and ease of fabrication in VLSI implementations.^[8] Boundary conditions define the interfaces at the array's edges, where input/output ports connect to external memory or host systems for data loading and result extraction. Unlike traditional architectures, systolic arrays lack an internal memory hierarchy, relying instead on these boundary ports to inject initial data streams and retrieve outputs, with unused boundary outputs often ignored to simplify design.^[1] Scalability is achieved by tiling multiple PEs into 1D, 2D, or even 3D configurations, allowing the array size to match computational demands. For instance, a 2D mesh array of PEs can efficiently handle matrix multiplication by aligning data flows along rows and columns, enabling pipelined processing of larger problems through repeated boundary injections.^[1] This modular extension supports high throughput without redesigning the core PE logic.^[8]

Data Flow and Synchronization

In systolic arrays, data flows rhythmically through the network of processing elements (PEs), with inputs typically entering from the array boundaries and propagating unidirectionally or bidirectionally between adjacent PEs in a pipelined manner.^[1] This systolic rhythm mimics a heartbeat, where data pulses through the array without requiring global broadcasting, enabling efficient reuse as operands interact locally to perform computations like inner products.^[1] Outputs generally exit from the opposite boundaries, and partial pipelining—achieved through registers within PEs—minimizes idle cycles or bubbles by staging data transfers to overlap computation and movement.^[9] Synchronization in systolic arrays is achieved via a global clock that drives simultaneous activations across all PEs, ensuring that data transfers and computations occur in lockstep without the need for complex arbitration due to the fixed, local interconnect paths.^[1] Registers, often referred to as shadow or latching registers in implementations, temporarily hold data during transfers between PEs, preventing race conditions and maintaining timing integrity as the clock edge triggers both computation and shifting.^[10] This clock-driven approach simplifies control, as each PE follows a predefined sequence of operations synchronized to the array's pulsations, avoiding the overhead of asynchronous handshaking.^[9] Common flow patterns include forward flows for input operands entering from one side and backward flows for partial results propagating in the reverse direction, as seen in linear arrays for matrix-vector multiplication where vectors move oppositely to align multiplications.^[1] In two-dimensional arrays, skewing techniques stagger input data—such as delaying each row by one clock cycle—to create diagonal wavefronts that align computations across the array, ensuring that dependent operations arrive precisely when needed for pipelined execution.^[11] Data dependencies in iterative algorithms are handled through local buffering within PEs, where small registers store intermediate results to resolve hazards without stalling the global flow, preserving the regularity of the dependence graph while enabling wavefront propagation.^[9] This local resolution leverages the array's modular structure, allowing dependencies to be mapped spatially so that data arrives just in time, thus maintaining high throughput without global intervention.^[1]

Goals and Benefits

Performance Objectives

Systolic arrays are designed to maximize throughput by leveraging pipelined data flows that enable a constant rate of computation across the array, particularly for linear algebra operations such as matrix multiplication. This approach targets the use of O(n²) processors to achieve O(n) time complexity for problems like solving dense systems of n linear equations, where data is rhythmically pumped through the network to overlap input, computation, and output phases.^[1] By ensuring a steady stream of data via simple, regular communication paths, systolic arrays sustain high data rates without bottlenecks from global broadcasts or reductions.^[1] Latency reduction forms another core objective, achieved through local interconnects that minimize communication overhead between neighboring processing elements. The goal is constant-time execution per operation across the array, as data moves only to adjacent cells in a mesh or hexagonal topology, avoiding long-distance wires that dominate delay in conventional designs.^[1] For instance, in matrix-vector multiplication, this pipelining reduces overall latency from O(wn) in sequential implementations to O(2n + w), where w is the wavefront size.^[1] Scalability targets emphasize linear speedup proportional to array size for embarrassingly parallel tasks, such as banded matrix operations on fixed-size arrays that handle inputs of arbitrary length.^[1] Additionally, fault tolerance is incorporated via modular replacement strategies, where redundant processing elements and bypass links allow reconfiguration to isolate and substitute faulty units, thereby maintaining performance and extending mean time between failures in large-scale implementations.^[12] Techniques like virtual arrays with 25-100% redundancy achieve linear improvements in reliability metrics, independent of array dimensions.^[12] Resource efficiency objectives prioritize optimization under VLSI area and time constraints, favoring uniform processor layouts and local connections to facilitate dense, low-cost fabrication. Regular grid structures minimize wiring complexity and support modular expansion, aligning with the systolic principle of simplicity for high-volume production.^[1] This design enables efficient silicon utilization, where the number of processors scales with computational demands without proportional increases in interconnection overhead.^[1]

Advantages Over Conventional Architectures

Systolic arrays address key limitations of conventional von Neumann architectures, which suffer from a central memory bottleneck where data must shuttle between a shared memory and a single processor, limiting scalability for compute-intensive tasks. By distributing both storage and computation across a network of processing elements (PEs) connected via local links, systolic arrays enable multiple computations per memory access, with data flowing rhythmically through the system to minimize global communication overhead.^[13] This distributed approach supports massive parallelism without relying on shared buses, allowing for efficient handling of data-parallel algorithms like matrix operations.^[13] The simplicity and regularity of systolic array designs offer significant advantages over irregular multiple-instruction multiple-data (MIMD) architectures, which often require complex control logic and irregular interconnects that complicate implementation. Systolic arrays employ a uniform grid of identical PEs with local, nearest-neighbor connections, facilitating straightforward very-large-scale integration (VLSI) fabrication and reducing design complexity.^[14] This regularity also leads to lower power consumption, as data movement is confined to short, predictable paths, avoiding the energy overhead of long-distance wiring in conventional designs. Performance in systolic arrays is highly predictable due to their fixed data paths and deterministic synchronization, contrasting with cache-dependent systems where execution times vary based on unpredictable memory hits and misses. The rhythmic data flow ensures near-100% utilization of PEs for well-suited algorithms, as computations proceed in lockstep without idle cycles from resource contention.^[13] Systolic arrays achieve cost-effectiveness through their modular structure, which allows scaling by simply adding more PEs without redesigning the core architecture, making per-operation costs low for high-volume numerical processing. This scalability aligns system cost proportionally with required performance, enabling efficient deployment in specialized hardware for tasks like signal processing.^[13]^[15]

Operational Mechanics

Processing Cycles

In a systolic array, each clock cycle follows a rhythmic pattern where processing elements (PEs) synchronously receive input data from neighboring PEs, perform a local computation—typically a multiply-accumulate operation such as updating a partial sum with C \leftarrow C + A \times B—temporarily store the results in local registers, and then forward the data to adjacent PEs for the next cycle.^[1] This structure ensures that data flows without global communication, mimicking a heartbeat-like pulsation across the array.^[1] In two-dimensional systolic arrays, such as those used for matrix multiplication, the computations advance via wavefront propagation, where active processing spreads diagonally from the top-left corner toward the bottom-right, progressively activating diagonals of PEs as input operands align.^[16] This diagonal wavefront enables efficient overlap of data movement and computation, with each cycle advancing the frontier by one position along the anti-diagonals.^[16] The operational phases include pipeline filling, steady-state execution, and draining: initial cycles load input data into the array, often injecting zeros or boundary values to initialize partial results; steady-state follows, where the array reaches full utilization with every PE contributing to throughput proportional to the input bandwidth; and final cycles output completed results as the wavefront exits the array boundaries.^[1] For an N \times N array, filling and draining typically require on the order of $2N cycles, bounding the transient overhead.^[16] For algorithms with iterative loops, such as successive approximation methods, the systolic array handles multiple iterations by pipelining computations across cycles, allowing new iterations to enter as prior ones drain, or through data recirculation where outputs are fed back as inputs for subsequent passes.^[1] This approach maintains rhythmic data flow without halting the array, enabling continuous processing of looped workloads.^[1]

Mathematical Foundations

The fundamental operation within a processing element (PE) of a systolic array is the multiply-accumulate (MAC), expressed as

c \leftarrow c + a \cdot b

where a and b are scalar inputs received from neighboring PEs or boundaries, and c represents the partial result passed onward after the computation. This recursive form enables pipelined accumulation across the array, with each PE performing one MAC per cycle while data flows rhythmically through local interconnects.^[1] Systolic arrays are synthesized by mapping algorithms modeled as dependence graphs onto hardware topologies via systolic projection. A dependence graph consists of nodes representing index-domain computations and directed edges denoting data dependencies with associated delays. The mapping process selects a scheduling vector \mathbf{s}^T to assign execution times to nodes and a projection vector \mathbf{p}^T to map multi-dimensional index spaces to the array's lower-dimensional geometry, ensuring no two dependent computations are scheduled simultaneously on the same processor. For instance, in matrix-vector multiplication \mathbf{c} = A \mathbf{b} where A is an n \times m matrix and \mathbf{b} is an m-vector, the dependence graph captures partial sums c_i^{(k+1)} = c_i^{(k)} + a_{i k} b_k for i = 1 to n and k = 1 to m. A linear projection, such as \mathbf{p}^T = [1 \, 0], maps this to a 1D array of n PEs, with dependencies resolved by propagating b_k and partial c_i along the chain. The validity of the mapping requires \mathbf{s}^T \mathbf{d} \geq 1 for each dependency vector \mathbf{d}, minimizing latency while respecting the systolic constraint of no global communication.^[1] Timing in systolic arrays is governed by node scheduling functions that dictate wavefront propagation. A standard scheduling function for 2D arrays is \psi(i,j) = i + j, which assigns time steps along diagonals, enabling simultaneous execution of independent computations in a wavefront that advances one step per cycle. This creates rhythmic data movement, where inputs propagate without stalling, provided the processor index is derived as

(i,j) \cdot \mathbf{p} \mod

array bounds. To synchronize arrivals, data skewing introduces initial delays in input vectors; for example, in the matrix-vector case, elements of \mathbf{b} are skewed by offsets proportional to row indices, ensuring a_{i k} and b_k meet at PE i exactly when needed. The skewing vector is computed as \mathbf{t} = \mathbf{s}^T \mathbf{e} for edge vectors \mathbf{e}, aligning flows to the schedule. Complexity analysis reveals tradeoffs between processors and time under systolic constraints. For matrix-vector multiplication on an n \times m array configured as a 1D linear systolic array of size \min(n,m), the time complexity is T = n + m - 1 cycles: n - 1 cycles to fill the pipeline with initial partial sums, m cycles for accumulation (or vice versa if m < n), minus 1 to account for overlapping output. This derives from the wavefront model, where input propagation takes \max(n,m) steps, but pipelining overlaps computation and data movement, yielding total latency T = n + m - 1. The processor-time product is p \cdot T \approx n m, matching the O(n m) arithmetic operations, but systolic designs optimize p to O(\min(n,m)) while bounding T = O(n + m), outperforming sequential O(n m) time with p = 1. For 2D extensions like matrix multiplication, scaling to p = O(n^2) yields T = O(n), with the constant factor (e.g., 3 for square matrices) from the schedule's dot product |\mathbf{s}^T \mathbf{p}|. These tradeoffs are formalized by minimizing T = \max(\mathbf{s}^T \mathbf{I}) over index domain \mathbf{I}, subject to processor count p = |\mathbf{P} \mathbf{I}| where \mathbf{P} is the projection matrix.^[1]

Examples

Polynomial Multiplication

Systolic arrays provide an efficient mapping for multiplying two polynomials of degree n, denoted as A(x) = \sum_{i=0}^n a_i x^i and B(x) = \sum_{i=0}^n b_i x^i, to compute the product C(x) = A(x) B(x) = \sum_{k=0}^{2n} c_k x^k. This is achieved using a linear systolic array consisting of n+1 processing elements (PEs) arranged in a one-dimensional chain. Each PE performs local multiply-accumulate operations on incoming coefficient streams, leveraging the rhythmic data flow inherent to systolic designs to compute the convolution of the coefficient sequences without global communication. The mapping exploits the structure of polynomial multiplication as a convolution, where the coefficients of A(x) are input from one end of the array (e.g., the left), flowing rightward, while the coefficients of B(x) enter from the opposite end (e.g., the right), flowing leftward. Each PE, indexed from 0 to n, receives one coefficient from each polynomial at appropriate timings, multiplies them to form a partial product, and accumulates it into a local register that tracks contributions shifted by the PE's position in the array. This ensures that partial products for each degree k of the result are systematically gathered across the PEs, with boundary conditions handled by preloading zeros for leading and trailing coefficients beyond degree n (e.g., a_i = 0 for i > n and similarly for b_i). The design maintains constant bandwidth, as each PE communicates only with its immediate neighbors. Execution proceeds in a pipelined manner over $2n+1 cycles, with the first coefficient c_0 emerging after n+1 cycles and subsequent coefficients output one per cycle. In each cycle after the initial latency, data propagation aligns the inputs such that the next result coefficient emerges from a designated output port, typically the rightmost PE or a dedicated accumulator chain. The core computation for each coefficient follows the convolution formula:

c_k = \sum_{i=0}^k a_i b_{k-i}

for k = 0 to $2n, with the systolic scheduling ensuring that all terms in the sum for a given k are accumulated without conflicts. Boundary conditions are managed by injecting zero values at the array ends after the initial coefficients, preventing extraneous contributions and ensuring exact degree-$2n results. The architecture can be visualized as a linear chain of n+1 PEs, with forward data paths for the A coefficients (left to right) and backward data paths for the B coefficients (right to left), connected via nearest-neighbor links. Each PE includes input buffers, a multiplier, an accumulator, and output ports, with control logic to synchronize the flows and handle zero padding for boundaries. This setup exemplifies the systolic principle of localized computation and rhythmic pulsing, enabling high throughput for polynomial operations in VLSI implementations.

Convolution Operations

Systolic arrays provide an efficient architecture for performing convolution operations, particularly in signal and image processing tasks involving finite impulse response (FIR) filters. The core algorithm computes the output sequence y = \sum_{k=0}^{M} h \cdot x[n - k], where x represents the input signal and h is the fixed kernel of length M+1. This computation is mapped onto a linear systolic array comprising M+1 processing elements (PEs), each responsible for a specific kernel coefficient.^[1] In this configuration, the input signal x flows unidirectionally through the array, while the kernel coefficients h propagate in the opposite direction to align with the sliding window. Each PE executes a multiply-accumulate operation, multiplying the arriving x and h values and adding the product to an incoming partial sum, which is then forwarded to the adjacent PE. This pipelined data flow ensures that the dependencies of the convolution are resolved locally within the array.^[1] Once the pipeline is filled, typically after an initial latency of O(M) cycles, the array generates one output sample y per clock cycle, achieving constant-time processing per sample independent of kernel size. This design is optimized for FIR filters but extends to infinite impulse response (IIR) variants by incorporating feedback paths in the PE structure to handle recursive dependencies. For two-dimensional convolution, common in image filtering, the architecture scales to a 2D mesh of PEs, leveraging separable kernels to decompose the operation into two sequential one-dimensional passes. In the first stage, horizontal linear arrays perform row-wise convolutions across the image, producing intermediate feature maps. These intermediates then undergo column-wise convolutions in a vertical pass using the same or reconfigured mesh, effectively computing the full 2D output y[i,j] = \sum_{p=0}^{M} \sum_{q=0}^{M} h[p,q] \cdot x[i-p, j-q] for kernel size (M+1) \times (M+1).^[17] This separable approach on a 2D mesh maintains high utilization, with data streaming through rows and columns in a systolic manner to minimize global communication. Efficiency mirrors the 1D case, yielding constant time per output pixel after pipeline fill, while the mesh parallelism scales throughput with the array dimensions.

Sorting Algorithms

Systolic arrays implement parallel comparison-based sorting through mapping algorithms like Batcher's odd-even merge sort or the bitonic sorter onto a two-dimensional mesh structure. These algorithms construct sorting networks composed of compare-exchange operations, which are ideally suited to the regular, pipelined data flow of systolic architectures, enabling simultaneous processing across multiple paths.^[18]^[19] In the mapping, input data elements enter the array boundaries and propagate through an arrangement of processing elements (PEs) organized into sequential stages that mirror the layers of the sorting network. Each PE functions as a compare-exchange unit, receiving two data values from neighboring PEs or inputs, comparing them, and swapping if the values are out of order to direct smaller elements toward one output direction (e.g., upward or leftward) while larger ones move oppositely. This localized operation ensures no global communication is needed, maintaining the systolic principle of rhythmic data pulsing without buffering delays. Synchronization occurs via the array's clock, with data advancing one step per cycle across the mesh.^[18] The execution proceeds in a series of logarithmic stages, where the odd-even merge sort recursively divides the input into halves, sorts them, and merges via alternating odd and even indexed comparisons. For an n × n mesh sorting up to n² elements, the process requires log₂ n merging levels, with each level comprising up to n/2 parallel compare-exchange operations per row or column; the overall sorting completes in O((log n)²) cycles, achieving high throughput as multiple data streams can pipeline through the network. The bitonic sorter variant follows a similar staged structure but builds bitonic sequences (increasing then decreasing) before merging, offering comparable complexity and adaptability to the mesh topology.^[18]^[19] Variants of these implementations employ truncated sorting networks for tasks like median finding or selection, where only a partial set of stages is activated to isolate the k-th smallest element without completing the full sort. This reduces computational overhead, as the network up to the median position suffices, leveraging the same compare-exchange PEs but with fewer cycles and potentially reconfigurable connections in the systolic mesh.^[19]^[20]

Applications

Traditional Signal Processing

Systolic arrays found early and prominent applications in traditional digital signal processing (DSP), particularly for tasks requiring high-throughput real-time computation that exceeded the capabilities of general-purpose CPUs at the time. These architectures were especially suited to operations like real-time filtering, where linear systolic arrays could efficiently implement finite impulse response (FIR) filters by pipelining data through processing elements, each performing multiply-accumulate operations on incoming streams. This design allowed for sustained high sample rates, addressing the limitations of von Neumann architectures that suffered from memory-computation bottlenecks in handling continuous data flows from sensors or audio inputs.^[21] In audio processing, linear systolic arrays were employed to realize FIR filters, enabling low-latency equalization and noise reduction in real-time systems. For instance, a linear array of processing cells could compute filtered outputs every few clock cycles, supporting applications in telecommunications where voice signals demanded precise temporal processing without delays. Similarly, two-dimensional systolic arrays extended this capability to image enhancement in early video systems, performing convolutions across pixel grids to sharpen or denoise frames, which was critical for broadcast and surveillance technologies in the 1980s. These configurations leveraged the rhythmic data flow inherent to systolic designs, ensuring modular scalability for varying filter lengths or image sizes.^[21] Fast Fourier Transform (FFT) computation also benefited from systolic arrays, with linear or hexagonal configurations mapping the Cooley-Tukey algorithm to achieve pipelined, real-time spectral analysis. Such implementations were vital for frequency-domain processing in communications, where systolic structures reduced latency compared to sequential FFT methods, enabling efficient modulation and demodulation at high data rates. In adaptive beamforming for radar and sonar systems, systolic arrays facilitated the computation of optimal weights for multiple sidelobe cancellers, enhancing spatial filtering to suppress interference and focus on target signals in military environments. This was particularly impactful in 1980s defense applications, where real-time adaptability was essential for detecting submerged threats or airborne objects.^[21]^[22] The adoption of systolic arrays surged in the 1980s within military and telecommunications sectors, driven by the need for specialized hardware to handle intensive DSP workloads. For example, systolic chips were developed for speech recognition tasks, using dynamic time warping algorithms mapped to linear arrays for connected word processing in secure communication systems. Projects like wafer-scale systolic integrations supported radar signal processing in defense initiatives, while telecom firms explored them for high-speed channel equalization. These early realizations underscored systolic arrays' role in overcoming CPU limitations for sample rates exceeding millions per second, paving the way for dedicated DSP silicon.^[23]^[24]

Modern Machine Learning Accelerators

Systolic arrays play a central role in modern machine learning accelerators by efficiently handling the matrix multiplications and general matrix multiply (GEMM) operations that dominate deep neural network training and inference. These architectures enable high-throughput computation with minimal data movement, addressing the memory bandwidth bottlenecks in conventional processors. For instance, Google's Tensor Processing Units (TPUs), introduced in 2016, leverage 2D systolic arrays to perform tensor operations at scale in datacenters.^[25] The first-generation TPU (TPUv1) features a single 256×256 systolic array of multiply-accumulate (MAC) units optimized for 8-bit integer operations, delivering up to 92 TOPS while consuming 40 W, primarily for inference workloads. Subsequent versions expanded this design: TPUv2 and TPUv3 incorporate multiple 128×128 systolic arrays per chip (equivalent in compute capacity to 256×256), supporting bfloat16 precision for training and achieving higher energy efficiency through pipelined dataflow. TPUv4 (2021) builds on this with four 128×128 arrays per chip, adding support for int8 and structured sparsity in embeddings to reduce compute for pruned models, while maintaining compatibility with low-precision formats like FP16 and INT8 for broader neural network acceleration. Later generations continued advancements: TPUv5 (2023) nearly doubled v4 performance with enhanced systolic arrays; TPUv6e (Trillium, GA May 2025) offers improved efficiency for large-scale training; and TPUv7 (Ironwood, announced April 2025, GA Q4 2025) provides over 4x the performance of Trillium, emphasizing inference with advanced systolic designs for generative AI. These optimizations enable TPUs to process large-scale models, such as those in Google's search and translation services, with high utilization on dense GEMM tasks.^[3]^[26]^[27]^[28] Beyond Google, other vendors have integrated systolic-like units into their AI accelerators. Apple's Neural Engine, debuting in the A11 Bionic chip in 2017 and evolving through M-series processors, employs matrix multiplication units based on systolic arrays to accelerate on-device inference for tasks like image recognition and natural language processing. Performance has scaled significantly, from 15.8 TOPS on A15 (2021) to 38 TOPS on M4 (2024), with the M5 series (October 2025) featuring an enhanced 16-core Neural Engine for even greater efficiency in mobile AI deployment. Similarly, Huawei's Ascend series uses DaVinci cores with Cube units—systolic array-based matrix multiply accelerators—that support vector and scalar processing for both training and inference, with recent models like Ascend 910C (2024/2025) emphasizing scalability in datacenter environments and upcoming Ascend 950 (2026) promising further petaflop-scale performance. In the 2020s, systolic arrays have seen renewed adoption in FPGA and ASIC designs for edge AI, where reconfigurable fabrics like those in AMD Versal AI Edge Series Gen 2 (2023) implement sparse systolic slices to handle resource-constrained inference on devices such as IoT sensors, offering flexibility over fixed ASICs.^[29]^[30]^[31] Recent advancements focus on hybrid systolic designs that combine arrays with other accelerator elements, such as vector units or sparse cores, to enhance versatility for diverse ML workloads. For example, hybrid systolic architectures integrate reconfigurable dataflows to support both dense and sparse operations, reducing energy consumption by up to 8% in edge inference while boosting throughput by 57% for models like transformers. These hybrids prioritize energy efficiency for mobile and datacenter applications, enabling systolic arrays to achieve 2.3× energy savings over prior accelerators in low-power scenarios, thus supporting sustainable scaling of AI deployment.^[32]

Implementations and Debates

Hardware Realizations

Early prototypes of systolic arrays emerged in the mid-1980s, with Carnegie Mellon University's Warp project delivering a notable example. A prototype of the Warp system, featuring a 2-cell linear systolic array, was completed in June 1985, with the full 10-cell version completed in early 1986; each cell was capable of 10 MFLOPS, serving as a programmable coprocessor for high-performance signal and image processing tasks.^[6] This prototype emphasized systolic data flow to achieve efficient pipelined computations, demonstrating feasibility for attached processors in research environments.^[33] Building on Warp, the iWARP project in the 1990s advanced scalable multicomputer architectures with systolic principles. Developed collaboratively by Carnegie Mellon and Intel, iWARP integrated a single-chip component combining a Lisp processor, floating-point unit, and communication support into a mesh-connected system; the first 64-cell production systems were delivered in 1990, enabling high-speed parallel computing for scientific applications.^[7] These systems supported systolic communication patterns across nodes, scaling to hundreds of cells while maintaining low-latency data rhythmic flow.^[34] Commercial realizations in the 1980s and 1990s included transputer-based systems and specialized ASICs. Transputers, developed by INMOS, were adapted into systolic arrays for parallel processing, such as in multilayered neural network implementations where one- and two-dimensional transputer configurations achieved efficient data pipelining for model reduction and feedback control tasks.^[35] Additionally, systolic ASICs like those for image processing emerged, with designs leveraging orthogonal interconnects to handle convolutions and correlations in dedicated chips.^[36] Intel's iWARP, commercialized from the academic prototype, represented a key ASIC evolution, packaging systolic elements into scalable, off-the-shelf multicomputers.^[37] In modern hardware, systolic arrays have seen widespread adoption in AI accelerators, exemplified by Google's Tensor Processing Unit (TPU) series since 2015. The first TPU featured a 256×256 systolic array of 8-bit multiply-accumulate units, optimized for matrix multiplications in neural networks, delivering up to 92 tera operations per second while minimizing data movement.^[3] Subsequent iterations, such as TPU v3 with dual 128×128 arrays (2018), and later versions up to the seventh-generation TPU v7 (Ironwood, announced November 2025), have further scaled systolic array designs for cloud-scale machine learning and advanced AI workloads, integrating systolic computation with host interfaces for tensor operations.^[38]^[39] Recent research has explored scalable systolic arrays for sparse neural networks and energy-efficient edge devices, with prototypes demonstrating up to 68% power savings through optimized dataflows (as of 2024).^[40] For research, open-source FPGA implementations on Xilinx platforms have proliferated, such as parametric systolic array generators for matrix multiplication and hardened designs with error detection, enabling customizable prototypes for DNN acceleration.^[41]^[42] Despite advances, hardware realizations face challenges in fabrication and system integration. Large systolic arrays suffer from reduced yields due to increased defect probabilities in dense interconnects and processing elements, necessitating fault-tolerant designs like redundancy to maintain reliability.^[43] Integration with GPUs or CPUs introduces overheads from data transfer latencies and mismatched paradigms, often requiring hybrid architectures to balance systolic efficiency with general-purpose flexibility.^[44]

Classification Controversies

The classification of systolic arrays has sparked ongoing debates within parallel computing literature, primarily revolving around strict versus loose interpretations of the core definition. In its original formulation, a systolic array is defined as a homogeneous network of processors that rhythmically compute and pass data through local interconnections, with no global control lines and synchronous operation to enable pipelined throughput without additional logic overhead.^[1] This strict view, emphasizing pure systolic data flow where data "pumps" through the array like blood through the heart, excludes designs incorporating hybrid control mechanisms, such as centralized scheduling or external intervention, which later extensions have introduced to enhance flexibility. Controversies arise over the inclusion of reconfigurable arrays, which allow dynamic reconfiguration of processing elements (PEs) during operation, potentially violating the rhythmic, fixed-flow principle of pure systolic designs.^[45] Similarly, arrays employing global clocks for synchronization face criticism, as clock skew and distribution challenges in large-scale implementations can disrupt the intended local, self-timed rhythm, leading some researchers to question their systolic purity.^[46] Disputes extend to emerging contexts like neuromorphic computing, where systolic-inspired structures simulate neural dynamics but incorporate non-local feedback loops, blurring the boundaries of traditional definitions.^[47] In quantum computing, analogous array-based approaches for quantum circuit simulation have been proposed, yet they often hybridize with classical control, prompting debates on whether such adaptations retain systolic essence.^[48] Key arguments trace back to 1980s and 1990s literature, where papers questioned the fit of non-homogeneous PEs—varying processor functions within the array—arguing that they compromise the modularity and regularity central to systolic efficiency.^[49] For instance, Flynn's taxonomy controversially places systolic arrays under MISD (multiple instruction, single data), but this is widely critiqued as problematic since data modification at each node aligns more closely with SIMD-like behavior, rendering MISD "nonsensical" in refined taxonomies.^[49] Modern perspectives, such as those on Google's Tensor Processing Units (TPUs), describe them as "systolic-inspired" rather than purely systolic due to integrated host interfaces and hybrid data management that deviate from unadulterated local flow.^[50] These debates have significant implications for taxonomy in parallel computing, as ambiguous boundaries complicate comparisons across architectures and hinder standardized benchmarking. The issues remain unresolved amid evolving hardware paradigms, with no consensus on delineating pure systolic designs from their extensions.^[49]