Fact-checked by Grok 2 weeks ago

Prefix sum

In computer science and mathematics, a prefix sum, also known as a cumulative sum, scan, or partial sum, is an operation that computes an array of partial sums from a given array of numbers, where each element in the resulting array represents the sum of all preceding elements (inclusive or exclusive variants exist).^[1] For an input array A = [a_0, a_1, \dots, a_{n-1}], the inclusive prefix sum array P satisfies

P{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = a_0

and P = P[i-1] + a_i for i = 1 to n-1, enabling efficient range sum queries where the sum from index i to j is P - P[i-1] (or P[j+1] - P for exclusive sums).^[1]^[2] Prefix sums form a fundamental data structure and algorithmic primitive, with sequential computation requiring O(n) time and space, making them ideal for preprocessing arrays to answer multiple subarray sum queries in constant time thereafter.^[2] The concept has origins in classical mathematics but gained prominence in parallel computing.^[1] Beyond basic range queries, prefix sums underpin numerous applications across algorithms, including radix sort, quicksort partitioning, multi-precision arithmetic, polynomial evaluation, lexical analysis, image component labeling, and processor allocation in parallel systems.^[1] In modern contexts like online analytical processing (OLAP) and data cubes, precomputed prefix sums support fast aggregation over multidimensional ranges, while parallel variants extend to GPU computing and SIMD instructions for high-performance tasks such as sorting and stream processing.^[3] Their associative nature—generalizing to any binary operator—ensures broad utility in both serial and distributed environments.^[1]

Fundamentals

Definition and Notation

In mathematics, the prefix sum, also known as the cumulative sum, of a finite sequence of numbers a_1, a_2, \dots, a_n is a sequence of partial sums defined by s_k = a_1 + a_2 + \dots + a_k = \sum_{i=1}^k a_i for each k = 1, 2, \dots, n, where s_0 = 0 by convention.^[4] In computational contexts, prefix sums are commonly represented using arrays, where for a zero-indexed input array a[0 \dots n-1], the prefix sum array s satisfies s = \sum_{j=0}^i a for i = 0 to n-1.^[5] This array-based notation extends to vectorized forms in programming languages, such as cumulative summation functions that produce the entire sequence in one operation.^[2] Prefix sums exhibit basic properties rooted in the linearity of summation: for any scalar c, the prefix sum of the scaled sequence satisfies \sum_{i=1}^k (c \cdot a_i) = c \cdot s_k.^[4] Additionally, since addition is an associative operation, prefix sums can be evaluated in non-sequential orders without altering the result, a property that facilitates efficient parallel computation.^[1] For example, given the sequence [1, 2, 3, 4], the prefix sums are [1, 3, 6, 10]. The naive approach of recomputing each prefix sum independently from the start of the sequence incurs O(n^2) time complexity in the worst case, whereas a single forward pass through the sequence computes the entire prefix sum array in O(n) time.^[5]

Inclusive and Exclusive Prefix Sums

Prefix sums can be computed in two primary variants: inclusive and exclusive. The inclusive prefix sum of an array A = [a_1, a_2, \dots, a_n] at index k is defined as s_k = \sum_{i=1}^k a_i, which incorporates the k-th element itself.^[6] This variant directly accumulates all elements up to and including the current position, making it suitable for scenarios requiring total sums that include the endpoint.^[7] In contrast, the exclusive prefix sum at index k is s_k = \sum_{i=1}^{k-1} a_i, excluding the k-th element, with the convention that s_1 = 0.^[6] This form provides the sum of all preceding elements only, often prepending a zero to the result array for consistency in indexing.^[8] The two variants are closely related and can be converted between each other efficiently. Specifically, the exclusive prefix sum can be obtained from the inclusive one by shifting: e_k = s_{k-1} for k \geq 2, with e_1 = 0; conversely, the inclusive can be derived from the exclusive as s_k = e_k + a_k.^[8] Consider an example array A = [1, 2, 3]. The inclusive prefix sums are [1, 3, 6], while the exclusive prefix sums are [0, 1, 3].^[7] Inclusive prefix sums are commonly used for computing cumulative totals, such as in cumulative distribution functions where the full sum up to a point is needed.^[7] Exclusive prefix sums, however, are preferred in parallel operations like scatter or stream compaction, where they help determine destinations by summing preceding counts without self-inclusion, avoiding overlaps during element placement.^[9]^[6]

Sequential Computation

The sequential computation of prefix sums is performed using a single-pass algorithm that iterates through the input array once, maintaining a running total to build the output array. For an inclusive prefix sum, where the output array s satisfies s = \sum_{j=0}^{i} a, the process initializes

s{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = a{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}

and then for each subsequent index i from 1 to n-1, computes s = s[i-1] + a. This approach directly applies the recursive definition of prefix sums in a linear traversal.^[10] An exclusive prefix sum variant shifts the computation by initializing

s{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0

and then s = s[i-1] + a for i = 1 to n, effectively making s = \sum_{j=0}^{i-1} a. The algorithm requires O(n) time complexity due to the single iteration over n elements and O(n) space for the output array, assuming a separate storage structure.^[1] To achieve O(1) extra space, an in-place variant modifies the input array directly by iterating from left to right and updating each element as a = a[i-1] + a for i = 1 to n-1, transforming a into its inclusive prefix sum while preserving the original data only if backed up beforehand. This maintains the O(n) time complexity but overwrites the input, suitable for scenarios where the original array is not needed post-computation.^[10] Edge cases are handled straightforwardly: an empty array yields an empty prefix sum; a single-element array results in

s{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = a{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}

(inclusive) or

s{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0

(exclusive); and negative numbers are supported since addition is associative and commutative over integers. The algorithm's pseudocode for the inclusive version using extra space is as follows:

function prefixSums(A, n):
    B = new [array](/page/Array) of size n
    B[0] = A[0]
    for i = 1 to n-1:
        B[i] = B[i-1] + A[i]
    return B
function prefixSums(A, n):
    B = new [array](/page/Array) of size n
    B[0] = A[0]
    for i = 1 to n-1:
        B[i] = B[i-1] + A[i]
    return B

This yields O(n) time and space.^[10] A practical implementation in Python for the in-place inclusive prefix sum is:

python
def in_place_prefix_sum(arr):
    for i in range(1, len(arr)):
        arr[i] += arr[i - 1]
    return arr
def in_place_prefix_sum(arr):
    for i in range(1, len(arr)):
        arr[i] += arr[i - 1]
    return arr

For example, input [1, 2, 3, 4] becomes [1, 3, 6, 10]. This code assumes a mutable list and handles negatives correctly, such as [-1, 2] yielding [-1, 1].^[10]

Parallel Algorithms

Hillis-Steele Algorithm

The Hillis-Steele algorithm, developed by W. Daniel Hillis and Guy L. Steele Jr. in 1986, provides a parallel approach to computing prefix sums tailored for fine-grained architectures like the Connection Machine. It leverages data parallelism by assigning one processor per array element and uses a repeated doubling strategy to propagate partial sums across increasing distances, enabling logarithmic-time execution with n processors.^[11] The algorithm operates in \log_2 n phases for an array of size n (a power of 2), where the offset distance doubles each iteration, starting from 1. In each phase, processors execute in parallel: for index i (0-based), if i \geq offset, the value at i is updated by adding the value at i - offset from the previous phase; otherwise, it remains unchanged. Double buffering (using input and output arrays) prevents read-write conflicts, ensuring all reads occur from the prior phase's results. This yields a span of O(\log n) but total work of O(n \log n) due to overlapping computations. Unlike sequential prefix sum computation, which requires O(n) work, this method emphasizes parallelism over efficiency.^[11] To illustrate, consider the array [1, 2, 3, 4, 5, 6, 7, 8], where the target inclusive prefix sums are [1, 3, 6, 10, 15, 21, 28, 36]. The iterations proceed as follows:

Offset	Updated Array
Initial	[1, 2, 3, 4, 5, 6, 7, 8]
1	[1, 3, 5, 7, 9, 11, 13, 15]
2	[1, 3, 6, 10, 14, 18, 22, 26]
4	[1, 3, 6, 10, 15, 21, 28, 36]

Each phase activates nearly all processors, demonstrating high parallelism, but redundant additions (e.g., early elements recomputed multiple times) contribute to the elevated work complexity. The algorithm's strengths lie in its straightforward implementation and maximal parallelism, with all processors potentially active per step, making it ideal for SIMD machines. However, the redundant operations lead to inefficiency, performing \Theta(n \log n) additions versus the optimal O(n).^[11] Pseudocode for the algorithm, adapted for 0-based indexing with double buffering:

for (offset = 1; offset < n; offset *= 2) {
    parallel for (i = 0; i < n; i++) {
        if (i >= offset) {
            out[i] = in[i - offset] + in[i];
        } else {
            out[i] = in[i];
        }
    }
    // Copy out to in or swap pointers
    for (i = 0; i < n; i++) {
        in[i] = out[i];
    }
}
for (offset = 1; offset < n; offset *= 2) {
    parallel for (i = 0; i < n; i++) {
        if (i >= offset) {
            out[i] = in[i - offset] + in[i];
        } else {
            out[i] = in[i];
        }
    }
    // Copy out to in or swap pointers
    for (i = 0; i < n; i++) {
        in[i] = out[i];
    }
}

This structure directly reflects the parallel propagation central to the original design.^[11]

Blelloch Algorithm

The Blelloch algorithm is a work-efficient parallel prefix sum computation that achieves optimal asymptotic performance by employing a two-phase approach on a virtual binary tree structure over the input array of size n (assumed to be a power of two). The first phase, known as the up-sweep or reduce phase, constructs partial sums by progressively aggregating elements toward the root of the tree. The second phase, the down-sweep, initializes the root and then propagates the prefix sums back down the tree to compute the final prefixes for all elements. This method ensures that the total number of operations remains linear in n, matching the sequential algorithm's efficiency while enabling parallelism.^[1] In the up-sweep phase, the algorithm performs pairwise additions across increasing strides, effectively halving the number of active elements at each level. Specifically, for d = 1 to \log_2 n: in parallel, for each offset k from 0 to n-1 in steps of $2^d, add the value at position k + 2^{d-1} - 1 to the value at k + 2^d - 1, storing the result in the latter position. This builds the tree bottom-up: at level d=1, adjacent pairs are summed; at higher levels, sums of those pairs are combined, until the root at position n-1 holds the total sum after \log_2 n steps. Each step operates in constant time with n / 2^d parallel additions, contributing to the overall linear work.^[1] The down-sweep phase then distributes the prefixes starting from the root. For an exclusive scan, first set the root value at position n-1 to 0. Then, for d = \log_2 n down to 1: in parallel, for each offset k from 0 to n-1 in steps of $2^d, temporarily store the left child value t at k + 2^{d-1} - 1, set the left child to the current parent value at k + 2^d - 1, and update the right child at k + 2^d - 1 to t plus the parent value. This propagates prefixes without overwriting needed partial sums. For inclusive scan, the root remains the total sum, and the down-sweep is adjusted by propagating the correct partial sums (e.g., left child receives its up-sweep value, right receives left's up-sweep plus parent's value). The phase mirrors the up-sweep in structure but uses swaps and additions for distribution.^[1] The algorithm's complexity is O(n) work (total additions across both phases sum to n - 1, identical to sequential computation) and O(\log n) span (depth of the tree, with \log n parallel steps per phase), making it suitable for parallel random-access machines (PRAM). This work-efficiency distinguishes it from less optimal parallel scans, as it avoids redundant computations while scaling with processor count up to n.^[1] To illustrate the exclusive scan, consider the standard input array of size n=8: [3, 1, 7, 0, 4, 1, 6, 3]. Up-sweep:

d=1: [3, 4, 7, 7, 4, 5, 6, 9]
d=2: [3, 4, 7, 11, 4, 5, 6, 14]
d=3: [3, 4, 7, 11, 4, 5, 6, 25]

Set root to 0: [3, 4, 7, 11, 4, 5, 6, 0] Down-sweep:

After d=3: [3, 4, 7, 0, 4, 5, 6, 11]
After d=2: [3, 0, 7, 4, 4, 11, 6, 16]
After d=1: [0, 3, 4, 11, 11, 15, 16, 22]

The final output is the exclusive prefix sums [0, 3, 4, 11, 11, 15, 16, 22]. For inclusive prefix sums, add the original input elements to the corresponding exclusive outputs: [3, 4, 11, 11, 15, 16, 22, 25]. The tree-like computation is evident in how partial sums at internal nodes enable efficient prefix distribution.^[1] The Blelloch algorithm was first described by Guy E. Blelloch in his 1990 technical report (published in 1993), specifically designed for the parallel random-access machine (PRAM) model to support efficient scans as a primitive for building more complex parallel algorithms.^[1]

Comparison of Parallel Approaches

Parallel prefix sum algorithms are evaluated using key complexity metrics in the work-depth model: work, which measures the total number of operations across all processors; span, which denotes the length of the critical path or the time taken by the slowest thread; and parallelism, defined as the ratio of work to span, indicating the potential speedup with sufficient processors.^[13] These metrics help assess efficiency relative to the sequential O(n) baseline, where n is the array size.^[1] The Hillis-Steele algorithm exhibits O(n log n) work and O(log n) span, yielding O(n) parallelism but incurring redundancy through repeated operations on elements, making it less efficient for large inputs compared to the sequential bound.^[14] In contrast, the Blelloch algorithm achieves optimal O(n) work and O(log n) span, providing O(n / log n) parallelism and matching the sequential work complexity while enabling logarithmic depth for parallel execution.^[1] Other variants offer trade-offs in these metrics. The Ladner-Fischer algorithm balances work and span through a parameterized family of circuits, achieving O(log n) span with work ranging from O(n) to O(n log n) depending on the fan-out parameter, allowing customization for depth-sensitive or resource-constrained environments.^[15] The Kogge-Stone algorithm maximizes parallelism with O(log n) span and minimal depth via dense connectivity, but at the cost of O(n log n) work due to high fan-out and wiring complexity, suiting applications prioritizing low latency over efficiency.^[16] Practical factors influencing algorithm choice include cache efficiency, where algorithms with localized memory access, like Blelloch's tree-based structure, reduce cache misses by minimizing global communications; synchronization overhead, as multi-phase designs (e.g., Hillis-Steele with log n steps) require frequent barriers that scale poorly with thread count; and scalability with processors, where work-efficient methods like Blelloch maintain performance as p increases up to n / log n, while high-parallelism variants like Kogge-Stone saturate earlier due to communication bottlenecks.^[17]^[8] Empirically, the Hillis-Steele algorithm performs well for small arrays (up to 1,024 elements) due to its simplicity and low synchronization in single-block executions, but Blelloch excels for large-scale inputs, delivering up to 20× speedup over sequential CPU scans on GPUs by leveraging work efficiency and block-level parallelism.^[14]

Implementations in Memory Models

In shared memory environments, such as multi-core CPUs, prefix sum computations often employ a two-level hierarchy to leverage thread-level parallelism while minimizing synchronization overhead. Each thread first computes a local prefix sum over its assigned partition of the input array, ensuring cache-friendly access patterns by sizing partitions to fit within L2 cache limits, such as 512 KB. A global merge phase then combines these local results using a Blelloch-style scan on the partition sums, adding offsets to adjust the local scans for the full array. This approach, implemented with POSIX threads or OpenMP, achieves up to 3x speedup over standard parallel libraries like GNU Parallel STL on multicore systems by reducing memory bandwidth demands in the second pass. For example, in OpenMP, the computation can use the #pragma omp parallel for directive for local scans followed by a reduction on block sums, with barriers for synchronization:

cpp
#pragma omp parallel
{
    // Local scan per thread
    #pragma omp for
    for (int i = low; i < high; i++) {
        out[i] = (i == low) ? in[i] : in[i] + out[i-1];
    }
    // Global merge (simplified)
    double block_sum = compute_block_sum();
    #pragma omp barrier
    // Add prefix of block sums to local results
}
#pragma omp parallel
{
    // Local scan per thread
    #pragma omp for
    for (int i = low; i < high; i++) {
        out[i] = (i == low) ? in[i] : in[i] + out[i-1];
    }
    // Global merge (simplified)
    double block_sum = compute_block_sum();
    #pragma omp barrier
    // Add prefix of block sums to local results
}

This method balances work across cores without requiring the number of threads to be a power of two, outperforming classic Blelloch implementations by up to 50% for array sizes fitting in L2 cache on dual-socket quad-core processors.^[3]^[18] In distributed memory settings, such as clusters using MPI, the hypercube all-reduce algorithm enables logarithmic-time prefix sums across nodes by exploiting the topology where processors form a d-dimensional hypercube (p = 2^d nodes). Each node iteratively exchanges partial sums along each dimension i from 0 to d-1: it sends its current sum m to the neighbor differing in bit i, receives m' , updates m = m ⊗ m' , and if its i-th bit is set, adjusts its local result x = m' ⊗ x . This computes both the prefix sum at each node and the global total in O(log p · (α + nβ)) communication time, where α is startup latency, β is per-word cost, and n is message size per node. For non-power-of-two processor counts, variants like the binomial tree can approximate this with similar logarithmic rounds. Implementations in MPI, such as MPI_Scan, integrate this for collective operations, with the hypercube excelling when p is a power of two.^[19]^[20] For large-scale or streaming data in GPU environments, prefix sums adapt to high-bandwidth but latency-sensitive memory models using pipelined binary tree structures or decoupled look-back techniques to hide communication costs. On GPUs, CUDA implementations divide the array into thread blocks (e.g., 128-1024 elements) fitting in shared memory, where each block performs an intra-block Blelloch scan via up-sweep (reduce to root) and down-sweep (propagate prefixes) phases, avoiding bank conflicts through padding and loop unrolling. Block sums are then scanned in a separate kernel, and offsets added back to blocks for the global result, handling arbitrary sizes by padding to power-of-two multiples. Recent optimizations in CUDA 12+ via the CUB library employ single-pass decoupled look-back scans, which use redundant computation to propagate prefixes with ~2n data movement, achieving throughput equivalent to memcpy (e.g., 18 GB/s on Tesla C2050) and 1.1-2.3x speedup over prior libraries like Thrust for large inputs. GPU-specific tiling loads data into shared memory tiles to coalesce global accesses, reducing latency by 4-7x in applications like compaction. A simplified CUDA kernel for block scan (exclusive) illustrates this:

cuda
__global__ void block_scan(float *g_odata, float *g_idata, float *temp, int n) {
    __shared__ float temp_storage[2*BLOCK_SIZE];
    int tid = threadIdx.x;
    int offset = 1;
    // Load data
    temp_storage[2*tid] = g_idata[blockIdx.x * blockDim.x + tid];
    temp_storage[2*tid+1] = 0;
    // Up-sweep
    for (int d = n>>1; d > 0; d >>= 1) {
        __syncthreads();
        if (tid < d) {
            int ai = offset*(2*tid+1)-1;
            int bi = offset*(2*tid+2)-1;
            temp_storage[bi] += temp_storage[ai];
        }
        offset *= 2;
    }
    // Clear root
    if (tid == 0) temp_storage[n-1] = 0;
    // Down-sweep
    for (int d = 1; d < n; d *= 2) {
        offset >>= 1;
        __syncthreads();
        if (tid < d) {
            int ai = offset*(2*tid+1)-1;
            int bi = offset*(2*tid+2)-1;
            float t = temp_storage[ai];
            temp_storage[ai] = temp_storage[bi];
            temp_storage[bi] += t;
        }
    }
    __syncthreads();
    g_odata[blockIdx.x * blockDim.x + tid] = temp_storage[2*tid];
}
__global__ void block_scan(float *g_odata, float *g_idata, float *temp, int n) {
    __shared__ float temp_storage[2*BLOCK_SIZE];
    int tid = threadIdx.x;
    int offset = 1;
    // Load data
    temp_storage[2*tid] = g_idata[blockIdx.x * blockDim.x + tid];
    temp_storage[2*tid+1] = 0;
    // Up-sweep
    for (int d = n>>1; d > 0; d >>= 1) {
        __syncthreads();
        if (tid < d) {
            int ai = offset*(2*tid+1)-1;
            int bi = offset*(2*tid+2)-1;
            temp_storage[bi] += temp_storage[ai];
        }
        offset *= 2;
    }
    // Clear root
    if (tid == 0) temp_storage[n-1] = 0;
    // Down-sweep
    for (int d = 1; d < n; d *= 2) {
        offset >>= 1;
        __syncthreads();
        if (tid < d) {
            int ai = offset*(2*tid+1)-1;
            int bi = offset*(2*tid+2)-1;
            float t = temp_storage[ai];
            temp_storage[ai] = temp_storage[bi];
            temp_storage[bi] += t;
        }
    }
    __syncthreads();
    g_odata[blockIdx.x * blockDim.x + tid] = temp_storage[2*tid];
}

Performance considerations emphasize bandwidth saturation and latency hiding: on multi-core CPUs, two-level scans yield 1.4-3x speedups over sequential code; GPUs achieve up to 20x over CPU baselines for n=10^7 elements, with CUB scans saturating memory bandwidth; in distributed MPI setups, pipelined variants for large messages (>1 MB) deliver up to 2x speedup over non-pipelined trees on 32-72 node clusters by overlapping communication. These adaptations ensure scalability across memory models, prioritizing work-efficiency and minimal synchronization.^[14]^[21]^[20]

Advanced Uses

Data Structures

The Fenwick tree, also known as the binary indexed tree, is a data structure that employs prefix sums to enable efficient range sum queries and point updates in O(log n) time for an array of n elements.^[22] It achieves this by storing cumulative sums in a tree-like array where each index represents a range governed by the binary representation of its position, allowing updates and queries to traverse the tree using least significant bit (LSB) manipulations.^[22] Construction of a Fenwick tree involves initializing an array of zeros and iteratively updating each position with its value, propagating changes up the tree via repeated LSB-based additions in a loop until the index exceeds n, resulting in O(n log n) total time.^[22] The segment tree is a hierarchical data structure that organizes prefix sums (or other associative operations) across array intervals in a binary tree, where leaf nodes hold individual elements and internal nodes store aggregates of their children.^[23] This enables range sum queries in O(log n) time by decomposing the query interval into O(log n) disjoint segments covered by tree nodes, and supports point updates in O(log n) time by propagating changes from leaves to the root.^[23] Building a segment tree requires O(n) time, achieved through a bottom-up or recursive process that computes sums for all nodes in a single pass.^[23] Compared to the segment tree, the Fenwick tree is more space-efficient, requiring only O(n) space in a single array versus the segment tree's O(n) space but with a larger constant factor due to its full binary tree structure (typically 4n elements).^[22]^[23] While both support sum queries and updates in O(log n) time, the segment tree extends more readily to additional operations like range minimum or maximum queries by storing different aggregates in nodes, whereas the Fenwick tree is primarily optimized for prefix sums.^[23] For example, consider constructing a Fenwick tree for the array A = [1, 2, 3, 4] (1-indexed). Initialize the tree array ft of size 5 as [0, 0, 0, 0, 0]. Update index 1 with value 1: add 1 to ft^[24] (now 1), then idx=2, add 1 to ft^[25] (now 1), then idx=4, add 1 to ft^[26] (now 1), yielding ft = [0, 1, 1, 0, 1]. Update index 2 with 2: add 2 to ft^[25] (now 3), then idx=4, add 2 to ft^[26] (now 3), yielding ft = [0, 1, 3, 0, 3]. Update index 3 with 3: add 3 to ft^[27] (now 3), then idx=4, add 3 to ft^[26] (now 6), yielding ft = [0, 1, 3, 3, 6]. Update index 4 with 4: add 4 to ft^[26] (now 10), yielding ft = [0, 1, 3, 3, 10]. A prefix sum query for the first k elements sums ft indices by adding while subtracting lsbit until k reaches 0.^[22] Modern variants, such as wavelet trees, extend prefix sum concepts to compressed representations of sequences over large alphabets, using balanced binary trees of bitvectors to support rank/select operations in O(log σ) time (where σ is the alphabet size) while achieving near-optimal space usage for succinct data structures.

Applications

Prefix sums find widespread application in algorithmic problem-solving, particularly for efficient range sum queries on static arrays. By precomputing the cumulative sum up to each index, the sum of elements between indices i and j can be retrieved in constant time as the difference between the prefix sums at j and i-1, reducing query time from O(n) to O(1) after O(n) preprocessing. This technique is essential in competitive programming and data structure optimizations, such as answering multiple subarray sum queries efficiently. Additionally, prefix sums enable the reconstruction of arrays from their difference arrays, which is crucial for range update operations; updates are applied to the difference array in O(1) time per update, and the original array is recovered via a prefix sum computation in linear time. In signal processing and time series analysis, prefix sums facilitate the computation of cumulative distribution functions (CDFs) from empirical data, providing a step-wise approximation of the underlying probability distribution for tasks like histogram equalization or quantile estimation. They also support efficient calculation of moving averages, where the average over a sliding window of size k is derived from prefix sums as (P - P[j-k])/k, enabling real-time smoothing in audio processing or sensor data streams without redundant summations. Computer graphics leverages prefix sums in order-independent transparency (OIT) rendering techniques to handle semi-transparent surfaces without requiring fragment sorting. In per-pixel fragment list methods, a prefix sum scan computes the count of fragments per pixel during a first pass, generating offsets for a second compositing pass that blends colors and opacities accurately, achieving real-time performance on GPUs for complex scenes like molecular visualizations. In machine learning, prefix sums enhance the efficiency of attention mechanisms in transformer models, notably in the Performer architecture, where they enable an unbiased approximation of the softmax kernel via a prefix-sum operation on random feature projections, reducing the quadratic complexity of standard attention to near-linear while preserving expressiveness for long-sequence tasks like language modeling. This approach has been integrated into frameworks like PyTorch for scalable training of large models. Furthermore, in financial modeling, prefix sums compute running totals for metrics such as cumulative returns or cash flows, allowing quick assessments of portfolio performance over historical periods without iterative additions.

References

[1]
[PDF] Prefix Sums and Their Applications
DEFINITION. The scan operation is a vector all-prefix-sums operation. Chapters 2, 3 and 4 discuss uses of the linked-list all-prefix-sums operation and derive ...
[2]
Prefix Sums - Princeton Competitive Programming
Prefix sums are a powerful technique for efficiently solving problems related to array manipulation and queries about sums of elements.Missing: algorithm | Show results with:algorithm
[3]
[PDF] Parallel Prefix Sum with SIMD - Columbia CS
Aug 31, 2020 · In this paper, we study different methods of computing prefix sums with SIMD instructions and multi- ple threads.<|control11|><|separator|>
[4]
Cumulative Sum -- from Wolfram MathWorld
A cumulative sum is a sequence of partial sums of a given sequence. For example, the cumulative sums of the sequence {a,b,c,...}, are a, a+b, a+b+c, .
[5]
Introduction to Prefix Sums - USACO Guide
1D Prefix Sums ; k k · of the prefix sum array stores the sum of all the elements in the original array from index ; 1 1 · up to ; k k ·. This can be calculated ...Resources · 1D Prefix Sums · Solution - Static Range Sum · Problems
[6]
https://wiki.rice.edu/confluence/download/attachments/4435861/lec35-ParallelPrefixSum.pdf
[7]
[PDF] GPU Computing: Data-Parallel Algorithms
inclusive prefix sum. If we do not include the element itself we talk about an exclusive prefix sum. Table 1 shows examples of an inclusive and exclusive ...
[8]
[PDF] Computing Included and Excluded Sums Using Parallel Prefix
May 18, 2020 · The exclusive scan is the inclusive scan ... From here on, unless explicitly noted otherwise, scan or prefix sum is the inclusive version.
[9]
[PPT] CUDA Optimization Tutorial - Texas Computer Science
The “inclusive” prefix sum includes the current ... Fibonacci sequence Fn is defined as follows: Fn+1 ... Compute exclusive prefix sum of filter in ...
[10]
[PDF] September 21, 2016 1 Overview 2 Prefix-Sums Algorithm - AlgoPARC
This algorithm iterates through the arrays, summing each element and ... in-place prefix-sum algorithm without the use of the additional array B (see ...
[11]
[PDF] DATA PARALLEL ALGORITHMS
STEELE, JR. In this article we describe a series of algorithms ap- propriate for fine-grained parallel computers with general communications.
[12]
[PDF] Parallel Scans & Prefix Sums - cs.Princeton
Parallel prefix-sum is a parallel algorithm with O(n) work and O(log n) span, using two passes to build a tree of sums bottom-up and compute prefixes top-down.
[13]
Chapter 39. Parallel Prefix Sum (Scan) with CUDA - NVIDIA Developer
A simple and common parallel algorithm building block is the all-prefix-sums operation. In this chapter, we define and illustrate the operation.
[14]
[PDF] Work-stealing prefix scan: Addressing load imbalance in large-scale ...
Oct 23, 2020 · The Ladner-Fischer scan is designed with a constant C2 that controls the depth–work balance. Table 1 presents a comparison of the discussed ...Missing: span | Show results with:span
[15]
[PDF] Parallel Prefix Algorithms for the Registration of Arbitrarily Long ...
Figure 3.4: An example of Kogge–Stone a.k.a. Hillis–Steele parallel prefix sum on 8 data elements. Results are produced in 3 steps and 17 applications of ...
[16]
[PDF] Fundamental Parallel Algorithms for Private-Cache Chip ...
Jun 16, 2008 · We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other paral- lel ...Missing: comparison synchronization
[17]
A novel parallel prefix sum algorithm and its implementation on multi ...
The Blelloch's algorithm and the variant of the proposed method are implemented via POSIX Threads, and are tested on two multicore platforms. The results showed ...
[18]
[PDF] 13 Collective Communication and Computation - People
The hypercube algorithm computes a prefix sum plus all-reduce in communication time log p(a +nβ). Hypercube algorithms only work when p is a power of two. Also, ...
[19]
[PDF] Parallel Prefix (Scan) Algorithms for MPI
In this section we describe three standard algorithms for the parallel prefix op- erations, and discuss their implementation in MPI. We focus exclusively on the ...
[20]
[PDF] Single-pass Parallel Prefix Scan with Decoupled Look-back
Compared to Kogge- Stone constructions, these networks exhibit high-radix fan-out when propagating partial prefixes computed by recursive subgraphs.Missing: span | Show results with:span
[21]
A new data structure for cumulative frequency tables - Fenwick - 1994
Abstract. A new method (the 'binary indexed tree') is presented for maintaining the cumulative frequencies which are needed to support dynamic arithmetic data ...Missing: original | Show results with:original
[22]
Implicit data structures for fast search and update - ScienceDirect
Implicit data structures for fast search and update☆. Author links open overlay panelJ.Ian Munro, Hendra Suwanda. Show more. Add to Mendeley. Share.