LCP array
The LCP array (Longest Common Prefix array) is an auxiliary data structure in string algorithms that accompanies a suffix array, which is a permutation of all suffixes of a given string sorted in lexicographical order. For a string of length n, the LCP array H of size n stores, at each index i \geq 1, the length of the longest common prefix between the i-th and (i+1)-th suffixes in the suffix array (with H{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0); this captures the shared prefix length for adjacent suffixes, enabling efficient computations that would otherwise require direct string comparisons.[1][2]
The LCP array facilitates linear-time construction from a precomputed suffix array using algorithms like Kasai's method, which exploits the property that if two suffixes share a prefix of length h, their one-character extensions share at least h-1 characters, allowing comparisons to skip matched portions and bound the total work to O(n) by performing at most $3n character comparisons across all pairs.[3] This efficiency stems from traversing suffixes in text order while maintaining a "height" (LCP) value that decreases predictably, ensuring mismatches contribute to an amortized constant cost per position. Parallel variants, such as par-PLCP, further optimize construction to O(n + K l_{\max}) work and O(n/K + l_{\max}) depth for K processors, achieving speedups on multicore systems for large-scale data.[2]
Key applications of the LCP array include identifying the longest repeated substring in a string by finding the maximum entry in O(n) time after construction, simulating suffix tree traversals for pattern matching without explicit tree storage, and supporting compressed text indexes like the Burrows-Wheeler transform for data compression.[1] In bioinformatics and information retrieval, it enables efficient queries such as finding all occurrences of a pattern in O(n + z + \log n) time using range minimum queries on the LCP array to compute lowest common ancestors in an implicit suffix tree, while its space efficiency—typically O(n \log n) bits—makes it preferable over full suffix trees for massive datasets like genomes.[2]
Fundamentals
Definition
The longest common prefix (LCP) array is an auxiliary data structure that complements the suffix array in string algorithms, introduced alongside the suffix array itself.[4]
For a string S of length n over a finite alphabet, appended with a unique sentinel character \ that is lexicographically smaller than any symbol in the [alphabet](/page/Alphabet) (so S[n-1] = $ ), the [suffix array](/page/Suffix_array) SA is a [permutation](/page/Permutation) of the indices {0, 1, \dots, n-1} such that the suffixes S[SA \dots n-1] \leq S[SA[5] \dots n-1] \leq \dots \leq S[SA[n-1] \dots n-1] in lexicographical order. The LCP array LCP[0 \dots n-1] satisfies LCP = 0 , and for each i = 1, \dots, n-1 , LCP is the length of the longest common prefix between the suffixes starting at SA[i-1] and SA ; formally, LCP = \max { k \geq 0 \mid S[SA[i-1] + j] = S[SA + j] \ \forall j = 0, \dots, k-1 } $.[4]
The primary purpose of the LCP array is to measure the degree of similarity between consecutive suffixes in the sorted order of the suffix array, thereby facilitating efficient string processing tasks by avoiding repeated substring comparisons during queries or traversals.[4]
Indexing conventions for the LCP array vary in the literature, with 0-based indexing as used above being standard in many implementations.[6] The sentinel \ guarantees distinct suffixes and correct lexicographical ordering by preventing any suffix from being a prefix of another. Each entry LCP is a non-negative integer at most n-2 , representing the maximum possible prefix length before divergence (since suffixes are distinct due to the sentinel). The LCP array depends solely on the suffix array SA and the original string S $, with no independent construction beyond these inputs.[7]
Example
To illustrate the LCP array, consider the string S = \text{banana\} , where the dollar sign () is a unique terminator smaller than any character to ensure proper suffix sorting. The suffixes of S are:
- Position 0: banana$
- Position 1: anana$
- Position 2: nana$
- Position 3: ana$
- Position 4: na$
- Position 5: a$
- Position 6: $
These suffixes, when sorted lexicographically (assuming 0-based indexing), yield the suffix array \text{SA} = [6, 5, 3, 1, 0, 4, 2], corresponding to the ordered suffixes: , a, ana, anana, banana, na, nana$.[2]
The LCP array is then computed by finding the length of the longest common prefix between each pair of consecutive suffixes in this sorted order, with \text{LCP}{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 by convention. For S = \text{[banana](/page/Banana)\} , the resulting LCP array is [0, 0, 1, 3, 0, 0, 2] . This is obtained through direct pairwise [prefix](/page/Prefix) comparisons: for instance, the suffixes at positions 5 (a) and 3 (ana) share a single character "a"; the suffixes at positions 3 (ana) and 1 (anana) share three characters "ana"; and the suffixes at positions 4 (na) and 2 (nana$) share two characters "na". All other consecutive pairs share no characters.[2]
The following table aligns the suffix array ranks, starting positions, suffixes, and LCP values for clarity:
| Rank i | SA[i] | Suffix | LCP[i] |
|---|
| 0 | 6 | $ | 0 |
| 1 | 5 | a$ | 0 |
| 2 | 3 | ana$ | 1 |
| 3 | 1 | anana$ | 3 |
| 4 | 0 | banana$ | 0 |
| 5 | 4 | na$ | 0 |
| 6 | 2 | nana$ | 2 |
This example demonstrates how the LCP array quantifies the shared prefixes among sorted suffixes, providing a compact representation of their similarities.[2]
Construction
Naive Approach
The naive approach to constructing the LCP array assumes that the suffix array SA for a string S of length n (terminated by a unique sentinel character) is already computed. This method computes each LCP value by directly comparing the corresponding suffixes character by character until a mismatch is found. Specifically, for each index i from 1 to n-1 (0-based indexing), LCP[i] is set to the length of the longest common prefix between the suffixes starting at positions SA[i-1] and SA[i] in S.
To implement this, a loop iterates over the positions in the suffix array, and for each pair of adjacent suffixes, an inner loop advances a counter k while S[\text{SA}[i-1] + k] = S[\text{SA} + k], stopping at the first mismatch or string end, with the final k becoming LCP[i]. LCP is conventionally set to 0, as there is no preceding suffix. The following pseudocode illustrates this process:
function naive_LCP(S, SA):
n = length(S)
LCP = array of size n, initialized to 0
for i = 1 to n-1:
k = 0
while SA[i-1] + k < n and SA[i] + k < n and S[SA[i-1] + k] == S[SA[i] + k]:
k += 1
LCP[i] = k
return LCP
function naive_LCP(S, SA):
n = length(S)
LCP = array of size n, initialized to 0
for i = 1 to n-1:
k = 0
while SA[i-1] + k < n and SA[i] + k < n and S[SA[i-1] + k] == S[SA[i] + k]:
k += 1
LCP[i] = k
return LCP
This direct comparison yields a space complexity of O(n) for storing the LCP array itself, in addition to the space for SA and S. However, the time complexity is O(n^2) in the worst case, as there are n-1 pairs and each comparison may examine up to O(n) characters, particularly in repetitive strings like all identical characters.
Efficient Algorithms
The construction of the LCP array can be performed in linear time relative to the string length n, assuming the suffix array is already available. The seminal algorithm for this purpose, introduced by Kasai et al. in 2001, scans the suffixes in their original positional order while leveraging the inverse suffix array to identify consecutive pairs in the sorted order and incrementally computing LCP values based on a key monotonicity property.[7] This method defines a height array h[1 \dots n], where h represents the LCP length between the suffixes starting at positions SA and SA[i-1] (with h{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} = 0), enabling efficient pairwise comparisons without recomputing prefixes from scratch.[7]
Kasai's algorithm proceeds as follows: First, compute the inverse suffix array ISA[0 \dots n-1], where ISA is the rank of the suffix starting at position i in the suffix array (i.e., SA[ISA] = i). Initialize h[0 \dots n-1] = 0 and a carry-over variable k = 0. Then, for each starting position i = 0 to n-1:
- If ISA = 0 (the first suffix in sorted order), set k = 0 and continue.
- Let j = SA[ISA - 1], the starting position of the suffix immediately preceding the one at i in sorted order.
- While i + k < n, j + k < n, and the characters at i + k and j + k match, increment k.
- Set h[ISA] = k.
- Set k = \max(k - 1, 0).
This process exploits the property that the LCP between a suffix and its predecessor in sorted order is at least one less than the LCP computed for the previous suffix pair, ensuring that the total number of character comparisons across all iterations of the inner while loop is bounded by O(n).[7] The overall time complexity is thus O(n), with O(n) space usage beyond the input suffix array.[7]
Subsequent works have proposed refinements and alternatives that maintain linear time while improving practical performance or reducing space. For instance, Ohlebusch et al. (2011) introduced lightweight algorithms inspired by Kärkkäinen's DC3 suffix array construction, which integrate LCP computation during the inductive sorting phases to achieve faster runtimes on large inputs with minimal additional memory.[8] These methods preserve the core inverse-scan logic of Kasai's approach but optimize for cache efficiency and external memory scenarios.[8]
Applications
Pattern Matching
The LCP array facilitates efficient pattern matching in a text T of length n by augmenting the suffix array, which sorts all suffixes of T in lexicographic order. To locate occurrences of a pattern P of length m, perform a binary search on the suffix array to identify the contiguous range [L, R] of indices where the suffixes starting at positions SA[L] to SA[R] (with SA denoting the suffix array) have P as a prefix. The number of occurrences equals R - L + 1, and their starting positions are SA for k = L to R. This approach leverages the sorted order of suffixes to isolate matching prefixes without scanning the entire text.
The LCP array enhances this process by enabling faster comparisons during binary search. In each step, when probing a midpoint suffix, compute the length of the common prefix between P and the probed suffix by first finding the minimum LCP value in the relevant range of the suffix array (using the LCP array's entries between the current bounds and the midpoint), which avoids full character-by-character scans. If the minimum LCP exceeds the current matched length, adjust the search bounds accordingly; otherwise, compare additional characters from P only as needed. This preprocessing reduces the total character comparisons across all binary search iterations.
Without the LCP array, each binary search comparison requires up to O(m) time, yielding an overall search time of O(m \log n). With the LCP array, the total time improves to O(m + \log n), as each character of P is compared at most once, while the binary search contributes O(\log n) steps, each accelerated by constant-time minimum LCP queries (precomputable in linear space via sparse tables or similar structures).
Consider the text T = "banana" , where the suffix array is SA = [5, 3, 1, 0, 4, 2] (suffixes starting at positions 5: "a", 3: "ana", 1: "anana", 0: "banana", 4: "na", 2: "nana") and the LCP array is LCP = [0, 1, 3, 0, 0, 2]. Searching for P = "ana" via binary search identifies the range [1, 2] (suffixes at positions 3 and 1), confirming two occurrences at positions 1 and 3 in T. The LCP values, particularly the minimum of 3 in the probed range, verify the prefix match efficiently without re-scanning "ana" repeatedly.
Suffix Tree Construction
The LCP array facilitates the efficient construction of a suffix tree from a given suffix array by encoding the necessary branching information through consecutive common prefix lengths between sorted suffixes. This conversion produces an equivalent suffix tree that explicitly represents the hierarchical structure of all suffixes, enabling advanced string processing operations. The process traverses the suffix array in order, using LCP values to identify points where suffixes diverge, thereby defining internal nodes and their depths. Internal nodes are created for groups of suffixes sharing a common prefix of length LCP, with the tree's topology reflecting the minimal set of such groupings.[9]
A prominent approach adapts ideas from Ukkonen's linear-time suffix tree algorithm to the suffix array context by constructing a Cartesian tree directly from the LCP array, where node depths correspond to LCP heights and leaves represent suffixes in suffix array order. In this Cartesian tree, the structure enforces a min-heap property on LCP values along paths, capturing the lowest common ancestor depths that mirror the suffix tree's internal node placements. This adaptation leverages the sorted nature of the suffix array to build the tree bottom-up, avoiding the need for incremental suffix insertions as in the original Ukkonen method. The resulting tree is isomorphic to the suffix tree, with LCP values serving as the metric for node heights.[10]
The construction proceeds in linear time through the following steps: initialize the leaves as the suffixes referenced by the suffix array; traverse the array from left to right, maintaining a stack representing the rightmost path of the current subtree. For each position i, while the stack's top has a depth greater than LCP, pop nodes to resolve branches; if LCP exceeds the current top's depth, push a new internal node with depth LCP and attach the previous subtree as its left child. The current suffix becomes the right child or leaf attachment point. Edge labels between nodes are derived from the original string, spanning the difference in depths (e.g., from position SA[i-1] + LCP[i-1] to SA + LCP). This stack-based traversal ensures each LCP change creates or resolves internal nodes precisely where branching occurs.[11]
Once built, the suffix tree has O(n) nodes and edges, matching the theoretical minimum for representing n suffixes, and supports O(n) total construction time following the computation of the suffix array and LCP array. This efficiency enables seamless integration with applications requiring tree-based navigation, such as dynamic pattern queries, while preserving the space advantages of the suffix array format.[6]
Longest Repeated Substrings
The longest repeated substring in a string can be identified using the LCP array in conjunction with the suffix array. The length of this substring is given by the maximum value in the LCP array, specifically \max_{i=2}^{n} \mathrm{LCP}, where n is the length of the string. This maximum represents the longest common prefix between any two suffixes, which corresponds to the longest repeated substring. The actual substring is the common prefix of length \mathrm{LCP} starting from the positions \mathrm{SA}[k-1] and \mathrm{SA} in the original string, where k is the index achieving the maximum.[7]
To find all repeated substrings exceeding a specified length threshold t, the LCP array can be scanned to identify all indices i where \mathrm{LCP} > t. For each such i, the repeated substring of length \mathrm{LCP} occurs at original positions \mathrm{SA}[i-1] and \mathrm{SA}. This scan requires O(n) time after the suffix array and LCP array have been constructed.[7]
For the string "banana$", the suffix array is [6, 5, 3, 1, 0, 4, 2] and the LCP array is [0, 1, 3, 0, 0, 2], yielding a maximum LCP value of 3 at index 3 (0-based indexing for LCP starting from the second pair). This corresponds to the substring "ana", which repeats starting at positions 1 and 3 in the original string.[7]
To identify distinct repeats that are non-overlapping or separated by at least a minimum distance d, an additional check is applied during the scan: for each candidate pair at \mathrm{SA}[i-1] and \mathrm{SA}, verify that |\mathrm{SA}[i-1] - \mathrm{SA}| \geq \mathrm{LCP} + d. This ensures the occurrences do not overlap excessively or violate the separation requirement, while preserving the O(n) time complexity.[7]
Extensions
LCP Queries for Arbitrary Suffixes
To compute the longest common prefix (LCP) between any two arbitrary suffixes of a string, which may not be consecutive in the lexicographically sorted order defined by the suffix array, one can leverage the precomputed LCP array through range minimum queries (RMQ).[6] Specifically, for two suffixes starting at positions i and j (with i < j) in the original string, their LCP length equals the minimum value in the LCP array over the range between their positions in the sorted suffix array.[12] This approach avoids direct string comparisons, which would be inefficient for large strings, and instead exploits the property that the LCP between non-adjacent sorted suffixes is bounded by the smallest LCP value among the intervening consecutive pairs.[6]
The core technique involves building an RMQ data structure on the LCP array to support fast minimum queries. A widely used method is the sparse table, which preprocesses the LCP array by computing minimum values over dyadic ranges (powers of two).[13] The sparse table construction takes O(n \log n) time and space, where n is the string length, by filling a 2D array where each entry st stores the minimum in the range starting at index i of length $2^k.[13] Once built, any RMQ can be answered in O(1) time by selecting two overlapping dyadic ranges that cover the query interval and taking their minimum.[13]
The full algorithm for an LCP query between suffixes i and j proceeds as follows: First, determine the ranks r_1 and r_2 of the suffixes in the sorted order using the inverse suffix array (where the inverse maps original positions to sorted indices). Assume without loss of generality that r_1 < r_2; then, the LCP is the minimum in the LCP array from index r_1 + 1 to r_2.[12] This range minimum is retrieved via the sparse table query. The inverse suffix array can be computed in O(n) time during suffix array construction.[6] Overall, preprocessing for these queries requires O(n \log n) time (dominated by the sparse table build after the LCP array is available), while each individual query runs in O(1) time.[12]
This capability extends the utility of the LCP array beyond consecutive suffixes, enabling efficient computation of LCPs for unsorted or arbitrarily selected pairs without recomputing prefix matches from scratch.[6] It is particularly valuable in applications requiring repeated LCP evaluations on diverse suffix pairs, such as advanced pattern matching or compression algorithms that operate on non-adjacent string segments.[12]
Variants and Generalizations
Variants of the longest common prefix (LCP) array extend its utility to specialized string models and data structures, particularly those handling repetitive or multi-source texts. In highly repetitive collections, such as genomic datasets with duplicated regions, the LCP array can be compressed using run-length encoding to exploit long sequences of identical LCP values, significantly reducing space usage while maintaining query efficiency. This approach integrates with compressed suffix trees and FM-indexes, where the run-length encoded LCP array replaces the standard representation, enabling navigation in structures like the Burrows-Wheeler transform (BWT) with o(n) bits per LCP value for repetitive strings. For instance, in the construction of fully-compressed suffix trees for repetitive texts, the LCP array is stored in a wavelet tree or Elias-Fano monotonic sequence, but run-length encoding further optimizes it by grouping equal values, achieving up to 20-50% space savings on DNA collections with high repetitiveness.[14][15]
Another generalization addresses colored strings, where each suffix is assigned a color representing its origin, such as different genomes in a pan-genome dataset. The colored LCP (cLCP) array extends the standard LCP by computing, for each pair of consecutive suffixes in the generalized suffix array, the longest common prefix between a suffix and the nearest suffixes from other colors (documents), formalized through upper and lower colored LCP arrays (UcLCP and LcLCP), which track the LCP values over subranges to nearest positions of specific other colors, allowing efficient computation via sequential scans of the suffix array in O(n) time for a collection of n symbols. In bioinformatics, this variant supports tasks like finding conserved regions across multiple genomes by identifying common substrings between different genomes, with applications to bacterial genomes.[16][17]
Generalized suffix arrays (GSAs) adapt the LCP array to multi-string sets by concatenating strings with unique separators smaller than the alphabet, ensuring suffixes from different strings are distinguishable. The LCP array for a GSA is computed similarly to the single-string case, with Kasai's linear-time algorithm extended to skip separator positions and handle cross-string comparisons only when prefixes match up to boundaries. This enables LCP values to reflect common prefixes across multiple documents, supporting applications like multi-pattern matching in document retrieval. For wildcards, the GSA incorporates don't-care symbols by modifying the comparison during sorting, preserving LCP computation integrity for approximate matching. Algorithms for external-memory construction of such GSAs, including LCP, process terabyte-scale collections in O(n log n) time using disk-based induced sorting.[18][19]
The suffix array induced sorting (SA-IS) algorithm integrates LCP computation directly into the SA construction process, avoiding a separate post-processing step. In the SA-IS framework, after inducing the order of L-type and S-type suffixes, the LCP array is built by propagating LCP values during the final ranking phase, achieving simultaneous O(n) time and space for both structures. This optimization is particularly effective for large-scale texts, as it reuses the induced buckets to compute pairwise LCPs without additional scans. Recent variants refine this by almost pure induced sorting, computing LCP entries in parallel during bucket sorting for enhanced performance on multi-core systems.[20]
In modern bioinformatics, LCP array variants underpin compressed indexes like the r-index for pan-genomes, where run-length encoded BWT pairs with sampled LCP queries to support pattern matching across thousands of genomes in o(n space proportional to distinct sequence length. The r-index enables LCP-bounded backward search, computing maximal exact matches (MEMs) in time dependent on the LCP length, which scales efficiently for repetitive pan-genomes like human haplotype collections. For example, indexing approximately 100 human genome assemblies requires around 100 GB, with MEM queries faster than traditional FM-indexes on repetitive data. These generalizations facilitate pan-genome alignments and variant detection by generalizing LCP to graph-based representations, integrating with tools for non-hierarchical full-text indexing.[21][22]