Longest common substring
In computer science, the longest common substring of two or more strings is defined as the longest contiguous sequence of characters that appears as a substring in all of them.[1] Given two strings S and T of lengths up to n over an alphabet of size \sigma, the problem requires identifying this maximal-length common fragment.[2]
This classic string algorithm problem can be solved using dynamic programming in O(n^2) time and space, where a 2D table tracks the lengths of common suffixes ending at each pair of positions in S and T. For more efficient solutions, suffix tree constructions enable linear-time computation, O(n \log \sigma), leveraging the lowest common ancestor queries on the generalized suffix tree of the input strings.[1] Advanced variants achieve sublinear space and time complexities, such as O(n \log \sigma / \sqrt{\log n}) time using O(n \log \sigma / \log n) space.[2]
The problem has significant applications in bioinformatics, including DNA sequence alignment, microarray analysis, and identifying common molecular subsequences in genomic data.[1] It also supports text processing tasks like plagiarism detection, where it identifies copied contiguous passages in documents by computing maximal overlapping substrings.[3]
Fundamentals
A substring of a string is defined as a contiguous sequence of characters within that string.
The longest common substring problem, given two strings S and T over an alphabet \Sigma, consists of finding the longest string U that is a substring of both S and T, along with its length |U| and the starting positions where it occurs in S and T.[4] The notation \mathrm{LCSubstring}(S, T) is commonly used to denote this longest string U, while |\mathrm{LCSubstring}(S, T)| refers to its length.[4]
Special cases arise depending on the input strings. If either S or T is the empty string, then \mathrm{LCSubstring}(S, T) is the empty string with length 0.[5] If S and T are identical, then \mathrm{LCSubstring}(S, T) = S = T, so the length equals |S|. If S and T share no common non-empty substring (e.g., consisting of disjoint alphabets), the length is 0, corresponding to the empty string.[5]
The problem generalizes to k \geq 2 strings S_1, S_2, \dots, S_k, where the goal is to find the longest string that is a substring of every S_i for i = 1 to k, again including its length and positions in each string.[6] In this setting, \mathrm{LCSubstring}(S_1, \dots, S_k) denotes the resulting string, and the empty string serves as the trivial solution with length 0 when no longer common substring exists.[6] This generalization can be computed using methods such as dynamic programming extended to multiple dimensions.[6]
Key Properties
The longest common substring (LCS) of two strings is not necessarily unique, as multiple distinct substrings of the maximum length may occur in both strings simultaneously. In practice, identifying any one such substring suffices for most applications, since the problem typically seeks the length or a representative instance rather than an exhaustive enumeration.
A defining characteristic of the LCS is its contiguous nature, requiring the common characters to appear consecutively in both input strings without gaps. This distinguishes it from the longest common subsequence, where elements need not be adjacent, allowing for more flexible but computationally harder alignments in sequence comparison tasks. The contiguity constraint simplifies certain structural analyses, such as those using suffix-based representations.
Occurrences of the LCS within the input strings can overlap, meaning the same maximal substring may appear in multiple positions that share characters. This overlapping property arises naturally in strings with repetitive patterns and influences how multiple instances are detected in advanced string processing.
The LCS length demonstrates monotonicity: if string U is a substring of string V, then for any string T, the LCS length between U and T is at most the LCS length between V and T. This follows directly from the substring inclusion, as every common substring of U and T remains a valid common substring of V and T.
The LCS length remains invariant under certain transformations, such as reversing both input strings; the length of the LCS between S and T equals that between the reverses of S and T. This invariance holds because reversal preserves the relative order and adjacency of characters within substrings, merely flipping their orientation without altering lengths.
Examples and Illustrations
Basic Examples with Two Strings
Consider two strings S = "ABAB" and T = "BABA". To identify the longest common substring manually, examine contiguous sequences starting from each position in S and check for matches in T. For instance, starting at position 1 in S ("ABA"), this sequence matches positions 2 through 4 in T ("ABA"). Similarly, starting at position 2 in S ("BAB"), it matches positions 1 through 3 in T ("BAB"). Each of these substrings has length 3, establishing the maximum length for common contiguous segments in this pair. Shorter common substrings, such as "AB" (length 2, appearing multiple times in both), are discarded because the problem requires the longest possible match.
The following alignment visualizes one such longest common substring ("ABA"):
S: A B A B
↑ ↑ ↑
T: B A B A
↑ ↑ ↑
S: A B A B
↑ ↑ ↑
T: B A B A
↑ ↑ ↑
Here, the bolded positions highlight the contiguous match, emphasizing that the substring must appear consecutively without gaps in both strings.
Another illustrative example involves S = "abcdxyz" and T = "xyzabcd". Manual inspection reveals that "abcd" appears contiguously at the beginning of S (positions 1-4) and at the end of T (positions 4-7). This substring of length 4 is the longest common one, surpassing shorter matches like "xyz" (length 3). As with the previous case, the focus remains on maximizing length while ensuring contiguity, recalling the formal requirement that substrings be consecutive sequences.[7]
The alignment for "abcd" is shown below:
| S Positions | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|
| S | a | b | c | d | x | y | z |
| T Positions | | | | 4 | 5 | 6 | 7 |
| T | | | | a | b | c | d |
This table underscores the positional overlap, where the common segment aligns directly without interruption. A common pitfall in such manual checks is overlooking overlapping possibilities or settling for non-maximal matches, such as the length-3 "xyz" here, which is valid but not optimal.
Examples with Multiple Strings
When extending the longest common substring problem to multiple strings, the goal is to find the longest contiguous sequence that appears in every string in the set, representing the intersection of common substrings across all inputs. This generalization is particularly useful in computational biology for identifying conserved motifs in sets of DNA or protein sequences from related organisms. The longest common substring length typically decreases as more strings are added, since the substring must satisfy the consecutive matching condition in each one.
Consider three strings: S1 = "BANANA", S2 = "BAYAN", and S3 = "BABA". The longest common substring to all three is "BA", with length 2. To illustrate the distinction from pairwise comparisons, note that the pairwise longest common substring lengths are also 2: "BA" or "AN" between S1 and S2, "BA" between S1 and S3, and "BA" between S2 and S3. However, no substring of length 3 is shared by all three, highlighting how the multi-string case requires stricter intersection.
A similar example arises in DNA sequence analysis, where short strings represent segments of genetic material. For S1 = "ABABC", S2 = "BABCA", and S3 = "ABCBA" (using letters analogous to nucleotide bases A, B, C for simplicity), the longest common substring is "ABC", with length 3. Here, pairwise longest common substrings can be longer—for instance, "BABC" (length 4) between S1 and S2—but the overall longest common substring shortens to 3 when intersecting with S3. Such patterns help identify evolutionarily conserved regions across multiple genomes.
Adding more strings often reduces the longest common substring length, as the probability of a long contiguous match across all decreases. For manual computation on small sets like the above, one approach is to generate all possible substrings from the shortest string (e.g., enumerate lengths from 1 upward in S3 = "BABA"), then check for exact consecutive presence in the others using string search (e.g., verify if "BA" appears contiguously in S1 and S2). This brute-force method has exponential time in string length but verifies the result transparently for illustration.
The following table compares pairwise longest common substring lengths against the overall for the DNA-analogous example:
| String Pair | LCS Length |
|---|
| S1 and S2 | 4 |
| S1 and S3 | 3 |
| S2 and S3 | 3 |
| All three (S1, S2, S3) | 3 |
This demonstrates how pairwise maxima exceed the global intersection in multi-string scenarios.[8]
Algorithms
Dynamic Programming Algorithm
The dynamic programming approach for finding the longest common substring (LCS) between two strings S and T of lengths m and n respectively utilizes a two-dimensional table to compute the lengths of common substrings ending at each pair of positions. The table dp represents the length of the longest common substring ending at position i in S and position j in T, where indices are 1-based for clarity.[9]
The recurrence relation for filling the table is defined as follows: if S = T, then dp = dp[i-1][j-1] + 1; otherwise, dp = 0. This ensures that only contiguous matches are extended, distinguishing it from the longest common subsequence problem. The base cases are dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for all i and dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for all j, as no substring can end before the start of either string. The maximum value in the table gives the length of the LCS.[9]
To retrieve the actual substring, backtracking starts from the cell with the maximum value and traces diagonally where characters match until reaching a zero. The following pseudocode illustrates the table construction and backtracking process:
function LCSLengthAndSubstring(S, T):
m = length(S)
n = length(T)
dp = array of size (m+1) x (n+1), initialized to 0
max_length = 0
end_i = 0
end_j = 0
for i from 1 to m:
for j from 1 to n:
if S[i-1] == T[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
if dp[i][j] > max_length:
max_length = dp[i][j]
end_i = i
end_j = j
else:
dp[i][j] = 0
// Backtrack to extract [substring](/page/Substring) from S
[substring](/page/Substring) = [empty string](/page/Empty_string)
i = end_i
j = end_j
while i > 0 and j > 0 and dp[i][j] > 0:
[substring](/page/Substring) = S[i-1] + [substring](/page/Substring)
i -= 1
j -= 1
return max_length, [substring](/page/Substring)
function LCSLengthAndSubstring(S, T):
m = length(S)
n = length(T)
dp = array of size (m+1) x (n+1), initialized to 0
max_length = 0
end_i = 0
end_j = 0
for i from 1 to m:
for j from 1 to n:
if S[i-1] == T[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
if dp[i][j] > max_length:
max_length = dp[i][j]
end_i = i
end_j = j
else:
dp[i][j] = 0
// Backtrack to extract [substring](/page/Substring) from S
[substring](/page/Substring) = [empty string](/page/Empty_string)
i = end_i
j = end_j
while i > 0 and j > 0 and dp[i][j] > 0:
[substring](/page/Substring) = S[i-1] + [substring](/page/Substring)
i -= 1
j -= 1
return max_length, [substring](/page/Substring)
This algorithm runs in O(mn) time due to the nested loops filling the table.[9]
Space usage can be optimized from O(mn) to O(\min(m, n)) by maintaining only the previous row of the table, as each cell depends solely on the prior row and column. This is achieved by using two one-dimensional arrays or a single array updated in place, making it practical for longer strings where one dimension is smaller.[9]
Suffix Tree Algorithm
The suffix tree algorithm for finding the longest common substring (LCS) between two strings S and T relies on constructing a generalized suffix tree for the concatenated string S\#T\$$, where #and$are distinct terminal symbols not appearing inSorT. This ensures that suffixes from SandTare uniquely identifiable, with leaves corresponding to suffixes ofSterminating at#and those ofTat$$.[10]
The generalized suffix tree is built in linear time, specifically O(m + n) where m = |S| and n = |T|, using Ukkonen's on-line construction algorithm, which incrementally adds characters while maintaining the tree structure.[11] Once constructed, the LCS is found by traversing the tree to identify the deepest internal node whose subtree contains leaves from both S and T. The string depth (length of the path label from the root to that node) gives the LCS length, and the path label itself is the substring; this traversal visits each node once in O(m + n) time.[10]
To compute the deepest such node, annotate each node during a post-order traversal by checking the origins of its leaf descendants. A node qualifies if its subtree includes at least one leaf from S and one from T, and among all qualifying nodes, the one with maximum string depth is selected. The following pseudocode outlines this process:
function FindLCS(Tree tree, string S, string T):
max_depth = 0
lcs_node = null
function annotate(node v):
if v is leaf:
if v.suffix starts in S:
v.has_S = true
v.has_T = false
else:
v.has_S = false
v.has_T = true
return
v.has_S = false
v.has_T = false
for child in v.children:
annotate(child)
if child.has_S:
v.has_S = true
if child.has_T:
v.has_T = true
if v.has_S and v.has_T:
global max_depth, lcs_node
current_depth = string_depth(v)
if current_depth > max_depth:
max_depth = current_depth
lcs_node = v
annotate(root of tree)
return path_label(lcs_node)
function FindLCS(Tree tree, string S, string T):
max_depth = 0
lcs_node = null
function annotate(node v):
if v is leaf:
if v.suffix starts in S:
v.has_S = true
v.has_T = false
else:
v.has_S = false
v.has_T = true
return
v.has_S = false
v.has_T = false
for child in v.children:
annotate(child)
if child.has_S:
v.has_S = true
if child.has_T:
v.has_T = true
if v.has_S and v.has_T:
global max_depth, lcs_node
current_depth = string_depth(v)
if current_depth > max_depth:
max_depth = current_depth
lcs_node = v
annotate(root of tree)
return path_label(lcs_node)
This approach achieves an overall linear time complexity of O(m + n).[10]
For the LCS among k > 2 strings, the method extends by concatenating all strings with distinct terminal separators (e.g., S_1 \#_1 S_2 \#_2 \dots S_k \#_k) and building a generalized suffix tree, where each leaf is marked with its originating string identifier. The deepest node is then the one whose subtree contains leaves from all k strings, found via a similar bottom-up annotation to compute the number of distinct identifiers C(v) in each subtree, selecting the maximum-depth node with C(v) = k. This maintains O(N) time and space, where N is the total length of the k strings.[8]
The primary advantage of the suffix tree algorithm is its linear-time performance, which preprocesses the strings once for multiple queries or extensions like finding all occurrences of the LCS, outperforming quadratic methods in large-scale applications.[8]
Suffix Array Algorithm
The suffix array algorithm provides an efficient method for computing the longest common substring (LCS) between two or more strings by leveraging a sorted array of all suffixes from a concatenated input, avoiding the hierarchical structure of suffix trees. To apply this approach, the input strings are first concatenated into a single string S of length m + n (for two strings of lengths m and n), separated by a unique sentinel character not present in the alphabet to distinguish suffixes from different original strings. The suffix array [SA](/page/SA) is then constructed by sorting all suffixes of S in lexicographical order, which can be achieved in O((m+n) \log (m+n)) time using a comparison-based sorting algorithm that compares suffixes efficiently.[12][13]
Once the suffix array is built, the longest common prefix (LCP) array is computed, where LCP stores the length of the longest common prefix between the suffixes at positions SA[i-1] and SA in the sorted order. This LCP array enables the identification of the LCS without explicit suffix comparisons. The LCP array can be constructed in linear time O(m+n) using Kasai's algorithm, which iterates through the suffixes in original order while using the suffix array to track previous matches and extend prefixes incrementally.[14]
To find the LCS, the algorithm scans the LCP array for the maximum value LCP such that the suffixes SA[i-1] and SA originate from different original strings (i.e., they straddle the sentinel separator). The position of this maximum LCP indicates the starting index of the LCS in the concatenated string, and the value gives its length. This scanning step takes O(m+n) time, making the overall LCS computation dominated by the suffix array construction.[15]
Kasai's algorithm for LCP computation proceeds as follows (pseudocode adapted for clarity):
function KasaiLCP(S, SA, n):
rank = array of size n # inverse of SA
for i = 0 to n-1:
rank[SA[i]] = i
lcp = array of size n, initialized to 0
h = 0
for i = 0 to n-1:
if rank[i] > 0:
j = SA[rank[i] - 1]
while i + h < n and j + h < n and S[i + h] == S[j + h]:
h += 1
lcp[rank[i]] = h
if h > 0:
h -= 1
return lcp
function KasaiLCP(S, SA, n):
rank = array of size n # inverse of SA
for i = 0 to n-1:
rank[SA[i]] = i
lcp = array of size n, initialized to 0
h = 0
for i = 0 to n-1:
if rank[i] > 0:
j = SA[rank[i] - 1]
while i + h < n and j + h < n and S[i + h] == S[j + h]:
h += 1
lcp[rank[i]] = h
if h > 0:
h -= 1
return lcp
This implementation ensures each character is compared at most once across iterations, achieving the linear time bound.[14]
For the extension to multiple strings, the input strings are concatenated with distinct sentinel separators (e.g., characters of decreasing values not in the alphabet), forming a generalized suffix array. The LCP array is scanned to find the maximum prefix length where the involved suffixes belong to at least k different original strings, often tracked via color arrays or grouping consecutive suffixes by origin during the scan. This requires O(m + n + \cdots) time postconstruction, with separators ensuring no cross-string prefixes span artifacts.[16]
Compared to suffix trees, which offer linear-time construction and queries but demand more complex implementation and higher space for pointers, the suffix array approach is simpler and more space-efficient (using only integer arrays of size O(m+n)), though its construction time is typically higher unless linear-time builders like DC3 are employed.[12]
Analysis and Applications
Computational Complexity
The dynamic programming algorithm for finding the longest common substring between two strings of lengths m and n exhibits a time complexity of O(mn), as it fills a two-dimensional table where each entry depends on previous computations to track matching suffix lengths. The space complexity in this basic implementation is also O(mn), corresponding to the full table storage. However, space can be optimized to O(\min(m, n)) by maintaining only the current and previous rows of the table during computation, reducing memory usage while preserving the time bound.
In contrast, the suffix tree approach constructs a generalized suffix tree for the concatenated strings (with distinct terminators), enabling identification of the deepest node with leaves from both strings in linear time. This yields an overall time and space complexity of O(m + n), leveraging the linear-time construction of suffix trees via Ukkonen's algorithm. The suffix array method, which sorts suffixes implicitly, requires O((m + n) \log (m + n)) time for construction using standard sorting techniques, followed by an O(m + n) traversal using the longest common prefix array to find the maximum matching depth between suffixes from each string.[16]
A fundamental lower bound for any longest common substring algorithm is \Omega(m + n) in the comparison-based model, as the entire input must be examined to verify potential matches. More advanced conditional lower bounds, based on the strong exponential time hypothesis (SETH) and the hardness of sorting permutations or strings, imply that no significantly subquadratic algorithm exists for related variants, underscoring the near-optimality of linear-time methods like suffix trees. For the generalization to k strings, a naive extension of dynamic programming incurs exponential time complexity in k, specifically O(n^k) for strings of equal length n, due to the multi-dimensional table. However, generalized suffix trees or suffix arrays reduce this to O(N) time, where N is the total input length, by building a single structure over all strings and seeking the deepest multi-labeled node.[17][18][8]
In practice, the dynamic programming method is often preferred for short strings (e.g., lengths up to a few thousand) due to its straightforward implementation and low constant factors, despite the quadratic scaling. For longer strings, suffix tree or suffix array approaches become advantageous, as their linear or near-linear complexities dominate, particularly in scenarios with large alphabets or genomic-scale data where memory trade-offs are manageable.
Practical Applications
In bioinformatics, the longest common substring (LCS) problem is employed to identify conserved regions in DNA sequences, facilitating gene sequence alignment by detecting contiguous matching segments that indicate functional similarities across species.[19] For instance, in analyzing microbial genomes, LCS algorithms help pinpoint shared genetic elements by finding the longest identical contiguous sequences between bacterial strains.[20]
In text processing, LCS serves as a core component in plagiarism detection systems, where it identifies extended copied passages in documents or source code by computing the longest contiguous matching strings between suspicious and reference texts. Similarly, version comparison tools, such as those in revision control systems like RCS, utilize LCS to compute differences between document revisions; the algorithm finds the longest common substring to minimize the delta representation, enabling efficient storage and display of changes.[21]
For data compression, LCS underpins dictionary-based methods like LZ77, which scans input streams to locate the longest backward-matching substring, replacing it with a reference to reduce redundancy and achieve lossless compression ratios, particularly effective for repetitive data such as text files or genomic sequences.
In software engineering, the longest common substring supports code clone detection by identifying duplicated contiguous code fragments, aiding in refactoring and maintenance.[22]
Standard libraries incorporate LCS variants for these purposes; for example, Python's difflib module uses a longest matching block algorithm—essentially an LCS computation—to generate sequence differences and similarity ratios, supporting applications from diff utilities to approximate string matching.[23]