Longest common subsequence
In computer science and related fields, the longest common subsequence (LCS) of two sequences is defined as the longest sequence that can be derived from both by deleting some elements without changing the order of the remaining elements, where the elements do not need to be contiguous in the original sequences.[1] For example, given the sequences "ABAZDC" and "BACBAD", one possible LCS is "ABAD" of length 4, preserving the relative order but allowing gaps.[2] This problem is a fundamental example of dynamic programming, enabling efficient computation of the LCS length and the subsequence itself for sequences of lengths m and n.[3]
The standard algorithm for LCS employs a dynamic programming table C of size (m+1) by (n+1), where C[i,j] represents the length of the LCS of the first i elements of the first sequence and the first j elements of the second sequence.[2] The recurrence relation is: if the i-th element of the first sequence equals the j-th element of the second, then C[i,j] = C[i-1,j-1] + 1; otherwise, C[i,j] = max(C[i-1,j], C[i,j-1]); base cases set C[i,0] = 0 and C[0,j] = 0 for all i and j.[1] This bottom-up approach fills the table in O(mn) time and space, with optimizations possible to reduce space to O(min(m,n)) while recovering the subsequence via traceback.[3] The problem's NP-hard extension to more than two sequences underscores the efficiency of the pairwise case.[2]
LCS has significant applications in bioinformatics, where it facilitates the alignment of DNA or protein sequences to identify similarities, such as conserved regions in nucleic acids.[4] It also underpins tools for document comparison, like the Unix "diff" utility, which computes minimal edits related to LCS differences, and spell-checking systems that detect similar words.[5] Additionally, LCS algorithms support plagiarism detection by measuring textual similarity and contribute to computational genomics for evolutionary analysis.[6]
Definition and Basics
In computer science, the longest common subsequence (LCS) of two or more sequences is defined as the longest sequence that can be derived from each of them by deleting some or no elements without altering the order of the remaining elements.[7] This subsequence must appear in the same relative order in all input sequences but is not required to be contiguous.[7]
Formally, for two finite sequences X = (x_1, x_2, \dots, x_m) and Y = (y_1, y_2, \dots, y_n) over some alphabet, a sequence Z = (z_1, z_2, \dots, z_k) is a common subsequence if there exist strictly increasing indices $1 \leq i_1 < i_2 < \dots < i_k \leq m and $1 \leq j_1 < j_2 < \dots < j_k \leq n such that z_t = x_{i_t} = y_{j_t} for each $1 \leq t \leq k.[7] The LCS is then any such Z of maximum length k, where $1 \leq k \leq \min(m, n), and no common subsequence of greater length exists.[7] This formulation generalizes to sets of more than two sequences, where Z must be a subsequence of every input sequence.[8]
Consider the sequences X = "ABCBDAB" (length 7) and Y = "BDCABA" (length 6). One LCS is "BCBA" (length 4), obtained from positions 2, 3, 4, and 6 in X, and positions 1, 3, 5, and 6 in Y. Here, the matching elements are non-contiguous in both sequences—for instance, the "B" and "C" in "BCBA" are adjacent in X but separated by "D" in Y—highlighting that an LCS differs from a longest common substring, which demands consecutive positions within each original sequence.
Key Properties
The longest common subsequence (LCS) problem exhibits optimal substructure, a fundamental property that allows the solution for longer sequences to be constructed from solutions to shorter prefix subproblems. Specifically, consider two sequences X = \langle x_1, x_2, \dots, x_m \rangle and Y = \langle y_1, y_2, \dots, y_n \rangle, and let Z = \langle z_1, z_2, \dots, z_k \rangle be any LCS of X and Y. If x_m = y_n, then z_k = x_m = y_n and Z_{1..k-1} (the prefix of Z excluding the last element) is an LCS of the prefixes X_{1..m-1} and Y_{1..n-1}. In this case, the LCS of X_{1..m} and Y_{1..n} is formed by appending the matching character to the LCS of the shorter prefixes. Conversely, if x_m \neq y_n, then either z_k \neq x_m, in which case Z is an LCS of X_{1..m-1} and Y_{1..n}, or z_k \neq y_n, in which case Z is an LCS of X_{1..m} and Y_{1..n-1}. This property ensures that the LCS length for the full sequences can be derived by considering whether the last elements match and recursively applying the solution to reduced instances.
A direct consequence of optimal substructure is the non-decreasing nature of LCS lengths as sequence prefixes extend. For any prefixes X_{1..i} and Y_{1..j}, the LCS length satisfies \text{LCS}(X_{1..i}, Y_{1..j}) \geq \text{LCS}(X_{1..i-1}, Y_{1..j}) and \text{LCS}(X_{1..i}, Y_{1..j}) \geq \text{LCS}(X_{1..i}, Y_{1..j-1}), because any common subsequence of the shorter prefixes remains a common subsequence of the longer ones, and extending the sequences cannot decrease the maximum possible length. Equality holds if the added character does not contribute to a longer common subsequence, such as when it has no matching counterpart in the opposing prefix. This monotonicity implies that \text{LCS}(X_{1..i}, Y_{1..j}) \leq \text{LCS}(X, Y) for all i \leq m and j \leq n, as the full sequences encompass all possible prefix combinations. These relations underpin the bounded growth of LCS lengths and facilitate efficient computation by ensuring progressive increases in solution quality.
The optimality of the LCS further manifests in its maximality: no common subsequence shorter than the LCS can be extended while preserving the relative order of elements to achieve the LCS length, as any such extension would contradict the definition of the LCS as the longest possible. In other words, any common subsequence of length less than k cannot be augmented with additional matching elements from the sequences without either violating the increasing index order required for subsequences or exceeding the proven maximum length. This property emphasizes that the LCS represents an irreducible longest alignment under the subsequence constraints, distinguishing it from other common subsequences that may be extendable but not to the maximal length.
Dynamic Programming Solution
Prefix Table Construction
In the dynamic programming approach to the longest common subsequence (LCS) problem, prefix table construction begins by defining prefixes of the input sequences. For sequences X = x_1 x_2 \dots x_m and Y = y_1 y_2 \dots y_n, the prefix X[1..i] consists of the first i elements x_1 x_2 \dots x_i where $0 \leq i \leq m, and similarly Y[1..j] = y_1 y_2 \dots y_j for $0 \leq j \leq n.
A two-dimensional table C of dimensions (m+1) \times (n+1) is initialized, where each entry C stores the length of an LCS between the prefixes X[1..i] and Y[1..j].
The base cases are established as C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for all $0 \leq j \leq n and C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for all $0 \leq i \leq m, reflecting that any prefix paired with an empty sequence yields an LCS of length zero.
The table is then filled iteratively, typically proceeding row-by-row from i = 1 to m, and within each row, column-by-column from j = 1 to n; this process leverages the monotonic non-decreasing property of LCS lengths with respect to prefix extensions to ensure consistent updates across cells.
Recurrence Relation
The recurrence relation for the longest common subsequence (LCS) problem is derived from the key properties of subsequences, specifically the cases where the last elements of the prefixes either match or do not. Consider two sequences X = x_1 x_2 \dots x_m and Y = y_1 y_2 \dots y_n. Let T(i, j) denote the length of an LCS of the prefixes X[1..i] and Y[1..j]. If x_i = y_j, then an optimal LCS for these prefixes can be formed by including this matching pair and appending it to an LCS of the prefixes X[1..i-1] and Y[1..j-1], yielding T(i, j) = T(i-1, j-1) + 1. If x_i \neq y_j, then the optimal LCS must skip either x_i or y_j, so T(i, j) is the maximum of the LCS lengths obtained by skipping one or the other: T(i, j) = \max(T(i-1, j), T(i, j-1)).
This leads to the full recurrence relation:
T(i, j) =
\begin{cases}
T(i-1, j-1) + 1 & \text{if } x_i = y_j \\
\max(T(i-1, j), T(i, j-1)) & \text{otherwise}
\end{cases}
with base cases T(0, j) = 0 for all j = 0, \dots, n and T(i, 0) = 0 for all i = 0, \dots, m, reflecting that the LCS of an empty prefix with any prefix is empty.
The correctness of this recurrence follows from the optimal substructure property of the LCS problem, which states that an optimal solution to the overall problem contains optimal solutions to its subproblems. To sketch the proof by induction on i + j: For the base cases where i = 0 or j = 0, the relation holds trivially as T(0, j) = T(i, 0) = 0. Assume it holds for all smaller prefixes (induction hypothesis). For T(i, j), if x_i = y_j, any LCS of length greater than T(i-1, j-1) + 1 would contradict the optimality of the subproblem LCS, as removing the match would yield a longer subsequence for the smaller prefixes. If x_i \neq y_j, suppose an optimal LCS skips neither; then it must end with characters not matching x_i or y_j, reducing to a subproblem covered by the max. Thus, the recurrence computes the true LCS length.
Worked Example for Two Sequences
To illustrate the dynamic programming solution for computing the length of the longest common subsequence (LCS), consider the sequences X = \text{ABCBDAB} (of length m=7) and Y = \text{BDCAB} (of length n=5). The LCS length for these sequences is 4, such as "BCAB".
The process begins by constructing an (m+1) \times (n+1) table C, where C denotes the LCS length for the prefixes X[1..i] and Y[1..j]. All boundary values are initialized to 0: C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for j = 0 to n, and C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for i = 0 to m. The table is filled row by row using the recurrence relation, incrementing by 1 when characters match and otherwise taking the maximum from the adjacent cells above or to the left.
For the first row (i=1, X{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} = \text{A}):
- At j=1 (Y{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} = \text{B}): \text{A} \neq \text{B}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} = \max(C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}, C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}) = \max(0, 0) = 0.
- At j=2 (Y{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = \text{D}): \text{A} \neq \text{D}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = \max(C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}, C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}) = 0.
- At j=3 (Y{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \text{C}): \text{A} \neq \text{C}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = 0.
- At j=4 (Y{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \text{A}): \text{A} = \text{A}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} + 1 = 1.
- At j=5 (Y{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}} = \text{B}): \text{A} \neq \text{B}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}} = \max(C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}}, C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}}) = \max(0, 1) = 1.
This row reflects the single match at position 4, propagating the value rightward where no further matches occur.
For the second row (i=2, X{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = \text{B}):
- At j=1: \text{B} = \text{B}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} = C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} + 1 = 1.
- At j=2: \text{B} \neq \text{D}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = \max(C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}, C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}) = \max(0, 1) = 1.
- At j=3: \text{B} \neq \text{C}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \max(C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}, C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}) = 1.
- At j=4: \text{B} \neq \text{A}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \max(C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}}, C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}) = \max(1, 1) = 1.
- At j=5: \text{B} = \text{B}, so C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}} = C{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} + 1 = 2.
Here, matches at columns 1 and 5 build on prior values, with the maximum propagating horizontally and vertically in non-match cases.
Continuing this process for subsequent rows:
- Third row (i=3, X{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \text{C}): No match at j=1 or j=2 (values 1); match at j=3 yields C{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = C{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} + 1 = 2; non-matches at j=4 and j=5 take maxima, resulting in 2.
- Fourth row (i=4, X{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \text{B}): Match at j=1 gives 1; non-matches up to j=3 yield 2; non-match at j=4 yields 2; match at j=5 gives C{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}} = C{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} + 1 = 3.
- Fifth row (i=5, X{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}} = \text{D}): Non-match at j=1 (1); match at j=2 gives C{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = C{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} + 1 = 2; non-matches yield 2, 2, and 3 at j=5.
- Sixth row (i=6, X{{grok:render&&&type=render_inline_citation&&&citation_id=6&&&citation_type=wikipedia}} = \text{A}): Non-matches propagate 1, 2, 2; match at j=4 gives C{{grok:render&&&type=render_inline_citation&&&citation_id=6&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = C{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} + 1 = 3; non-match at j=5 yields 3.
- Seventh row (i=7, X{{grok:render&&&type=render_inline_citation&&&citation_id=7&&&citation_type=wikipedia}} = \text{B}): Match at j=1 gives 1; non-match at j=2 (2); non-match at j=3 (2); non-match at j=4 (3); match at j=5 gives C{{grok:render&&&type=render_inline_citation&&&citation_id=7&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}} = C{{grok:render&&&type=render_inline_citation&&&citation_id=6&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} + 1 = 4.
Each decision leverages the recurrence: matches extend the diagonal predecessor by 1, while non-matches select the larger neighboring value to ensure the optimal substructure is preserved. The bottom-right cell C{{grok:render&&&type=render_inline_citation&&&citation_id=7&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=5&&&citation_type=wikipedia}} = 4 gives the LCS length.
The completed table is as follows:
| ∅ | B | D | C | A | B |
|---|
| ∅ | 0 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 0 | 0 | 0 | 1 | 1 |
| AB | 0 | 1 | 1 | 1 | 1 | 2 |
| ABC | 0 | 1 | 1 | 2 | 2 | 2 |
| ABCB | 0 | 1 | 1 | 2 | 2 | 3 |
| ABCBD | 0 | 1 | 2 | 2 | 2 | 3 |
| ABCBD A | 0 | 1 | 2 | 2 | 3 | 3 |
| ABCBDAB | 0 | 1 | 2 | 2 | 3 | 4 |
This table highlights key match positions (bolded) that contribute to the accumulating length, demonstrating how the algorithm systematically builds the solution from smaller subproblems.
LCS Length Computation
The dynamic programming approach to compute the length of the longest common subsequence (LCS) between two sequences X = \langle x_1, x_2, \dots, x_m \rangle and Y = \langle y_1, y_2, \dots, y_n \rangle builds upon the prefix table construction and recurrence relation, filling a table C where C stores the LCS length for the prefixes X[1..i] and Y[1..j].
The algorithm initializes an (m+1) \times (n+1) table C with zeros along the boundaries: C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for i = 0 to m, and C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for j = 0 to n. It then iterates over i from 1 to m and j from 1 to n, setting C according to the recurrence: if x_i = y_j, then C = C[i-1][j-1] + 1; otherwise, C = \max(C[i-1], C[j-1]). The LCS length is finally retrieved as C.
Here is the pseudocode for this process:
function LCS_LENGTH(X[1..m], Y[1..n]):
create table C[0..m][0..n]
for i ← 0 to m:
C[i][0] ← 0
for j ← 0 to n:
C[0][j] ← 0
for i ← 1 to m:
for j ← 1 to n:
if x_i == y_j:
C[i][j] ← C[i-1][j-1] + 1
else:
C[i][j] ← max(C[i-1][j], C[i][j-1])
return C[m][n]
function LCS_LENGTH(X[1..m], Y[1..n]):
create table C[0..m][0..n]
for i ← 0 to m:
C[i][0] ← 0
for j ← 0 to n:
C[0][j] ← 0
for i ← 1 to m:
for j ← 1 to n:
if x_i == y_j:
C[i][j] ← C[i-1][j-1] + 1
else:
C[i][j] ← max(C[i-1][j], C[i][j-1])
return C[m][n]
This implementation runs in O(mn) time, as it requires a single pass to fill the entire table with constant-time operations per cell. The space complexity is also O(mn), dominated by the storage for the table C.
In practice, this algorithm is straightforward to implement in languages such as Python, using nested loops and a 2D list for the table, or in C++, employing a vector of vectors or a raw 2D array for efficiency.
Reconstruction and Variations
Traceback for Sequence Recovery
Once the dynamic programming table C has been constructed to compute the length of the longest common subsequence (LCS) between two sequences X = x_1 x_2 \dots x_m and Y = y_1 y_2 \dots y_n, where C denotes the LCS length for prefixes X[1..i] and Y[1..j], the actual subsequence can be recovered via a traceback process starting from C.[9][10] This process traces backward through the table, building the subsequence in reverse order by following the decisions that led to the optimal length.
The traceback algorithm proceeds iteratively or recursively as follows: Begin at position (i, j) = (m, n). While i > 0 and j > 0, check if x_i = y_j; if so and C = C[i-1][j-1] + 1, include x_i (or y_j) in the LCS and move diagonally to (i-1, j-1). Otherwise, move to the predecessor cell that preserves the maximum value: if C = C[j-1], move left to (i, j-1); else move up to (i-1, j). This preference for moving left (or up) when both directions yield the same length selects one specific path among potentially multiple optimal paths.[9][10] The resulting subsequence is one valid LCS of length C, though other LCSs of the same length may exist if the table has ties in predecessor values.[9]
The following pseudocode implements the iterative traceback to construct one LCS, assuming 1-based indexing for the sequences and table (with C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}}[*] = 0 and C[*]{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0):
LCS = empty list
i = m
j = n
while i > 0 and j > 0:
if x_i == y_j:
prepend x_i to LCS
i = i - 1
j = j - 1
else if C[i][j] == C[i][j-1]:
j = j - 1
else:
i = i - 1
reverse LCS to get the forward [subsequence](/page/Subsequence)
LCS = empty list
i = m
j = n
while i > 0 and j > 0:
if x_i == y_j:
prepend x_i to LCS
i = i - 1
j = j - 1
else if C[i][j] == C[i][j-1]:
j = j - 1
else:
i = i - 1
reverse LCS to get the forward [subsequence](/page/Subsequence)
This runs in O(m + n) time after the O(mn) table construction.[9]
To illustrate, consider the worked example from the LCS length computation section, where the DP table for sequences X and Y yields an LCS length of 4. Starting from the bottom-right cell C and following the traceback rules—matching equal characters diagonally and preferring left moves when tied—recovers the subsequence "BCAB" by including 'B' (match), skipping non-matches leftward, including 'C' (match), and so on until the top-left.[10]
Finding All LCSs
When multiple longest common subsequences (LCSs) exist between two sequences, identifying all distinct ones requires extending the standard dynamic programming (DP) framework beyond computing a single LCS length or recovering one instance via traceback. The DP table used for LCS length computation can be interpreted as a directed acyclic graph (DAG), where each cell (i, j) represents the prefixes of the input sequences up to positions i and j, and edges correspond to valid backtracking moves: horizontal (skipping a character in the second sequence), vertical (skipping in the first), or diagonal (matching characters). Paths in this graph from the bottom-right cell (m, n) to the top-left (0, 0) that preserve the maximum LCS length correspond exactly to the LCS alignments, as each such path defines a unique sequence of matches. This graph-based view allows systematic exploration of all possibilities while leveraging the O(mn) preprocessing for the length table.
To count the number of LCS alignments (i.e., the number of ways to achieve the maximum LCS length) without enumerating them explicitly, a secondary DP table C is constructed alongside the standard length table L, where C stores the number of such alignments for the prefixes ending at i and j. The base cases are C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 1 and C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 1 for all i, j (considering the empty alignment). For i > 0 and j > 0, initialize C = 0. Then, if X_i = Y_j and L = L[i-1][j-1] + 1, add C[i-1][j-1] to C. If L[i-1] = L, add C[i-1]. If L[j-1] = L, add C[j-1]. This sums over all optimal predecessor directions and ensures C gives the total number of LCS alignments in O(mn) time and space.
For explicit enumeration of all LCS alignments, depth-first search (DFS) or breadth-first search (BFS) can traverse the DAG defined by the length table, starting from (m, n) and following only length-preserving edges while building the subsequence via diagonal matches. To obtain distinct LCS strings (accounting for possible duplicates from repeated characters), the traversal can use set structures to track unique outputs; alternatively, the graph can be refined during preprocessing to merge equivalent paths using adjacency lists and contour representations. This yields an output-sensitive algorithm that lists all unique LCSs in time linear in the output size plus O(mn) preprocessing, extending the single-traceback method by exploring all viable branches rather than one. A simple implementation constructs adjacency lists row-by-row, ensuring no redundant paths.
The time complexity for both counting and enumeration is O(mn) for preprocessing, but enumeration can require exponential time in the worst case, as the number of LCS alignments is bounded above by the binomial coefficient \binom{m+n}{m} but achieves \Theta(2^{\min(m,n)/2}) in adversarial inputs like alternating sequences of two characters.[11] Space can be optimized to O(n) via Hirschberg's technique adapted for the graph. These methods are practical when the number of LCSs is small but scale poorly for cases with many ambiguities.[11]
Diff Computation
The longest common subsequence (LCS) serves as the foundation for computing differences between two sequences, such as text files in version control systems, by identifying the maximal preserved structure and highlighting deviations as insertions or deletions.[12] This approach aligns the sequences along the LCS, inserting gaps in non-matching regions to visualize changes compactly, which is more intuitive than enumerating all possible edits.[12]
The algorithm begins with the dynamic programming table for LCS length, followed by a traceback to recover one LCS alignment.[12] During traceback, matching characters (diagonal moves in the table) are marked as common and unchanged. Horizontal moves indicate deletions from the first sequence (X), while vertical moves signify insertions from the second sequence (Y). Gaps are implicitly handled by skipping non-common elements, and the output formats these as prefixed lines or symbols (e.g., '-' for deletions, '+' for insertions) relative to the LCS backbone.[12] This method, originally implemented in the Unix diff utility, efficiently produces a minimal set of operations by leveraging the LCS to minimize the reported changes.[12]
For example, consider sequences X = "ABCBDAB" and Y = "BDCAB", with an LCS of "BCAB" (length 4). The alignment based on traceback might yield the following diff output, showing insertions (+) and deletions (-) relative to X:
- Delete 'A' (initial in X)
- Common 'B'
- Insert 'D' (from Y)
- Common 'C'
- Delete 'B' (from X)
- Delete 'D' (from X)
- Common 'A'
- Common 'B'
This results in a compact representation like: -A +D -B -D, with common parts omitted for brevity, illustrating how the LCS anchors the stable elements.[12]
The LCS length also relates directly to the minimum number of insertions and deletions needed to transform one sequence into the other, given by |X| + |Y| - 2 × LCS length, which equals the edit distance under an insertion-deletion-only model (a restricted form of Levenshtein distance).[13]
Complexity Analysis
Time and Space Complexity
The standard dynamic programming algorithm for computing the length of the longest common subsequence (LCS) between two sequences of lengths m and n requires O(mn) time. This complexity arises from filling a (m+1) \times (n+1) table where each of the mn cells is computed in constant time using the recurrence relation based on matching prefixes.[14]
In terms of space complexity, the algorithm uses O(mn) space to store the dynamic programming table, which captures the solutions to all distinct subproblems defined by the prefixes of the input sequences.[15]
The time complexity of O(mn) is tight for the standard approach, applying uniformly in both the best and worst cases, as the algorithm always processes every cell in the table irrespective of the input sequences' similarity. While space usage remains O(mn) in the basic implementation, it can be reduced to O(\min(m,n)) through careful storage of only necessary rows or diagonals without affecting the time bound.[14]
Under the Strong Exponential Time Hypothesis (SETH), no algorithm exists for LCS that runs in O(n^{2-\epsilon}) time for any constant \epsilon > 0, even on binary alphabets; this conditional lower bound of \Omega(n^{2}/\mathrm{polylog}\, n) matches the O(n^2 / \log n) upper bound achieved via the method of Four Russians for constant-sized alphabets.[16][17]
Optimizations for Space and Time
The standard dynamic programming approach for computing the length of the longest common subsequence (LCS) of two sequences of lengths m and n (with m \leq n) requires O(mn) space to store the entire table. However, this can be optimized to O(m) space by retaining only the previous and current rows during computation, as each entry in the current row depends solely on the immediately preceding row and the current row's left neighbor. This technique discards completed rows after use, making it suitable for memory-constrained environments while preserving the O(mn) time complexity.[18]
For recovering the actual LCS sequence (not just its length), Hirschberg's algorithm achieves linear O(n) space through a divide-and-conquer strategy. It recursively splits one sequence at its midpoint, computes LCS lengths from the halves to the other sequence using the two-row method, and identifies the optimal split point to balance the recursion tree, ensuring the full sequence can be constructed without storing the entire table. This approach maintains O(mn) time overall and is particularly effective for long sequences where full reconstruction is needed.
Several practical optimizations target time efficiency in the dynamic programming computation. For instance, the Hunt-Szymanski algorithm reduces the number of comparisons by preprocessing matching character positions into sorted lists and processing only those pairs, skipping non-matching positions entirely; this yields O(r \log n) time where r is the number of matching pairs (r \leq mn), offering significant speedup when matches are sparse. Bit-parallelism exploits word-level operations in hardware to process multiple columns or rows simultaneously, achieving near-constant-time updates for short sequences or small alphabets (length up to the word size, typically 64 bits), reducing effective time to O(mn / w) where w is the word width.[14]
Implementation-level tweaks further enhance performance on modern hardware. Filling the DP table in row-major order aligns memory accesses with cache line granularity, minimizing cache misses and improving locality, which can yield 2-5x speedups on multicore processors without altering asymptotic complexity.
Advanced theoretical techniques achieve subquadratic time. The method of Four Russians preprocesses all possible small subproblems (blocks of size \Theta(\log n)) into lookup tables, allowing the DP to be computed block-wise with constant-time lookups per block, resulting in O(n^2 / \log^2 n) time for constant-sized alphabets. Under the word RAM model, recent refinements leverage larger effective block sizes via bit operations, improving to O(n^2 / \log n) time while maintaining practical applicability for bioinformatics and text processing tasks.
Applications and Extensions
Relation to Edit Distance and Other Problems
The Levenshtein distance, also known as the edit distance, between two strings of lengths m and n under unit costs for insertions, deletions, and substitutions, satisfies the relation that it is at most m + n - 2 \times \ell, where \ell is the length of the longest common subsequence (LCS); equality holds when the optimal alignment requires no substitutions, effectively reducing to insertions and deletions only.[19] This connection arises because the LCS identifies the maximum number of exact matches that can be preserved without cost, while the remaining characters require operations to transform one string into the other.[20] The dynamic programming formulation for both problems shares structural similarities, with the LCS computation serving as a subroutine or bound in edit distance algorithms.[21]
The longest increasing subsequence (LIS) problem can be reduced to the LCS problem through a simple transformation: given a sequence S, construct a second sequence T by sorting the elements of S in non-decreasing order (handling duplicates appropriately by position if needed); the length of the LCS between S and T equals the length of the LIS in S.[22] This works because any common subsequence must respect the increasing order in T, ensuring it forms an increasing subsequence in S, and the LCS maximizes its length.[23] Such reductions highlight how LCS serves as a foundational model for ordering-constrained subsequence problems, enabling the application of LCS dynamic programming techniques to LIS with minor adaptations.[24]
In bioinformatics, the LCS problem models ungapped sequence alignment, where the goal is to identify the longest exact matching subsequence without allowing mismatches or gaps within the matched regions, serving as a core primitive for more complex alignments.[20] This is particularly useful in tools like BLAST for seeding alignments with exact matches before extending to gapped regions, providing an efficient way to quantify similarity in DNA or protein sequences without penalizing internal gaps in the common parts.[25] The LCS length directly corresponds to the score of such alignments under a unit match reward and no gap penalties within matches, facilitating rapid initial comparisons in large genomic datasets.[26]
The LCS problem exhibits connections to the 0/1 knapsack problem through shared dynamic programming paradigms, where selecting non-adjacent matching positions in LCS mirrors choosing items with binary decisions under capacity constraints, though LCS enforces order preservation unlike standard knapsack.[27] In RNA folding and structure prediction, LCS variants are employed to compare secondary or tertiary structures by finding conserved subsequences that indicate functional motifs, aiding in alignment of folded RNA molecules for evolutionary analysis.[28] For instance, methods like LCS-TA extend LCS to torsion-angle-aware comparisons of RNA 3D fragments, identifying similar structural elements across molecules.[29]
Extensions of LCS, such as the weighted LCS (WLCS) where matches carry varying costs based on character weights, are NP-hard even for integer weights bounded by a small constant, contrasting with the polynomial-time solvability of unweighted LCS.[23] This hardness arises from reductions to problems like longest common dense subsequence, establishing tight conditional lower bounds under hypotheses like the Strong Exponential Time Hypothesis (SETH).[16] WLCS finds applications in approximate string matching with non-uniform similarities, underscoring how weighting introduces combinatorial intractability to the otherwise tractable LCS framework.[30]
Longest Palindromic Subsequence
The longest palindromic subsequence (LPS) of a string S of length n is the longest subsequence that forms a palindrome, reading the same forwards and backwards. A key algorithmic reduction computes the length of the LPS by finding the length of the longest common subsequence (LCS) between S and its reverse S^R, since any palindromic subsequence of S must appear as a common subsequence between S and S^R, and vice versa for common subsequences that are palindromic. This equivalence holds because a palindrome in S corresponds to matching positions symmetrically in the reversed string.
To compute the LPS length, apply the standard dynamic programming algorithm for LCS on inputs S and S^R, which fills a table dp where dp represents the LCS length for the prefixes S[1..i] and S^R[1..j]. The value dp then gives the LPS length directly, with the same O(n^2) time and space complexity as the base LCS problem. Recovering the actual LPS subsequence requires careful traceback in the DP table to ensure the selected common subsequence is palindromic, as not all LCS backtracking paths guarantee this property; suitable paths, such as those following diagonal, up, and left moves in specific orders, always yield a valid palindrome.
For example, consider S = "character". Its reverse is S^R = "retcarahc". One LCS is "caac", with length 4, which is palindromic ("caac" reversed is itself). This matches the LPS of S, as no longer palindromic subsequence exists.[31]
LPS computation via this reduction finds applications in bioinformatics, where palindromic subsequences in DNA or protein sequences signal regulatory elements like promoters and introns, aiding gene activity analysis and detection of genomic instabilities linked to cancer.[32] It also supports pattern recognition in sequences for self-repair mechanisms in chromosomes.[32]
An extension is the weighted LPS, where characters or positions carry weights, maximizing the total weight of the palindromic subsequence; this variant uses automata-based representations like palindromic subsequence automata to model and compute solutions efficiently for weighted cases.
Behavior on Random Strings
The behavior of the longest common subsequence (LCS) on random strings has been extensively studied to understand its average-case performance, particularly in contrast to worst-case analyses. For two independent uniform random binary strings of lengths m and n, the expected LCS length E[LCS(X, Y)] grows asymptotically as \gamma \sqrt{mn}, where \gamma is the Chvátal–Sankoff constant, a value that remains unknown exactly but is bounded between approximately 0.79 and 0.83 based on theoretical and empirical estimates.[33] This scaling arises from probabilistic analyses showing that the LCS length is sublinear in the input sizes, reflecting the sparsity of long matching subsequences in uncorrelated random data.
Simulations on uniform random binary strings confirm this O(\sqrt{mn}) growth for the expected LCS length, with extensive Monte Carlo experiments providing high-precision estimates of \gamma \approx 0.8122 and revealing that the variance of the LCS length is proportional to \sqrt{mn} (or n for equal lengths).[34] The distribution of LCS lengths under this model is concentrated around the mean, with fluctuations typically on the order of n^{1/4} or smaller, as derived from limit theorems and large-scale computations that model the alignment process as a random matching problem.[35]
In models incorporating sequence similarity, such as when strings are generated with a match probability p between positions, the LCS length exhibits a phase transition: for low p (random-like), it remains O(\sqrt{mn}), but transitions to \Theta(\min(m,n)) for high p near 1, where long linear alignments dominate.[36] This threshold behavior highlights the shift from diffusive to ballistic scaling in subsequence matching.
Recent studies up to 2025 have refined these bounds using advanced probabilistic methods, such as coupling arguments and concentration inequalities, yielding tighter lower bounds on \gamma (e.g., 0.793) and extensions to generalized constants for multiple strings or larger alphabets.[37] In genomics, these random string models inform null distributions for assessing the significance of observed LCS lengths in alignments of non-homologous DNA sequences, helping distinguish true evolutionary signals from chance matches in large-scale comparative analyses.[38]