Earley parser
The Earley parser is a dynamic programming-based algorithm for parsing context-free grammars (CFGs), invented by Jay Earley in 1970 as an efficient general-purpose solution for recognizing and constructing parse trees from input strings.[1] Named after its creator, it operates as a chart parser that builds an incremental table of partial parses, enabling it to handle arbitrary CFGs—including those with left recursion, ambiguity, and empty productions—without requiring grammar modifications or backtracking.[1] The algorithm's time complexity is O(n³) in the worst case for an input string of length n, but it reduces to O(n²) for unambiguous grammars and O(n) for many practical, deterministic cases like those in programming languages.[1] At its core, the Earley parser maintains a chart consisting of sets S₀ through Sₙ, where each Sⱼ holds states representing valid parses ending at position j in the input.[2] A state takes the form [i, j, A → α • β], indicating that nonterminal A derives the substring from position i to j via production A → α β, with the dot marking progress after symbols α.[2] The parser advances through three inference rules applied iteratively to populate the chart:- Predict: When the dot precedes a nonterminal B, add states for all productions of B starting at the current position.
- Scan: When the dot precedes a terminal that matches the next input symbol, advance the dot and move to the subsequent position.
- Complete: When a state completes a production (dot at the end), use it to advance any prior states waiting for that nonterminal.[2]
Overview and Background
Introduction
The Earley parser is an efficient algorithm for recognizing and parsing strings according to general context-free grammars (CFGs), capable of handling ambiguity and left-recursion without requiring grammar transformations.[1] Developed by Jay Earley at Carnegie Mellon University, it was first published in 1970 as part of his work on context-free language processing.[1] This parser stands out for its generality, as it applies to any CFG, in contrast to more specialized methods like LR parsers that demand deterministic grammars.[1] Its time complexity is O(n^3) in the worst case, where n is the input string length, but drops to O(n^2) for unambiguous grammars and linear time for many practical grammars, such as those defining programming languages.[1] These properties make it empirically superior to earlier top-down and bottom-up approaches for broad CFG applications.[1] Employing dynamic programming in a chart-parsing framework, the Earley parser supports robust syntax analysis in fields like natural language processing (NLP) and compiler construction.[1] In NLP, extensions enable probabilistic models for tasks such as psycholinguistic simulation and sentence parsing.[4]History and Development
The Earley parser was invented by Jay Earley as part of his PhD thesis at Carnegie Mellon University, completed in 1968.[5] This work introduced an efficient algorithm for recognizing and parsing context-free grammars, building on dynamic programming techniques similar to those in Daniel Younger’s 1967 algorithm and incorporating state sets and lookahead mechanisms derived from Donald Knuth’s LR(k) parsing framework from 1965.[5] Earley’s thesis laid the foundation for a general-purpose parser capable of handling arbitrary context-free grammars without requiring Chomsky normal form or other restrictions common in prior methods.[5] Earley published his algorithm in 1970 in the Communications of the ACM, where it was presented as "An Efficient Context-Free Parsing Algorithm," emphasizing its O(n³) time complexity for general cases and better performance for unambiguous grammars.[1] The publication quickly gained attention in theoretical computer science for providing a unified approach to parsing that combined top-down prediction with bottom-up recognition, addressing limitations in earlier dynamic programming parsers like the Cocke-Kasami-Younger (CKY) algorithm.[5] Following its introduction, the Earley parser saw integration into parser generator tools as variants compatible with Yacc, enabling faster execution for general grammars compared to traditional LR parsers; for instance, directly executable Earley implementations have been developed.[6] In modern software, it powers libraries such as Python’s Lark parsing toolkit, which employs the Earley algorithm for robust handling of context-free grammars in applications like natural language processing.[7] The algorithm has earned lasting recognition in computational linguistics as a cornerstone for parsing unrestricted context-free grammars, serving as a benchmark for over half a century in natural language analysis due to its ability to manage ambiguity and complex syntactic structures efficiently.Core Algorithm
Earley Recognizer
The Earley recognizer is a dynamic programming algorithm for determining whether a string is accepted by a context-free grammar (CFG), employing a hybrid of top-down and bottom-up strategies to achieve efficient recognition. Developed by Jay Earley, it processes the input string from left to right, avoiding the exponential time complexity of naive recursive descent by memoizing partial parses in a chart-like structure. This approach ensures quadratic time for unambiguous grammars and cubic time in the worst case for general CFGs, making it more efficient than many contemporary algorithms for broad classes of grammars. The algorithm builds a sequence of state sets S_0, S_1, \dots, S_n, where n is the length of the input string and each S_j collects partial parses that consume the prefix up to position j. Items within these sets take the form [A \to \alpha \bullet \beta, i], indicating that a production A \to \alpha \beta has matched \alpha from position i to j, with the dot marking the boundary before the yet-to-be-processed \beta. Recognition begins by seeding S_0 with the initial item for the augmented grammar's start production. Processing each state set involves iteratively applying three core phases—prediction, scanning, and completion—until the set stabilizes with no new items added. Prediction expands nonterminals immediately after the dot by adding their productions to the current set, anticipating future derivations. Scanning advances the dot over a terminal symbol if it matches the next input token, shifting the item to the subsequent state set. Completion reduces a fully parsed nonterminal by moving the dot in earlier items that anticipated it, propagating the progress backward. The grammar is typically augmented with a new start symbol S' and production S' \to S, where S is the original start symbol, to handle multiple sentences or ensure unique roots. Recognition succeeds if S_n contains the completed item [S' \to S \bullet, 0], confirming the entire string derives from the start symbol. Epsilon rules, or empty productions A \to \epsilon, are accommodated by allowing the dot to advance over null nonterminals during prediction and completion, enabling zero-length matches without consuming input positions. This handling prevents infinite loops by tracking origins and only adding items once per unique configuration, preserving termination even with nullable symbols.Extension to Parser
To extend the Earley recognizer into a full parser capable of producing parse trees or forests, the core state items are augmented with additional metadata to track derivation paths. Specifically, each item of the form [A \to \alpha \cdot B \beta, j], which indicates that the prefix \alpha of production A \to \alpha B \beta has been recognized starting from position j in the input, is modified to include backpointers or origin links pointing to the states where the relevant subderivations began.[5] These backpointers record the starting position f (where $0 \leq f \leq n+1, with n being the input length) for the instance of the production being parsed, enabling reconstruction of the derivation structure by tracing links backward from completed items.[5] The prediction and scanning rules remain largely unchanged from the recognizer, but the completer rule is enhanced to store these links during reductions. When a nonterminal B is completed in an item [B \to \gamma \cdot, k] (where \gamma spans from j to k), the completer advances any waiting items [A \to \alpha \cdot B \beta, i] to [A \to \alpha B \cdot \beta, i] while adding a backpointer from the new item to the completed B item, effectively linking the subparse for B to the parent production.[5] This additional storage captures parse actions only during completion, distinguishing the parser from the pure recognizer, which merely checks acceptance without recording structure; the parser thus incurs extra space proportional to the number of such links, potentially up to O(n^3) in the worst case, while preserving the cubic time complexity.[5] Ambiguity in the grammar is handled naturally through this mechanism, as multiple valid completions for the same nonterminal B (arising from different predictions or scans) result in multiple backpointers from the advanced items, allowing all alternative derivations to be represented without duplication of shared substructures.[5] For instance, if a nonterminal admits multiple production instances over the same span, each completion adds a distinct link, enabling the parser to enumerate or pack all possibilities. In modern implementations, these links can be organized into a shared packed parse forest (SPPF), where nodes represent substrings and alternatives are packed to efficiently store the forest in cubic space.[8] The output of the extended parser is a parse tree for unambiguous inputs or a representation of all possible derivations (such as a forest) for ambiguous ones, constructed by traversing the backpointers from the final completed item for the start symbol spanning the entire input (e.g., [S \to \gamma \cdot, 0] at position n).[5] This traversal yields the full syntactic structure, with each path corresponding to a valid derivation, providing not just acceptance but a concrete basis for semantic analysis or further processing.[8]Formal Mechanics
State Sets and Items
In the Earley parsing algorithm, the fundamental data structure is the Earley item, formally defined as a tuple [A \to \alpha \bullet \beta, i], where A is a nonterminal, A \to \alpha \beta is a production in the grammar, the dot \bullet marks the current position in the right-hand side (with \alpha already parsed and \beta remaining to be parsed), and i is the position in the input string where the derivation of \alpha began.[9][1] This notation captures partial progress toward recognizing a sentential form, allowing the algorithm to track multiple possible derivations simultaneously without backtracking. Items represent hypotheses about how portions of the input align with grammar rules, enabling efficient reuse of subcomputations through dynamic programming. State sets, denoted S_j for $0 \leq j \leq n where n is the input length, are collections of all Earley items reachable after processing the first j symbols of the input.[9] Each S_j forms an unordered set to avoid duplicates, aggregating active items (where the dot precedes a symbol) and ensuring that only unique configurations are stored, which contributes to the algorithm's O(n^3) worst-case time complexity while optimizing space to O(n^2).[1] The initial set S_0 seeds the parsing process, and subsequent sets build incrementally. To facilitate recognition of complete parses, the grammar is augmented by introducing a new start symbol S' and adding the production S' \to S (or S' \to S \# with an end marker \# for explicit termination).[9] This isolates the original start symbol S and ensures the parser verifies that the entire input is consumed. Items in state sets exhibit key properties: they encode derivation origins for ambiguity resolution, leverage set structure to eliminate redundant computations, and distinguish completed items (where the dot is at the end, denoted [A \to \alpha \bullet, i]) from in-progress ones, with completed items signaling viable reductions.[1] These structures are populated through predictive and advancement operations on prior sets.Prediction, Scanning, and Completion Rules
The three core rules of the Earley parser—prediction, scanning, and completion—form the operational mechanism for populating the state sets S_k, where each S_k contains items representing partial parses up to input position k. These rules are applied iteratively to existing items in a state set, adding new items to the appropriate sets (current or subsequent) only if they are not already present, thereby ensuring efficiency and preventing infinite loops from recursive productions. The process for each S_k repeats until no additional items can be derived, at which point the parser advances to S_{k+1} via scanning. The prediction rule handles nonterminals immediately following the dot in an item. For an item [A \to \alpha \bullet B \beta, j] \in S_k, where B is a nonterminal and j denotes the origin position of the item, add all items [B \to \bullet \gamma, k] to S_k for every production B \to \gamma in the grammar. This rule expands the current state set by anticipating all possible ways to derive B starting at position k, enabling top-down exploration of the grammar without consuming input.[1] The scanning rule consumes matching terminals from the input string. For an item [A \to \alpha \bullet t \beta, j] \in S_k, where t is a terminal, if the input symbol at position k+1 equals t, then add [A \to \alpha t \bullet \beta, j] to S_{k+1}. This advances the dot past the terminal and shifts the item to the next state set, effectively matching and progressing along the input sequence. Scanning is the only rule that moves items forward in the chart, linking the parse to the actual input.[1] The completion rule resolves subparse completions and propagates them backward. For a completed item [B \to \gamma \bullet, m] \in S_k and any item [A \to \alpha \bullet B \beta, j] \in S_m (where m is the origin of the completed item), add [A \to \alpha B \bullet \beta, j] to S_k. This moves the dot past the nonterminal B in waiting items from the appropriate earlier state set, bottom-up integrating the successful derivation of B into enclosing productions. Completion thus connects disjoint parse fragments across state sets.[1] These rules operate in an interleaved manner within the loop for S_k: prediction and completion add to the current set or earlier waiting sets, while scanning queues items for the next. Duplicates are checked via set membership before insertion, halting recursion in cases like left recursion (e.g., A \to A \alpha) by not re-adding identical items. This disciplined application ensures the algorithm explores all valid paths without redundancy, achieving recognition for any context-free grammar.[10]Implementation Details
Pseudocode
The Earley parser algorithm processes a context-free grammar G = (V, \Sigma, P, S) and an input string w = a_1 a_2 \dots a_n \in \Sigma^* to determine if w is derivable from the start symbol S, producing a set of states (items) across chart positions from 0 to n. The algorithm maintains a chart C consisting of sets S_j for j = 0 to n, where each S_j contains Earley items of the form [A \to \alpha \cdot \beta, j], indicating a production A \to \alpha \beta with the dot after \alpha and origin at position j. Epsilon rules are handled naturally through prediction and completion without special cases, while the empty string (n=0) is accepted if [S \to \cdot \gamma, 0] completes in S_0. The grammar need not be in Chomsky normal form, though some implementations assume it for simplicity; the general form supports arbitrary productions. Acceptance occurs if [S' \to S \cdot, 0] \in S_n, where S' is an augmented start symbol with production S' \to S. Optional parse links can be stored in items for forest construction. The high-level outline initializes S_0 with the start item and iteratively builds each S_j by applying the predictor, scanner, and completer rules in a loop until no new items are added (fixed point), processing positions sequentially from 0 to n. This ensures all possible derivations are explored efficiently, with time complexity O(n^3) in the worst case overall, but O(n^2) for unambiguous grammars.[1] Below is a detailed pseudocode implementation in a functional style, using sets to avoid duplicates. Items are represented as tuples (A \to \alpha \cdot \beta, origin), and the grammar's productions are indexed by left-hand side for efficient lookup.The predictor rule adds anticipated items for nonterminals immediately after the dot, handling epsilon productions by immediately completing them if the body is empty. It is applied when the item has a nonterminal B after the dot, adding new items only if not already present to avoid duplicates.function EarleyParse(G, w): n = length(w) C = array of n+1 empty sets // C[j] = S_j add_item(C[0], (S' → ·S, 0)) // Augmented start: S' → S for j from 0 to n: agenda = C[j].copy() // Items to process while agenda is not empty: item = agenda.pop() if is_predictor_applicable(item): apply_predictor(item, C[j], G) if is_scanner_applicable(item, j, n): apply_scanner(item, C[j], w, C) if is_completer_applicable(item): apply_completer(item, C[j], C) if (S' → S·, 0) in C[n]: return "Accepted" // Optionally return parse forest via links else: return "Rejected"function EarleyParse(G, w): n = length(w) C = array of n+1 empty sets // C[j] = S_j add_item(C[0], (S' → ·S, 0)) // Augmented start: S' → S for j from 0 to n: agenda = C[j].copy() // Items to process while agenda is not empty: item = agenda.pop() if is_predictor_applicable(item): apply_predictor(item, C[j], G) if is_scanner_applicable(item, j, n): apply_scanner(item, C[j], w, C) if is_completer_applicable(item): apply_completer(item, C[j], C) if (S' → S·, 0) in C[n]: return "Accepted" // Optionally return parse forest via links else: return "Rejected"
The scanner rule advances the dot over a terminal matching the next input symbol, adding the new item to the next chart position S_{j+1}. It applies only when the symbol after the dot is a terminal and j < n.function apply_predictor(item, S_j, G): (A → α · B β, i) = item if B is nonterminal: for each production B → γ in G.productions[B]: new_item = (B → ·γ, j) if new_item not in S_j: add_item(S_j, new_item) agenda.add(new_item) // For immediate processing, e.g., epsilonsfunction apply_predictor(item, S_j, G): (A → α · B β, i) = item if B is nonterminal: for each production B → γ in G.productions[B]: new_item = (B → ·γ, j) if new_item not in S_j: add_item(S_j, new_item) agenda.add(new_item) // For immediate processing, e.g., epsilons
The completer rule propagates completions backward: when an item [B \to \gamma \cdot, k] is reached, it advances all prior items in S_k that anticipated B. This handles reductions and is key for recognizing completed constituents.function apply_scanner(item, S_j, w, C): (A → α · a β, i) = item if a is [terminal](/page/Terminal) and j < n and w[j+1] == a: // 1-based indexing for w new_item = (A → α a · β, i) if new_item not in C[j+1]: add_item(C[j+1], new_item) // No agenda push here; processed in next j iterationfunction apply_scanner(item, S_j, w, C): (A → α · a β, i) = item if a is [terminal](/page/Terminal) and j < n and w[j+1] == a: // 1-based indexing for w new_item = (A → α a · β, i) if new_item not in C[j+1]: add_item(C[j+1], new_item) // No agenda push here; processed in next j iteration
For the empty string case (n=0), the loop runs only for j=0, and epsilon rules propagate through prediction and completion to potentially complete the start item in S_0. In practice, implementations use hash sets for S_j to ensure O(1) membership checks, and a global agenda or per-set processing avoids redundant rule applications. This pseudocode directly translates the original formulation, enabling straightforward implementation in languages like Python or Haskell.function apply_completer(item, S_j, C): (B → γ ·, k) = item // Dot at end for each prior_item in C[k]: (A → α · B β, i) = prior_item if B is nonterminal: new_item = (A → α B · β, i) if new_item not in S_j: add_item(S_j, new_item) agenda.add(new_item) // Reprocess in current jfunction apply_completer(item, S_j, C): (B → γ ·, k) = item // Dot at end for each prior_item in C[k]: (A → α · B β, i) = prior_item if B is nonterminal: new_item = (A → α B · β, i) if new_item not in S_j: add_item(S_j, new_item) agenda.add(new_item) // Reprocess in current j
Worked Example
To illustrate the operation of the Earley algorithm, consider the simple ambiguous context-free grammar for arithmetic expressions, which generates left- and right-associative parses for chains of additions:- E → E + E
- E → num
- [S → • E, 0] (initial item)
Prediction adds the E productions: - [E → • E + E, 0]
- [E → • num, 0]
Scanning from [E → • num, 0] in S_0 (since the next symbol is num):
- [E → num •, 0]
Completion from this item (recognizing E from 0 to 1): adds to items in S_0 expecting • E: - [S → E •, 0]
- [E → E • + E, 0]
Scanning from [E → E • + E, 0] in S_1 (next symbol is +):
- [E → E + • E, 0]
Prediction from this item (• E is nonterminal): - [E → • E + E, 2]
- [E → • num, 2]
Scanning from [E → • num, 2] in S_2 (next symbol is num):
- [E → num •, 2]
Completion (recognizing E from 2 to 3): adds to items in S_2 expecting • E: - [E → E + E •, 0]
Further completion from [E → E + E •, 0] (recognizing E from 0 to 3): adds to items in S_0 expecting • E: - [S → E •, 0] (duplicate, ignored in set)
- [E → E • + E, 0]
Scanning from the new [E → E • + E, 0] (next symbol is +): - [E → E + • E, 0] (in S_4, but processed later)
Prediction from [E → E + • E, 0] will occur in S_4.
From prior scanning:
- [E → E + • E, 0]
Prediction: - [E → • E + E, 4]
- [E → • num, 4]
Additionally, from completion in S_3, scanning from [E → E • + E, 0]: - [E → E + • E, 0] (duplicate)
Prediction repeats the above (ignored).
Scanning from [E → • num, 4] in S_4 (next symbol is num):
- [E → num •, 4]
Completion (recognizing E from 4 to 5): adds to items in S_4 expecting • E: - [E → E + E •, 0] (from [E → E + • E, 0])
Further completion from [E → E + E •, 0] (recognizing E from 0 to 5): adds to items in S_0 expecting • E: - [S → E •, 0]
- [E → E • + E, 0]
Now, a second path emerges: earlier completion in S_3 also allows recognizing "2 + 3" as E from 2 to 5 via [E → E + E •, 2] (added by completing [E → E + • E, 0] in S_2 with the subtree from 2 to 5). This completes to another instance of [E → E + E •, 0] in S_5, linking back through the first "+" to recognize the full "1 + (2 + 3)" from 0 to 5.
Advanced Features
Building the Parse Forest
The parse forest in the Earley parser is constructed as a directed acyclic graph (DAG) that compactly represents all possible parse trees for an input string, with nodes corresponding to nonterminals or terminals and edges representing productions from the context-free grammar.[8][10] This structure, often implemented as a shared packed parse forest (SPPF), ensures that common subparses are reused across multiple derivations, preventing exponential growth in size even for highly ambiguous grammars.[8] To build the forest, the algorithm traverses the completed Earley items backward from the final state set, utilizing origins (start and end positions of substrings) and backpointers (links to predecessor items) stored during recognition.[5][10] For a completed item like A \to \alpha \cdot, the backpointer indicates the originating item in an earlier state set that advanced the dot, recursively assembling subtrees by following these links; predicted items use null pointers, scanned terminals create leaf nodes, and reductions combine subforests from multiple completed sub-items.[8][10] This recursive process, often formalized as a function likebuild_tree that yields branches or leaves based on pointer types (e.g., predecessor for scans, reduction for completions), constructs the DAG in cubic time matching the recognizer's complexity.[10]
Shared structures address ambiguity by packing alternative derivations into nodes labeled with triples (nonterminal, start, end), where edges (e.g., left/right child or pivot splits) link to shared sub-DAGs, ensuring the forest size remains polynomial.[8] The forest can be represented using adjacency lists for nodes or recursive data structures like Branch N ([tree](/page/Tree) [list](/page/List)), facilitating traversal for tree extraction or semantic processing.[10]
For a simple ambiguous grammar S \to a S \mid S a \mid a parsing the string "aa", the full tree expansion would duplicate the subtree for the second "a" across two parses (S \to a S with S \to a, or S \to S a with S \to a), yielding exponential nodes; the DAG instead shares a single terminal node for the second "a" via a packed node at position (S, 0, 2), with backpointers to origins at (a, 0, 1) and (S, 1, 2) for one alternative, and (S, 0, 1) and (a, 1, 2) for the other, reducing the structure to linear size while representing both derivations.[8]