Fact-checked by Grok 2 weeks ago

Earley parser

The Earley parser is a dynamic programming-based for context-free grammars (CFGs), invented by Jay Earley in as an efficient general-purpose solution for recognizing and constructing parse trees from input strings. Named after its creator, it operates as a parser that builds an incremental of partial parses, enabling it to handle arbitrary CFGs—including those with , , and empty productions—without requiring grammar modifications or . The algorithm's is O(n³) in the worst case for an input string of length n, but it reduces to O(n²) for unambiguous grammars and O(n) for many practical, deterministic cases like those in programming languages. At its core, the Earley parser maintains a chart consisting of sets S₀ through Sₙ, where each Sⱼ holds states representing valid parses ending at position j in the input. A state takes the form [i, j, A → α • β], indicating that nonterminal A derives the substring from position i to j via production A → α β, with the dot marking progress after symbols α. The parser advances through three inference rules applied iteratively to populate the chart:
  • Predict: When the dot precedes a nonterminal B, add states for all productions of B starting at the current position.
  • Scan: When the dot precedes a terminal that matches the next input symbol, advance the dot and move to the subsequent position.
  • Complete: When a state completes a production (dot at the end), use it to advance any prior states waiting for that nonterminal.
This mechanism combines top-down prediction with bottom-up completion, packing shared substructures to avoid redundant computation and enabling recognition of the start symbol spanning the full input upon success. The Earley parser's generality surpasses table-driven parsers like LL(k) or LR(k), which demand grammars free of certain constructs (e.g., left recursion in LL), as it empirically outperforms them on diverse CFGs without preprocessing. Its space complexity is O(n²), making it suitable for moderate-length inputs, though optimizations like those addressing epsilon rules enhance practicality for real-world use. Primarily applied in computational linguistics, it excels at parsing ambiguous natural language structures, and modern variants incorporate weights or probabilities for tasks like probabilistic parsing in NLP pipelines.

Overview and Background

Introduction

The Earley parser is an efficient algorithm for recognizing and parsing strings according to general context-free grammars (CFGs), capable of handling ambiguity and left-recursion without requiring grammar transformations. Developed by Jay Earley at , it was first published in as part of his work on processing. This parser stands out for its generality, as it applies to any CFG, in contrast to more specialized methods like LR parsers that demand deterministic grammars. Its is O(n^3) in the worst case, where n is the input length, but drops to O(n^2) for unambiguous grammars and linear time for many practical grammars, such as those defining programming languages. These properties make it empirically superior to earlier top-down and bottom-up approaches for broad CFG applications. Employing dynamic programming in a chart-parsing framework, the Earley parser supports robust syntax analysis in fields like (NLP) and construction. In , extensions enable probabilistic models for tasks such as psycholinguistic simulation and sentence .

History and Development

The Earley parser was invented by Jay Earley as part of his thesis at , completed in 1968. This work introduced an efficient algorithm for recognizing and context-free grammars, building on dynamic programming techniques similar to those in Daniel Younger’s 1967 algorithm and incorporating state sets and lookahead mechanisms derived from ’s LR(k) framework from 1965. Earley’s laid the foundation for a general-purpose parser capable of handling arbitrary context-free grammars without requiring or other restrictions common in prior methods. Earley published his algorithm in 1970 in the Communications of the ACM, where it was presented as "An Efficient Context-Free Parsing Algorithm," emphasizing its O(n³) for general cases and better performance for unambiguous grammars. The publication quickly gained attention in for providing a unified approach to that combined top-down prediction with bottom-up recognition, addressing limitations in earlier dynamic programming parsers like the Cocke-Kasami-Younger (CKY) algorithm. Following its introduction, the Earley parser saw integration into parser generator tools as variants compatible with , enabling faster execution for general grammars compared to traditional LR parsers; for instance, directly executable Earley implementations have been developed. In modern software, it powers libraries such as ’s Lark parsing toolkit, which employs the Earley algorithm for robust handling of context-free grammars in applications like . The has earned lasting recognition in as a for unrestricted s, serving as a for over half a century in analysis due to its ability to manage and complex efficiently.

Core Algorithm

Earley Recognizer

The Earley recognizer is a dynamic programming for determining whether a is accepted by a (CFG), employing a hybrid of top-down and bottom-up strategies to achieve efficient recognition. Developed by Jay Earley, it processes the input from left to right, avoiding the exponential of naive recursive by memoizing partial parses in a chart-like structure. This approach ensures quadratic time for unambiguous grammars and cubic time in the worst case for general CFGs, making it more efficient than many contemporary s for broad classes of grammars. The algorithm builds a sequence of state sets S_0, S_1, \dots, S_n, where n is the length of the input and each S_j collects partial parses that consume the up to position j. Items within these sets take the form [A \to \alpha \bullet \beta, i], indicating that a A \to \alpha \beta has matched \alpha from position i to j, with the marking the boundary before the yet-to-be-processed \beta. Recognition begins by seeding S_0 with the initial item for the augmented grammar's start . Processing each state set involves iteratively applying three core phases—prediction, scanning, and —until the set stabilizes with no new items added. expands nonterminals immediately after the dot by adding their productions to the current set, anticipating future derivations. Scanning advances the dot over a terminal symbol if it matches the next input token, shifting the item to the subsequent state set. reduces a fully parsed nonterminal by moving the dot in earlier items that anticipated it, propagating the progress backward. The grammar is typically augmented with a new start symbol S' and production S' \to S, where S is the original start symbol, to handle multiple sentences or ensure unique roots. Recognition succeeds if S_n contains the completed item [S' \to S \bullet, 0], confirming the entire string derives from the start symbol. Epsilon rules, or empty productions A \to \epsilon, are accommodated by allowing the dot to advance over null nonterminals during prediction and completion, enabling zero-length matches without consuming input positions. This handling prevents infinite loops by tracking origins and only adding items once per unique configuration, preserving termination even with nullable symbols.

Extension to Parser

To extend the Earley recognizer into a full parser capable of producing parse trees or forests, the core state items are augmented with additional metadata to track derivation paths. Specifically, each item of the form [A \to \alpha \cdot B \beta, j], which indicates that the prefix \alpha of production A \to \alpha B \beta has been recognized starting from position j in the input, is modified to include backpointers or origin links pointing to the states where the relevant subderivations began. These backpointers record the starting position f (where $0 \leq f \leq n+1, with n being the input length) for the instance of the production being parsed, enabling reconstruction of the derivation structure by tracing links backward from completed items. The prediction and scanning rules remain largely unchanged from the recognizer, but the completer rule is enhanced to store these links during reductions. When a nonterminal B is completed in an item [B \to \gamma \cdot, k] (where \gamma spans from j to k), the completer advances any waiting items [A \to \alpha \cdot B \beta, i] to [A \to \alpha B \cdot \beta, i] while adding a backpointer from the new item to the completed B item, effectively linking the subparse for B to the parent production. This additional storage captures parse actions only during completion, distinguishing the parser from the pure recognizer, which merely checks without recording structure; the parser thus incurs extra space proportional to the number of such links, potentially up to O(n^3) in the worst case, while preserving the cubic . Ambiguity in the grammar is handled naturally through this mechanism, as multiple valid completions for the same nonterminal B (arising from different predictions or scans) result in multiple backpointers from the advanced items, allowing all alternative derivations to be represented without duplication of shared substructures. For instance, if a nonterminal admits multiple production instances over the same span, each completion adds a distinct link, enabling the parser to enumerate or pack all possibilities. In modern implementations, these links can be organized into a shared packed parse forest (SPPF), where nodes represent substrings and alternatives are packed to efficiently store the forest in cubic space. The output of the extended parser is a for unambiguous inputs or a representation of all possible (such as ) for ambiguous ones, constructed by traversing the backpointers from the final completed item for the start symbol spanning the entire input (e.g., [S \to \gamma \cdot, 0] at position n). This traversal yields the full syntactic structure, with each path corresponding to a valid , providing not just but a concrete basis for semantic analysis or further processing.

Formal Mechanics

State Sets and Items

In the Earley parsing algorithm, the fundamental data structure is the Earley item, formally defined as a tuple [A \to \alpha \bullet \beta, i], where A is a nonterminal, A \to \alpha \beta is a production in the grammar, the dot \bullet marks the current position in the right-hand side (with \alpha already parsed and \beta remaining to be parsed), and i is the position in the input string where the derivation of \alpha began. This notation captures partial progress toward recognizing a sentential form, allowing the algorithm to track multiple possible derivations simultaneously without backtracking. Items represent hypotheses about how portions of the input align with grammar rules, enabling efficient reuse of subcomputations through dynamic programming. State sets, denoted S_j for $0 \leq j \leq n where n is the input length, are collections of all Earley items reachable after processing the first j symbols of the input. Each S_j forms an unordered set to avoid duplicates, aggregating active items (where the dot precedes a ) and ensuring that only unique configurations are stored, which contributes to the algorithm's O(n^3) worst-case time complexity while optimizing space to O(n^2). The initial set S_0 seeds the parsing process, and subsequent sets build incrementally. To facilitate recognition of complete parses, the grammar is augmented by introducing a new start symbol S' and adding the production S' \to S (or S' \to S \# with an end marker \# for explicit termination). This isolates the original start symbol S and ensures the parser verifies that the entire input is consumed. Items in state sets exhibit key properties: they encode derivation origins for ambiguity resolution, leverage set structure to eliminate redundant computations, and distinguish completed items (where the dot is at the end, denoted [A \to \alpha \bullet, i]) from in-progress ones, with completed items signaling viable reductions. These structures are populated through predictive and advancement operations on prior sets.

Prediction, Scanning, and Completion Rules

The three core rules of the Earley parser—prediction, scanning, and —form the operational mechanism for populating the state sets S_k, where each S_k contains items representing partial parses up to input position k. These rules are applied iteratively to existing items in a state set, adding new items to the appropriate sets (current or subsequent) only if they are not already present, thereby ensuring efficiency and preventing infinite loops from recursive productions. The process for each S_k repeats until no additional items can be derived, at which point the parser advances to S_{k+1} via scanning. The prediction rule handles nonterminals immediately following the dot in an item. For an item [A \to \alpha \bullet B \beta, j] \in S_k, where B is a nonterminal and j denotes the origin position of the item, add all items [B \to \bullet \gamma, k] to S_k for every production B \to \gamma in the . This rule expands the current state set by anticipating all possible ways to derive B starting at position k, enabling top-down exploration of the grammar without consuming input. The scanning rule consumes matching terminals from the input string. For an item [A \to \alpha \bullet t \beta, j] \in S_k, where t is a terminal, if the input symbol at position k+1 equals t, then add [A \to \alpha t \bullet \beta, j] to S_{k+1}. This advances the dot past the terminal and shifts the item to the next state set, effectively matching and progressing along the input sequence. Scanning is the only rule that moves items forward in the chart, linking the parse to the actual input. The completion rule resolves subparse completions and propagates them backward. For a completed item [B \to \gamma \bullet, m] \in S_k and any item [A \to \alpha \bullet B \beta, j] \in S_m (where m is the origin of the completed item), add [A \to \alpha B \bullet \beta, j] to S_k. This moves the dot past the nonterminal B in waiting items from the appropriate earlier state set, bottom-up integrating the successful of B into enclosing productions. Completion thus connects disjoint parse fragments across state sets. These rules operate in an interleaved manner within the for S_k: and add to the current set or earlier waiting sets, while scanning queues items for the next. Duplicates are checked via set membership before insertion, halting in cases like (e.g., A \to A \alpha) by not re-adding identical items. This disciplined application ensures the algorithm explores all valid paths without redundancy, achieving recognition for any .

Implementation Details

Pseudocode

The Earley parser algorithm processes a G = (V, \Sigma, P, S) and an input string w = a_1 a_2 \dots a_n \in \Sigma^* to determine if w is derivable from the start symbol S, producing a set of states (items) across chart positions from 0 to n. The algorithm maintains a C consisting of sets S_j for j = 0 to n, where each S_j contains Earley items of the form [A \to \alpha \cdot \beta, j], indicating a A \to \alpha \beta with the dot after \alpha and origin at position j. Epsilon rules are handled naturally through and completion without special cases, while the (n=0) is accepted if [S \to \cdot \gamma, 0] completes in S_0. The need not be in , though some implementations assume it for simplicity; the general form supports arbitrary productions. Acceptance occurs if [S' \to S \cdot, 0] \in S_n, where S' is an augmented start symbol with S' \to S. Optional parse links can be stored in items for forest construction. The high-level outline initializes S_0 with the start item and iteratively builds each S_j by applying the predictor, , and completer rules in a loop until no new items are added (fixed point), processing positions sequentially from 0 to n. This ensures all possible derivations are explored efficiently, with O(n^3) in the worst case overall, but O(n^2) for unambiguous grammars. Below is a detailed implementation in a functional style, using sets to avoid duplicates. Items are represented as tuples (A \to \alpha \cdot \beta, origin), and the grammar's productions are indexed by left-hand side for efficient lookup.
function EarleyParse(G, w):
    n = length(w)
    C = array of n+1 empty sets  // C[j] = S_j
    add_item(C[0], (S' → ·S, 0))  // Augmented start: S' → S
    
    for j from 0 to n:
        agenda = C[j].copy()  // Items to process
        while agenda is not empty:
            item = agenda.pop()
            if is_predictor_applicable(item):
                apply_predictor(item, C[j], G)
            if is_scanner_applicable(item, j, n):
                apply_scanner(item, C[j], w, C)
            if is_completer_applicable(item):
                apply_completer(item, C[j], C)
    
    if (S' → S·, 0) in C[n]:
        return "Accepted"  // Optionally return parse forest via links
    else:
        return "Rejected"
The predictor rule adds anticipated items for nonterminals immediately after the dot, handling productions by immediately completing them if the body is empty. It is applied when the item has a nonterminal B after the dot, adding new items only if not already present to avoid duplicates.
function apply_predictor(item, S_j, G):
    (A → α · B β, i) = item
    if B is nonterminal:
        for each production B → γ in G.productions[B]:
            new_item = (B → ·γ, j)
            if new_item not in S_j:
                add_item(S_j, new_item)
                agenda.add(new_item)  // For immediate processing, e.g., epsilons
The scanner rule advances the dot over a matching the next input , adding the new item to the next position S_{j+1}. It applies only when the symbol after the dot is a and j < n.
function apply_scanner(item, S_j, w, C):
    (A → α · a β, i) = item
    if a is [terminal](/page/Terminal) and j < n and w[j+1] == a:  // 1-based indexing for w
        new_item = (A → α a · β, i)
        if new_item not in C[j+1]:
            add_item(C[j+1], new_item)
            // No agenda push here; processed in next j iteration
The completer rule propagates completions backward: when an item [B \to \gamma \cdot, k] is reached, it advances all prior items in S_k that anticipated B. This handles reductions and is key for recognizing completed constituents.
function apply_completer(item, S_j, C):
    (B → γ ·, k) = item  // Dot at end
    for each prior_item in C[k]:
        (A → α · B β, i) = prior_item
        if B is nonterminal:
            new_item = (A → α B · β, i)
            if new_item not in S_j:
                add_item(S_j, new_item)
                agenda.add(new_item)  // Reprocess in current j
For the empty string case (n=0), the loop runs only for j=0, and rules propagate through and to potentially complete the start item in S_0. In practice, implementations use hash sets for S_j to ensure O(1) membership checks, and a global agenda or per-set processing avoids redundant rule applications. This directly translates the original formulation, enabling straightforward implementation in languages like or .

Worked Example

To illustrate the operation of the Earley algorithm, consider the simple ambiguous context-free grammar for arithmetic expressions, which generates left- and right-associative parses for chains of additions:
  • E → E + E
  • E → num
Here, num represents a numeric terminal, + is the addition operator terminal, and the grammar is augmented with a start production S → E for convenience. The input string is "1 + 2 + 3", tokenized as the sequence of five symbols: num (for 1), +, num (for 2), +, num (for 3). The algorithm constructs state sets S_k for k = 0 to 5, where S_k contains items of the form [A → α • β, j], indicating that the production A → α β has recognized α starting from position j up to the current position k, with β remaining to be recognized. The sets are built incrementally using the prediction, scanning, and completion rules. S_0 (before any input):
  • [S → • E, 0] (initial item)
    Prediction adds the E productions:
  • [E → • E + E, 0]
  • [E → • num, 0]
No scanning or completion applies yet. S_1 (after "1"):
Scanning from [E → • num, 0] in S_0 (since the next symbol is num):
  • [E → num •, 0]
    Completion from this item (recognizing E from 0 to 1): adds to items in S_0 expecting • E:
  • [S → E •, 0]
  • [E → E • + E, 0]
S_2 (after "+"):
Scanning from [E → E • + E, 0] in S_1 (next symbol is +):
  • [E → E + • E, 0]
    Prediction from this item (• E is nonterminal):
  • [E → • E + E, 2]
  • [E → • num, 2]
No further additions at this stage. S_3 (after "2"):
Scanning from [E → • num, 2] in S_2 (next symbol is num):
  • [E → num •, 2]
    Completion (recognizing E from 2 to 3): adds to items in S_2 expecting • E:
  • [E → E + E •, 0]
    Further completion from [E → E + E •, 0] (recognizing E from 0 to 3): adds to items in S_0 expecting • E:
  • [S → E •, 0] (duplicate, ignored in set)
  • [E → E • + E, 0]
    Scanning from the new [E → E • + E, 0] (next symbol is +):
  • [E → E + • E, 0] (in S_4, but processed later)
    Prediction from [E → E + • E, 0] will occur in S_4.
S_4 (after second "+"):
From prior scanning:
  • [E → E + • E, 0]
    Prediction:
  • [E → • E + E, 4]
  • [E → • num, 4]
    Additionally, from completion in S_3, scanning from [E → E • + E, 0]:
  • [E → E + • E, 0] (duplicate)
    Prediction repeats the above (ignored).
S_5 (after "3", end of input):
Scanning from [E → • num, 4] in S_4 (next symbol is num):
  • [E → num •, 4]
    Completion (recognizing E from 4 to 5): adds to items in S_4 expecting • E:
  • [E → E + E •, 0] (from [E → E + • E, 0])
    Further completion from [E → E + E •, 0] (recognizing E from 0 to 5): adds to items in S_0 expecting • E:
  • [S → E •, 0]
  • [E → E • + E, 0]
    Now, a second path emerges: earlier completion in S_3 also allows recognizing "2 + 3" as E from 2 to 5 via [E → E + E •, 2] (added by completing [E → E + • E, 0] in S_2 with the subtree from 2 to 5). This completes to another instance of [E → E + E •, 0] in S_5, linking back through the first "+" to recognize the full "1 + (2 + 3)" from 0 to 5.
The presence of [S → E •, 0] in S_5 confirms acceptance: the input is validly parsed as an E starting from position 0. The ambiguity of the is demonstrated by the multiple completion paths leading to [E → E + E •, 0] in S_5—one corresponding to the left-associative parse ((1 + 2) + 3), and the other to the right-associative parse (1 + (2 + 3)). These paths are traceable via the origins (j values) and would be used in the parser extension to build a parse with both trees.

Advanced Features

Building the Parse Forest

The parse forest in the Earley parser is constructed as a (DAG) that compactly represents all possible parse trees for an input string, with nodes corresponding to nonterminals or terminals and edges representing productions from the . This structure, often implemented as a shared packed parse forest (SPPF), ensures that common subparses are reused across multiple derivations, preventing in size even for highly ambiguous grammars. To build the forest, the algorithm traverses the completed Earley items backward from the final state set, utilizing origins (start and end positions of substrings) and backpointers (links to predecessor items) stored during recognition. For a completed item like A \to \alpha \cdot, the backpointer indicates the originating item in an earlier state set that advanced the , recursively assembling subtrees by following these links; predicted items use null pointers, scanned terminals create nodes, and s combine subforests from multiple completed sub-items. This recursive process, often formalized as a like build_tree that yields branches or leaves based on pointer types (e.g., predecessor for scans, reduction for completions), constructs the DAG in cubic time matching the recognizer's complexity. Shared structures address ambiguity by packing alternative derivations into nodes labeled with triples (nonterminal, start, end), where edges (e.g., left/right child or pivot splits) link to shared sub-DAGs, ensuring the size remains . The can be represented using adjacency s for nodes or recursive data structures like Branch N ([tree](/page/Tree) [list](/page/List)), facilitating traversal for extraction or semantic processing. For a simple ambiguous grammar S \to a S \mid S a \mid a parsing the string "aa", the full tree expansion would duplicate the subtree for the second "a" across two parses (S \to a S with S \to a, or S \to S a with S \to a), yielding exponential nodes; the DAG instead shares a single terminal node for the second "a" via a packed node at position (S, 0, 2), with backpointers to origins at (a, 0, 1) and (S, 1, 2) for one alternative, and (S, 0, 1) and (a, 1, 2) for the other, reducing the structure to linear size while representing both derivations.

Optimizations and Variants

One key optimization for the Earley parser is the Leo technique, which addresses inefficiencies caused by right-recursive grammars by prioritizing Earley items based on their left extent during processing. This avoids redundant predictions and , enabling the parser to run in linear time O(n) and space for every LR(k) grammar without lookahead. To handle epsilon productions efficiently, implementations incorporate precomputation of nullable nonterminals, allowing predictions and to skip over empty without adding duplicate states or risking infinite loops during the main parsing loop. Similarly, unit productions are optimized by precomputing chains of unit rules, which collapses long chains of single-nonterminal into direct links, reducing the number of operations. These techniques, collectively addressing packing of equivalent states and unpacking for recovery, significantly lower the practical runtime for grammars with nullables and units by minimizing bloat. The base Earley algorithm exhibits a worst-case time complexity of O(n^3) for general context-free grammars, where n is the input length; however, refinements yield O(n^2) for unambiguous grammars and O(n) for LR(0) grammars when combined with optimizations like Leo's. Space complexity remains O(n^2) in the worst case but drops to O(n) for linear-time scenarios. Variants of the Earley parser extend its capabilities for specific applications, such as integration with generalized LR (GLR) techniques to manage higher degrees of through nondeterministic chart expansion, achieving equivalence in expressive power while leveraging Earley's dynamic programming . In probabilistic parsing, the algorithm is adapted for probabilistic context-free grammars (PCFGs) by incorporating probability computations during , scanning, and , enabling efficient calculation of prefix probabilities and the most likely parse in O(n^3) time. Modern implementations further avoid worst-case performance through grammar preprocessing, such as eliminating unreachable nonterminals, removing pure rules, and substituting unit productions, which prunes the state space before begins.

References

  1. [1]
    An efficient context-free parsing algorithm - ACM Digital Library
    A parsing algorithm which seems to be the most efficient general context-free algorithm known is described. It is similar to both Knuth's LR(k) algorithm ...Missing: original | Show results with:original
  2. [2]
    None
    Summary of each segment:
  3. [3]
    Practical Earley Parsing | The Computer Journal - Oxford Academic
    Jan 1, 2002 · PDF. Views. Article contents. Cite. Cite. John Aycock, R. Nigel Horspool, Practical Earley Parsing, The Computer Journal, Volume 45, Issue 6 ...
  4. [4]
    A probabilistic earley parser as a psycholinguistic model
    A probabilistic Earley parser is used to calculate cognitive load and predicts reading time on a word-by-word basis.
  5. [5]
    [PDF] An Efficient Context-Free Parsing Algorithm
    1967), 26-29. An Efficient Context-Free Parsing. Algorithm. JAY EARLEY ... Dept., Carnegie-Mellon U., Pittsburgh,. Pa., 1968. 2. KNUTH, D. E. On the ...Missing: history | Show results with:history
  6. [6]
    Directly-Executable Earley Parsing - ResearchGate
    Aug 29, 2025 · We have developed a yacc-compatible parser generator that creates parsers that are 2.5 to 6.5 times faster than those generated by yacc or bison ...
  7. [7]
    Parsers — Lark documentation - Read the Docs
    An Earley Parser is a chart parser capable of parsing any context-free grammar at O(n^3), and O(n^2) when the grammar is unambiguous.
  8. [8]
    (PDF) SPPF-Style Parsing From Earley Recognisers - ResearchGate
    Aug 7, 2025 · Later, Scott (2008) and Scott and Johnstone (2010) presented an Earley-based parser which produces a binarized SPPF representation of all ...
  9. [9]
    [PDF] PARSING TECHNIQUES
    In discussing the Amsterdam Compiler Kit and in teaching compiler construction, it has, however, been our experience that seemingly difficult parsing techniques ...
  10. [10]
    [PDF] A Verified Earley Parser - DROPS
    Earley [12] parsing, originally conceived by Jay Earley in 1968, is an algorithm that allows the full range of context-free grammars while still being very ...