LR parser
An LR parser is a bottom-up parser that performs a left-to-right scan of the input while constructing a rightmost derivation in reverse order, enabling the efficient recognition of deterministic context-free grammars in linear time.[1] This parsing technique uses a stack to manage state and a lookahead symbol (typically one token) to decide between shifting input onto the stack or reducing previously shifted symbols according to the grammar's production rules.[1] LR parsers are particularly powerful because they can handle a broader class of grammars than top-down parsers like LL, including those with left recursion and certain ambiguities, without requiring backtracking.[2]
Developed by Donald E. Knuth in 1965, LR parsing originated as a generalization of simpler precedence parsing methods to address the limitations of early compiler technologies. In his seminal paper, Knuth introduced LR(k) grammars, where k denotes the number of lookahead symbols, proving that such parsers could deterministically recognize the largest class of context-free languages parsable in linear time by a deterministic pushdown automaton. This innovation laid the foundation for practical compiler construction, as LR parsers detect syntax errors early and support mechanical generation from grammar specifications.[1]
Common variants include canonical LR(1), which uses full lookahead sets for maximum precision but results in larger tables; simple LR (SLR), a simpler approximation with smaller tables at the cost of some grammar coverage; and LALR(1), which merges lookahead sets from canonical LR to balance efficiency and power, making it suitable for real-world applications.[1] Tools like Yacc and its open-source successor Bison implement LALR(1) parsing, widely adopted in compiler front-ends for languages such as C, Java, and Python due to their ability to handle complex, real programming language grammars.[3]
Introduction
Definition and Role in Parsing
An LR parser is a type of bottom-up shift-reduce parser that processes input strings from left to right while constructing a rightmost derivation in reverse, enabling the recognition of deterministic context-free languages.[1] It operates using a deterministic finite automaton (DFA) encoded in parsing tables, where decisions for shifting, reducing, or accepting are made based on the current stack state (representing parser position) and a bounded lookahead of input symbols, typically one for LR(1) variants.[1] This mechanism was formalized by Donald Knuth in his definition of LR(k) grammars, which characterize languages parsable efficiently with k lookahead symbols.[4]
In the context of compiling, LR parsers play a crucial role in the front-end phase by analyzing source code to generate parse trees or abstract syntax trees (ASTs), verifying syntactic validity, and facilitating early error detection during a single left-to-right pass over the input.[1] They are particularly suited for programming languages, where grammars often include complex structures like expressions with operator precedence and associativity, allowing the parser to build hierarchical representations that subsequent phases (e.g., semantic analysis) can use.[4] By recognizing valid sentence structures in linear time—specifically O(n for an input string of length n—LR parsers ensure efficient processing without rescanning prior input.[1]
Compared to top-down parsers such as LL parsers, which predict derivations from the start symbol downward, LR parsers excel in handling left-recursive grammars directly without requiring grammar transformations or backtracking, as they build the parse bottom-up from terminals.[5] This capability allows LR parsers to deterministically recognize a strictly larger class of context-free grammars than LL(k) parsers for any fixed k, encompassing all LL(k) languages as a proper subset while supporting more ambiguous or recursive structures common in real-world languages.[4] Overall, these advantages make LR parsers a cornerstone for practical compiler design, offering mechanical generation from grammars and robust error handling.[1]
Historical Development
The LR parser was first formalized by Donald E. Knuth in 1965 as part of his foundational work on syntax-directed translation for programming languages. In his seminal paper, Knuth introduced the LR(k) parsing family, which enables deterministic bottom-up parsing of a broad class of context-free grammars using left-to-right scans and rightmost derivations in reverse, with k symbols of lookahead. This approach generalized earlier precedence parsing techniques, providing a theoretical framework for efficient, table-driven parsers capable of handling nondeterministic grammars in linear time.[4]
Building on Knuth's LR(k) foundation, subsequent developments focused on practical variants to reduce computational complexity. The simplest form, LR(0), emerged as a baseline without lookahead, suitable for unambiguous grammars but limited in scope. In 1971, Frank L. DeRemer proposed Simple LR (SLR(1)), which incorporates one symbol of lookahead using FOLLOW sets to resolve ambiguities more effectively while maintaining smaller parsing tables than full LR(1). Further refinement came with Look-Ahead LR (LALR(1)), positioned as a practical compromise between power and efficiency; an efficient algorithm for its construction was detailed by DeRemer and Thomas J. Pennello in 1982, merging LR(0) states with propagated lookaheads to achieve near-LR(1) capability with significantly reduced state space.[6]
LR parsing's influence extended rapidly to compiler construction tools, notably through its integration into Yet Another Compiler-Compiler (Yacc), developed by Stephen C. Johnson at Bell Laboratories in 1975. Yacc automated LALR(1) parser generation from grammar specifications, revolutionizing software development by enabling rapid prototyping of language processors. This tool's legacy evolved into open-source implementations, including GNU Bison in 1985, which extended Yacc's functionality for broader portability and enhancements like improved error reporting.[7]
Key milestones in LR parsing highlight its recognition as a powerful method for deterministically processing nondeterministic grammars, contrasting with earlier top-down parsers' limitations. Canonical LR(1) parsers, as originally defined by Knuth, offer the highest expressiveness among the variants but at the cost of larger tables and higher construction overhead, underscoring the trade-offs that drove the adoption of SLR(1) and LALR(1) in real-world applications.[4]
Bottom-Up Parsing Basics
Shift-Reduce Mechanism
The shift-reduce mechanism forms the core of bottom-up parsing, where the parser processes input symbols from left to right, incrementally constructing a parse tree by alternating between shifting symbols onto a stack and reducing matched sequences according to grammar productions. This approach simulates the reverse of a rightmost derivation, starting from the input string and working toward the grammar's start symbol.
In the shift action, the parser moves the next input symbol (a terminal) from the input buffer onto the top of the parse stack without applying any reduction, preserving the symbol for potential future matching with production rules. This action is typically performed when the current stack top does not yet form a complete syntactic construct.[8]
The reduce action reverses a production by identifying a sequence of symbols on the stack's top that matches the right-hand side of a grammar rule and replacing it with the non-terminal on the left-hand side, thereby recognizing a larger syntactic unit and advancing the parse tree construction upward. Reductions occur only when the matched sequence qualifies as a handle in the current right-sentential form.[9]
A handle is the rightmost substring in a right-sentential form that matches the right-hand side of some production and whose reduction yields the prior right-sentential form in the rightmost derivation of the input; for unambiguous context-free grammars, every right-sentential form possesses exactly one such unique handle, guaranteeing a deterministic reduction path in viable parses.[9]
General shift-reduce parsing is inherently non-deterministic, as multiple actions may compete at a stack configuration: a shift might continue building a construct, while one or more reduces could apply to different handles, resulting in shift-reduce conflicts (ambiguity between shifting and reducing) or reduce-reduce conflicts (ambiguity among multiple applicable productions). LR parsers resolve this non-determinism through lookahead symbols and deterministic tables, enabling efficient handling of a broad class of grammars.
For a concrete illustration, consider parsing the input "id + id" against an expression grammar with productions E → E + T, E → T, T → id (assuming $ as end marker). The parser shifts "id" (now stack: id), reduces it to T via T → id (stack: T), then reduces to E via E → T (stack: E); it shifts "+" (stack: E +), shifts the second "id" (stack: E + id), reduces to T (stack: E + T), and finally reduces to E via E → E + T (stack: E), accepting the input as a valid expression. This sequence highlights how shifts accumulate terminals and non-terminals, while reduces prune handles to form higher-level structures.[8]
The parse stack maintains the current partial parse, enabling these actions to interact seamlessly in the overall algorithm.
Parse Stack and Actions
In bottom-up parsing, the parse stack serves as the central data structure for maintaining the state of the parsing process, building the parse tree incrementally from the input string. The stack is initialized as empty or with an initial state (often denoted as state 0), representing the starting point of the grammar's augmented form. It grows and contracts dynamically as the parser processes input symbols, encoding both the sequence of grammar symbols derived so far and the current parsing configuration to enable deterministic decisions.[4]
The stack's composition alternates between grammar symbols—terminals or non-terminals—and parser states, which are indices into a finite automaton derived from the grammar. States capture the possible partial parses at each point, while symbols represent the actual input or derived elements accumulated on the right-sentential form. For instance, after processing a prefix of the input, the stack might hold a sequence like [state 0, terminal 'a', state 3, non-terminal A, state 5], where the top state determines the next action based on the current input lookahead. This interleaved structure ensures that the parser can track the viable prefixes of the language efficiently.[4]
The primary actions on the stack build upon the foundational shift-reduce mechanism of bottom-up parsing. A shift action appends the next input terminal symbol to the stack followed by the corresponding new state, extending the partial parse to include the current symbol. A reduce action identifies a complete right-hand side (RHS) of a production on the top of the stack, pops twice the length of the RHS in elements (symbols and states), pushes the production's left-hand side (LHS) non-terminal, and then pushes the state transitioned to via the LHS. Beyond these, an accept action occurs when the stack holds the start symbol and the input is fully consumed, signaling a successful parse of the entire string. Conversely, an error action is triggered if no valid shift or reduce is possible, indicating an invalid input prefix that requires recovery or rejection. These actions ensure the stack always represents a valid prefix of some right-sentential form in the grammar.[4]
Stack growth occurs during shifts, as each appends two elements (symbol and state), simulating the expansion of the input derivation. Contraction happens via reduces, where popping 2 × |RHS| elements removes the matched substring and replaces it with the single LHS non-terminal plus its state, effectively collapsing the derivation tree bottom-up. This dynamic resizing allows the parser to handle nested structures without revisiting earlier decisions. The stack's role in ambiguity resolution is crucial: by preserving the full history of symbols and states, it disambiguates multiple possible parses for the same input prefix, guiding the parser toward the unique rightmost derivation required for deterministic context-free languages.[4]
To illustrate, consider a simple evolution during parsing:
Initial: [state 0]
After shift on 'a': [state 0, 'a', state 2]
After shift on 'b': [state 0, 'a', state 2, 'b', state 4]
After reduce (e.g., A → a b): [state 0, A, state 3] // Popped 4 elements (2 symbols + 2 states), pushed A and new state
Accept: Stack holds [start symbol S, final state] with end of input
Initial: [state 0]
After shift on 'a': [state 0, 'a', state 2]
After shift on 'b': [state 0, 'a', state 2, 'b', state 4]
After reduce (e.g., A → a b): [state 0, A, state 3] // Popped 4 elements (2 symbols + 2 states), pushed A and new state
Accept: Stack holds [start symbol S, final state] with end of input
This diagram shows the stack's initialization as empty (or with state 0) and its transformation, highlighting how shifts build depth while reduces prune and restructure.[4]
Core LR Parsing Algorithm
Parser Loop and Decisions
The LR parser executes an iterative algorithm that drives the bottom-up parsing process, maintaining a stack to track the partial parse and processing the input from left to right. This loop continues until the parser reaches an accept state or encounters an error, with decisions guided by two precomputed tables: the action table, which dictates shifts, reductions, accepts, or errors based on the current state and lookahead symbol, and the goto table, which handles state transitions for nonterminals after reductions. The algorithm builds on the shift-reduce mechanism, where the stack holds alternating states and grammar symbols to represent valid prefixes of right-sentential forms.[10]
In the decision process, the parser first consults the action table using the state at the top of the stack and the current lookahead terminal from the input buffer. If the entry specifies a shift to state i, the lookahead is pushed onto the stack followed by state i, advancing the input. A reduce entry for production j (with right-hand side of length r) prompts popping $2r stack elements (symbols and states), exposing the state below, then pushing the production's left-hand-side nonterminal A and the new state from the goto table entry for that exposed state and A. An accept entry signals successful parsing, while an empty entry triggers an error, potentially invoking recovery mechanisms or halting. These table-driven decisions ensure deterministic parsing for LR grammars without backtracking.[10]
Input handling occurs incrementally within the loop: the lookahead is the next unconsumed terminal, fetched initially before the loop and updated only on shifts. Reductions do not advance the input, preserving the lookahead for subsequent decisions, which allows the parser to resolve ambiguities using bounded lookahead.[10]
Termination happens upon encountering the accept action, which occurs when the stack holds the initial state followed by the augmented start symbol and the end-of-input marker serves as lookahead, confirming the input derives the start symbol. Otherwise, an error action terminates parsing unsuccessfully, reporting a syntax error.[10]
The following pseudocode illustrates the core loop structure:
input = [scanner](/page/Scanner).getNextToken(); // Initial lookahead
[stack](/page/Stack) = [0]; // Initial [state](/page/State) (augmented [grammar](/page/Grammar))
while true:
current_state = top([stack](/page/Stack))
action = [ACTION](/page/Action)[current_state, input]
if action == shift i:
[push](/page/Push)([stack](/page/Stack), input)
[push](/page/Push)([stack](/page/Stack), i)
input = [scanner](/page/Scanner).getNextToken()
[else](/page/The_Else) if action == reduce j: // Production A → β, |β| = r
pop([stack](/page/Stack), 2 * r) // Remove states and symbols
exposed_state = top([stack](/page/Stack))
[push](/page/Push)([stack](/page/Stack), A)
[push](/page/Push)([stack](/page/Stack), [GOTO](/page/Goto)[exposed_state, A])
// Do not advance input; recheck with same lookahead
[else](/page/The_Else) if action == accept:
return "Parse successful"
[else](/page/The_Else): // Error
report_syntax_error()
// Optional: recovery or halt
input = [scanner](/page/Scanner).getNextToken(); // Initial lookahead
[stack](/page/Stack) = [0]; // Initial [state](/page/State) (augmented [grammar](/page/Grammar))
while true:
current_state = top([stack](/page/Stack))
action = [ACTION](/page/Action)[current_state, input]
if action == shift i:
[push](/page/Push)([stack](/page/Stack), input)
[push](/page/Push)([stack](/page/Stack), i)
input = [scanner](/page/Scanner).getNextToken()
[else](/page/The_Else) if action == reduce j: // Production A → β, |β| = r
pop([stack](/page/Stack), 2 * r) // Remove states and symbols
exposed_state = top([stack](/page/Stack))
[push](/page/Push)([stack](/page/Stack), A)
[push](/page/Push)([stack](/page/Stack), [GOTO](/page/Goto)[exposed_state, A])
// Do not advance input; recheck with same lookahead
[else](/page/The_Else) if action == accept:
return "Parse successful"
[else](/page/The_Else): // Error
report_syntax_error()
// Optional: recovery or halt
This loop can be visualized as a flowchart with a central iteration node branching to shift (advance input, update stack), reduce (pop/push via goto, loop back), accept (exit success), or error (exit failure), emphasizing the table-based determinism.[11][10]
Handles and Reductions
In bottom-up parsing, particularly within the LR framework, a handle is formally defined as the right-hand side (RHS) of a production in a rightmost derivation of the input string, specifically the substring that corresponds to the production applied in the reverse step of the derivation process.[4] This identification ensures that reductions mimic the inverse of the rightmost derivation, preserving the structure of the parse tree while building it bottom-up.[12] The handle's position on the parser stack represents the viable prefix that can be reduced to advance toward the start symbol.
The reduction process in an LR parser involves recognizing a handle on the top of the stack, popping the stack symbols corresponding to the RHS β of the production A → β, pushing the left-hand side (LHS) A onto the stack, and transitioning to a new state via the goto function based on the exposed state and A.[4] For a handle with RHS β of length |β|, the parser pops 2 \times |β| elements from the stack, corresponding to the |β| symbols in β and the |β| states immediately following each symbol.[12] This operation effectively replaces the handle with A, updating the stack configuration to reflect the reduced sentential form.
In grammars with right recursion, such as those defining lists or expressions (e.g., E → E + T | T), multiple reductions may occur consecutively without intervening shifts, as each reduction exposes another handle on the stack.[4] This sequence handles the recursive structure by iteratively reducing inner productions before outer ones, ensuring the parser correctly associates right-recursive constructs without backtracking. Detailed step-by-step illustrations of handles and reductions for specific grammars are provided in later sections.
LR Parser Variants
LR(0) Parsers
LR(0) parsers represent the simplest form of LR parsers, relying exclusively on the current parser state to determine shift or reduce actions without examining any lookahead symbols. Introduced by Donald Knuth in his seminal work on bottom-up parsing, these parsers process input from left to right while constructing a rightmost derivation in reverse, using a stack to manage viable prefixes of the grammar.[4][1]
Central to LR(0) parsing are LR(0) items, which consist of a grammar production with a dot (·) marking the current position in the right-hand side, indicating how much of the production has been recognized. For example, given a production A \to \alpha \beta, possible items include A \to \cdot \alpha \beta (before parsing begins) and A \to \alpha \cdot \beta (after recognizing \alpha). These items form the basis of parser states, where each state is a set of such items representing all possible partial parses at that point. Decisions in an LR(0) parser depend solely on the item set in the current state, aligning with the general LR algorithm's shift-reduce mechanism but without lookahead refinement.[13][14][1]
The construction of an LR(0) parser begins with an augmented grammar, adding a new start production S' \to S to ensure a unique handle for the entire input. States are generated as the canonical collection of sets of LR(0) items, starting from the initial state containing S' \to \cdot S and its closure (all items derivable via epsilon transitions on nonterminals). Transitions between states are computed using the goto function: for a state I and symbol X (terminal or nonterminal), the goto state is the closure of items obtained by advancing the dot past X in relevant items from I. This process yields a deterministic finite automaton (DFA) whose states correspond to parser configurations and edges to shift actions on symbols. Reduce actions occur in states containing complete items A \to \gamma \cdot, and the accept action in the state with S' \to S \cdot. The resulting action and goto tables drive the parser loop.[4][13][14]
In terms of parsing power, LR(0) parsers recognize exactly the class of LR(0) grammars, a proper subset of the more general LR(1) grammars that allow one symbol of lookahead. This limited scope means LR(0) parsers handle only unambiguous context-free grammars where no shift-reduce or reduce-reduce conflicts arise during table construction, making them unsuitable for many practical grammars that require lookahead to resolve ambiguities, such as those involving operator precedence or the dangling else problem. A grammar is formally LR(0) if, in the DFA constructed from its LR(0) items, no state contains both a complete item (indicating a reduce) and an item with the dot before the same lookahead symbol (indicating a shift), and no state has multiple complete items (indicating a reduce-reduce conflict).[4][13][14]
LR(0) parsers offer advantages in simplicity and efficiency for the grammars they can handle: their tables are smaller and quicker to construct than those of full LR(1) parsers, as they avoid propagating lookahead sets, and they enable linear-time parsing with early error detection. However, these benefits come at the cost of reduced expressiveness; most real-world programming language grammars introduce conflicts in LR(0) tables, necessitating enhancements like SLR(1) or LALR(1) to resolve them without exponentially increasing table size. In practice, LR(0) parsers are rarely used standalone due to their frequent conflicts, though the underlying automaton serves as a foundation for more powerful variants.[1][13][14]
SLR(1) and LALR(1) Parsers
SLR(1) parsers extend LR(0) parsers by incorporating a single token of lookahead to resolve reduce actions more precisely, using FOLLOW sets derived from the grammar. In an SLR(1) parser, for a state containing an item A \to \alpha \bullet (indicating a completed production), a reduce action is placed in the action table for lookahead symbol a if a \in \text{FOLLOW}(A). This approach allows SLR(1) to handle a broader class of grammars than LR(0) by avoiding some shift-reduce conflicts where the lookahead distinguishes between shifting and reducing. The construction begins with the LR(0) automaton and augments the action entries based on FOLLOW sets, making it simpler to implement than full lookahead methods.[15]
LALR(1) parsers achieve greater precision by building on LR(1) items but merging states that share the same core (the set of LR(0) items), then propagating and unioning the lookahead sets from the original LR(1) states. This merging reduces the number of states compared to canonical LR(1) while retaining much of its parsing power, resulting in compact tables suitable for practical use. The lookahead sets for reductions are computed more accurately than in SLR(1), as they reflect context-specific possibilities rather than global FOLLOW sets. LALR(1) forms the basis for tools like Yacc and Bison, which generate efficient parsers for context-free grammars in compiler construction.[16][17]
SLR(1) is simpler to construct due to its reliance on FOLLOW sets but often introduces more conflicts in ambiguous situations, as these sets may over-approximate valid lookaheads. In contrast, LALR(1) resolves many such conflicts by using propagated lookaheads, allowing it to parse a superset of SLR(1) grammars with fewer states than full LR(1). The power hierarchy is LR(0) ⊂ SLR(1) ≈ LALR(1) ⊂ LR(1), where LALR(1) and SLR(1) are nearly equivalent in expressive power but LALR(1) is more practical for real-world applications. LALR(1) is sufficient for most practical parsing problems, including the grammars of many programming languages.[18][19]
A representative example illustrates the difference in conflict resolution: consider the grammar
S' → S
S → L = R | R
L → * R | id
R → L
S' → S
S → L = R | R
L → * R | id
R → L
In the SLR(1) parser, a shift-reduce conflict arises in the state containing R \to L \bullet because = \in \text{FOLLOW}(R), prompting a reduce on lookahead =, which conflicts with the shift for = from S \to L \bullet = R. However, in the LALR(1) parser, the lookahead for reducing R \to L is propagated as only \} or end-of-input (not =), avoiding the conflict and allowing a correct shift on =. This demonstrates how LALR(1)'s precise lookaheads prevent unnecessary reduces that SLR(1) cannot.[20]
Parsing Table Construction
Augmented Grammar and Items
In LR parsing, the grammar is first augmented to facilitate the construction of a deterministic parser. This involves introducing a new nonterminal symbol S', distinct from the original start symbol S, and adding a single production S' \to S. The resulting augmented grammar G' ensures a unique initial configuration for the parser and a distinct accept state upon recognizing the complete input string followed by the end marker.[21] This augmentation prevents ambiguities in handling the start of the parse and clearly signals acceptance when the item [S' \to S \bullet] is reached, where \bullet denotes the position after the entire right-hand side has been processed.90426-2)
Central to LR parsing are LR items, which represent partial derivations in the grammar and model the parser's state of knowledge during recognition. An LR item consists of a production rule with a dot \bullet inserted at some position in the right-hand side, indicating how much of the production has been matched so far. For a production A \to \alpha \beta, where \alpha and \beta are strings of grammar symbols, possible items include [A \to \bullet \alpha \beta] (no progress made), [A \to \alpha \bullet \beta] ( \alpha matched, \beta pending), and [A \to \alpha \beta \bullet] (complete match). The initial item for the augmented grammar is [S' \to \bullet S], signifying the parser's starting point. These items capture the parser's expectations: the symbols before the dot have been recognized on the input, while those after suggest future actions like shifting terminals or reducing nonterminals.[21]
LR items are categorized into kernels and closures to build the states of the LR automaton efficiently. The kernel of a set of items comprises those where the dot is not at the beginning of the right-hand side, reflecting the core progress from prior transitions (e.g., items like [A \to \alpha \bullet \beta] with \alpha nonempty). In contrast, the closure expands this kernel by incorporating additional items derived from any nonterminal immediately following the dot in kernel items, using the grammar's productions to anticipate possible expansions (e.g., if a kernel item has [A \to \alpha \bullet B \beta], and B \to \gamma is a production, then [B \to \bullet \gamma] is added to the closure). This distinction allows compact representation of parser states: kernels identify unique configurations reached via symbol transitions, while closures complete the set by including all viable next derivations, enabling the parser to handle nondeterminism deterministically. The purpose of these elements is to formalize the positions in potential parses, providing a foundation for computing transitions and decisions without backtracking.[21]90426-2)
For example, consider a simple augmented grammar with original start S \to A and A \to a. The initial item [S' \to \bullet S] has kernel \emptyset (dot at start), but its closure adds [S \to \bullet A] and [A \to \bullet a], modeling the full set of possibilities at the outset. This structure underpins all LR variants by representing the parser's localized view of the grammar at each step.[21]
Closure and State Computation
In LR parsing, the closure operation computes the complete set of valid items reachable from an initial set of LR(0) items, ensuring the parser state captures all possible expansions of nonterminals immediately following the current prefix. Given a set of items I, the closure \operatorname{cl}(I) is formed by initializing with the items in I and then iteratively adding, for every item [A \to \alpha \cdot B \beta] in the current set (where B is a nonterminal), all items of the form [B \to \cdot \gamma] for each production B \to \gamma in the grammar. This process continues recursively until no new items are added, reaching a fixed point. The result represents all productions that could potentially be derived after the prefixes encoded in I, akin to epsilon-like expansions in nondeterministic finite automata.[22]
The goto function defines transitions between parser states based on grammar symbols. For a set of items I and a symbol X (terminal or nonterminal), \operatorname{goto}(I, X) is the closure of the set \{ [A \to \alpha X \cdot \beta] \mid [A \to \alpha \cdot X \beta] \in I \}. This moves the dot past X in all applicable items, forming new items, and then applies the closure operation to include any further expansions. The goto function thus simulates the shift or reduction transitions in the parser's deterministic finite automaton (DFA).[22]
To construct the full set of parser states, begin with the initial state I_0 = \operatorname{cl}(\{ [S' \to \cdot S] \}), where S' \to S is the augmented start production and S is the grammar's start symbol. Subsequent states are generated by applying \operatorname{goto}(I, X) for every existing state I and every grammar symbol X, collecting unique sets until no new states are produced. This enumeration can be performed using breadth-first search (BFS) or depth-first search (DFS) to explore the state graph: start a queue or stack with I_0, compute gotos from the current state, add unvisited results as new states, and record transitions until exhaustion. The resulting states, numbered sequentially from 0 to n, form the canonical collection of LR(0) item sets, which underlies the parser's DFA.[22]
As an illustrative example, consider a simple augmented grammar with productions S' \to S, S \to A, and A \to a B, B \to b. Starting with the initial item set I = \{ [S' \to \cdot S] \}, the closure adds [S \to \cdot A] since the dot precedes nonterminal S. Then, from [S \to \cdot A], add [A \to \cdot a B]. No further nonterminals follow the dot in the new item, so \operatorname{cl}(I) = \{ [S' \to \cdot S], [S \to \cdot A], [A \to \cdot a B] \}. Now, applying \operatorname{[goto](/page/Goto)}(\operatorname{cl}(I), A) yields \{ [S \to A \cdot] \}, whose closure remains the same since no nonterminal follows the dot. This demonstrates how closure expands the state to include potential derivations, while goto advances the parse position.[22]
Table Filling and Conflicts
Action and Goto Table Entries
The action and goto tables form the core decision mechanism in an LR parser, dictating shifts, reductions, accepts, or errors based on the current stack-top state and the next input symbol. These tables are populated after computing the set of parser states, each represented by a collection of LR items, using the goto function to determine transitions on grammar symbols. The rows of both tables correspond to the numbered states (typically from 0 to the total number of states), while the columns of the action table align with terminal symbols (including the end marker $), and the columns of the goto table align with non-terminal symbols.
In the action table, for a given state i and terminal a, the entry is set to "shift j" if the goto function applied to state i and symbol a yields a new state j; this instructs the parser to push a onto the stack and transition to state j. For reduction actions, if state i contains a completed item [A \to \beta \bullet] (where \beta is the right-hand side of production A \to \beta), the entry action[i, a] is set to "reduce A \to \beta" for each terminal a in the appropriate lookahead set; in LR(0) parsers, this applies to all terminals, whereas non-LR(0) variants like SLR(1), LALR(1), or LR(1) restrict reductions to lookaheads in FOLLOW(A) or propagated lookahead sets to ensure validity. The accept entry is specifically placed in the action table for the state containing the item [S' \to S \bullet] (where S' is the augmented start symbol and S the original start), setting action[i, $] to "accept" to indicate successful parsing upon encountering the end-of-input marker.
The goto table is populated exclusively for non-terminals: for state i and non-terminal A, the entry goto[i, A] is set to j if the goto function from state i on A yields state j; this transition occurs after a reduction, pushing A onto the stack and moving to j. Entries in either table that remain undefined after processing all states and symbols are marked as "error," triggering parse failure handling.
The table-filling process can be formalized in pseudocode as follows, iterating over the computed states and symbols:
for each state i in 0 to number_of_states - 1:
// Handle shifts for terminals
for each terminal a:
j = goto(I_i, a) // I_i is the item set for state i
if j is defined:
action[i][a] = "shift j"
else:
action[i][a] = "error"
// Handle reductions and accept
for each completed item [A → β •] in I_i where A ≠ S':
for each a in lookahead_set(A → β): // e.g., FOLLOW(A) in SLR(1), all terminals in LR(0)
action[i][a] = "reduce A → β"
if [S' → S •] in I_i:
action[i][$] = "accept"
// Handle gotos for non-terminals
for each non-terminal A:
j = goto(I_i, A)
if j is defined:
goto[i][A] = j
// No else needed; undefined gotos are handled during parsing
for each state i in 0 to number_of_states - 1:
// Handle shifts for terminals
for each terminal a:
j = goto(I_i, a) // I_i is the item set for state i
if j is defined:
action[i][a] = "shift j"
else:
action[i][a] = "error"
// Handle reductions and accept
for each completed item [A → β •] in I_i where A ≠ S':
for each a in lookahead_set(A → β): // e.g., FOLLOW(A) in SLR(1), all terminals in LR(0)
action[i][a] = "reduce A → β"
if [S' → S •] in I_i:
action[i][$] = "accept"
// Handle gotos for non-terminals
for each non-terminal A:
j = goto(I_i, A)
if j is defined:
goto[i][A] = j
// No else needed; undefined gotos are handled during parsing
This construction ensures deterministic behavior for LR(k) grammars, as originally formalized for general left-to-right parsing with bounded lookahead.
Shift-Reduce and Reduce-Reduce Conflicts
In LR parsing, a shift-reduce conflict arises in a parser state where the LR automaton contains an item of the form A \to \alpha \cdot a \beta, indicating a potential shift action on terminal a, alongside a completed item B \to \gamma \cdot whose lookahead set includes a, suggesting a reduce action instead. This ambiguity forces the parser to choose between shifting the next input symbol onto the stack or reducing the handle using the production for B. A classic illustration is the "dangling else" problem in grammars for conditional statements, such as:
stmt → if expr then stmt
| if expr then stmt else stmt
| other
stmt → if expr then stmt
| if expr then stmt else stmt
| other
In a state after parsing if expr then stmt, the parser encounters else and must decide whether to shift it (to associate the else with the inner if) or reduce the inner statement (associating the else with the outer if).[21]
A reduce-reduce conflict occurs when a single parser state includes two or more completed items, such as A \to \beta \cdot and C \to \delta \cdot, with overlapping lookahead sets that permit reduction by either production on the same input symbol. This situation indicates that the grammar is not LR(k) for the given lookahead, as the parser cannot deterministically select which reduction to perform without additional context. Such conflicts are less common than shift-reduce but signal inherent nondeterminism in the grammar's structure.[21]
Conflicts are detected during the construction and filling of the action and goto tables, where multiple possible entries—such as both a shift and a reduce, or two reduces—are generated for the same pair of state and input symbol. Parser generators like Yacc report these as errors unless resolved, ensuring the table remains deterministic.[17]
To resolve shift-reduce and reduce-reduce conflicts in LR(0) and SLR(1) parsers, lookahead restrictions using FOLLOW sets are applied: a reduce action for B \to \gamma is only entered for symbols in FOLLOW(B), preventing overlap with shift actions where possible.[21] In LALR(1) parsers, more precise lookahead propagation merges sets from contributing LR(1) states while avoiding spurious conflicts, as described in algorithms that compute propagated lookaheads via graph traversal on the LR(0) automaton.[23] These methods increase the class of parsable grammars without full LR(1) table sizes.
Grammar refactoring offers a fundamental approach to eliminate conflicts entirely, such as declaring operator precedences and associativities to guide shift-reduce resolutions or rewriting ambiguous productions to enforce a unique parse tree.[17] For the dangling else, associating else with the nearest if via precedence (e.g., else higher than then) or restructuring the grammar with a matched/unmatched statement hierarchy resolves the conflict while preserving the intended semantics.[21]
Practical Examples
Grammar for Arithmetic Expressions
A standard context-free grammar for simple arithmetic expressions, commonly used as a running example in LR parsing, defines expressions with addition and multiplication operators while respecting operator precedence and left associativity.[21]
The grammar rules are as follows:
- E \to E + T \mid T
- T \to T * F \mid F
- F \to (E) \mid \text{id}
Here, E represents an expression, T a term (capturing multiplication precedence), and F a factor (handling parentheses and atomic values). The terminals include the operators + and *, parentheses ( and ), identifiers \text{id} (e.g., variables or constants), and the end marker \ . The non-terminals are E (the start symbol), T , and F $.[21]
This grammar avoids ambiguity by using left-recursive productions for both E and T, which enforce left associativity for the operators; for instance, the expression \text{id} + \text{id} + \text{id} is parsed as (\text{id} + \text{id}) + \text{id}, and multiplication takes precedence over addition due to the hierarchical structure of non-terminals.[21]
To illustrate, consider the input string \text{id} * \text{id} + \text{id} \ . A rightmost [derivation](/page/Derivation) from the start symbol E proceeds as follows, replacing the rightmost non-terminal at each step and ultimately yielding the string with the desired grouping (\text{id} * \text{id}) + \text{id} $:
- E \Rightarrow E + T
- E \Rightarrow E + F
- E \Rightarrow E + \text{id}
- E \Rightarrow T + \text{id}
- E \Rightarrow F * F + \text{id}
- E \Rightarrow \text{id} * F + \text{id}
- E \Rightarrow \text{id} * \text{id} + \text{id}
This derivation respects the grammar's precedence and associativity rules.[21]
Step-by-Step Table Construction
To construct the parsing table for the arithmetic expression grammar using SLR(1), begin by augmenting the grammar with a new start production E' → E, where E is the original start symbol; this ensures a unique starting state and facilitates end-of-input handling.[24] The original grammar is E → E + T | T, T → T * F | F, F → (E) | id, with productions numbered as follows: (1) E → E + T, (2) E → T, (3) T → T * F, (4) T → F, (5) F → (E), (6) F → id, and the augmented production (0) E' → E.[24]
The initial state, State 0, starts with the item [E' → •E] and is expanded using the closure operation, which adds all items derivable by replacing the nonterminal after the dot with its productions and repeating until no new items are added; this yields State 0: [E' → •E], [E → •E + T], [E → •T], [T → •T * F], [T → •F], [F → •(E)], [F → •id].[24] From State 0, compute goto transitions on symbols: goto on E leads to State 1 with kernel [E' → E•] and closure adding [E → E• + T]; goto on T leads to State 2 with [E → T•], [T → T• * F]; goto on F leads to State 3 with [T → F•]; goto on ( leads to State 4, a copy of State 0 but shifted for the parenthesis; and goto on id leads to State 5 with [F → id•].[24]
Continue enumerating states by applying goto from existing states until no new states are generated, resulting in 12 states total: State 6 from State 1 or 8 on + with [E → E + •T] and closures for T and F; State 7 from State 2 or 9 on * with [T → T * •F] and closures for F; State 8 from State 4 on E with [F → ( E • )]; State 9 from State 6 on T with [E → E + T•], [T → T• * F]; State 10 from State 7 on F with [T → T * F•]; and State 11 from State 8 on ) with [F → (E)•].[24] These states form the canonical collection of LR(0) items, capturing all possible parser configurations for this grammar.[24]
To fill the ACTION table, for each state and terminal, place a shift action s_k if there is a goto transition on that terminal to state k; place a reduce action r_i on terminals in the FOLLOW set of the nonterminal in a completed item A → α• (production i) if the state contains such an item; and place accept in State 1 on (end marker).[](https://cs.gmu.edu/~white/CS540/slr.pdf) The FOLLOW sets are: FOLLOW(E') = {}, FOLLOW(E) = {, ), +}, FOLLOW(T) = {, ), +, *}, FOLLOW(F) = {, ), +, *}, computed via standard FIRST and FOLLOW algorithms to resolve reduces in SLR(1).[](https://cs.gmu.edu/~white/CS540/slr.pdf) For example, in State 0, shift ( to s4 and [id](/page/id) to s5; in State 5 ([F → id•]), reduce r6 (F → id) on FOLLOW(F) = {, ), +, *}; in State 3 ([T → F•]), reduce r4 (T → F) on FOLLOW(T).[24] The GOTO table entries are placed for nonterminals as the state numbers from the transitions.[24]
The resulting SLR(1) tables are as follows, where the ACTION table uses terminals {+, *, (, ), id, $} and the GOTO table uses nonterminals {E, T, F}; SLR adjustments appear in the reduce actions, which are placed solely based on FOLLOW sets rather than more precise lookaheads, potentially introducing conflicts in less suitable grammars but none here.[24]
ACTION Table:
| State | + | * | ( | ) | id | $ |
|---|
| 0 | | | s4 | | s5 | |
| 1 | s6 | | | | | accept |
| 2 | r2 | s7 | | r2 | | r2 |
| 3 | r4 | r4 | | r4 | | r4 |
| 4 | | | s4 | | s5 | |
| 5 | r6 | r6 | | r6 | | r6 |
| 6 | | | s4 | | s5 | |
| 7 | | | s4 | | s5 | |
| 8 | s6 | | | | | |
| 9 | r1 | s7 | | r1 | | r1 |
| 10 | r3 | r3 | | r3 | | r3 |
| 11 | r5 | r5 | | r5 | | r5 |
GOTO Table:
| State | E | T | F |
|---|
| 0 | 1 | 2 | 3 |
| 1 | | | |
| 2 | | | |
| 3 | | | |
| 4 | 8 | 2 | 3 |
| 5 | | | |
| 6 | | 9 | 3 |
| 7 | | | 10 |
| 8 | | | |
| 9 | | | |
| 10 | | | |
| 11 | | | |
This table enables deterministic parsing without conflicts, confirming the grammar's SLR(1) suitability.[24]
Parsing Process Illustration
To illustrate the operation of an LR parser during input processing, consider the sample input token sequence "id * id + id" appended with the end-of-input marker $, processed using the SLR(1) action and goto tables constructed from the augmented arithmetic expression grammar (as detailed in prior sections on table construction). This input represents a valid expression with multiplication taking precedence over addition. The simulation follows the standard LR parsing algorithm, which iteratively consults the tables based on the current stack top state and lookahead symbol to perform shift, reduce, or accept actions until the input is fully processed.
The process initializes with the stack containing solely the start state 0 (denoted as ) and the input pointer at the first token "id", making "id" the initial lookahead symbol. In state 0, the action table entry action[0, id] = s5 specifies a shift: the parser pushes the symbol "id" and state 5 onto the stack, yielding [0 id 5], and advances the input pointer to "". With the stack top at state 5 and lookahead "", action[5, ] = r6 (reduce by F → id) is applied, since "" ∈ FOLLOW(F); the parser pops two stack elements (id and state 5), exposes state 0, and applies goto[0, F] = 3 to push "F" and state 3, yielding [0 F 3], lookahead remains "". Next, in state 3 with lookahead "", action[3, ] = r4 (reduce by T → F) applies (using FOLLOW(T) in SLR); pop two (F 3), goto[0, T] = 2, stack [0 T 2], lookahead "".
In state 2 with lookahead "", action[2, ] = s7 triggers a shift, appending "" and state 7 to the stack as [0 T 2 * 7], advancing to "id". With lookahead "id", action[7, id] = s5 shifts, resulting in [0 T 2 * 7 id 5], advancing to "+". In state 5 with lookahead "+", action[5, +] = r6 (reduce F → id); pop two (id 5), goto[7, F] = 10, stack [0 T 2 * 7 F 10], lookahead "+". Immediately, in state 10 with lookahead "+", action[10, +] = r3 (reduce T → T * F); pop six elements ( 7 F 10 T 2, for RHS length 3), exposes state 0, goto[0, T] = 2, stack [0 T 2], lookahead "+". This reduction recognizes "id * id" as a term T, enforcing multiplication precedence and left associativity.
With stack top at state 2 and lookahead "+", action[2, +] = r2 (reduce E → T, since "+" ∈ FOLLOW(E)); pop two (T 2), goto[0, E] = 1, stack [0 E 1], lookahead "+". In state 1 with lookahead "+", action[1, +] = s6 shifts, appending "+" and state 6: [0 E 1 + 6], advancing to "id". Action[6, id] = s5 shifts "id" to [0 E 1 + 6 id 5], lookahead "". In state 5 with lookahead "", action[5, ] = r6 reduces F → id; pop two, goto[6, F] = 3, [0 E 1 + 6 F 3], "". Then state 3, action[3, ] = r4 reduces T → F; pop two, goto[6, T] = 9, [0 E 1 + 6 T 9], "". In state 9 with lookahead "", action[9, $] = r1 (reduce E → E + T); pop six (+ 6 T 9 E 1), exposes state 0, goto[0, E] = 1, stack [0 E 1], lookahead "". Finally, action[1, $] = accept signals successful parsing.
This sequence demonstrates how table-driven decisions, guided by FOLLOW sets in SLR(1), handle precedence (reducing multiplication before addition) and associativity without backtracking.[24]
Stack Evolution and Reductions
In an LR parser, the stack serves as the core data structure for tracking the partial derivation during bottom-up parsing, alternating between parser states (which encode possible future actions) and grammar symbols (terminals or nonterminals). Shifts push a lookahead terminal and its corresponding state onto the stack, advancing the parse along potential productions, while reductions identify a handle—a rightmost substring matching a production's right-hand side—pop the associated symbols and states (2 times the length of the right-hand side), and push the left-hand side nonterminal with the goto state computed from the exposed state below the handle. This process ensures unambiguous recognition of viable prefixes, with the stack depth reflecting the nesting of the emerging parse structure.
The reductions drive the bottom-up assembly of the parse tree: each reduction applies a production, effectively creating a parent node for the popped subtree symbols and linking it to the context above the handle, progressively building from leaves (terminals) to the root. For arithmetic expressions, this manifests as recognizing factors, terms, and expressions in a precedence-driven manner, where operator associativity is handled by sequential reductions after shifts.
Consider an LR(1) parser for the augmented grammar E' → E, E → E + T | T, T → T * F | F, F → (E) | id, processing the input string "id * id + id $". The stack begins with state 0, and evolves as follows, with snapshots after each major action emphasizing the configuration just before and after key reductions (state numbers per standard canonical construction for this grammar).[24]
| Configuration | Stack Snapshot | Remaining Input | Operation/Reduction Trace |
|---|
| Initial | | id * id + id $ | - |
| After shift/reduce | [0 T 2] | id + id $ | Shift "id", reduce F → id (pop 2), reduce T → F (pop 2); first term recognized. |
| After shift | [0 T 2 * 7] | id + id $ | Shift "*"; continue term (higher precedence defers E reduce). |
| After shift/reduce | [0 T 2 * 7 F 10] | + id $ | Shift "id", reduce F → id (pop 2); second factor. |
| After reduce | [0 T 2] | + id $ | Reduce T → T * F (pop 6); multiplication assembled as left-associative term. |
| After reduce/shift | [0 E 1 + 6] | id $ | Reduce E → T (pop 2, on "+"); shift "+"; now at expression level. |
| After shift/reduce | [0 E 1 + 6 T 9] | $ | Shift "id", reduce F → id (pop 2), reduce T → F (pop 2); second term. |
| After reduce | [0 E 1] | $ | Reduce E → E + T (pop 6); addition applied after multiplication complete. Accept on E' → E (pop 2). |
This sequence demonstrates multiple consecutive reductions for operator precedence and associativity: after the second "id", reductions assemble the multiplication (F → id, T → T * F) before reducing to E on "+", forming a parse tree with * nested under T within the overall E. The final accept confirms the complete tree rooted at E'. The use of lookahead in decisions (e.g., reduce E → T only on "+" or "$", shift "*" from T state) enforces grammar rules without backtracking.[24]
In cases of invalid input, such as "id + ", the stack evolves to [0 E 1] after reducing the initial "id" through F → id, T → F, E → T, then shifts "+" to [0 E 1 + 6], but encounters "" with no defined action[6, $] (no shift or reduce for T expected), signaling a syntax error and invoking recovery (e.g., popping states or inserting tokens).[24]
Advanced Topics
Lookahead Sets and Precision
In LR(1) parsing, lookahead sets play a crucial role by associating each production item with a specific terminal symbol that predicts the next input, enabling the parser to make unambiguous decisions during shift-reduce operations. An LR(1) item is represented as [A → α•β, a], where A → αβ is a production, the dot (•) indicates the current position in the right-hand side, α has been parsed, β remains to be parsed, and a is the lookahead terminal (or the end marker $) that must follow the completion of the production to trigger a reduction.[4] This lookahead context resolves parsing ambiguities that arise in LR(0) parsers, such as shift-reduce or reduce-reduce conflicts, by restricting reductions to cases where the next input symbol matches the predicted lookahead, ensuring deterministic behavior for a broader class of grammars.[25]
FOLLOW sets provide the foundation for propagating lookaheads in LR(1) items, defined as the set of terminals that can appear immediately after a nonterminal A in any valid derivation from the start symbol. For the start symbol S, FOLLOW(S) always includes the end marker $; for other nonterminals, FOLLOW(A) is computed using a fixed-point iteration that propagates terminals through the grammar's productions—for each production B → γAδ, add FIRST(δ) (excluding ε if δ is nullable) to FOLLOW(A), and if δ is nullable or empty, add FOLLOW(B) to FOLLOW(A).[25] This iterative process continues until no new terminals are added, ensuring FOLLOW sets capture all possible succeeding terminals.[22]
Complementing FOLLOW sets, FIRST sets determine the initial terminals derivable from a string of grammar symbols, essential for computing lookaheads during item closure in LR(1) state construction. For a string α, FIRST(α) includes all terminals that can begin strings derived from α; if α is ε-producible (nullable), ε is included, allowing propagation to subsequent symbols. Computation proceeds recursively: for a terminal t, FIRST(t) = {t}; for a nonterminal X with productions X → Y₁Y₂⋯Yₖ, initialize FIRST(X) with FIRST(Y₁) (excluding ε if nullable), then iteratively add FIRST(Yᵢ) for subsequent nullable prefixes; ε-productions are handled by marking nonterminals as nullable and including ε in their FIRST sets if applicable.[25] In the closure operation for an item [A → α•Bγ, a], new items [B → •δ, b] are added where b ∈ FIRST(γa), combining FIRST computation with lookahead propagation to refine precision.[22]
The use of FIRST and FOLLOW sets in LR(1) parsers achieves high precision by tailoring lookaheads to specific derivation contexts, allowing recognition of all deterministic context-free grammars without conflicts, though this comes at the cost of potentially exponential growth in the number of parser states due to distinct lookahead combinations.[4] In contrast, approximations like LALR(1) merge states from the canonical LR(1) automaton while approximating lookaheads via FOLLOW sets or propagation, reducing state count to linear in grammar size but risking introduced conflicts in some grammars.[22] The recursive algorithms for FIRST and FOLLOW computation typically involve iterative fixed-point iterations over the grammar's nonterminals and productions, starting with initial seeds (e.g., $ for FOLLOW(S)) and updating sets until convergence, with time complexity O(n³) in the worst case for n symbols but efficient in practice for typical grammars.[25]
Error Detection and Recovery
In LR parsers, syntax errors are detected during the parsing loop when the action table entry corresponding to the current state and lookahead symbol is undefined or marked as an error, indicating that no valid shift, reduce, or accept action is possible.[26] This detection mechanism relies on the deterministic nature of the parsing table, where the absence of an entry signals a deviation from the viable prefixes of the grammar.[27]
Common recovery techniques aim to resume parsing after an error with minimal disruption to the overall process. In panic mode recovery, the parser discards input tokens until it encounters a synchronizing token—typically a member of the follow set of a nonterminal—allowing it to pop the stack until a state where this token is valid, thus restoring a viable prefix.[26] Another approach involves popping stack symbols until a viable prefix is achieved, effectively backtracking to a point where parsing can continue without further insertions or deletions.[28] More sophisticated methods include local repairs such as inserting or deleting tokens near the error site, or augmenting the grammar with error productions (e.g., E \to \text{error}) to explicitly handle anticipated error patterns like missing semicolons.[27] These error productions integrate seamlessly into the LR table construction, enabling the parser to reduce erroneous input to a special error nonterminal and continue.[29]
To enhance recovery accuracy, techniques often employ multiple lookaheads to anticipate corrections, evaluating potential shifts or reductions based on subsequent tokens to select the most plausible repair.[27] For instance, in an arithmetic expression grammar, an input like "id + +" triggers an error on the second "+", where the parser might skip the extraneous "+" (a synchronizing token in the follow set of the expression nonterminal) and resume parsing, avoiding cascade failures.[26] However, LR recovery strategies assume errors are locally deterministic and syntax-focused, limiting their applicability to semantic analysis or deeply nested ambiguities, where additional context beyond the parsing table may be required.[28]