Greibach normal form

Greibach normal form (GNF) is a canonical representation for context-free grammars (CFGs) in formal language theory, where each production rule has the form A \to a \alpha, with A a nonterminal symbol, a a terminal symbol, and \alpha a (possibly empty) string of nonterminal symbols. This structure ensures that every right-hand side begins with a terminal, preventing rules that start solely with nonterminals or consist only of the empty string (except possibly for the start symbol).^[1] Introduced by Sheila A. Greibach in her seminal 1965 paper, the form addresses limitations in other normalizations like Chomsky normal form by enabling direct correspondence between grammar derivations and automaton computations. The central theorem establishes that for any context-free language L not containing the empty word, there exists an equivalent CFG in GNF generating exactly L; if the empty word is included, it can be handled by an additional start-symbol production. Proofs of this theorem typically involve transforming a Chomsky normal form grammar through substitutions and eliminations to achieve the required production shapes, with the resulting grammar size bounded polynomially in the original.^[2] GNF holds significant importance in theoretical computer science, particularly for constructing pushdown automata (PDAs) from CFGs, as its left-to-right terminal initiation mirrors the input-reading process in PDA simulations of leftmost derivations.^[3] This facilitates proofs of language equivalence between CFGs and PDAs, analysis of parsing algorithms in compilers, and extensions to weighted or infinite-word contexts in advanced formal language studies.^[1]

Definition

Production Rule Structure

In Greibach normal form, every production rule adheres to the strict syntactic structure A \to a \alpha, where A is a single nonterminal symbol, a is a single terminal symbol, and \alpha is a string of zero or more nonterminal symbols (which may be empty).^[2] This form is defined for context-free grammars, ensuring that derivations begin with terminal symbols to facilitate leftmost parsing.^[2] A key restriction of this structure is the prohibition against left recursion: no production rule may begin with a nonterminal, meaning forms like A \to B \beta (where B is a nonterminal and \beta is any string) are not permitted.^[2] This eliminates cycles that could lead to infinite left derivations without progress in consuming input terminals.^[2] Regarding the empty string \epsilon, it is allowed exclusively as a production for a designated start symbol, such as S \to \epsilon, but only if the generated language includes \epsilon; no other nonterminal may have a direct empty production in this form, and the start symbol S must not appear on the right-hand side of any rule.^[4] Representative examples of valid production rules in Greibach normal form include:

S \to [aA](/page/Terminal), where a is a terminal and A is a nonterminal.
A \to b, where b is a terminal (here, \alpha is empty).
B \to [cBB](/page/String), where c is a terminal and BB is a string of nonterminals.^[4]

Formal Grammar Requirements

A context-free grammar G = (V, \Sigma, P, S) is in Greibach normal form if all productions in P adhere to the specified production rule structure, the start symbol S does not appear on the right-hand side of any production, and the grammar generates the same language as its original form prior to conversion. This ensures that derivations proceed without left recursion and that the start symbol initiates all valid sentential forms without cyclic dependencies involving itself.^[5]^[2] The nonterminals in V must satisfy additional constraints to maintain the form's validity: no ε-productions are allowed except possibly the single rule S \to \epsilon if \epsilon \in L(G), and all nonterminals are reachable from the start symbol S and productive (i.e., each derives at least one terminal string), eliminating useless symbols that contribute nothing to the language. These conditions promote a clean grammar structure, where every nonterminal is both reachable from S and productive in deriving terminal strings, avoiding superfluous elements that could complicate analysis or parsing.^[5] The language generated by a grammar in Greibach normal form remains equivalent to that of the original context-free grammar, preserving L(G) throughout the normalization process. For instance, the grammar with productions S \to aS \mid b is already in Greibach normal form, generating the language \{ a^n b \mid n \geq 0 \}; in contrast, a grammar such as S \to Sa \mid a requires conversion to eliminate left recursion and achieve the form.^[2]^[5]

Conversion Process

Eliminating ε-Productions

ε-productions in a context-free grammar are rules of the form A \to \epsilon, where A is a nonterminal and \epsilon is the empty string. These productions allow nonterminals to derive the empty string and must be eliminated (except possibly for the start symbol if the language includes \epsilon) to achieve Greibach normal form, as GNF productions cannot be empty except in specific cases. Eliminating ε-productions preserves the generated language while preparing the grammar for further normalization steps.^[6]^[7] The standard algorithm to remove ε-productions proceeds as follows: First, identify the set of nullable nonterminals—those that can derive \epsilon (including those with direct A \to \epsilon or through chains). This can be computed by starting with nonterminals having \to [\epsilon](/page/Epsilon) and iteratively adding those whose all productions consist of nullables. Then, for every production A \to X_1 X_2 \dots X_k (where X_i are grammar symbols), generate all possible productions by omitting subsets of the nullable X_i's, adding A \to Y_1 Y_2 \dots Y_m for each non-empty subset where nullables are skipped. Finally, remove all original ε-productions. If the original start symbol S is nullable and the language includes \epsilon, introduce a new start symbol S_0 \to S \mid \epsilon; otherwise, if S nullable but \epsilon not in language, ensure no S \to \epsilon. This process may introduce unit productions, which are handled in a subsequent step, and the resulting grammar generates the same language.^[7] For example, consider productions S \to A B \mid \epsilon, A \to a \mid \epsilon, B \to b \mid \epsilon. Nullable are S, A, B. For S \to A B, add S \to A, S \to B, S \to \epsilon (but already have), and keep S \to A B. Remove ε-prods, but since \epsilon in language, keep S \to \epsilon or use new start. The grammar becomes S \to A B \mid A \mid B \mid \epsilon, A \to a, B \to b.^[7] This step ensures the grammar is ε-free (except possibly start), setting the stage for eliminating unit productions and left recursion.

Substituting Unit Productions

Unit productions in a context-free grammar are rules of the form A \to B, where both A and B are nonterminal symbols. These productions do not contribute terminals to derivations and can form chains or cycles that complicate normalization. In the process of converting a grammar to Greibach normal form (GNF), where every production must begin with a terminal symbol, unit productions must be eliminated to ensure all right-hand sides start appropriately.^[6] The elimination of unit productions involves substituting them with non-unit productions from reachable nonterminals, preserving the generated language. The standard algorithm assumes the grammar is ε-free and proceeds as follows: Construct a directed graph where nodes are nonterminals and edges represent unit productions. For each nonterminal A, compute the set N_A of all nonterminals reachable from A via paths in this graph (including A itself). Then, for every B \in N_A and every non-unit production B \to \gamma (where \gamma is not a single nonterminal), add the production A \to \gamma to the new set of rules. Finally, remove all original unit productions. This process handles chains like A \to B \to C by effectively substituting C's non-unit rules directly into A, and it resolves cycles (such as A \to B \to \cdots \to A) by including all reachable non-unit productions without introducing loops. The resulting grammar generates the same language, as any derivation using unit chains can be shortened by replacing them with direct non-unit rules, and vice versa.^[6]^[7] If a nonterminal B reachable from A via units is nullable (i.e., B \Rightarrow^* \epsilon), the algorithm may introduce A \to \epsilon only if the original grammar allows it; however, since ε-productions are typically removed prior to this step in GNF conversion (except possibly for the start symbol if the language includes the empty string), such cases are avoided to maintain ε-freedom.^[6] Consider the grammar with productions A \to B, B \to a \mid C, C \to b, where a and b are terminals. The unit graph has edges A \to B and B \to C, so N_A = \{A, B, C\}. Non-unit productions are B \to a and C \to b. Thus, add A \to a and A \to b, then remove the units, yielding A \to a \mid b. This eliminates the chain while preserving derivations.^[7] This substitution prepares the grammar for subsequent steps, such as eliminating left recursion, by ensuring no unproductive unit chains remain.^[6]

Eliminating Left Recursion

Left recursion in a context-free grammar occurs when a non-terminal can derive a string beginning with itself, either directly (immediate left recursion, where a production is of the form A \to A \alpha) or indirectly (through a cycle involving other non-terminals). Such recursion prevents the construction of Greibach normal form (GNF), as GNF requires all productions to begin with a terminal symbol, ensuring no leftmost non-terminal derivations. Eliminating left recursion is thus a crucial step in the conversion process, preserving the language generated by the grammar while enabling subsequent substitutions to achieve GNF.^[2] For immediate left recursion, consider productions for a non-terminal A partitioned into left-recursive forms A \to A \alpha_1 | \cdots | A \alpha_r and non-left-recursive forms A \to \beta_1 | \cdots | \beta_s, where none of the \beta_i begin with A. The standard transformation introduces a new non-terminal A' and replaces the productions as follows:

\begin{align*} A &\to \beta_1 A' \mid \cdots \mid \beta_s A' \\ A' &\to \alpha_1 A' \mid \cdots \mid \alpha_r A' \mid \epsilon \end{align*}

This converts the left recursion into right recursion, allowing derivations to progress without infinite loops in top-down parsing contexts. As a simple example, consider the grammar fragment S \to S \alpha \mid a, where a is a terminal and \alpha is a string of grammar symbols. Applying the transformation yields S \to a S' and S' \to \alpha S' \mid \epsilon, eliminating the recursion while generating the same language.^[8] Indirect left recursion arises when a non-terminal A_i derives another non-terminal A_j (with j < i) that eventually leads back to A_i through a chain. To handle this, non-terminals are ordered arbitrarily as A_1, \dots, A_n, and substitutions are performed iteratively to convert indirect recursion into immediate recursion, which is then eliminated using the above method. The full procedure assumes the grammar has no cycles among non-terminals and no \epsilon-productions (with nullable sets computed beforehand to handle potential empties). Unit productions should be substituted first as a prerequisite to simplify the grammar.^[8] The complete algorithm for eliminating left recursion proceeds as follows:

Arrange the non-terminals in some order A_1, \dots, A_n.
For each i from 1 to n:
- For each j from 1 to i-1, and for each production A_i \to A_j \gamma, replace it by adding productions A_i \to \delta \gamma for each existing production A_j \to \delta, and remove the original A_i \to A_j \gamma.
- Eliminate immediate left recursion in the updated productions for A_i using the direct transformation.

This pseudocode ensures all indirect recursions are resolved by substitution before direct elimination:

for i = 1 to n do
    for j = 1 to i-1 do
        for each production A_i → A_j γ do
            remove A_i → A_j γ
            for each production A_j → δ do
                add A_i → δ γ
    eliminate immediate left recursion on A_i
for i = 1 to n do
    for j = 1 to i-1 do
        for each production A_i → A_j γ do
            remove A_i → A_j γ
            for each production A_j → δ do
                add A_i → δ γ
    eliminate immediate left recursion on A_i

The resulting grammar is free of left recursion and equivalent to the original. For instance, in a grammar with productions A \to B \gamma \mid \beta and B \to A \delta, substitution first replaces A \to B \gamma with A \to A \delta \gamma \mid \beta, yielding immediate recursion A \to A (\delta \gamma) \mid \beta, which is then transformed to A \to \beta A' and A' \to \delta \gamma A' \mid \epsilon.^[8]

Substitution to Achieve Greibach Form

After eliminating ε-productions, unit productions, and left recursion, the grammar is free of these issues but may still have productions starting with nonterminals. The final step to achieve Greibach normal form involves substituting the productions of the leftmost nonterminal in each rule to ensure every production starts with a terminal. This step preserves the language and results in the desired form A \to a \alpha.^[7] Assuming the grammar is ε-free, unit-free, and left-recursion-free, order the nonterminals A_1, A_2, \dots, A_n in a topological order such that if a production has leftmost nonterminal A_j in a rule for A_i, then j > i (possible due to no left recursion). Process the nonterminals from A_n to A_1: For each production A_i \to A_j \gamma (where j > i, \gamma is a string of nonterminals), replace it with productions A_i \to \beta \gamma for every production A_j \to \beta of A_j. Since A_j (with higher index) has already been processed, all its productions A_j \to \beta start with a terminal (i.e., in GNF). Productions already starting with a terminal remain unchanged. Remove any duplicates after substitution. If the resulting grammar introduces ε-productions due to nullable symbols, they should have been handled earlier, but the process ensures none appear.^[7] For example, suppose after prior steps, we have A \to B C \mid a, B \to b D \mid b, ordered with B after A. Processing B first (already starts with b). Then for A → B C, substitute: A → b D C | b C, plus A → a. Now all start with terminals: A → b D C | b C | a, B → b D | b. This achieves GNF while preserving derivations.^[7] If the language contains \epsilon, add a new start symbol S_0 \to S \mid \epsilon after the process, ensuring S_0 has no other productions. The resulting grammar is in Greibach normal form and generates the same language.^[2]

Properties

Uniqueness and Equivalence

Every context-free language possesses at least one equivalent grammar in Greibach normal form (GNF), as established by a constructive algorithm that transforms any context-free grammar into this form, handling the empty string via a start symbol production if necessary.^[2] This existence theorem relies on systematic elimination of certain production types while preserving the generated language.^[2] The equivalence of GNF grammars to their original counterparts is guaranteed by the conversion process, which involves substituting unit productions, removing left recursion, and ensuring each production begins with a terminal symbol followed by zero or more non-terminals; this yields a grammar generating precisely the same language L(G).^[2] Specifically, the resulting GNF grammar is strongly equivalent to the input grammar, meaning their languages are identical.^[2] Although GNF provides a canonical structure, it is not unique: multiple GNF grammars can generate the identical context-free language, varying in non-terminal selections or production sequencing due to the non-canonical nature of the transformation algorithms.^[9] In relation to Chomsky normal form (CNF), GNF holds equivalent expressive power, as both forms can describe any context-free language, but GNF emphasizes leftmost terminal derivations without the binary non-terminal branching characteristic of CNF.^[2]

Implications for Parsing

The structure of Greibach normal form (GNF), where every production rule begins with a terminal symbol, eliminates left recursion inherent in general context-free grammars, thereby facilitating top-down parsing techniques such as recursive descent without the risk of infinite loops during derivation.^[10] In a recursive descent parser, the terminal-first rule allows immediate lookahead to select the appropriate production, ensuring that the parser processes the input string in a single pass from left to right, with each recursive call consuming exactly one terminal symbol before expanding nonterminals.^[11] This property guarantees that, for a string of length n in the language generated by a GNF grammar, any top-down parser halts after at most depth n, directly mirroring the leftmost derivation process where terminals are generated sequentially.^[12] Furthermore, GNF supports predictive parsing in LL(1) frameworks by providing unique terminal prefixes for the alternatives of each nonterminal's productions, enabling unambiguous selection with a single token of lookahead. If the initial terminals for a nonterminal's rules are distinct, the grammar becomes LL(1) compatible, as the parser can deterministically choose the expansion without backtracking or conflict resolution, which is particularly advantageous for constructing efficient table-driven parsers.^[13] However, the conversion process to GNF can significantly impact parser construction due to potential increases in grammar size, which in worst-case scenarios grow exponentially with the number of nonterminals, complicating the generation of parsing tables or code.^[11] For example, a grammar with chained indirect left recursion may require substituting and expanding productions, leading to an explosion in the number of rules—up to $2^{n^2} for n nonterminals in pathological cases—thus raising the time and space complexity of building and executing the parser.^[14] Although improved algorithms achieve polynomial bounds, such as O(|G|^4) where |G| is the original grammar size, the practical overhead can still affect scalability for large grammars in compiler design.^[15]

Applications

In Formal Language Theory

Greibach normal form (GNF) was introduced by Sheila A. Greibach in 1965 as a normalization for context-free grammars, where every production begins with a terminal symbol, ensuring that derivations proceed in a left-to-right manner without left recursion, which aligns closely with leftmost derivations.^[2] In formal language theory, GNF is useful for analyzing context-free languages (CFLs), as its structure ensures a direct correspondence to leftmost derivations. Since every CFL has an equivalent grammar in GNF, properties of CFLs can be studied using this form without loss of generality.^[16] GNF also aids in proofs of non-context-freeness by assuming a grammar in this form and deriving contradictions through analysis of terminal prefixes in derivations; for instance, pumping may alter expected prefix lengths or structures in languages requiring strict balancing, like those involving multiple matched counts.^[17] This technique leverages the sequential production of terminals to expose mismatches that violate the language's constraints.

Relation to Pushdown Automata

Greibach normal form (GNF) provides a direct correspondence to pushdown automata (PDAs) by facilitating the construction of a PDA that accepts the same language as the grammar, without epsilon transitions on the input. In a GNF grammar, each production is of the form A \to a \alpha, where a is a terminal and \alpha is a (possibly empty) string of nonterminals; this structure allows the PDA to simulate the grammar's leftmost derivations by reading one terminal at a time while managing the stack for nonterminals. The PDA operates with two states: an initial state p and a working state q. From p, it reads the first terminal a and pushes the corresponding \alpha from the start symbol productions. In q, for each nonterminal A on top of the stack, upon reading a, it pops A and pushes \alpha if A \to a \alpha applies. Acceptance occurs by empty stack when the input is consumed, ensuring the PDA halts after exactly n steps for an input string of length n.^[3] For example, consider the production A \to a B in a GNF grammar. This maps to a PDA transition: on input a with A on top of the stack, pop A and push B, moving to state q. If multiple productions exist for A, such as A \to a B and A \to a C, the PDA becomes nondeterministic, choosing which string to push. This construction preserves the language generated by the grammar, excluding the empty string, and demonstrates how GNF eliminates the need for epsilon moves during input consumption.^[3]^[18] In the context of deterministic context-free languages (DCFLs), a restricted form of GNF—where all productions for a given nonterminal start with the same terminal—enables the construction of a deterministic PDA (DPDA) without nondeterminism. This "deterministic GNF" ensures unique choices for stack operations upon reading a terminal, aligning with the deterministic nature of the language. Such grammars arise from DCFLs and support realtime DPDA recognition, where the automaton processes one input symbol per step. The conversion follows the standard GNF-to-PDA algorithm but resolves choices deterministically due to the unique terminal prefix per nonterminal.^[3]^[18] The overall conversion algorithm from a GNF grammar to a PDA simulates leftmost derivations by treating the stack as the current sentential form's nonterminal suffix. Starting with the start symbol on the stack, each input terminal triggers a matching production, replacing the top nonterminal with the production's nonterminal string while consuming the terminal. This process mirrors the grammar's generative steps in reverse for recognition, proving equivalence between GNF grammars and PDAs for context-free languages.^[3]