In formal language theory, the Chomsky normal form (CNF) is a standardized representation of context-free grammars (CFGs) in which every production rule is restricted to one of two forms: A \to BC, where A, B, and C are nonterminal symbols (variables), or A \to a, where a is a terminal symbol, with the optional allowance for the start symbol S to have a rule S \to \epsilon if the empty string is in the language.[1] This form eliminates unit productions (A \to B), mixed terminal-nonterminal rules, and productions with more than two symbols on the right-hand side, ensuring a binary branching structure in the corresponding parse trees.[2]
Named after linguist Noam Chomsky, who first described this restricted form in his 1959 paper on the formal properties of grammars, CNF serves as a canonical normal form for type-2 grammars in the Chomsky hierarchy.[1] Every context-free grammar can be algorithmically converted into an equivalent CNF grammar that generates the same language, preserving the grammar's generative power while simplifying its structure for analysis.[3] This transformation is constructive and involves steps such as removing unnecessary symbols, eliminating \epsilon-productions (except possibly for the start symbol), removing unit productions, and binarizing longer rules.[4]
The importance of CNF lies in its role in theoretical computer science and natural language processing, where it facilitates proofs about context-free languages and enables efficient parsing algorithms. For instance, in CNF, the derivation of any string of length n requires exactly $2n - 1 production applications, which underpins dynamic programming approaches like the Cocke-Younger-Kasami (CYK) algorithm for recognizing membership in a language, achieving O(n^3) time complexity.[2] Additionally, CNF's binary tree structure aligns parse trees with hierarchical syntactic analysis, making it valuable for compiler design, ambiguity resolution, and computational linguistics.[5]
Introduction and background
Definition
A context-free grammar G = (V, \Sigma, P, S) consists of a finite set V of variables (nonterminals), a finite set \Sigma of terminals, a finite set P of production rules, and a distinguished start variable S \in V.[2][4]
The grammar G is in Chomsky normal form (CNF) if every production rule in P has one of two forms: A \to BC, where A, B, C \in V are variables and neither B nor C is the empty string, or A \to a, where A \in V and a \in \Sigma is a terminal symbol.[6][7][8]
This form imposes strict restrictions on the productions: there are no \epsilon-productions except possibly S \to \epsilon if the empty string is in the language generated by G and S does not appear in the right-hand side of any other production; there are no unit productions of the form A \to B where A, B \in V; and there are no productions with right-hand sides longer than two symbols or mixing terminals and nonterminals except for the single-terminal case.[2][4][9]
The Chomsky normal form is named after linguist Noam Chomsky, who first described this restricted form in his 1959 paper "On Certain Formal Properties of Grammars."[1]
Historical development
The Chomsky normal form emerged as part of Noam Chomsky's efforts to formalize and standardize the structure of context-free grammars within the broader framework of generative linguistics and computability theory. In his seminal 1956 paper, "Three Models for the Description of Language," Chomsky introduced the Chomsky hierarchy, categorizing grammars into types 0 through 3 based on production rule restrictions, with type 2 encompassing context-free grammars.[10] This classification motivated the creation of normalized forms to streamline theoretical analysis, particularly for addressing decision problems such as language membership testing and syntactic parsing, by reducing the complexity of derivations while preserving generative power.
Chomsky first described the normal form in his 1959 paper "On Certain Formal Properties of Grammars," presenting it as a "regular" form for type-2 grammars with productions limited to binary nonterminal rules or single terminals.[1] His initial formulation in the 1950s laid the groundwork for these normalizations as tools to prove fundamental properties of formal languages, emphasizing their role in distinguishing context-free languages from more restrictive regular languages (type 3) and more general unrestricted grammars (type 0). The hierarchy itself highlighted the need for such standardizations to explore generative capacity and computational tractability.[10]
Refinements appeared in Chomsky's 1963 work, "Formal Properties of Grammars," where he explicitly detailed the Chomsky normal form as a canonical representation for type-2 grammars, consisting solely of binary nonterminal productions and terminal substitutions.[11] This evolution facilitated rigorous proofs of equivalence between arbitrary context-free grammars and their normalized counterparts, enhancing the toolkit for undecidability results and language equivalence demonstrations.[11]
By the 1960s, the form had become integral to parsing theory, supporting developments in algorithmic recognition of context-free languages and bridging linguistics with early computer science. Chomsky normal form's tight connection to the type-2 category of the hierarchy proved essential for theoretical advancements, including analyses of self-embedding and inherent ambiguities in natural language structures.[11]
Properties and equivalence
Key properties
Grammars in Chomsky normal form exhibit several key structural properties that simplify analysis and processing. Every nonterminal derives a non-empty string of terminals, as the form prohibits ε-productions (except possibly for the start symbol in languages containing the empty string) and unit productions.[1] The corresponding parse trees are binary branching at all internal nodes, with each non-leaf node having exactly two children that are nonterminals, while leaves are single terminals.[12] In a minimal grammar—meaning one without redundant symbols—there are no useless nonterminals; every symbol is reachable from the start symbol and generates at least one terminal string.[13]
These structural constraints yield important computational properties. The binary form enables the Cocke-Younger-Kasami (CYK) dynamic programming algorithm to recognize membership in the generated language in O(n^3) time for input strings of length n, where the hidden constant depends on the number of nonterminals and productions.[14] Furthermore, the bounded tree structure facilitates ambiguity detection by allowing the enumeration or counting of all possible parse trees for a given string, as the number of such trees is finite and can be computed efficiently using extensions of CYK.[15]
Converting a general context-free grammar to Chomsky normal form can substantially expand its size. If the original grammar has size n (measured by the total number of symbols across all productions), the process may introduce up to O(n^2) new nonterminals and production rules in the worst case with an appropriate order of steps, such as performing binarization before epsilon-elimination, primarily due to the introduction of auxiliary symbols for binarization and terminal separation.
Chomsky normal form is not unique for a given language; distinct grammars in this form can generate identical context-free languages, arising from different choices in the normalization process or equivalent but non-isomorphic representations.[13] For any string of length n \geq 1 derived from a Chomsky normal form grammar, every leftmost (or rightmost) derivation requires exactly $2n - 1 steps: n steps to introduce terminals and n - 1 steps to branch nonterminals. To see this, note that each terminal-introducing production adds one leaf, requiring n such productions, while the binary branching necessitates n - 1 nonterminal-pairing productions to connect them into a full tree; the total steps follow by induction on n, with the base case n=1 using one step and the inductive step adding two steps (one branch and one terminal) per extension.
Equivalence to general context-free grammars
A fundamental result in the theory of formal languages establishes that every context-free language can be generated by a context-free grammar in Chomsky normal form.[9] Specifically, for any context-free grammar G generating a language L, there exists an equivalent grammar G' in Chomsky normal form such that L(G') = L(G).[16]
The proof of this theorem is constructive and relies on a sequence of grammar transformations that systematically restrict the form of productions while preserving the generated language at each stage.[9] These transformations ensure that the resulting grammar adheres to the rules of Chomsky normal form without altering the language's generative power.[16]
A detailed proof sketch proceeds by induction on the types of production rules present in the original grammar. Base cases handle simple forms like terminal or binary nonterminal productions, which are already in normal form. For inductive steps, each elimination process—such as removing ε-productions or unit productions—is shown to yield an equivalent grammar by demonstrating that every derivation in the original can be mimicked in the new one and vice versa, thus preserving the language exactly.[9]
Special attention is given to edge cases involving the empty string ε. If ε belongs to the language, the conversion allows the start symbol to retain a production to ε, which is explicitly permitted in the definition of Chomsky normal form for the start variable. For languages excluding ε, the process eliminates all ε-productions entirely, ensuring the resulting grammar generates precisely the non-empty strings of the original language.[16]
This equivalence has significant implications: it confirms that the class of context-free languages admits a canonical representation via Chomsky normal form grammars, facilitating theoretical analyses and equivalence proofs in computability and automata theory.[9]
Conversion procedure
Preliminary preparations
Before proceeding with the specific transformations required for Chomsky normal form, it is essential to prepare the context-free grammar by removing useless symbols and ensuring its structural integrity. Useless symbols include non-generating nonterminals, which cannot derive any terminal string, and unreachable nonterminals, which cannot be derived from the start symbol in any sentential form. These elements do not contribute to the language generated by the grammar and can be safely eliminated to simplify subsequent steps.[17]
To identify non-generating nonterminals, apply a fixed-point algorithm starting from the terminals. Initially, mark all terminal symbols as generating. Then, iteratively mark a nonterminal A as generating if there exists a production A \to \alpha where every symbol in \alpha (a string over terminals and nonterminals) has already been marked generating. Continue until no additional nonterminals can be marked. All unmarked nonterminals are non-generating; remove them along with all productions in which they appear on the left- or right-hand side. This process ensures that every remaining nonterminal can derive at least one terminal string.[17][18]
Next, identify unreachable nonterminals by another fixed-point computation, but starting from the designated start symbol S. Initially, mark S as reachable. Then, iteratively mark a nonterminal B as reachable if it appears in the right-hand side of any production whose left-hand side is already marked reachable. Continue until no further markings are possible. Unmarked nonterminals are unreachable; eliminate them and all associated productions. Perform this reachability computation only after removing non-generating symbols, as the latter may otherwise propagate false reachability. The resulting grammar contains only symbols involved in derivations from S to terminal strings.[17][18]
If the original grammar has overlapping symbols between the terminal alphabet \Sigma and the nonterminal alphabet V_N, rename the terminals—such as by appending a distinguishing marker (e.g., replacing a shared symbol a with t_a)—to enforce disjointness, as required by the standard definition of context-free grammars. This prevents ambiguity in parsing and derivation steps.[12]
In cases where the grammar features multiple start-like rules or several nonterminals functioning as potential starts (e.g., disconnected components), consolidate them by introducing a new start symbol S' with productions S' \to S_1 \mid S_2 \mid \dots \mid S_k, where S_1, \dots, S_k are the original start-like nonterminals; set S' as the new start. This unifies the grammar under a single entry point without altering the generated language.[19]
These preliminary steps produce a minimal grammar in which every nonterminal is both generating and reachable, thereby reducing the size and complexity for later normalization phases while preserving the exact language generated by the original grammar.[19]
Elimination of ε-productions
Epsilon-productions, also known as null productions, are rules of the form A \to \epsilon in a context-free grammar (CFG), where A is a nonterminal and \epsilon denotes the empty string. These productions allow nonterminals to derive the empty string, which complicates certain analyses and parsing algorithms, necessitating their removal during conversion to Chomsky normal form while preserving the generated language.[20]
To eliminate ε-productions, first identify the set of nullable nonterminals, denoted N_\epsilon, which are those that can derive \epsilon. This set is computed iteratively as the smallest fixed point satisfying: initialize with all nonterminals A such that A \to \epsilon exists; then repeatedly add any nonterminal B for which there is a production B \to X_1 X_2 \dots X_k (where each X_i is a nonterminal) such that all X_i \in N_\epsilon. The process terminates since the set of nonterminals is finite, yielding N_\epsilon in time proportional to the square of the grammar's size.[20][21]
Once nullable nonterminals are identified, remove all ε-productions A \to \epsilon. For each remaining production A \to \alpha, where \alpha = X_1 X_2 \dots X_m consists of terminals and nonterminals, generate new productions by considering all possible ways to omit subsets of the nullable nonterminals in \alpha, provided the resulting right-hand side is non-empty. Specifically, for every non-empty subsequence obtained by deleting zero or more occurrences of symbols from N_\epsilon, add a corresponding production A \to \beta, where \beta is that subsequence. This substitution ensures that derivations using nullable nonterminals are simulated by the new rules without altering the language.[20][21]
For example, consider a production A \to B C D, where C \in N_\epsilon but B, D \notin N_\epsilon. The new productions include A \to B C D (omitting none), A \to B D (omitting C), but not A \to \epsilon or A \to B (as those would omit non-nullables or create invalid empties). If multiple nullables are present, all combinations are added, potentially increasing the number of productions exponentially in the length of \alpha, though practical grammars remain manageable.[20]
A special case arises with the start symbol S. If S \in N_\epsilon (indicating \epsilon \in L(G)), introduce a new start symbol S' with productions S' \to S and S' \to \epsilon, and ensure S does not appear on any right-hand side. This preserves \epsilon in the language without allowing other ε-productions. If S \notin N_\epsilon, no ε-production is added for S'.[21][20]
The resulting grammar generates the same language as the original, as every derivation in the new grammar corresponds to one in the old by inserting ε-derivations for omitted nullables, and vice versa, with the possible exception of \epsilon handled explicitly for the start symbol. This step may introduce longer productions, addressed in subsequent normalization phases.[20][21]
Elimination of unit productions
Unit productions in a context-free grammar are rules of the form A \to B, where A and B are nonterminals.[22] These productions must be eliminated during the conversion to Chomsky normal form to ensure all remaining rules conform to the required structure, while preserving the grammar's generative power.[2] Assuming ε-productions have already been removed, unit productions do not introduce new ε-derivations, as no nonterminal derives the empty string.[23]
To identify unit pairs, compute the transitive closure of unit derivations, determining for each nonterminal A the set of all nonterminals B such that A \Rightarrow^* B using only unit productions.[22] This can be done efficiently by constructing a dependency graph with nodes for nonterminals and directed edges from C to D for each unit production C \to D, then finding reachable nodes from each starting nonterminal via standard graph traversal or transitive closure algorithms like Warshall's.[22] Self-loops (A \to A) are simply removed without substitution, as they contribute nothing to derivations.[22]
The elimination algorithm proceeds as follows: for each nonterminal A and each B reachable from A via unit derivations (with B \neq A), add to the grammar all non-unit productions of B; that is, if B \to \gamma where \gamma is a string of length at least two or containing terminals, include A \to \gamma.[23] Then, delete all original unit productions.[2] This process is non-iterative once the reachable sets are computed, though cycles in the dependency graph are inherently handled by the transitive closure, ensuring no infinite substitutions occur.[22]
The resulting grammar generates the same language as the original, as every derivation using units can be replaced by direct non-unit rules without altering the yielded terminal strings.[22] However, the number of productions may increase, potentially up to quadratic in the number of nonterminals in the worst case, due to copying non-unit rules across unit chains.[2]
Replacement of nonsolitary terminals
In the conversion procedure for Chomsky normal form, the replacement of nonsolitary terminals targets productions where a terminal symbol appears in the right-hand side (RHS) alongside one or more other symbols, such as A \to aB or A \to BaC, violating the form's requirement that terminals must appear only in solitary positions (i.e., A \to a).[2][24] These "mixed" productions are identified by scanning all rules after prior steps like eliminating ε-productions and unit productions, focusing solely on those with RHS length greater than 1 containing terminals.[25]
The algorithm proceeds as follows: For each distinct terminal a \in [\Sigma](/page/Sigma) that appears in a nonsolitary position across any production, introduce a fresh nonterminal X_a (not previously used in the grammar) and add the unit production X_a \to a. Then, replace every occurrence of a in the RHS of mixed productions with X_a, while leaving solitary terminal productions A \to a unchanged.[2][24] This introduces exactly one new nonterminal and one new production per such terminal a, resulting in an addition of up to |\Sigma| new rules and nonterminals, depending on which terminals appear in mixed contexts.[25]
For example, consider the production A \to aBC, where a is a terminal and B, C are nonterminals. Introduce X_a \to a, then rewrite the original as A \to X_a BC. Similarly, for D \to BaC, it becomes D \to B X_a C after the same replacement.[2] If multiple terminals are involved, such as in S \to aXbY, introduce X_a \to a and X_b \to b, yielding S \to X_a X Y X_b.[24]
This transformation preserves the language generated by the grammar, as any derivation incorporating X_a can be expanded to include a without altering the yield, and conversely, any string derived originally can be obtained by substituting back from the new nonterminals.[2][24] Upon completion, all terminals in the grammar now appear only in solitary RHS positions, setting the stage for further binarization of nonterminal chains while maintaining equivalence to the original context-free grammar.[26]
Reduction to binary nonterminals
In the conversion to Chomsky normal form, after eliminating ε-productions, unit productions, and mixed terminals in right-hand sides, the grammar consists of rules where the right-hand side is either a single terminal or a sequence of two or more nonterminals.[27] The next step addresses productions with more than two nonterminals on the right-hand side, known as long rules of the form A \to B_1 B_2 \dots B_k where k > 2 and each B_i is a nonterminal.[28] These must be broken down into binary rules (exactly two nonterminals on the right-hand side) to satisfy the form's requirements, ensuring all such rules become X \to Y Z for nonterminals X, Y, Z.[29]
The algorithm proceeds recursively by introducing new nonterminals to split the sequence. For a long rule A \to B_1 B_2 \dots B_k, replace it with A \to B_1 C_1, where C_1 is a fresh nonterminal, and add the rule C_1 \to B_2 \dots B_k. Repeat the process on C_1 \to B_2 \dots B_k until the right-hand side has exactly two symbols, yielding a chain of binary rules such as C_{k-3} \to B_{k-2} B_{k-1} and C_{k-2} \to B_{k-1} B_k.[27] This introduces up to k-2 new nonterminals per original rule, often named systematically like A_i or C_i to indicate their position in the original sequence (e.g., C_1 for the suffix starting after B_1).[28] The choice of splitting from the left (as described) is conventional but arbitrary; right-branching or balanced variants exist, though they preserve equivalence.[29]
This transformation preserves the language generated by the grammar, as every derivation using the original long rule corresponds to a unique sequence of binary derivations in the new grammar, and vice versa, maintaining the same yield of terminals.[27] The resulting parse trees may appear skewed (left- or right-branching) due to the chaining, but the associative structure of derivations ensures semantic equivalence in context-free generation.[28] For example, consider the rule S \to NP VP PP; it becomes S \to NP X_1, X_1 \to VP PP, introducing one new nonterminal X_1 since k=3.[29] After processing all long rules—typically following unit production elimination—the grammar has only binary nonterminal rules, alongside terminal rules like A \to a.[27]
Handling the start symbol
In Chomsky normal form, the start symbol must not appear on the right-hand side of any production rule, as this ensures the form's strict structure and facilitates algorithms like the Cocke-Younger-Kasami parser.[9] If, after prior conversion steps, productions exist where the start symbol S appears on the right-hand side—such as A \to S \beta for some nonterminal A and string \beta of nonterminals and terminals—this violates the normal form requirement.[2]
To resolve this, introduce a new start symbol S' distinct from all existing nonterminals, and add the production S' \to S. The original start symbol S is then renamed to a new nonterminal S_0 to prevent confusion, and every occurrence of S on the right-hand side of other productions is replaced with S_0. For instance, if there was a rule A \to S B, it becomes A \to S_0 B. The new grammar now has S' as its start symbol, and S' does not appear on any right-hand side. This transformation preserves the generated language, as any derivation starting from S' simply begins with the step to S (or S_0), yielding the same strings as the original grammar.[25][9]
This step is typically performed last in the conversion procedure to avoid propagating modifications to the start symbol during earlier eliminations, such as those for \varepsilon-productions or unit productions, which could otherwise introduce unwanted cycles or require additional substitutions involving the new start.[2] By delaying it, the process ensures that the final start symbol remains isolated from right-hand sides without complicating intermediate steps.
A special case arises if the grammar includes the production S \to \varepsilon (which is permissible in Chomsky normal form only if the language is \{\varepsilon\}), as the language must still generate the empty string after the change. In this scenario, add the production S' \to \varepsilon alongside S' \to S (with S renamed to S_0) to preserve \varepsilon in the language. For languages containing \varepsilon but not solely \{\varepsilon\}, prior \varepsilon-elimination steps would have already handled this by ensuring no \varepsilon-productions remain except where necessary, and the new start addition maintains the property without further adjustment.[9]
Recommended order of steps
The recommended order of steps for converting a context-free grammar to Chomsky normal form begins with preliminary preparations to remove useless symbols, followed by elimination of ε-productions, elimination of unit productions, replacement of nonsolitary terminals, reduction to binary nonterminals, and finally handling the start symbol.[30]
This sequence minimizes complications and ensures correctness by addressing issues that could arise from dependencies between steps. Eliminating ε-productions second—after preliminaries—avoids reintroducing nullables in later phases, as subsequent transformations like unit elimination could otherwise generate new ε-productions if nullables persist.[31] Unit productions are then removed to prevent chain explosions, since ε-elimination may create additional unit rules that would amplify complexity if handled earlier.[23] Replacing nonsolitary terminals precedes binarization to isolate terminals into dedicated unit productions, ensuring long right-hand sides contain only nonterminals and simplifying the introduction of binary rules without mixed symbols.[2]
While any order that preserves language equivalence is valid, alternatives can increase the size of intermediate grammars; for instance, eliminating units before ε-productions risks creating ε-units that require rework.[31] The full process yields a Chomsky normal form grammar with O(n^4) rules in the worst case, where n measures the original grammar's size, though practical implementations benefit from variable reuse in binarization to reduce growth.[2]
Examples and illustrations
Step-by-step conversion example
Consider the context-free grammar G = (V, \Sigma, P, S) with variables V = \{S, A, B\}, terminals \Sigma = \{a, b\}, start symbol S, and productions P = \{ S \to ASA \mid aB, \, A \to B \mid S, \, B \to b \mid \epsilon \}.[32] The conversion process below follows the standard steps without altering the language equivalence.
Preliminary preparations
Add a new start symbol S_0 with the production S_0 \to S to isolate the original start symbol and prevent issues if it becomes nullable during later steps. The updated grammar is:
S₀ → S
S → ASA | aB
A → B | S
B → b | ε
S₀ → S
S → ASA | aB
A → B | S
B → b | ε
No other preparations are needed, as there are no immediate irregularities beyond the ε-production.
Elimination of ε-productions
The nullable nonterminals are A and B (since B \to \epsilon and A \to B). Remove B \to \epsilon. For productions containing nullable symbols, add shortened versions by omitting combinations of nullables (excluding the empty string for non-ε original productions):
- From S \to aB, add S \to a (omit B).
- From S \to ASA, add S \to SA (omit first A), S \to AS (omit second A), and S \to S (omit both A's).
- From A \to B, no new production (omitting B would yield ε, which is not added).
The resulting grammar is:
S₀ → S
S → [ASA](/page/ASA) | aB | a | SA | AS | S
A → B | S
B → b
S₀ → S
S → [ASA](/page/ASA) | aB | a | SA | AS | S
A → B | S
B → b
Elimination of unit productions
The unit productions are S_0 \to S, A \to S, A \to B, and S \to S. Remove S \to S. Replace each unit by adding the non-unit productions of the target:
- For A \to S, add S's non-unit productions to A: A \to [ASA](/page/ASA) \mid aB \mid a \mid SA \mid AS; remove A \to S.
- For A \to B, add B's productions to A: A \to b; remove A \to B.
- For S_0 \to S, add S's non-unit productions to S_0: S_0 \to [ASA](/page/ASA) \mid aB \mid a \mid SA \mid AS; remove S_0 \to S.
The resulting grammar is:
S₀ → ASA | aB | a | SA | AS
S → ASA | aB | a | SA | AS
A → b | ASA | aB | a | SA | AS
B → b
S₀ → ASA | aB | a | SA | AS
S → ASA | aB | a | SA | AS
A → b | ASA | aB | a | SA | AS
B → b
Replacement of nonsolitary terminals
Nonsolitary terminals appear in aB (a mixed with nonterminal). Introduce a new variable C with C \to a, and replace all instances of aB with CB:
S₀ → ASA | CB | a | SA | AS
S → ASA | CB | a | SA | AS
A → b | ASA | CB | a | SA | AS
B → b
C → a
S₀ → ASA | CB | a | SA | AS
S → ASA | CB | a | SA | AS
A → b | ASA | CB | a | SA | AS
B → b
C → a
All terminals are now solitary.
Reduction to binary nonterminals
The only productions with more than two symbols on the right-hand side are those with ASA (three nonterminals). Introduce a new variable D with D \to SA, and replace all ASA with AD:
S₀ → AD | CB | a | SA | AS
S → AD | CB | a | SA | AS
A → b | AD | CB | a | SA | AS
B → b
C → a
D → SA
S₀ → AD | CB | a | SA | AS
S → AD | CB | a | SA | AS
A → b | AD | CB | a | SA | AS
B → b
C → a
D → SA
All right-hand sides now consist of either two nonterminals or a single terminal.
Handling the start symbol
The new start S_0 does not derive \epsilon (consistent with the original language excluding \epsilon), and no further adjustments are needed, as S_0's productions are already in CNF and the unit S_0 \to S was eliminated earlier.
The final grammar in Chomsky normal form has variables \{S_0, S, A, B, C, D\}, the same terminals, start S_0, and productions as above. This equivalent grammar generates the same language as the original.[32]
For a simpler illustration without units or mixed terminals, consider the grammar S \to aSb \mid \epsilon (generating balanced a's and b's, including empty). After adding S_0 \to S, eliminate \epsilon: add S \to ab (omit nullable S from aSb). Introduce E \to a and F \to b to handle the nonsolitary terminals in ab, yielding S \to E F; replace aSb with E S F. To binarize E S F, introduce G \to S F, replacing with E G. Thus, S \to E G \mid E F, G \to S F. Since the language includes \epsilon, the start handling yields S_0 \to \epsilon \mid E G \mid E F.[2]
Application in parsing
Chomsky normal form (CNF) facilitates efficient parsing of context-free languages through algorithms like the Cocke-Younger-Kasami (CYK) algorithm, which determines whether a given string belongs to the language generated by the grammar. The CYK algorithm relies on the binary structure of CNF productions to build a dynamic programming table that tracks derivable substrings, enabling a bottom-up recognition process.
Consider a simple CNF grammar with nonterminals S, A, B, C, D, E, F and productions:
- S \to A B
- A \to C D
- B \to E F
- C \to a
- D \to a
- E \to b
- F \to b
This grammar generates the string "aabb". To parse it using CYK, construct a table V where i is the starting position (1-based) and j is the ending position, containing the nonterminals that derive the substring w_i \dots w_j. For w = "aabb" (n=4), initialize length-1 entries based on terminal productions:
- V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} = \{C, D\} (derives "a")
- V{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = \{C, D\} (derives "a")
- V{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \{E, F\} (derives "b")
- V{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \{E, F\} (derives "b")
For longer spans, fill bottom-up by considering all possible splits k between i and j, adding a nonterminal X to V if there exists a production X \to Y Z where Y \in V and Z \in V[k+1].
For length 2:
- V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = \{A\} (from A \to C D, using splits at k=1)
- V{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \emptyset (no matching productions)
- V{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \{B\} (from B \to E F, using splits at k=3)
For length 3:
- V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \emptyset (no matching productions across splits)
- V{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \emptyset (no matching productions across splits)
For length 4:
- V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \{S\} (from S \to A B, using split at k=2)
Since S \in V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}}, the string "aabb" is recognized by the grammar. The process examines O(n^3) substrings and splits, with each check scanning O(|G|) productions, yielding O(n^3 |G|) time complexity, where |G| is the number of productions. This cubic bound holds specifically because CNF restricts productions to binary nonterminals or single terminals, avoiding higher-degree branching that would inflate the time to O(n^{d+1}) for productions of degree d > 2.
The CYK table for this example can be visualized as follows (rows for starting index i, columns for ending index j; empty cells for invalid spans j < i are blank):
| i \setminus j | 1 | 2 | 3 | 4 |
|---|
| 1 | {C,D} | {A} | ∅ | {S} |
| 2 | | {C,D} | ∅ | ∅ |
| 3 | | | {E,F} | {B} |
| 4 | | | | {E,F} |
Variants and extensions
The Chomsky reduced form is a variant of the Chomsky normal form (CNF) that imposes additional constraints to ensure a cleaner and more efficient representation of context-free grammars (CFGs) without the empty string. Specifically, a CFG G = (T, N, S, R) is in Chomsky reduced form if every production rule in R is either A \to BC or A \to a, where A, B, C \in N are nonterminal symbols and a \in T is a terminal symbol; the start symbol S does not appear on the right-hand side of any production; there are no \epsilon-productions; and there are no useless symbols, meaning every nonterminal in N is reachable from S and productive (i.e., derives some terminal string).[33][34]
This form differs from standard CNF primarily in its stricter exclusion of \epsilon-productions, including the optional S \to \epsilon rule allowed in CNF for languages containing the empty string, and its explicit requirement for productivity and reachability to eliminate unproductive (non-generating) or unreachable nonterminals.[33] Standard CNF focuses mainly on the binary production structure but may retain such inefficiencies unless explicitly cleaned. Only CFGs generating languages without \epsilon can be transformed into this reduced form.[33]
To convert a CFG to Chomsky reduced form, first apply the standard CNF conversion process, which includes removing \epsilon-productions, unit productions, and mixed terminals while binarizing rules; then perform additional cleanup by eliminating any remaining S \to \epsilon (adjusting the language accordingly if \epsilon was generated) and removing unproductive and unreachable nonterminals using fixed-point iterations over the productive and reachable sets.[33][34]
Grammars in Chomsky reduced form exhibit minimal size by excluding redundant symbols and rules, facilitating streamlined derivations that can always be represented as full binary trees, and they are employed in theoretical analyses requiring precise control over grammar structure, such as complexity proofs for parsing.[33] This form is equivalent in expressive power to standard CNF for \epsilon-free languages but offers greater efficiency through its reduced symbol set and absence of nullabilities.[33]
The Floyd normal form is a variant of Chomsky normal form for context-free grammars in which production rules are of the form A \to B, A \to B C, or A \to t, where A, B, and C are nonterminals and t is a terminal symbol. This form allows unit productions and binary productions with only nonterminals on the right-hand side, along with single-terminal rules, and is often expressed in Backus–Naur form (BNF) syntax with alternatives (e.g., A ::= B | C).[35]
Proposed by Robert W. Floyd in his 1961 note on mathematical induction for phrase structure grammars, this form was introduced to enable straightforward inductive proofs on derivation trees by providing a structure suitable for induction on the height or size of trees, while allowing the unit productions common in BNF notations.[35] Floyd's approach arose in the context of analyzing properties of grammars through induction on derivations, particularly useful for theoretical studies where retaining unit productions simplifies the analysis without the need to eliminate them as in CNF.[35]
Compared to standard Chomsky normal form, which eliminates unit productions and requires binary rules to involve only nonterminals (with terminals isolated), the Floyd normal form retains unit productions and does not introduce auxiliary nonterminals for terminals in the same way, but still binarizes longer rules using only nonterminals. This allows for more concise representations in BNF and supports studies of grammatical properties by maintaining a form amenable to inductive reasoning on tree structures.[35]
To convert a context-free grammar to Floyd normal form, follow steps similar to CNF conversion—eliminating epsilon productions and binarizing longer rules using nonterminals—but retain unit productions instead of removing them. Terminals are handled via single productions, and alternatives can be used to group unit rules. For instance, a rule like A \to a B c would first be binarized to introduce nonterminals for sequences, such as A \to X Y, X \to a Z, Z \to B, Y \to c, but adjusted to fit the unit and binary nonterminal constraints without mixed rules.
This form preserves the context-free language while providing a structure that facilitates inductive proofs and aligns with BNF practices, ensuring derivations can be analyzed via induction on tree depth.[35]
Applications and significance
Role in parsing algorithms
Chomsky normal form (CNF) is essential for several key parsing algorithms that process context-free grammars (CFGs), as it standardizes production rules to facilitate efficient computation. The Cocke-Younger-Kasami (CYK) algorithm, developed in the mid-1960s, relies on CNF to enable bottom-up table-filling dynamic programming for recognizing whether a string belongs to the language generated by the grammar.[36] In CYK, the grammar's productions are restricted to binary nonterminal rules (A → BC) or unary terminal rules (A → a), which directly correspond to the algorithm's recurrence relation for filling a triangular table of spans.[37]
The binary structure of CNF rules matches the dynamic programming paradigm by allowing a nonterminal to derive a substring through combinations of exactly two substructures, ensuring that base cases—single terminals—are handled straightforwardly without mixed-length productions complicating the table entries.[37] This form reduces the time complexity of parsing from the exponential runtime of naive top-down recursive methods to O(n³), where n is the input string length, assuming a fixed grammar size; the space complexity is O(n²).[36] Without binarization, as provided by CNF, the algorithm's loop over split points would scale with production arity, potentially exceeding cubic time.[37]
The Earley parser, introduced in 1970, operates on general CFGs using a chart-based approach.[38] CNF is a standard requirement for bottom-up dynamic programming parsers like CYK in compiler design and natural language processing tools, where general CFGs are first converted to this form to leverage polynomial-time recognition.[36]
However, CNF does not resolve inherent ambiguities in grammars; parsing ambiguous CFGs in this form still produces multiple parses, necessitating disambiguation strategies such as probabilistic models or heuristics for unique tree selection.[38] While every context-free language has an equivalent CNF grammar, the parsing algorithms are tailored specifically to this normalized form, requiring preprocessing for non-conforming inputs.[37]
In formal language theory, Chomsky normal form (CNF) plays a pivotal role in establishing key properties of context-free languages (CFLs) and grammars by providing a standardized structure that facilitates rigorous proofs. One fundamental application is in deciding the emptiness problem for context-free grammars (CFGs). The emptiness can be determined through a fixed-point iteration over the productive nonterminals: start with the set of terminals and iteratively add nonterminals that derive strings in the current set until no changes occur; if the start symbol is productive, the language is nonempty.[39] This approach converges in polynomial time relative to the grammar size.
CNF also underpins proofs of equivalence between grammars and languages. Any CFG can be transformed into an equivalent CNF grammar without altering the generated language, enabling direct comparisons and normalizations that preserve context-free properties.[40] This normalization is essential for adapting the pumping lemma to CFLs, where the binary structure of CNF derivations allows for precise decomposition of parse trees into repeatable substructures, yielding a pumping length bounded by the grammar's nonterminal count.[41]
In language classification, CNF aids in demonstrating closure properties, such as the closure of CFLs under intersection with regular languages. The proof constructs a new CFG in CNF by integrating the finite automaton's states as nonterminals into the original grammar's productions, ensuring the resulting rules remain binary or terminal-producing while capturing the intersection.[42][43] Furthermore, CNF is instrumental in proofs of undecidability, particularly for grammar ambiguity: reductions to the Post correspondence problem show that determining whether a CNF grammar is ambiguous is undecidable, as the normal form does not resolve inherent nondeterminism in derivations.[44]
The utility of CNF extends to standardizing grammars for theoretical comparisons, allowing researchers to analyze structural invariants across equivalent formulations without loss of generality.[45] It also supports variants of Ogden's lemma, where the distinguished positions in a marked string can be pumped using the binary branching of CNF parse trees, providing a stronger tool for non-CFL proofs than the standard pumping lemma.[46][47]
Theoretically, CNF bridges generative linguistics—rooted in hierarchical phrase structure—with automata theory by mapping syntactic derivations to computational models like pushdown automata, facilitating analyses of linguistic complexity in formal terms.[48] A key fact is that derivations in a CNF grammar correspond directly to full binary trees, where internal nodes represent binary productions and leaves denote terminals; this isomorphism simplifies the analysis of language growth rates, as the tree height bounds the derivation length and enables generating function techniques to compute asymptotic densities.[49][50]