Fact-checked by Grok 2 weeks ago

Chomsky normal form

In formal language theory, the Chomsky normal form (CNF) is a standardized representation of context-free grammars (CFGs) in which every production rule is restricted to one of two forms: A \to BC, where A, B, and C are nonterminal symbols (variables), or A \to a, where a is a terminal symbol, with the optional allowance for the start symbol S to have a rule S \to \epsilon if the empty string is in the language. This form eliminates unit productions (A \to B), mixed terminal-nonterminal rules, and productions with more than two symbols on the right-hand side, ensuring a binary branching structure in the corresponding parse trees. Named after linguist , who first described this restricted form in his 1959 paper on the formal properties of grammars, CNF serves as a for type-2 grammars in the . Every can be algorithmically converted into an equivalent CNF grammar that generates the same language, preserving the grammar's generative power while simplifying its structure for analysis. This transformation is constructive and involves steps such as removing unnecessary symbols, eliminating \epsilon-productions (except possibly for the start symbol), removing unit productions, and binarizing longer rules. The importance of CNF lies in its role in theoretical computer science and natural language processing, where it facilitates proofs about context-free languages and enables efficient parsing algorithms. For instance, in CNF, the derivation of any string of length n requires exactly $2n - 1 production applications, which underpins dynamic programming approaches like the Cocke-Younger-Kasami (CYK) algorithm for recognizing membership in a language, achieving O(n^3) time complexity. Additionally, CNF's binary tree structure aligns parse trees with hierarchical syntactic analysis, making it valuable for compiler design, ambiguity resolution, and computational linguistics.

Introduction and background

Definition

A context-free grammar G = (V, \Sigma, P, S) consists of a finite set V of variables (nonterminals), a finite set \Sigma of terminals, a finite set P of production rules, and a distinguished start variable S \in V. The grammar G is in Chomsky normal form (CNF) if every production rule in P has one of two forms: A \to BC, where A, B, C \in V are variables and neither B nor C is the , or A \to a, where A \in V and a \in \Sigma is a . This form imposes strict restrictions on the productions: there are no -productions except possibly S \to \epsilon if the is in the language generated by G and S does not appear in the right-hand side of any other production; there are no unit productions of the form A \to B where A, B \in V; and there are no productions with right-hand sides longer than two symbols or mixing terminals and nonterminals except for the single-terminal case. The Chomsky normal form is named after linguist , who first described this restricted form in his 1959 paper "On Certain Formal Properties of Grammars."

Historical development

The Chomsky normal form emerged as part of 's efforts to formalize and standardize the structure of context-free grammars within the broader framework of generative linguistics and . In his seminal 1956 paper, "Three Models for the Description of Language," Chomsky introduced the , categorizing grammars into types 0 through 3 based on production rule restrictions, with type 2 encompassing context-free grammars. This classification motivated the creation of normalized forms to streamline theoretical analysis, particularly for addressing decision problems such as language membership testing and syntactic , by reducing the complexity of derivations while preserving generative power. Chomsky first described the normal form in his 1959 "On Certain Formal Properties of Grammars," presenting it as a "" form for type-2 grammars with productions limited to nonterminal rules or single terminals. His initial formulation in the laid the groundwork for these normalizations as tools to prove fundamental properties of formal languages, emphasizing their role in distinguishing context-free languages from more restrictive languages (type 3) and more general unrestricted grammars (type 0). The itself highlighted the need for such standardizations to explore generative and computational tractability. Refinements appeared in Chomsky's 1963 work, "Formal Properties of Grammars," where he explicitly detailed the as a representation for type-2 grammars, consisting solely of nonterminal productions and substitutions. This evolution facilitated rigorous proofs of between arbitrary context-free grammars and their normalized counterparts, enhancing the toolkit for undecidability results and equivalence demonstrations. By the , the form had become integral to parsing theory, supporting developments in algorithmic recognition of context-free languages and bridging with early . Chomsky normal form's tight connection to the type-2 category of the proved essential for theoretical advancements, including analyses of self-embedding and inherent ambiguities in structures.

Properties and equivalence

Key properties

Grammars in exhibit several key structural properties that simplify analysis and processing. Every nonterminal derives a non-empty string of terminals, as the form prohibits ε-productions (except possibly for the start symbol in languages containing the ) and unit productions. The corresponding parse trees are binary branching at all internal nodes, with each non-leaf node having exactly two children that are nonterminals, while leaves are single terminals. In a minimal —meaning one without redundant symbols—there are no useless nonterminals; every symbol is reachable from the start symbol and generates at least one terminal string. These structural constraints yield important computational properties. The binary form enables the Cocke-Younger-Kasami (CYK) dynamic programming to recognize membership in the generated language in O(n^3) time for input strings of length n, where the hidden constant depends on the number of nonterminals and productions. Furthermore, the bounded facilitates ambiguity detection by allowing the enumeration or counting of all possible parse trees for a given string, as the number of such trees is finite and can be computed efficiently using extensions of CYK. Converting a general context-free grammar to Chomsky normal form can substantially expand its size. If the original grammar has size n (measured by the total number of symbols across all productions), the process may introduce up to O(n^2) new nonterminals and production rules in the worst case with an appropriate order of steps, such as performing binarization before epsilon-elimination, primarily due to the introduction of auxiliary symbols for binarization and terminal separation. Chomsky normal form is not unique for a given language; distinct grammars in this form can generate identical context-free languages, arising from different choices in the normalization process or equivalent but non-isomorphic representations. For any of length n \geq 1 derived from a Chomsky normal form , every leftmost (or rightmost) requires exactly $2n - 1 steps: n steps to introduce terminals and n - 1 steps to branch nonterminals. To see this, note that each terminal-introducing production adds one leaf, requiring n such productions, while the branching necessitates n - 1 nonterminal-pairing productions to connect them into a full ; the total steps follow by on n, with the base case n=1 using one step and the inductive step adding two steps (one branch and one terminal) per extension.

Equivalence to general context-free grammars

A fundamental result in the theory of formal languages establishes that every context-free language can be generated by a context-free grammar in Chomsky normal form. Specifically, for any context-free grammar G generating a language L, there exists an equivalent grammar G' in Chomsky normal form such that L(G') = L(G). The proof of this theorem is constructive and relies on a sequence of grammar transformations that systematically restrict the form of productions while preserving the generated language at each stage. These transformations ensure that the resulting grammar adheres to the rules of Chomsky normal form without altering the language's generative power. A detailed proof sketch proceeds by induction on the types of production rules present in the original grammar. Base cases handle simple forms like terminal or binary nonterminal productions, which are already in normal form. For inductive steps, each elimination process—such as removing ε-productions or unit productions—is shown to yield an equivalent grammar by demonstrating that every derivation in the original can be mimicked in the new one and vice versa, thus preserving the language exactly. Special attention is given to edge cases involving the ε. If ε belongs to the language, the conversion allows the start symbol to retain a production to ε, which is explicitly permitted in the definition of Chomsky normal form for the start variable. For languages excluding ε, the process eliminates all ε-productions entirely, ensuring the resulting grammar generates precisely the non-empty strings of the original language. This equivalence has significant implications: it confirms that the class of context-free languages admits a representation via Chomsky normal form grammars, facilitating theoretical analyses and equivalence proofs in and .

Conversion procedure

Preliminary preparations

Before proceeding with the specific transformations required for Chomsky normal form, it is essential to prepare the by removing useless symbols and ensuring its structural integrity. Useless symbols include non-generating nonterminals, which cannot derive any terminal string, and unreachable nonterminals, which cannot be derived from the start symbol in any sentential form. These elements do not contribute to the generated by the grammar and can be safely eliminated to simplify subsequent steps. To identify non-generating nonterminals, apply a fixed-point starting from the . Initially, mark all symbols as generating. Then, iteratively mark a nonterminal A as generating if there exists a A \to \alpha where every symbol in \alpha (a over and nonterminals) has already been marked generating. Continue until no additional nonterminals can be marked. All unmarked nonterminals are non-generating; remove them along with all productions in which they appear on the left- or right-hand side. This process ensures that every remaining nonterminal can derive at least one . Next, identify unreachable nonterminals by another fixed-point computation, but starting from the designated start symbol S. Initially, mark S as reachable. Then, iteratively mark a nonterminal B as reachable if it appears in the right-hand side of any production whose left-hand side is already marked reachable. Continue until no further markings are possible. Unmarked nonterminals are unreachable; eliminate them and all associated productions. Perform this computation only after removing non-generating symbols, as the latter may otherwise propagate false reachability. The resulting grammar contains only symbols involved in derivations from S to terminal strings. If the original grammar has overlapping symbols between the terminal alphabet \Sigma and the nonterminal alphabet V_N, rename the terminals—such as by appending a distinguishing marker (e.g., replacing a shared symbol a with t_a)—to enforce disjointness, as required by the standard definition of context-free grammars. This prevents ambiguity in parsing and derivation steps. In cases where the grammar features multiple start-like rules or several nonterminals functioning as potential starts (e.g., disconnected components), consolidate them by introducing a new start symbol S' with productions S' \to S_1 \mid S_2 \mid \dots \mid S_k, where S_1, \dots, S_k are the original start-like nonterminals; set S' as the new start. This unifies the grammar under a single entry point without altering the generated language. These preliminary steps produce a minimal grammar in which every nonterminal is both generating and reachable, thereby reducing the size and complexity for later normalization phases while preserving the exact language generated by the original grammar.

Elimination of ε-productions

Epsilon-productions, also known as null productions, are rules of the form A \to \epsilon in a (CFG), where A is a nonterminal and \epsilon denotes the empty string. These productions allow nonterminals to derive the empty string, which complicates certain analyses and parsing algorithms, necessitating their removal during conversion to Chomsky normal form while preserving the generated language. To eliminate ε-productions, first identify the set of nullable nonterminals, denoted N_\epsilon, which are those that can derive \epsilon. This set is computed iteratively as the smallest fixed point satisfying: initialize with all nonterminals A such that A \to \epsilon exists; then repeatedly add any nonterminal B for which there is a production B \to X_1 X_2 \dots X_k (where each X_i is a nonterminal) such that all X_i \in N_\epsilon. The process terminates since the set of nonterminals is finite, yielding N_\epsilon in time proportional to the square of the grammar's size. Once nullable nonterminals are identified, remove all ε-productions A \to \epsilon. For each remaining production A \to \alpha, where \alpha = X_1 X_2 \dots X_m consists of terminals and nonterminals, generate new productions by considering all possible ways to omit subsets of the nullable nonterminals in \alpha, provided the resulting right-hand side is non-empty. Specifically, for every non-empty subsequence obtained by deleting zero or more occurrences of symbols from N_\epsilon, add a corresponding production A \to \beta, where \beta is that subsequence. This substitution ensures that derivations using nullable nonterminals are simulated by the new rules without altering the language. For example, consider a A \to B C D, where C \in N_\epsilon but B, D \notin N_\epsilon. The new productions include A \to B C D (omitting none), A \to B D (omitting C), but not A \to \epsilon or A \to B (as those would omit non-nullables or create invalid empties). If multiple nullables are present, all combinations are added, potentially increasing the number of productions exponentially in the length of \alpha, though practical grammars remain manageable. A special case arises with the start symbol S. If S \in N_\epsilon (indicating \epsilon \in L(G)), introduce a new start symbol S' with productions S' \to S and S' \to \epsilon, and ensure S does not appear on any right-hand side. This preserves \epsilon in the language without allowing other ε-productions. If S \notin N_\epsilon, no ε-production is added for S'. The resulting grammar generates the same language as the original, as every derivation in the new grammar corresponds to one in the old by inserting ε-derivations for omitted nullables, and vice versa, with the possible exception of \epsilon handled explicitly for the start symbol. This step may introduce longer productions, addressed in subsequent normalization phases.

Elimination of unit productions

Unit productions in a context-free grammar are rules of the form A \to B, where A and B are nonterminals. These productions must be eliminated during the conversion to Chomsky normal form to ensure all remaining rules conform to the required structure, while preserving the grammar's generative power. Assuming ε-productions have already been removed, unit productions do not introduce new ε-derivations, as no nonterminal derives the empty string. To identify unit pairs, compute the transitive closure of unit derivations, determining for each nonterminal A the set of all nonterminals B such that A \Rightarrow^* B using only unit productions. This can be done efficiently by constructing a with nodes for nonterminals and directed edges from C to D for each unit production C \to D, then finding reachable nodes from each starting nonterminal via standard or transitive closure algorithms like Warshall's. Self-loops (A \to A) are simply removed without substitution, as they contribute nothing to derivations. The elimination algorithm proceeds as follows: for each nonterminal A and each B reachable from A via unit derivations (with B \neq A), add to the grammar all non-unit productions of B; that is, if B \to \gamma where \gamma is a string of length at least two or containing terminals, include A \to \gamma. Then, delete all original unit productions. This process is non-iterative once the reachable sets are computed, though cycles in the are inherently handled by the , ensuring no infinite substitutions occur. The resulting grammar generates the same language as the original, as every derivation using units can be replaced by direct non-unit rules without altering the yielded terminal strings. However, the number of productions may increase, potentially up to quadratic in the number of nonterminals in the worst case, due to copying non-unit rules across unit chains.

Replacement of nonsolitary terminals

In the conversion procedure for Chomsky normal form, the replacement of nonsolitary terminals targets productions where a appears in the right-hand side (RHS) alongside one or more other s, such as A \to aB or A \to BaC, violating the form's requirement that terminals must appear only in solitary positions (i.e., A \to a). These "mixed" productions are identified by scanning all rules after prior steps like eliminating ε-productions and productions, focusing solely on those with RHS length greater than 1 containing terminals. The algorithm proceeds as follows: For each distinct terminal a \in [\Sigma](/page/Sigma) that appears in a nonsolitary position across any , introduce a fresh nonterminal X_a (not previously used in the ) and add the unit X_a \to a. Then, replace every occurrence of a in the RHS of mixed productions with X_a, while leaving solitary terminal productions A \to a unchanged. This introduces exactly one new nonterminal and one new production per such terminal a, resulting in an addition of up to |\Sigma| new rules and nonterminals, depending on which terminals appear in mixed contexts. For example, consider the production A \to aBC, where a is a and B, C are nonterminals. Introduce X_a \to a, then rewrite the original as A \to X_a BC. Similarly, for D \to BaC, it becomes D \to B X_a C after the same replacement. If multiple terminals are involved, such as in S \to aXbY, introduce X_a \to a and X_b \to b, yielding S \to X_a X Y X_b. This transformation preserves the language generated by the grammar, as any derivation incorporating X_a can be expanded to include a without altering the yield, and conversely, any string derived originally can be obtained by substituting back from the new nonterminals. Upon completion, all terminals in the grammar now appear only in solitary RHS positions, setting the stage for further binarization of nonterminal chains while maintaining equivalence to the original context-free grammar.

Reduction to binary nonterminals

In the conversion to Chomsky normal form, after eliminating ε-productions, unit productions, and mixed terminals in right-hand sides, the grammar consists of rules where the right-hand side is either a single terminal or a sequence of two or more nonterminals. The next step addresses productions with more than two nonterminals on the right-hand side, known as long rules of the form A \to B_1 B_2 \dots B_k where k > 2 and each B_i is a nonterminal. These must be broken down into binary rules (exactly two nonterminals on the right-hand side) to satisfy the form's requirements, ensuring all such rules become X \to Y Z for nonterminals X, Y, Z. The algorithm proceeds recursively by introducing new nonterminals to split the sequence. For a long rule A \to B_1 B_2 \dots B_k, replace it with A \to B_1 C_1, where C_1 is a fresh nonterminal, and add the rule C_1 \to B_2 \dots B_k. Repeat the process on C_1 \to B_2 \dots B_k until the right-hand side has exactly two symbols, yielding a of binary rules such as C_{k-3} \to B_{k-2} B_{k-1} and C_{k-2} \to B_{k-1} B_k. This introduces up to k-2 new nonterminals per original rule, often named systematically like A_i or C_i to indicate their position in the original sequence (e.g., C_1 for the suffix starting after B_1). The choice of splitting from the left (as described) is conventional but arbitrary; right-branching or balanced variants exist, though they preserve equivalence. This preserves the generated by the , as every using the original long corresponds to a unique sequence of derivations in the new , and vice versa, maintaining the same yield of terminals. The resulting parse trees may appear skewed (left- or right-branching) due to the chaining, but the associative structure of derivations ensures semantic equivalence in context-free generation. For example, consider the S \to NP VP PP; it becomes S \to NP X_1, X_1 \to VP PP, introducing one new nonterminal X_1 since k=3. After processing all long rules—typically following unit production elimination—the has only nonterminal rules, alongside terminal rules like A \to a.

Handling the start symbol

In Chomsky normal form, the start symbol must not appear on the right-hand side of any production rule, as this ensures the form's strict structure and facilitates algorithms like the Cocke-Younger-Kasami parser. If, after prior conversion steps, productions exist where the start symbol S appears on the right-hand side—such as A \to S \beta for some nonterminal A and string \beta of nonterminals and terminals—this violates the normal form requirement. To resolve this, introduce a new start symbol S' distinct from all existing nonterminals, and add the S' \to S. The original start symbol S is then renamed to a new nonterminal S_0 to prevent confusion, and every occurrence of S on the right-hand side of other is replaced with S_0. For instance, if there was a rule A \to S B, it becomes A \to S_0 B. The new grammar now has S' as its start symbol, and S' does not appear on any right-hand side. This transformation preserves the generated language, as any derivation starting from S' simply begins with the step to S (or S_0), yielding the same strings as the original grammar. This step is typically performed last in the conversion procedure to avoid propagating modifications to the start symbol during earlier eliminations, such as those for \varepsilon-productions or unit productions, which could otherwise introduce unwanted cycles or require additional substitutions involving the new start. By delaying it, the process ensures that the final start symbol remains isolated from right-hand sides without complicating intermediate steps. A special case arises if the grammar includes the production S \to \varepsilon (which is permissible in Chomsky normal form only if the language is \{\varepsilon\}), as the language must still generate the empty string after the change. In this scenario, add the production S' \to \varepsilon alongside S' \to S (with S renamed to S_0) to preserve \varepsilon in the language. For languages containing \varepsilon but not solely \{\varepsilon\}, prior \varepsilon-elimination steps would have already handled this by ensuring no \varepsilon-productions remain except where necessary, and the new start addition maintains the property without further adjustment. The recommended order of steps for converting a to Chomsky normal form begins with preliminary preparations to remove useless symbols, followed by elimination of ε-productions, elimination of unit productions, replacement of nonsolitary terminals, reduction to binary nonterminals, and finally handling the start symbol. This sequence minimizes complications and ensures correctness by addressing issues that could arise from dependencies between steps. Eliminating ε-productions second—after preliminaries—avoids reintroducing nullables in later phases, as subsequent transformations like unit elimination could otherwise generate new ε-productions if nullables persist. Unit productions are then removed to prevent chain explosions, since ε-elimination may create additional unit rules that would amplify complexity if handled earlier. Replacing nonsolitary terminals precedes binarization to isolate terminals into dedicated unit productions, ensuring long right-hand sides contain only nonterminals and simplifying the introduction of rules without mixed symbols. While any order that preserves language equivalence is valid, alternatives can increase the size of intermediate grammars; for instance, eliminating units before ε-productions risks creating ε-units that require rework. The full process yields a Chomsky normal form grammar with O(n^4) rules in the worst case, where n measures the original grammar's size, though practical implementations benefit from variable reuse in binarization to reduce growth.

Examples and illustrations

Step-by-step conversion example

Consider the G = (V, \Sigma, P, S) with variables V = \{S, A, B\}, terminals \Sigma = \{a, b\}, start symbol S, and productions P = \{ S \to ASA \mid aB, \, A \to B \mid S, \, B \to b \mid \epsilon \}. The process below follows the standard steps without altering the language equivalence.

Preliminary preparations

Add a new start symbol S_0 with the production S_0 \to S to isolate the original start symbol and prevent issues if it becomes nullable during later steps. The updated grammar is:
S₀ → S
S → ASA | aB
A → B | S
B → b | ε
No other preparations are needed, as there are no immediate irregularities beyond the ε-production.

Elimination of ε-productions

The nullable nonterminals are A and B (since B \to \epsilon and A \to B). Remove B \to \epsilon. For productions containing nullable symbols, add shortened versions by omitting combinations of nullables (excluding the empty string for non-ε original productions):
  • From S \to aB, add S \to a (omit B).
  • From S \to ASA, add S \to SA (omit first A), S \to AS (omit second A), and S \to S (omit both A's).
  • From A \to B, no new production (omitting B would yield ε, which is not added).
The resulting grammar is:
S₀ → S
S → [ASA](/page/ASA) | aB | a | SA | AS | S
A → B | S
B → b

Elimination of unit productions

The unit productions are S_0 \to S, A \to S, A \to B, and S \to S. Remove S \to S. Replace each unit by adding the non-unit productions of the target:
  • For A \to S, add S's non-unit productions to A: A \to [ASA](/page/ASA) \mid aB \mid a \mid SA \mid AS; remove A \to S.
  • For A \to B, add B's productions to A: A \to b; remove A \to B.
  • For S_0 \to S, add S's non-unit productions to S_0: S_0 \to [ASA](/page/ASA) \mid aB \mid a \mid SA \mid AS; remove S_0 \to S.
The resulting grammar is:
S₀ → ASA | aB | a | SA | AS
S → ASA | aB | a | SA | AS
A → b | ASA | aB | a | SA | AS
B → b

Replacement of nonsolitary terminals

Nonsolitary terminals appear in aB (a mixed with nonterminal). Introduce a new variable C with C \to a, and replace all instances of aB with CB:
S₀ → ASA | CB | a | SA | AS
S → ASA | CB | a | SA | AS
A → b | ASA | CB | a | SA | AS
B → b
C → a
All terminals are now solitary.

Reduction to binary nonterminals

The only productions with more than two symbols on the right-hand side are those with ASA (three nonterminals). Introduce a new variable D with D \to SA, and replace all ASA with AD:
S₀ → AD | CB | a | SA | AS
S → AD | CB | a | SA | AS
A → b | AD | CB | a | SA | AS
B → b
C → a
D → SA
All right-hand sides now consist of either two nonterminals or a single terminal.

Handling the start symbol

The new start S_0 does not derive \epsilon (consistent with the original language excluding \epsilon), and no further adjustments are needed, as S_0's productions are already in CNF and the unit S_0 \to S was eliminated earlier. The final grammar in Chomsky normal form has variables \{S_0, S, A, B, C, D\}, the same terminals, start S_0, and productions as above. This equivalent grammar generates the same as the original. For a simpler without units or mixed terminals, consider the grammar S \to aSb \mid \epsilon (generating balanced a's and b's, including empty). After adding S_0 \to S, eliminate \epsilon: add S \to ab (omit nullable S from aSb). Introduce E \to a and F \to b to handle the nonsolitary terminals in ab, yielding S \to E F; replace aSb with E S F. To binarize E S F, introduce G \to S F, replacing with E G. Thus, S \to E G \mid E F, G \to S F. Since the language includes \epsilon, the start handling yields S_0 \to \epsilon \mid E G \mid E F.

Application in parsing

Chomsky normal form (CNF) facilitates efficient of context-free languages through algorithms like the Cocke-Younger-Kasami () algorithm, which determines whether a given belongs to the language generated by the . The relies on the binary structure of CNF productions to build a dynamic programming table that tracks derivable substrings, enabling a bottom-up recognition process. Consider a simple CNF with nonterminals S, A, B, C, D, E, F and productions:
  • S \to A B
  • A \to C D
  • B \to E F
  • C \to a
  • D \to a
  • E \to b
  • F \to b
This grammar generates the string "aabb". To parse it using CYK, construct a table V where i is the starting position (1-based) and j is the ending position, containing the nonterminals that derive the w_i \dots w_j. For w = "aabb" (n=4), initialize length-1 entries based on productions:
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}} = \{C, D\} (derives "a")
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = \{C, D\} (derives "a")
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \{E, F\} (derives "b")
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \{E, F\} (derives "b")
For longer spans, fill bottom-up by considering all possible splits k between i and j, adding a nonterminal X to V if there exists a production X \to Y Z where Y \in V and Z \in V[k+1]. For length 2:
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}} = \{A\} (from A \to C D, using splits at k=1)
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \emptyset (no matching productions)
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \{B\} (from B \to E F, using splits at k=3)
For length 3:
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=3&&&citation_type=wikipedia}} = \emptyset (no matching productions across splits)
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=2&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \emptyset (no matching productions across splits)
For length 4:
  • V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}} = \{S\} (from S \to A B, using split at k=2)
Since S \in V{{grok:render&&&type=render_inline_citation&&&citation_id=1&&&citation_type=wikipedia}}{{grok:render&&&type=render_inline_citation&&&citation_id=4&&&citation_type=wikipedia}}, the string "aabb" is recognized by the grammar. The process examines O(n^3) substrings and splits, with each check scanning O(|G|) productions, yielding O(n^3 |G|) , where |G| is the number of productions. This cubic bound holds specifically because CNF restricts productions to nonterminals or single terminals, avoiding higher-degree branching that would inflate the time to O(n^{d+1}) for productions of degree d > 2. The CYK table for this example can be visualized as follows (rows for starting index i, columns for ending index j; empty cells for invalid spans j < i are blank):
i \setminus j1234
1{C,D}{A}{S}
2{C,D}
3{E,F}{B}
4{E,F}

Variants and extensions

Chomsky reduced form

The Chomsky reduced form is a variant of the Chomsky normal form (CNF) that imposes additional constraints to ensure a cleaner and more efficient representation of context-free grammars (CFGs) without the empty string. Specifically, a CFG G = (T, N, S, R) is in Chomsky reduced form if every production rule in R is either A \to BC or A \to a, where A, B, C \in N are nonterminal symbols and a \in T is a terminal symbol; the start symbol S does not appear on the right-hand side of any production; there are no \epsilon-productions; and there are no useless symbols, meaning every nonterminal in N is reachable from S and productive (i.e., derives some terminal string). This form differs from standard CNF primarily in its stricter exclusion of \epsilon-productions, including the optional S \to \epsilon rule allowed in CNF for languages containing the , and its explicit requirement for productivity and to eliminate unproductive (non-generating) or unreachable nonterminals. Standard CNF focuses mainly on the production structure but may retain such inefficiencies unless explicitly cleaned. Only CFGs generating languages without \epsilon can be transformed into this . To convert a CFG to Chomsky reduced form, first apply the standard CNF conversion process, which includes removing \epsilon-productions, unit productions, and mixed terminals while binarizing rules; then perform additional cleanup by eliminating any remaining S \to \epsilon (adjusting the language accordingly if \epsilon was generated) and removing unproductive and unreachable nonterminals using fixed-point iterations over the productive and reachable sets. Grammars in Chomsky reduced form exhibit minimal size by excluding redundant symbols and rules, facilitating streamlined derivations that can always be represented as full binary trees, and they are employed in theoretical analyses requiring precise control over structure, such as proofs for . This form is equivalent in expressive power to standard CNF for \epsilon-free languages but offers greater efficiency through its reduced symbol set and absence of nullabilities.

Floyd normal form

The Floyd normal form is a variant of Chomsky normal form for context-free grammars in which production rules are of the form A \to B, A \to B C, or A \to t, where A, B, and C are nonterminals and t is a terminal symbol. This form allows unit productions and binary productions with only nonterminals on the right-hand side, along with single-terminal rules, and is often expressed in (BNF) syntax with alternatives (e.g., A ::= B | C). Proposed by in his 1961 note on for phrase grammars, this form was introduced to enable straightforward inductive proofs on trees by providing a suitable for on the height or size of trees, while allowing the unit productions common in BNF notations. Floyd's approach arose in the context of analyzing properties of grammars through induction on s, particularly useful for theoretical studies where retaining unit productions simplifies the analysis without the need to eliminate them as in CNF. Compared to standard Chomsky normal form, which eliminates unit productions and requires binary rules to involve only nonterminals (with terminals isolated), the Floyd normal form retains unit productions and does not introduce auxiliary nonterminals for terminals in the same way, but still binarizes longer rules using only nonterminals. This allows for more concise representations in BNF and supports studies of grammatical properties by maintaining a form amenable to on tree structures. To convert a to Floyd normal form, follow steps similar to CNF conversion—eliminating epsilon productions and binarizing longer rules using nonterminals—but retain unit productions instead of removing them. Terminals are handled via single productions, and alternatives can be used to group unit rules. For instance, a rule like A \to a B c would first be binarized to introduce nonterminals for sequences, such as A \to X Y, X \to a Z, Z \to B, Y \to c, but adjusted to fit the unit and binary nonterminal constraints without mixed rules. This form preserves the while providing a structure that facilitates inductive proofs and aligns with BNF practices, ensuring derivations can be analyzed via on .

Applications and significance

Role in algorithms

Chomsky normal form (CNF) is essential for several key algorithms that process context-free grammars (CFGs), as it standardizes rules to facilitate efficient . The Cocke-Younger-Kasami (CYK) algorithm, developed in the mid-1960s, relies on CNF to enable bottom-up table-filling dynamic programming for recognizing whether a belongs to the generated by the . In CYK, the grammar's productions are restricted to nonterminal rules (A → BC) or unary terminal rules (A → a), which directly correspond to the algorithm's for filling a triangular table of spans. The binary structure of CNF rules matches the dynamic by allowing a nonterminal to derive a through combinations of exactly two substructures, ensuring that base cases—single terminals—are handled straightforwardly without mixed-length productions complicating the entries. This form reduces the of from the exponential runtime of naive top-down recursive methods to O(n³), where n is the input , assuming a fixed size; the is O(n²). Without binarization, as provided by CNF, the algorithm's loop over split points would scale with production , potentially exceeding cubic time. The , introduced in 1970, operates on general CFGs using a chart-based approach. CNF is a standard requirement for bottom-up dynamic programming parsers like CYK in compiler design and tools, where general CFGs are first converted to this form to leverage polynomial-time recognition. However, CNF does not resolve inherent ambiguities in grammars; parsing ambiguous CFGs in this form still produces multiple parses, necessitating disambiguation strategies such as probabilistic models or heuristics for unique tree selection. While every context-free language has an equivalent CNF grammar, the parsing algorithms are tailored specifically to this normalized form, requiring preprocessing for non-conforming inputs.

Use in formal language theory

In formal language theory, Chomsky normal form (CNF) plays a pivotal role in establishing key properties of context-free languages (CFLs) and by providing a standardized structure that facilitates rigorous proofs. One fundamental application is in deciding the problem for context-free (CFGs). The can be determined through a over the productive nonterminals: start with the set of terminals and iteratively add nonterminals that derive strings in the current set until no changes occur; if the start symbol is productive, the language is nonempty. This approach converges in time relative to the grammar size. CNF also underpins proofs of between and . Any CFG can be transformed into an equivalent CNF without altering the generated , enabling direct comparisons and normalizations that preserve context-free properties. This normalization is essential for adapting the to CFLs, where the binary structure of CNF derivations allows for precise decomposition of parse trees into repeatable substructures, yielding a pumping length bounded by the 's nonterminal count. In , CNF aids in demonstrating closure properties, such as the closure of CFLs under with languages. The proof constructs a new CFG in CNF by integrating the finite automaton's states as nonterminals into the original 's productions, ensuring the resulting rules remain or terminal-producing while capturing the . Furthermore, CNF is instrumental in proofs of undecidability, particularly for ambiguity: reductions to the show that determining whether a CNF grammar is ambiguous is undecidable, as the normal form does not resolve inherent nondeterminism in derivations. The utility of CNF extends to standardizing grammars for theoretical comparisons, allowing researchers to analyze structural invariants across equivalent formulations without loss of generality. It also supports variants of Ogden's lemma, where the distinguished positions in a marked string can be pumped using the binary branching of CNF parse trees, providing a stronger tool for non-CFL proofs than the standard pumping lemma. Theoretically, CNF bridges generative —rooted in hierarchical phrase structure—with by mapping syntactic derivations to computational models like pushdown automata, facilitating analyses of linguistic complexity in formal terms. A key fact is that derivations in a CNF grammar correspond directly to full binary , where internal nodes represent binary productions and leaves denote terminals; this simplifies the analysis of growth rates, as the tree height bounds the derivation length and enables generating function techniques to compute asymptotic densities.

References

  1. [1]
    [PDF] On Certain Formal Properties of Grammars*
    A finite state language is essentially what is called in. Kleene (1956) a "regular event." Page 15. ON CERTAIN FORMAL PROPERTIES OF GRAMMARS. 151. Restriction 3 ...
  2. [2]
    [PDF] Chomsky Normal Form
    Why Chomsky Normal Form? The key advantage is that in Chomsky Normal Form, every derivation of a string of n letters has exactly 2n − 1 steps. Thus: one can ...Missing: parsing | Show results with:parsing
  3. [3]
    [PDF] 1 Chomsky Normal Form
    Normal Forms. A grammar is in a normal form if its production rules have a special structure: • Chomsky Normal Form: Productions are of the form A → BC or A → a ...
  4. [4]
    [PDF] ECS 120 Lesson 11 – Chomsky Normal Form
    Apr 23, 2001 · We will show that for every context-free grammar G, there is an equivalent grammar G that is in Chomsky Normal Form. The constructive proof for ...
  5. [5]
    [PDF] 16.070 - Introduction to Computers & Programming - MIT
    Apr 23, 2003 · CNF and Parse Trees. ▫ Chomsky Normal Form is useful to interpret a grammar as a parse tree. ▫ CNF forms a binary tree! ▫ Consider the ...
  6. [6]
    Chomsky Normal Form (CNF)
    Each production is either of the form A→BC or A→a (where A,B,C are variables), which is the definition of Chomsky Normal Form.<|control11|><|separator|>
  7. [7]
    [PDF] Chomsky Normal Form
    According to the definition of Chomsky normal form, the only rule of this type which is allowed is the rule in which the right-hand side is a terminal symbol.
  8. [8]
    [PDF] CSCI 341–Fall 2024: Lecture Notes Set 9: Chomsky Normal Form
    Oct 7, 2024 · A Context-Free Grammar is in Chomsky Normal Form if 1. Every rule is of the form A → BC or A → a, where A,B,C ∈ V,a ∈ Σ. 2. S is not on the ...
  9. [9]
    [PDF] Chomsky and Greibach Normal Forms
    Chomsky and Greibach Normal Forms – p.3/24. Page 7. Definition. A context-free grammar is in Chomsky normal form if every rule is of the form: where is a ...<|control11|><|separator|>
  10. [10]
    [PDF] TIIKEE MODELS FOR TIE DESCRIPTION OF LANGUAGE
    We study the formal properties of a set of grammatical trans- formations that carry sentences with phra.se structure into new sentences with derived phrase.<|control11|><|separator|>
  11. [11]
    [PDF] Introduction to Context Free Grammar - UBC Computer Science
    The grammar must be rendered into Chomsky Normal Form (CNF). • CNF ... Noam Chomsky (1956), IRE Transactions on Information Theory. • "To CNF or ...
  12. [12]
    Three models for the description of language - IEEE Xplore
    Three models for the description of language | IEEE Journals & Magazine | IEEE Xplore ... Date of Publication: 30 September 1956. ISSN Information: Print ...
  13. [13]
    Handbook of Mathematical Psychology [1] - DOKUMEN.PUB
    Handbook of Mathematical Psychology [1]. 867 173 39MB. English Pages [508] Year 1963. Report DMCA / ...
  14. [14]
    [PDF] Context-Free Grammars and Constituency Parsing
    Chomsky normal form grammars are binary branching, that is they have binary trees (down binary branching to the prelexical nodes). We make use of this ...
  15. [15]
    [PDF] Introduction to Automata Theory, Languages, and Computation
    7.4 Decision Properties of CFL's. 7.4.1 Complexity of Converting Among CFG's and PDA's . . . 294. 7.4.2 Running Time of Conversion to Chomsky Normal Form. 295.
  16. [16]
    [PDF] The CYK Algorithm - Computer Science | UC Davis Engineering
    The CYK Algorithm. • J. Cocke. • D. Younger,. • T. Kasami. – Independently developed an algorithm to answer this question. Page 4. The CYK Algorithm Basics. – ...
  17. [17]
    Advances in Ambiguity Detection Methods for Formal Grammars
    Aug 6, 2025 · ... Ambiguity Detection Methods for. Formal Grammars. Hari Mohan ... Chomsky Normal Form (CNF) will occur in derivations in which every ...
  18. [18]
    [PDF] Chomsky and Greibach Normal Forms - University of Iowa
    Normal forms are useful when more advanced topics in computation theory are approached, as we shall see further. Computation Theory – p.3/27. Page 4. Definition.
  19. [19]
    Useless variables in context-free grammars
    A variable X in a context-free grammar is called useless if it doesn't occur in any derivation of a word from that grammar. Here is an easy algorithm to ...
  20. [20]
    [PDF] Automata and Formal Languages
    Definition of Context-Free-Grammars. A CFG is a ... An algorithm for elimination of useless symbols first ... Algorithm: Part 2. 2. Elimination of non-reachables.
  21. [21]
    [PDF] Introduction to Automata Theory, Languages, and Computation
    Hopcroft, John E., 1939-. Introduction to automata theory, languages, and computation / John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman.-2nd ed. p. cm. ISBN ...
  22. [22]
    [PDF] 5 Context-Free Languages and Grammars - Jeff Erickson
    3In most textbook descriptions of the CFG conversion algorithm, this stage is performed last, after removing. "-productions and unit productions. But with ...
  23. [23]
    [PDF] Chapter 7 Context-Free Languages and PDA's
    Parsing algorithms for unambiguous grammars are more efficient than parsing algorithms for ambiguous grammars. Page 19. 7.3. NORMAL FORMS FOR CONTEXT-FREE ...
  24. [24]
    [PDF] An Introduction to Formal Languages and Automata - Peter Linz
    This book covers formal languages, automata, and related topics in the theory of computation, designed for an introductory course.
  25. [25]
    [PDF] Chomsky Normal Form - cs.rit.edu
    • Chomsky Normal Form. – A context free grammar is in Chomsky Normal. Form (CNF) if every production is of the form: • A → BC. • A → a.
  26. [26]
    [PDF] CS 301 - Lecture 10 – Chomsky Normal Form
    So use whichever procedure you'd like, but Sipser's can be very bad. (Sipser's is bad if you have long rules with lots of variables with ε-rules). 13 / 23 ...
  27. [27]
    [PDF] Conversion to Chomsky Normal Form - Washington
    Step 1: Create a new start symbol S0 and add the rule S0 → S. Step 2: For each terminal symbol a ∈ Σ that appears on the right side of a rule of G of size ...
  28. [28]
    [PDF] CS 273, Lecture 15 Chomsky Normal form
    Mar 6, 2008 · ” One example is Chomsky Normal Form (CNF). Chomsky ... and graft on a couple nodes to make a parse tree for the next longer length string.
  29. [29]
    [PDF] Constituency Parsing - Stanford University
    13.2.1 Conversion to Chomsky Normal Form. We begin our investigation of the CKY algorithm by examining the requirement that grammars used with it must be in ...
  30. [30]
    [PDF] CKY Algorithm, Chomsky Normal Form - University of Washington
    Jan 13, 2010 · Definition. CNF grammar: a context-free grammar where the RHS of each production rule is restricted to be either two non-terminals or one.
  31. [31]
    [PDF] Chapter 3 Context-Free Grammars, Context-Free Languages, Parse ...
    Thus, in order to convert a grammar to Chomsky Normal Form, we need to show how to eliminate "-rules and chain rules. This is not the end of the story, since we ...
  32. [32]
    [PDF] Normal Forms for CFG's - Stanford InfoLab
    Chomsky Normal Form. Page 2. 2. Variables That Derive Nothing. ◇Consider: S ... No useless symbols. 2. No ε-productions. 3. No unit productions. ◇ I.e. ...
  33. [33]
    [PDF] CS310 : Automata Theory 2019 Lecture 19: Chomsky normal form
    Feb 21, 2019 · Removing epsilon productions. We apply following two transformations. 1. Let A → Y1....Yk be a production rule and Yi1 ...Yim are nullable.<|control11|><|separator|>
  34. [34]
    [PDF] CS351 Pumping Lemma, Chomsky Normal Form
    Chomsky Normal form is useful when we interpret the grammar as a parse tree. This is because the parse tree of a CNF grammar forms a binary tree. For ...
  35. [35]
    [PDF] arXiv:2003.05171v2 [cs.CL] 25 Sep 2020
    Sep 25, 2020 · Hopcroft and J. D. Ullman. Introduction to Automata Theory ... Definition 3 (Chomsky reduced form) A CFG G = (T,N,S,R) is said to ...
  36. [36]
    [PDF] Conversion to Chomsky Normal Form (CNF) - LARA
    Conversion to Chomsky Normal Form. (CNF). Steps: (not in the optimal order). –remove unproductive symbols. –remove unreachable symbols.
  37. [37]
    Random Language Model | Phys. Rev. Lett.
    Mar 29, 2019 · ... grammars produce linear derivations, context-free grammars ... Any derivation in the Chomsky reduced form can be drawn on a binary tree.
  38. [38]
    Recognition and parsing of context-free languages in time n3
    A recognition algorithm is exhibited whereby an arbitrary string over a given vocabulary can be tested for containment in a given context-free language.
  39. [39]
    [PDF] To CNF or not to CNF? An Efficient Yet Presentable Version of the ...
    Jun 3, 2009 · For textbooks on formal languages, the input of the CYK algorithm is a context-free grammar G in. Chomsky normal form (CNF) and a word w over ...
  40. [40]
    An efficient context-free parsing algorithm - ACM Digital Library
    EARLEY, J. An efficient context-free parsing algorithm. Ph.D. Thesis, Comput. Sei. Dept., Carnegie-Mellon U., Pittsburgh, Pa., 1968. Digital Library · Google ...Missing: original | Show results with:original
  41. [41]
    [PDF] Lecture 21: Chomsky Normal form
    Apr 13, 2010 · known as the Chomsky Normal Form. First, we will give an algorithm ... As such, a word w of length n, must be generated by a parse tree.
  42. [42]
    [PDF] Automata & Formal Languages
    Nov 21, 2023 · The emptiness problem for context-free grammars is solvable. Proof. Given a context-free grammar G, first transform it into Chomsky normal form ...
  43. [43]
    [PDF] Verified Algorithms for Context-Free Grammars in Coq
    Aug 10, 2016 · Furthermore, the proof of the pumping lemma for context-free languages is based on the Chomsky normal form [8]. We show that every grammar can ...<|control11|><|separator|>
  44. [44]
    [PDF] Normal Forms of Grammars - CS 373: Theory of Computation
    Chomsky Normal Form: Productions are of the form A → BC or A → a. Agha ... How to convert any context-free grammar to an equivalent grammar in the ...
  45. [45]
    (PDF) Formalization of the pumping lemma for context-free languages
    Oct 16, 2015 · The Pumping Lemma is a property that is valid for all context ... Chomsky Normal Form (CNF) for context-free grammars. This, in turn ...
  46. [46]
    [PDF] Context-free languages are closed under intersection with regular ...
    Feb 24, 2014 · The intersection of a context-free language L1 and a regular lan- guage L2 is context-free. For any CFG G = (V, Σ, R, S) in Chomsky normal form ...
  47. [47]
    [PDF] 1 The Intersection of a CFG and a REG is CFG - UMD CS
    Def 1.1 A context free grammar is in Chomsky Normal Form if every production is either of the ... Since CFL's are closed under union (and this can be ...
  48. [48]
    [PDF] Ambiguity Functions of Context-Free Grammars and Languages
    the union of two disjoint alphabets called sets of terminals and nonterminals, respectively. Thus, we implicitly assume that a nonterminal is never a string.
  49. [49]
    [PDF] Automating Grammar Comparison - LARA
    For explanatory purposes, as- sume that the input grammar is in Chomsky's Normal Form. (CNF) [16] which ensures that every right-hand-side of the grammar is ...
  50. [50]
    [PDF] Theory of Computation 7 Normalforms and Algorithms
    Let (N0,Σ,P0,S) be in Chomsky Normal Form. 1. Removing non-terminating ... variant of Ogden's Lemma for regular languages: If a language L satisfies ...
  51. [51]
    [PDF] arXiv:1403.6230v3 [cs.FL] 30 Jun 2014
    Jun 30, 2014 · We assume that all the grammars are in Chomsky normal form. ... Our proof uses only the Ogden's lemma for DCFGs and is much shorter.
  52. [52]
    [PDF] Notes on computational linguistics
    Feb 20, 2000 · Chomsky normal form grammars have rules of only the following forms ... Automata Theory, Languages and Computation. Addison-. Wesley ...
  53. [53]
    Properties of Context-Free Languages
    Aug 2, 2021 · If the grammar is in Chomsky normal form, each production “parent” expands to either two variable children or a single non-terminal child. All ...<|control11|><|separator|>
  54. [54]
    [PDF] The Average Height of Binary Trees and Other Simple Trees - Inria
    HEIGHTS OF BINARY TREES although our treatment also generalizes to sequences {c,.} with a growth rate limited by an exponential. UP to isomorphism, a simple ...