Fact-checked by Grok 2 weeks ago

Context-free language

In formal language theory, a context-free language (CFL) is a type of formal language generated by a context-free grammar (CFG), where each production rule takes the form A \to \alpha, with A a single nonterminal symbol and \alpha a (possibly empty) string of terminals and nonterminals.^[1] The language L(G) of such a grammar G = (V, \Sigma, R, S) consists of all terminal strings derivable from the start symbol S via repeated application of the rules, formally L(G) = \{ w \in \Sigma^* \mid S \Rightarrow^* w \}.^[2] Context-free languages occupy Type-2 in the Chomsky hierarchy of formal grammars, introduced by linguist Noam Chomsky in his 1956 paper "Three Models for the Description of Language," which classified grammars by their generative power and relevance to natural language structure.^[3] They form a proper superset of the regular languages (Type-3, recognized by finite automata) but a proper subset of the context-sensitive languages (Type-1), enabling the description of more complex structures like nested parentheses or balanced expressions that exceed regular capabilities.^[1] Key properties of CFLs include closure under union, concatenation, and the Kleene star operation, though they are not closed under intersection or complement; membership in a CFL is decidable using a pushdown automaton (PDA), a nondeterministic finite automaton augmented with a stack for memory.^[1] Some CFLs are inherently ambiguous, admitting multiple parse trees for the same string, which poses challenges for parsing algorithms.^[1] CFLs have foundational applications in computer science, particularly in defining the syntax of programming languages via notations like Backus-Naur Form (BNF), which Chomsky's work directly influenced for the ALGOL 60 specification, and in compiler design tools such as Yacc for generating parsers.^[1] They also model aspects of natural language syntax in computational linguistics, though real human languages often require extensions beyond pure CFLs.^[2]

Formalisms

Context-free grammars

A context-free grammar (CFG) is formally defined as a 4-tuple G = (V, \Sigma, P, S), where V is a finite set of variables (nonterminal symbols), \Sigma is a finite set of terminals (alphabet symbols), P is a finite set of productions of the form A \to \alpha with A \in V and \alpha \in (V \cup \Sigma)^*, and S \in V is the start symbol.^[3] This structure allows the replacement of a nonterminal by the right-hand side of a production independently of the surrounding symbols, distinguishing CFGs from more restrictive formalisms.^[3] Derivations in a CFG begin with the start symbol and apply productions successively to generate strings. A leftmost derivation replaces the leftmost nonterminal at each step, while a rightmost derivation replaces the rightmost nonterminal; both yield the same language but may differ in sequence.^[4] Parse trees, or derivation trees, represent these processes graphically: the root is the start symbol, internal nodes are nonterminals with children corresponding to the right-hand side of productions, and leaves form the yield string from left to right. For example, consider the CFG with productions S \to aSb \mid \epsilon, which generates balanced parentheses; a parse tree for aabb illustrates the recursive nesting. The Chomsky normal form (CNF) restricts a CFG such that every production is either A \to BC (with B, C \in V) or A \to a (with a \in \Sigma), except possibly S \to \epsilon if \epsilon is in the language.^[4] Any CFG is equivalent to one in CNF via a series of transformations: first, eliminate \epsilon-productions by adding alternatives; second, remove unit productions A \to B by substituting; third, replace terminals in mixed productions with new nonterminals; and fourth, binarize longer productions by introducing auxiliary nonterminals. These steps preserve the generated language, as proven constructively.^[4] The Greibach normal form (GNF) requires every production to be A \to a\alpha where a \in [\Sigma](/page/Sigma) and \alpha \in V^*, eliminating left recursion. Transformation to GNF involves first converting to CNF, then ordering the nonterminals such that no nonterminal appears before itself in dependencies, and substituting the right-hand sides recursively in reverse order to ensure each production starts with a terminal; this also preserves equivalence.^[4] The language generated by a CFG G, denoted L(G), consists of all strings w \in \Sigma^* derivable from S via a finite sequence of productions, i.e., S \Rightarrow^* w, where the yield of any full derivation tree is such a string. Context-free grammars generate context-free languages, which are equivalently recognized by nondeterministic pushdown automata.^[3]

Pushdown automata

A pushdown automaton (PDA) is a computational model that extends the finite automaton by incorporating an auxiliary stack memory, enabling it to recognize context-free languages.^[4] Formally, a PDA is defined as a 7-tuple M = (Q, \Sigma, \Gamma, \delta, q_0, Z_0, F), where Q is a finite set of states, \Sigma is the finite input alphabet, \Gamma is the finite stack alphabet, \delta is the transition function, q_0 \in Q is the start state, Z_0 \in \Gamma is the initial stack symbol, and F \subseteq Q is the set of accepting states.^[4] The transition function \delta: Q \times (\Sigma \cup \{\epsilon\}) \times \Gamma \to \mathcal{P}(Q \times \Gamma^*) specifies nondeterministic transitions, where \mathcal{P} denotes the power set; for a state q, input symbol a \in \Sigma \cup \{\epsilon\}, and top-of-stack symbol X \in \Gamma, \delta(q, a, X) yields a finite set of pairs (p, \gamma) with p \in Q and \gamma \in \Gamma^*, indicating a move to state p while replacing the top stack symbol X with the string \gamma (which may involve pushing, popping, or no change).^[4] A deterministic PDA (DPDA) restricts \delta such that for every q, a \neq \epsilon, and X, |\delta(q, a, X)| \leq 1, and \epsilon-transitions are limited to avoid nondeterminism; however, the languages accepted by DPDAs form a proper subset of the context-free languages, as not every context-free language admits a DPDA recognizer.^[4] A configuration of a PDA is a triple (q, w, \alpha), where q \in Q is the current state, w \in \Sigma^* is the remaining input, and \alpha \in \Gamma^* is the current stack content (with the leftmost symbol at the bottom).^[4] One configuration (q, w, \alpha) yields another (p, w', \beta) in one step if w = a w' for some a \in \Sigma \cup \{\epsilon\}, the top of \alpha = X \gamma for X \in \Gamma and \gamma \in \Gamma^*, and (p, \beta') \in \delta(q, a, X) with \beta = \beta' \gamma; the relation \vdash^* denotes the reflexive transitive closure of single steps, representing computations.^[4] PDAs accept strings in two equivalent modes: by final state or by empty stack.^[4] The language recognized by M by final state is L(M) = \{ w \in \Sigma^* \mid (q_0, w, Z_0) \vdash^* (q, \epsilon, \alpha) \text{ for some } q \in F \text{ and } \alpha \in \Gamma^* \}, while acceptance by empty stack is N(M) = \{ w \in \Sigma^* \mid (q_0, w, Z_0) \vdash^* (q, \epsilon, \epsilon) \text{ for some } q \in Q \}; these yield the same class of languages.^[4] Nondeterministic PDAs recognize exactly the context-free languages, establishing equivalence with context-free grammars.^[4] To convert a context-free grammar G to an equivalent PDA, construct M that nondeterministically simulates leftmost derivations of G by pushing nonterminals onto the stack according to productions and popping terminals as they match the input, accepting by empty stack when the derivation completes.^[4] Conversely, from a PDA M, derive a grammar G whose variables encode pairs of states and stack symbols, with productions mirroring transitions to generate strings that lead from the initial configuration to an accepting one, ensuring L(G) = L(M).^[4]

Examples

Dyck language

The Dyck language serves as a canonical example of a context-free language, capturing the essence of properly nested and matched structures, such as balanced parentheses. For the case of a single pair of brackets, the Dyck language D_1 over the alphabet \{ (, ) \} consists of all strings where every opening parenthesis has a corresponding closing parenthesis, with proper nesting and no unmatched symbols.^[5] This language includes the empty string \epsilon, as well as strings like (), (()), and ()(()), but excludes ill-formed strings such as () or (())(.^[6] The language D_1 is generated by the context-free grammar G = (V, \Sigma, R, S), where V = \{S\} is the set of variables, \Sigma = \{ (, ) \} is the terminal alphabet, S is the start symbol, and R consists of the productions:

S → (S) | SS | ε
S → (S) | SS | ε

This grammar recursively builds nested structures through the rule S \to (S), allows concatenation of independent substructures via S \to SS, and includes the empty production for the base case.^[6]^[7] More generally, the Dyck language of order k, denoted D_k, extends this to an alphabet with k distinct pairs of matching brackets, such as \{ (, ), [, ], \{, \} \} for k=3. It comprises all strings where brackets of each type are properly matched and nested, without interleaving mismatches. The grammar for D_k uses productions S \to X_i S Y_i S \mid \epsilon for i = 1 to k, where X_i and Y_i represent the i-th opening and closing brackets, ensuring type-specific matching.^[8] Recognition of the Dyck language by a pushdown automaton (PDA) leverages the stack to enforce matching. The PDA operates in a single state: upon reading an opening bracket, it pushes the corresponding symbol onto the stack; upon reading a closing bracket, it pops the top symbol if it matches the expected opening type, rejecting otherwise. To handle multiple types, the stack stores the bracket type. The automaton accepts by empty stack at the end of input, confirming all openings have been matched and the stack is cleared.^[7] The Dyck language is inherently non-regular, as demonstrated by applying the pumping lemma for regular languages to strings like ((^n )^n, which cannot be pumped without violating balance. This non-regularity underscores the need for stack memory in pushdown automata to track unbounded nesting depth.^[6] The term "Dyck language" originates from the work of mathematician Walther von Dyck (1856–1934), whose foundational contributions to combinatorial group theory in the 1880s, particularly on presentations by generators and relations, inspired the modeling of balanced words as reduced forms in free groups. The modern formal usage in language theory was introduced by Noam Chomsky and Marcel-Paul Schützenberger in their 1963 paper on the algebraic structure of context-free languages.^[9]

Expression grammars

Context-free grammars provide a natural way to specify the syntax of arithmetic expressions, capturing the structure of operations like addition and multiplication while incorporating parentheses for grouping and identifiers as operands. A typical language modeled this way is L, the set of all valid arithmetic expressions built from identifiers (such as variable names), the binary operators + and \times, and parentheses, excluding division and unary operations; this language is generated by a context-free grammar over the terminal alphabet \Sigma = \{ +, \times, (, ), \text{id} \}.^[10] A simple context-free grammar for L is given by the productions:

E → E + E | E × E | ( E ) | id
E → E + E | E × E | ( E ) | id

with E as the start symbol. This grammar generates all strings in L, but it is ambiguous because some expressions admit multiple distinct parse trees, leading to different interpretations of operator associativity or precedence.^[11]^[12] For example, consider the string \text{id} + \text{id} \times \text{id}. One leftmost derivation is E \Rightarrow E + E \Rightarrow \text{id} + E \Rightarrow \text{id} + E \times E \Rightarrow \text{id} + \text{id} \times E \Rightarrow \text{id} + \text{id} \times \text{id}, yielding a parse tree where addition is the root and multiplication is nested on the right. Another is E \Rightarrow E \times E \Rightarrow E + E \times E \Rightarrow \text{id} + E \times E \Rightarrow \text{id} + \text{id} \times E \Rightarrow \text{id} + \text{id} \times \text{id}, yielding a parse tree where multiplication is the root and addition is nested on the left. These trees imply different evaluation orders: the first as (\text{id} + \text{id}) \times \text{id}, and the second as \text{id} + (\text{id} \times \text{id}), highlighting the lack of specified precedence between + and \times.^[11]^[12] To resolve this ambiguity and enforce standard operator precedence—where multiplication binds tighter than addition—a disambiguated grammar introduces nonterminals for precedence levels:

E → E + T | T
T → T × F | F
F → ( E ) | id
E → E + T | T
T → T × F | F
F → ( E ) | id

Here, E handles addition, T handles multiplication, and F handles factors (parenthesized expressions or identifiers). For the string \text{id} + \text{id} \times \text{id}, the unique leftmost derivation is E \Rightarrow E + T \Rightarrow T + T \Rightarrow F + T \Rightarrow \text{id} + T \Rightarrow \text{id} + T \times F \Rightarrow \text{id} + F \times F \Rightarrow \text{id} + \text{id} \times F \Rightarrow \text{id} + \text{id} \times \text{id}, producing a parse tree that groups multiplication first: \text{id} + (\text{id} \times \text{id}). This structure ensures a single interpretation consistent with mathematical conventions.^[13] Such grammars are fundamental in compiler design, where they define the syntactic rules for expression evaluation in programming languages like C and Java, enabling parsers to construct abstract syntax trees that respect operator precedence and associativity during code compilation.^[14]

Properties

Closure properties

Context-free languages are closed under several fundamental operations, allowing the construction of new context-free languages from existing ones using context-free grammars or pushdown automata. These properties facilitate the analysis and generation of complex languages while maintaining membership in the class. Specifically, closure holds for union, concatenation, Kleene star, homomorphism, and inverse homomorphism, with explicit constructions demonstrating how to derive equivalent grammars or automata.^[15]^[16] Union. If L_1 and L_2 are context-free languages generated by context-free grammars G_1 = (V_1, [\Sigma](/page/Sigma), P_1, S_1) and G_2 = (V_2, [\Sigma](/page/Sigma), P_2, S_2) with disjoint sets of nonterminals V_1 and V_2, then L_1 \cup L_2 is context-free. Construct a new grammar G = (V_1 \cup V_2 \cup \{S\}, [\Sigma](/page/Sigma), P_1 \cup P_2 \cup \{S \to S_1 \mid S_2\}, S), where S is a fresh start symbol; this grammar generates strings from either L_1 or L_2.^[15] Concatenation. The concatenation L_1 \cdot L_2 = \{xy \mid x \in L_1, y \in L_2\} is also context-free. Using the same grammars G_1 and G_2, modify G_1 by adding a new start symbol S with production S \to S_1 A, where A is a new nonterminal with A \to S_2, and include all productions from both G_1 and G_2 (adjusting for disjoint nonterminals); this ensures derivations first produce a string from L_1 followed by one from L_2.^[15] Kleene star. The Kleene star L^* = \bigcup_{k=0}^\infty L^k (with L^0 = \{\epsilon\}) of a context-free language L generated by G = (V, \Sigma, P, S) is context-free. Introduce a new start symbol S' with productions S' \to S S' \mid \epsilon, and use the original productions for S; this allows zero or more concatenations of strings from L.^[15] Homomorphism. For a homomorphism h: \Sigma^* \to \Delta^* and context-free language L \subseteq \Sigma^*, the image h(L) = \{h(w) \mid w \in L\} is context-free. Given a grammar for L, replace each terminal a \in \Sigma in the productions with the string h(a); if h(a) is nonempty, introduce auxiliary nonterminals to generate the string sequentially from left to right, ensuring the grammar produces exactly the images under h.^[17] Inverse homomorphism. The preimage h^{-1}(L) = \{w \in \Sigma^* \mid h(w) \in L\} under the same homomorphism h is also context-free. This follows from a pushdown automaton construction: simulate the PDA for L on the fly by applying h to input symbols and pushing the results onto the stack as if reading from h(w), accepting if the simulated PDA accepts.^[16] Context-free languages are not closed under intersection, complement, or difference. For intersection, consider L_1 = \{a^n b^n c^m \mid n, m \geq 0\} and L_2 = \{a^m b^n c^n \mid m, n \geq 0\}, both context-free (generated by simple grammars matching pairs of exponents independently), but their intersection is \{a^n b^n c^n \mid n \geq 0\}, which is not context-free (provable via the pumping lemma for context-free languages).^[18] For complement, suppose context-free languages were closed under complementation; then, since they are closed under union, they would be closed under intersection via De Morgan's laws (\overline{L_1 \cup L_2} = \overline{L_1} \cap \overline{L_2}), contradicting the non-closure under intersection. Thus, there exist context-free languages whose complements are not context-free.^[19] For difference, L_1 \setminus L_2 = L_1 \cap \overline{L_2}; non-closure under either intersection or complement implies non-closure under difference, as both operations are required.^[20]

Pumping lemma

The pumping lemma for context-free languages provides a necessary condition that all context-free languages must satisfy, serving as a key tool for proving that certain languages are not context-free by contradiction. Formally, let L be a context-free language. Then there exists a positive integer p (the pumping length) such that for every string w \in L with |w| \geq p, w can be divided as w = uvxyz where:

|vxy| \leq p,
|vy| \geq 1,
for all integers k \geq 0, the string uv^k x y^k z \in L.

This decomposition ensures that the portions v and y can be "pumped" (repeated any number of times, including zero) while keeping the resulting strings in L.

Proof Sketch

The proof relies on the structure of parse trees generated by a context-free grammar in Chomsky normal form for L, which has at most m variables for some m. Set the pumping length p = 2^m. For any w \in L with |w| \geq p, its parse tree has height at most m+1 but at least m+1 leaves, so by the pigeonhole principle, some path from the root to a leaf contains a repeated variable A \to \alpha and A \to \beta at two nodes. The subtrees rooted at these nodes yield the repeatable substrings v and y (with u, x, and z from the surrounding tree), ensuring |vxy| \leq p due to the bounded height and |vy| \geq 1 since the subtrees are non-empty. Pumping corresponds to repeating the subtree for A, preserving membership in L.^[21]

Application Example

A classic application demonstrates that the language L = \{a^n b^n c^n \mid n \geq 0\} is not context-free. Assume for contradiction that L is context-free, so it satisfies the pumping lemma with some pumping length p. Consider w = a^p b^p c^p \in L. Then w = uvxyz with |vxy| \leq p and |vy| \geq 1. Since |vxy| \leq p, the substring vxy lies entirely within the a's, or spans a's and b's, or spans b's and c's, but cannot touch all three blocks due to the length bound. Pumping with k=2 yields uv^2 x y^2 z, which increases the count of a's (or b's, depending on position) without matching increases in the other symbols, resulting in unequal exponents (e.g., more a's than b's or c's), so uv^2 x y^2 z \notin L. This contradiction implies L is not context-free.^[21]

Variants

Ogden's lemma strengthens the standard pumping lemma by allowing the selection of a set of at least p "distinguished" positions in w, requiring that v and y together cover at least one distinguished position. This provides finer control over the pumpable substrings' locations, aiding proofs where the standard lemma fails due to ambiguous positioning.^[22]

Limitations

The pumping lemma is a necessary but not sufficient condition for context-freeness: while every context-free language satisfies it, some non-context-free languages also satisfy the condition, so it cannot prove that a language is context-free. It is thus primarily useful for establishing non-membership in the class of context-free languages.

Decidability properties

Context-free languages exhibit a mix of decidable and undecidable decision problems, reflecting the computational boundaries of their generative models like context-free grammars and pushdown automata. Several core properties can be algorithmically determined, while others, involving global comparisons, cannot. The membership problem for context-free languages is decidable: given a context-free grammar G and a string w, it is possible to determine whether w \in L(G). This follows from the existence of effective parsing procedures that enumerate all possible derivations and halt with a yes or no answer.^[23] The emptiness problem is also decidable: determining whether L(G) = \emptyset for a context-free grammar G can be resolved by identifying productive nonterminals through a fixed-point computation. Specifically, a nonterminal is productive if it derives a terminal string, computed iteratively by marking nonterminals that appear in the right-hand sides of productions leading to terminals or other productive nonterminals; the language is empty if the start symbol is unproductive.^[23] Similarly, the finiteness problem—whether L(G) is finite—is decidable. This is established by analyzing the grammar for cycles that allow infinite derivations, effectively reducing to emptiness checks on modified grammars where self-embedding nonterminals (those deriving strings containing themselves) are restricted, ensuring no unbounded pumping occurs if the resulting language is empty.^[23] However, the universality problem is undecidable: it cannot be determined in general whether L(G) = \Sigma^* for a context-free grammar G over alphabet \Sigma. This result is obtained by reduction from the undecidable Post Correspondence Problem, encoding instances of PCP into context-free grammars such that the language equals \Sigma^* if and only if the PCP has a solution.^[23] The equivalence problem between two context-free grammars G_1 and G_2—determining if L(G_1) = L(G_2)—is likewise undecidable. This undecidability follows directly from that of universality, as equivalence to a trivial grammar generating \Sigma^* would solve the universality problem, which is impossible.^[23] In a notable decidable case, it is possible to determine whether a given context-free language is regular. This involves checking if the pushdown automaton accepting the language behaves as a finite-state automaton, for instance, by verifying that the stack usage is bounded or by constructing an equivalent finite automaton if the pushdown is unnecessary.^[23]

Parsing and Recognition

Parsing algorithms

Top-down parsing algorithms, such as LL(1) parsers, construct a parse tree by starting from the start symbol and recursively expanding non-terminals based on the input string, using one token of lookahead to predict the next production. These parsers are predictive, relying on a parsing table constructed from FIRST and FOLLOW sets to determine the appropriate production for each non-terminal given the current lookahead symbol. The FIRST set of a string α is the set of terminals that can appear as the first symbol in some derivation of α, while the FOLLOW set of a non-terminal A is the set of terminals that can appear immediately after A in some sentential form. To build the predictive parsing table, for each production A → α, entries are filled in row A and column t where t is in FIRST(α) if ε ∉ FIRST(α), or in FOLLOW(A) if ε ∈ FIRST(α); conflicts arise if multiple productions are predicted for the same entry, indicating the grammar is not LL(1). Bottom-up parsing algorithms, including LR(0), SLR(1), and LALR(1) parsers, build the parse tree from the input string upward by shifting symbols onto a stack and reducing them according to productions when a handle (right side of a production) is recognized. LR(0) parsers use items of the form A → α • β to represent positions in productions without lookahead, constructing a deterministic finite automaton from the grammar's canonical collection of sets of items to guide shifts and reduces. SLR(1) (Simple LR(1)) extends LR(0) by incorporating FOLLOW sets to resolve reduce actions based on one lookahead symbol, while LALR(1) (Look-Ahead LR(1)) merges states from the full LR(1) automaton to reduce table size, preserving determinism for many practical grammars. Shift-reduce conflicts occur when the parser cannot decide whether to shift the next input symbol or reduce the top stack symbols, often due to ambiguous grammars; these are typically resolved by grammar redesign, such as factoring or left-recursion elimination, to ensure a unique handle. The Cocke-Younger-Kasami (CYK) algorithm is a general bottom-up parsing method applicable to any context-free grammar in Chomsky normal form (CNF), where productions are either A → BC or A → a. It employs dynamic programming with a triangular table dp[A], where dp[A] is true if the substring from position i to i+j-1 can be derived from non-terminal A. The algorithm initializes the table for single symbols (j=1) and iteratively fills longer substrings by checking if A → BC where B derives the left part and C the right part, recognizing membership if the start symbol spans the full string.^[24] The Earley algorithm provides a general chart-parsing approach that handles all context-free languages, including ambiguous and nondeterministic grammars, using dynamic programming without requiring grammar transformation. It maintains a chart of states for each position j in the input, where each state is a tuple (A → α • β, i, j) indicating that α derives the substring from i to j, with • marking the current position in the production. Parsing proceeds in three phases per position: prediction adds new states for non-terminals β at j; scanning advances • over matching terminals to j+1; and completion links completed productions (• at end of β) back to waiting states expecting that non-terminal. This process explores all possible derivations, achieving O(n^3) time complexity in the worst case.^[25] Parsing algorithms can detect ambiguity in context-free grammars by identifying multiple valid parse trees for the same input string; for instance, LL(1) and LR parsers fail with conflicts on ambiguous grammars, while general methods like Earley or CYK reveal ambiguity through multiple completing states or table entries for the start symbol.^[25]

Membership testing

Membership testing for a context-free language (CFL) involves determining whether a given string of length n belongs to the language generated by a context-free grammar (CFG). This problem is decidable, and efficient algorithms exist that run in polynomial time, placing CFL membership in the complexity class P. The Cocke-Younger-Kasami (CYK) algorithm and the Earley parser are two general-purpose methods for testing membership in arbitrary CFLs, both exhibiting a worst-case time complexity of O(n^3). The CYK algorithm relies on dynamic programming and requires the CFG to be in Chomsky normal form (CNF), where productions are limited to A \to BC (with B, C nonterminals) or A \to a (with a a terminal). Any CFG can be converted to CNF in polynomial time without altering the language it generates, enabling efficient application of CYK. This conversion ensures that the algorithm fills an n \times n table bottom-up, where each entry represents possible nonterminal derivations for substrings, leading to O(n^2) space complexity for the table. For certain subclasses of CFLs, more efficient linear-time algorithms are available. Top-down LL(k) parsers and bottom-up LR(k) parsers achieve O(n) time complexity when the grammar is suitable—specifically, LL for left-to-right, leftmost derivations without left recursion, and LR for a broader class including left-recursive grammars. LR parsers handle left recursion naturally without infinite loops, unlike standard top-down approaches, which require grammar transformations to eliminate it. In practice, membership testing via LR parsing is widely implemented in compiler construction tools. For instance, Yacc (Yet Another Compiler-Compiler) generates LR parsers from CFGs, facilitating efficient syntactic analysis in compilers for programming languages. These tools preprocess the grammar to build parsing tables, allowing linear-time recognition during compilation.^[26] An alternative theoretical approach simulates a nondeterministic pushdown automaton (NPDA) equivalent to the CFG, but this is inefficient for membership testing. Direct simulation explores all possible computation paths, potentially leading to exponential time complexity in the worst case due to branching nondeterminism and stack operations. In contrast, the parsing algorithms above avoid this by using deterministic table-driven methods.

Comparisons

Chomsky hierarchy

The Chomsky hierarchy classifies formal grammars into four types based on restrictions on their production rules, each corresponding to a class of languages of increasing generative power. Type 0 grammars are unrestricted, generating recursively enumerable languages; Type 1 grammars are context-sensitive, generating context-sensitive languages; Type 2 grammars are context-free, generating context-free languages; and Type 3 grammars are regular, generating regular languages.^[27] Noam Chomsky first introduced the hierarchy in his 1956 paper "Three Models for the Description of Language," where he explored finite-state, phrase-structure, and transformational models, and formalized it in 1959 in "On Certain Formal Properties of Grammars."^[28]^[29] Type 2 grammars in the hierarchy, known as context-free grammars, consist of production rules of the form A \to \alpha, where A is a single nonterminal symbol and \alpha is any string composed of terminal and nonterminal symbols, without any context restrictions on the application of rules. These grammars generate exactly the context-free languages.^[29]^[27] The language classes form strict inclusions: regular languages are properly contained in context-free languages (\text{REG} \subset \text{CFL}), context-free languages are properly contained in context-sensitive languages (\text{CFL} \subset \text{CSL}), and context-sensitive languages are properly contained in recursively enumerable languages (\text{CSL} \subset \text{RE}). For example, the language \{a^n b^n \mid n \geq 0\} is context-free but not regular, as shown by the pumping lemma for regular languages. Similarly, the language \{a^n b^n c^n \mid n \geq 0\} is context-sensitive but not context-free.^[27] Corresponding to these language classes, context-free languages are recognized by pushdown automata, which use a stack for memory, while context-sensitive languages are recognized by linear bounded automata, which operate within linear space bounds relative to the input length.^[27]

Non-context-free languages

A canonical example of a non-context-free language is L = \{ a^n b^n c^n \mid n \geq 0 \}, which consists of strings with equal numbers of each of three distinct symbols.^[30] The pumping lemma for context-free languages can be used to prove that L is not context-free. Assume for contradiction that L is context-free, and let p be the pumping length given by the lemma. Select the string w = a^p b^p c^p \in L, which has length greater than p. By the lemma, w = uvxyz where |vxy| \leq p, |vy| > 0, and uv^i x y^i z \in L for all i \geq 0. The constraint |vxy| \leq p implies that the substring vxy lies entirely within one block of identical symbols or spans at most two adjacent blocks. If vxy is confined to one block (e.g., the a's), pumping with i = 2 increases the count of that symbol while leaving the others unchanged, resulting in unequal counts and a string not in L. If vxy spans two blocks (e.g., a's and b's), pumping with i = 2 either introduces symbols out of order or imbalances the counts across the three symbols, again yielding a string not in L. In all cases, a contradiction arises, so L is not context-free.^[30] Another classic non-context-free language is the copy language L = \{ ww \mid w \in \{a, b\}^* \}, comprising all strings that are the concatenation of some word with itself.^[31] To demonstrate that this language is not context-free using the pumping lemma, assume it is, with pumping length p. Choose w = a^p b a^p b \in L (where the first half is a^p b and the second half matches it). Decompose w = uvxyz with the usual conditions. Given |vxy| \leq p, the substring vxy cannot span both halves symmetrically in a way that preserves the copy structure upon pumping. Pumping with i = 2 will either add extra symbols to one half without matching the other or disrupt the identical sequences in the two halves, producing a string where the first and second parts no longer match, hence not in L. This contradiction shows L is not context-free.^[21] The language L = \{ a^n b^m \mid n, m \geq 0, n \neq m \} over the alphabet \{a, b\} provides insight into closure properties, as it equals the union of the context-free languages \{ a^n b^m \mid n > m \} and \{ a^n b^m \mid n < m \}, both of which are context-free (generated by straightforward context-free grammars that produce excess a's or b's). However, examples like \{ a^n b^n c^n \mid n \geq 0 \} arise as intersections of context-free languages, such as \{ a^n b^m c^k \mid m = n \} \cap \{ a^n b^m c^k \mid n = k \}, illustrating that the intersection of two context-free languages need not be context-free.^[30] Ogden's lemma strengthens the pumping lemma by allowing the selection of distinguished positions in the pumped string, enabling more targeted proofs for non-context-freeness in languages where the standard lemma's decomposition is ambiguous, such as variants of the copy language or languages requiring precise symbol matching across distant parts of the string.^[32] Languages like \{ a^n b^n c^n \mid n \geq 0 \} and the copy language are context-sensitive, as they can be recognized by linear bounded automata or generated by context-sensitive grammars that enforce equality through context-dependent rules, such as progressively marking and copying symbol counts.^[33]

References

[1]
Context-free Language - an overview | ScienceDirect Topics
Context-free languages are defined as the class of languages generated by context-free grammars (CFGs), where each production rule allows for the rewriting of a ...Introduction to Context-Free... · Formal Definition, Properties...<|control11|><|separator|>
[2]
context-free grammar in nLab
Jun 30, 2025 · The notion of context-free grammar, now used in linguistics, computer science and mathematics, was introduced in the works of Noam Chomsky. 2.
[3]
Three models for the description of language - IEEE Xplore
Three models for the description of language. Abstract: We investigate several conceptions of linguistic structure to determine whether or not they can provide ...
[4]
On certain formal properties of grammars - ScienceDirect.com
References. Chomsky, 1956. Chomsky N. Three models for the description of language. IRE Trans. on Inform. Theory, IT-2 (No. 3) (1956), pp. 113-124. View in ...Missing: citation | Show results with:citation
[5]
A New Normal-Form Theorem for Context-Free Phrase Structure ...
A New Normal-Form Theorem for Context-Free Phrase Structure Grammars. Author: Sheila A. Greibach. Sheila A. Greibach. Harvard University, Cambridge ...
[6]
[PDF] formal languages and their relation to automata - saved paradigms
pushdown automaton defined in Chapter 5. In passing, we note that a pushdown automaton can be thought of as a nondeterministic Tm with a read-only input on ...
[7]
[PDF] Grammars
S ⇒ aSb ⇒ aaSbb ⇒ aabb. Dyck Language. The Dyck language is the language of all balanced strings of left and right parentheses, which is over the alphabet Σ = ...
[8]
[PDF] Lecture 5: Context-Free Grammars
Sep 11, 2023 · We say that a language L is context-free if ther exists a context-free grammar G such that L = L(G). 3 Examples. Like a state diagram, you can ...
[9]
[PDF] Harvard CS 121 and CSCI E-121 Lecture 9: Ambiguity, Pushdown ...
Oct 1, 2013 · Balanced parentheses language. AKA “Dyck set over 2 letters”. R = {S → ε, S → SS, S → (S)}. S ⇒ SS ⇒ (S)S ⇒ ((S))S ⇒ (())S. ⇒ (())(S) ⇒ (())() ...
[10]
[PDF] The Dyck Language Edit Distance Problem in Near-Linear Time
DYCK(s) is a fundamental context free gram- mar representing the language of well-balanced parentheses of s different types, and DYCK language edit distance is ...
[11]
The origins of combinatorics on words - ScienceDirect.com
The term Dyck language for the class of the empty word appears for the first time in the paper of Chomsky and Schützenberger in 1963 [22]. The term Dyck ...
[12]
[PDF] Context-Free Grammars (and Languages)
Example: a (simplistic) syntax for arithmetic expressions expr → expr + expr expr → expr × expr expr → var var → a var → b var → c. e.g. expr * a + b × c.
[13]
[PDF] Context-Free Grammars - Stony Brook Computer Science
Jan 24, 2021 · Show that the following CFG is ambiguous: E → E + E | E × E | ( E ) | n. Page 63. Arithmetic expression: Ambiguous grammar. Problem. Show that ...<|control11|><|separator|>
[14]
[PDF] CSE 311 Lecture 21: Context- Free Grammars - Washington
Another example CFG: simple arithmetic expressions. Can this CFG generate ? Can this CFG generate in two entirely different ways? E → E + E|E ∗ E|(E) |x |y ...
[15]
[PDF] 1 Context-Free Grammars - UNC Computer Science
It is designed so that multiplication will have precedence over addition so that for example in the expression a ∗ b + c, the multiplication is done before the ...
[16]
CS 4120 Spring 2020 Introduction to Compilers - CS@Cornell
The grammar is called “context-free” because whether the production applies doesn't depend on the surrounding context α and β. For example, the string “(1+4)+2” ...
[17]
[PDF] 1 Closure Properties
Proposition 1. If L1 and L2 are context-free languages then L1 ∪ L2 is also context-free. Proof. Let L1 be language recognized by G1 = (V1,Σ,R1,S1) and L2 ...
[18]
[PDF] Closure Properties of Context-Free Languages - COSE215
May 15, 2023 · Closure under Homomorphism. Theorem (Closure under Homomorphism). If h is a homomorphism and L is a context-free language, then so is h(L).
[19]
[PDF] Closure Properties of CFL's | Substitution - Stanford InfoLab
Closure of CFL's under homomorphism. Nonclosure Under Intersection. The reader shows the following language L = f0i1j2k3l ji = k and j = lg not ...
[20]
[PDF] Context-Free and Noncontext-Free Languages
Closure Under Intersection. The context-free languages are not closed under intersection: The proof is by counterexample. Let: L. 1. = {anbncm: n, m ≥ 0} ...
[21]
[PDF] 1 Closure Properties of Context-Free Languages
Therefore context-free languages are not closed under complemen- tation because they are not closed under intersection.
[22]
[PDF] Decision and Closure Properties of CFLs - cs.rit.edu
CFLs are closed under Union, Concatenation, and Kleene Star, but not under Intersection, Difference, or Complement.
[23]
[PDF] 1 Pumping Lemma
Like in the case of regular languages, one can use closure properties to show that a language is non-contextfree. To prove L is not context-free, we construct a ...
[24]
A helpful result for proving inherent ambiguity | Theory of Computing ...
A helpful result for proving inherent ambiguity ... Article PDF. Download to read the full article text. Use our pre-submission checklist.
[25]
[PDF] Introduction to Automata Theory, Languages, and Computation
Hopcroft, John E., 1939-. Introduction to automata theory, languages, and computation / John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman.-2nd ed. p. cm ...
[26]
On the translation of languages from left to right
### Summary of LR Parsing Concepts from the Paper
[27]
CS470/570: CYK Algorithm - Zoo | Yale University
In fact, Cocke's 1970 paper about the algorithm was predated by Younger's 1967 paper, which was predated by Kasami's 1965 paper. The CYK algorithm is ...Missing: original | Show results with:original
[28]
An efficient context-free parsing algorithm - ACM Digital Library
EARLEY, J. An efficient context-free parsing algorithm. Ph.D. Thesis, Comput. Sei. Dept., Carnegie-Mellon U., Pittsburgh, Pa., 1968. Digital Library · Google ...Missing: original | Show results with:original
[29]
[PDF] Yacc: Yet Another Compiler-Compiler - ePaperPress
Yacc: Yet Another Compiler-Compiler. Stephen C. Johnson. ABSTRACT. Computer program input generally has some structure; in fact, every computer pro- gram that ...
[30]
Formal language theory: refining the Chomsky hierarchy - PMC - NIH
The first part of this article gives a brief overview of the four levels of the Chomsky hierarchy, with a special emphasis on context-free and regular languages ...Missing: original | Show results with:original<|control11|><|separator|>
[31]
[PDF] TIIKEE MODELS FOR TIE DESCRIPTION OF LANGUAGE
We study the formal properties of a set of grammatical trans- formations that carry sentences with phra.se structure into new sentences with derived phrase.
[32]
[PDF] On Certain Formal Properties of Grammars*
A finite state language is essentially what is called in. Kleene (1956) a "regular event." Page 15. ON CERTAIN FORMAL PROPERTIES OF GRAMMARS. 151. Restriction 3 ...
[33]
[PDF] Languages That Are and Are Not Context-Free
Theorem: There exist languages that are not context-free. Proof: (1) There are a countably infinite number of context-free languages. This true because every ...
[34]
[PDF] Pumping Lemma: Context Free Languages
| n ≥ 0} is not context free using the Pumping Lemma. • Suppose {anbncn | n ≥ 0} is context free. ... Find string s ∈ A with |s| ≥ p. Let s = 0 p. 10 p. 1. 4 ...
[35]
[PDF] Chapter 3 Context-Free Grammars, Context-Free Languages, Parse ...
L = {anbncn | n ≥ 1}. Page 41. 3.12. OGDEN'S LEMMA. 85 is not context-free. Since L is infinite, we will be able to use the pumping lemma. The proof proceeds by ...Missing: non- | Show results with:non-
[36]
CMPS 450 - Week 5 - Ch 4.1-4.3 "Syntax Analysis"
A yet more restrictive classification of grammars is context-sensitive grammars. The classic example is: Language L is defined as a^nb^nc^n, for n >= 1. L ...