Context-free language
In formal language theory, a context-free language (CFL) is a type of formal language generated by a context-free grammar (CFG), where each production rule takes the form A \to \alpha, with A a single nonterminal symbol and \alpha a (possibly empty) string of terminals and nonterminals.[1] The language L(G) of such a grammar G = (V, \Sigma, R, S) consists of all terminal strings derivable from the start symbol S via repeated application of the rules, formally L(G) = \{ w \in \Sigma^* \mid S \Rightarrow^* w \}.[2]
Context-free languages occupy Type-2 in the Chomsky hierarchy of formal grammars, introduced by linguist Noam Chomsky in his 1956 paper "Three Models for the Description of Language," which classified grammars by their generative power and relevance to natural language structure.[3] They form a proper superset of the regular languages (Type-3, recognized by finite automata) but a proper subset of the context-sensitive languages (Type-1), enabling the description of more complex structures like nested parentheses or balanced expressions that exceed regular capabilities.[1]
Key properties of CFLs include closure under union, concatenation, and the Kleene star operation, though they are not closed under intersection or complement; membership in a CFL is decidable using a pushdown automaton (PDA), a nondeterministic finite automaton augmented with a stack for memory.[1] Some CFLs are inherently ambiguous, admitting multiple parse trees for the same string, which poses challenges for parsing algorithms.[1]
CFLs have foundational applications in computer science, particularly in defining the syntax of programming languages via notations like Backus-Naur Form (BNF), which Chomsky's work directly influenced for the ALGOL 60 specification, and in compiler design tools such as Yacc for generating parsers.[1] They also model aspects of natural language syntax in computational linguistics, though real human languages often require extensions beyond pure CFLs.[2]
Context-free grammars
A context-free grammar (CFG) is formally defined as a 4-tuple G = (V, \Sigma, P, S), where V is a finite set of variables (nonterminal symbols), \Sigma is a finite set of terminals (alphabet symbols), P is a finite set of productions of the form A \to \alpha with A \in V and \alpha \in (V \cup \Sigma)^*, and S \in V is the start symbol.[3] This structure allows the replacement of a nonterminal by the right-hand side of a production independently of the surrounding symbols, distinguishing CFGs from more restrictive formalisms.[3]
Derivations in a CFG begin with the start symbol and apply productions successively to generate strings. A leftmost derivation replaces the leftmost nonterminal at each step, while a rightmost derivation replaces the rightmost nonterminal; both yield the same language but may differ in sequence.[4] Parse trees, or derivation trees, represent these processes graphically: the root is the start symbol, internal nodes are nonterminals with children corresponding to the right-hand side of productions, and leaves form the yield string from left to right. For example, consider the CFG with productions S \to aSb \mid \epsilon, which generates balanced parentheses; a parse tree for aabb illustrates the recursive nesting.
The Chomsky normal form (CNF) restricts a CFG such that every production is either A \to BC (with B, C \in V) or A \to a (with a \in \Sigma), except possibly S \to \epsilon if \epsilon is in the language.[4] Any CFG is equivalent to one in CNF via a series of transformations: first, eliminate \epsilon-productions by adding alternatives; second, remove unit productions A \to B by substituting; third, replace terminals in mixed productions with new nonterminals; and fourth, binarize longer productions by introducing auxiliary nonterminals. These steps preserve the generated language, as proven constructively.[4]
The Greibach normal form (GNF) requires every production to be A \to a\alpha where a \in [\Sigma](/page/Sigma) and \alpha \in V^*, eliminating left recursion. Transformation to GNF involves first converting to CNF, then ordering the nonterminals such that no nonterminal appears before itself in dependencies, and substituting the right-hand sides recursively in reverse order to ensure each production starts with a terminal; this also preserves equivalence.[4]
The language generated by a CFG G, denoted L(G), consists of all strings w \in \Sigma^* derivable from S via a finite sequence of productions, i.e., S \Rightarrow^* w, where the yield of any full derivation tree is such a string. Context-free grammars generate context-free languages, which are equivalently recognized by nondeterministic pushdown automata.[3]
Pushdown automata
A pushdown automaton (PDA) is a computational model that extends the finite automaton by incorporating an auxiliary stack memory, enabling it to recognize context-free languages.[4] Formally, a PDA is defined as a 7-tuple M = (Q, \Sigma, \Gamma, \delta, q_0, Z_0, F), where Q is a finite set of states, \Sigma is the finite input alphabet, \Gamma is the finite stack alphabet, \delta is the transition function, q_0 \in Q is the start state, Z_0 \in \Gamma is the initial stack symbol, and F \subseteq Q is the set of accepting states.[4]
The transition function \delta: Q \times (\Sigma \cup \{\epsilon\}) \times \Gamma \to \mathcal{P}(Q \times \Gamma^*) specifies nondeterministic transitions, where \mathcal{P} denotes the power set; for a state q, input symbol a \in \Sigma \cup \{\epsilon\}, and top-of-stack symbol X \in \Gamma, \delta(q, a, X) yields a finite set of pairs (p, \gamma) with p \in Q and \gamma \in \Gamma^*, indicating a move to state p while replacing the top stack symbol X with the string \gamma (which may involve pushing, popping, or no change).[4] A deterministic PDA (DPDA) restricts \delta such that for every q, a \neq \epsilon, and X, |\delta(q, a, X)| \leq 1, and \epsilon-transitions are limited to avoid nondeterminism; however, the languages accepted by DPDAs form a proper subset of the context-free languages, as not every context-free language admits a DPDA recognizer.[4]
A configuration of a PDA is a triple (q, w, \alpha), where q \in Q is the current state, w \in \Sigma^* is the remaining input, and \alpha \in \Gamma^* is the current stack content (with the leftmost symbol at the bottom).[4] One configuration (q, w, \alpha) yields another (p, w', \beta) in one step if w = a w' for some a \in \Sigma \cup \{\epsilon\}, the top of \alpha = X \gamma for X \in \Gamma and \gamma \in \Gamma^*, and (p, \beta') \in \delta(q, a, X) with \beta = \beta' \gamma; the relation \vdash^* denotes the reflexive transitive closure of single steps, representing computations.[4]
PDAs accept strings in two equivalent modes: by final state or by empty stack.[4] The language recognized by M by final state is L(M) = \{ w \in \Sigma^* \mid (q_0, w, Z_0) \vdash^* (q, \epsilon, \alpha) \text{ for some } q \in F \text{ and } \alpha \in \Gamma^* \}, while acceptance by empty stack is N(M) = \{ w \in \Sigma^* \mid (q_0, w, Z_0) \vdash^* (q, \epsilon, \epsilon) \text{ for some } q \in Q \}; these yield the same class of languages.[4]
Nondeterministic PDAs recognize exactly the context-free languages, establishing equivalence with context-free grammars.[4] To convert a context-free grammar G to an equivalent PDA, construct M that nondeterministically simulates leftmost derivations of G by pushing nonterminals onto the stack according to productions and popping terminals as they match the input, accepting by empty stack when the derivation completes.[4] Conversely, from a PDA M, derive a grammar G whose variables encode pairs of states and stack symbols, with productions mirroring transitions to generate strings that lead from the initial configuration to an accepting one, ensuring L(G) = L(M).[4]
Examples
Dyck language
The Dyck language serves as a canonical example of a context-free language, capturing the essence of properly nested and matched structures, such as balanced parentheses. For the case of a single pair of brackets, the Dyck language D_1 over the alphabet \{ (, ) \} consists of all strings where every opening parenthesis has a corresponding closing parenthesis, with proper nesting and no unmatched symbols.[5] This language includes the empty string \epsilon, as well as strings like (), (()), and ()(()), but excludes ill-formed strings such as () or (())(.[6]
The language D_1 is generated by the context-free grammar G = (V, \Sigma, R, S), where V = \{S\} is the set of variables, \Sigma = \{ (, ) \} is the terminal alphabet, S is the start symbol, and R consists of the productions:
S → (S) | SS | ε
S → (S) | SS | ε
This grammar recursively builds nested structures through the rule S \to (S), allows concatenation of independent substructures via S \to SS, and includes the empty production for the base case.[6][7]
More generally, the Dyck language of order k, denoted D_k, extends this to an alphabet with k distinct pairs of matching brackets, such as \{ (, ), [, ], \{, \} \} for k=3. It comprises all strings where brackets of each type are properly matched and nested, without interleaving mismatches. The grammar for D_k uses productions S \to X_i S Y_i S \mid \epsilon for i = 1 to k, where X_i and Y_i represent the i-th opening and closing brackets, ensuring type-specific matching.[8]
Recognition of the Dyck language by a pushdown automaton (PDA) leverages the stack to enforce matching. The PDA operates in a single state: upon reading an opening bracket, it pushes the corresponding symbol onto the stack; upon reading a closing bracket, it pops the top symbol if it matches the expected opening type, rejecting otherwise. To handle multiple types, the stack stores the bracket type. The automaton accepts by empty stack at the end of input, confirming all openings have been matched and the stack is cleared.[7]
The Dyck language is inherently non-regular, as demonstrated by applying the pumping lemma for regular languages to strings like ((^n )^n, which cannot be pumped without violating balance. This non-regularity underscores the need for stack memory in pushdown automata to track unbounded nesting depth.[6]
The term "Dyck language" originates from the work of mathematician Walther von Dyck (1856–1934), whose foundational contributions to combinatorial group theory in the 1880s, particularly on presentations by generators and relations, inspired the modeling of balanced words as reduced forms in free groups. The modern formal usage in language theory was introduced by Noam Chomsky and Marcel-Paul Schützenberger in their 1963 paper on the algebraic structure of context-free languages.[9]
Expression grammars
Context-free grammars provide a natural way to specify the syntax of arithmetic expressions, capturing the structure of operations like addition and multiplication while incorporating parentheses for grouping and identifiers as operands. A typical language modeled this way is L, the set of all valid arithmetic expressions built from identifiers (such as variable names), the binary operators + and \times, and parentheses, excluding division and unary operations; this language is generated by a context-free grammar over the terminal alphabet \Sigma = \{ +, \times, (, ), \text{id} \}.[10]
A simple context-free grammar for L is given by the productions:
E → E + E | E × E | ( E ) | id
E → E + E | E × E | ( E ) | id
with E as the start symbol. This grammar generates all strings in L, but it is ambiguous because some expressions admit multiple distinct parse trees, leading to different interpretations of operator associativity or precedence.[11][12]
For example, consider the string \text{id} + \text{id} \times \text{id}. One leftmost derivation is E \Rightarrow E + E \Rightarrow \text{id} + E \Rightarrow \text{id} + E \times E \Rightarrow \text{id} + \text{id} \times E \Rightarrow \text{id} + \text{id} \times \text{id}, yielding a parse tree where addition is the root and multiplication is nested on the right. Another is E \Rightarrow E \times E \Rightarrow E + E \times E \Rightarrow \text{id} + E \times E \Rightarrow \text{id} + \text{id} \times E \Rightarrow \text{id} + \text{id} \times \text{id}, yielding a parse tree where multiplication is the root and addition is nested on the left. These trees imply different evaluation orders: the first as (\text{id} + \text{id}) \times \text{id}, and the second as \text{id} + (\text{id} \times \text{id}), highlighting the lack of specified precedence between + and \times.[11][12]
To resolve this ambiguity and enforce standard operator precedence—where multiplication binds tighter than addition—a disambiguated grammar introduces nonterminals for precedence levels:
E → E + T | T
T → T × F | F
F → ( E ) | id
E → E + T | T
T → T × F | F
F → ( E ) | id
Here, E handles addition, T handles multiplication, and F handles factors (parenthesized expressions or identifiers). For the string \text{id} + \text{id} \times \text{id}, the unique leftmost derivation is E \Rightarrow E + T \Rightarrow T + T \Rightarrow F + T \Rightarrow \text{id} + T \Rightarrow \text{id} + T \times F \Rightarrow \text{id} + F \times F \Rightarrow \text{id} + \text{id} \times F \Rightarrow \text{id} + \text{id} \times \text{id}, producing a parse tree that groups multiplication first: \text{id} + (\text{id} \times \text{id}). This structure ensures a single interpretation consistent with mathematical conventions.[13]
Such grammars are fundamental in compiler design, where they define the syntactic rules for expression evaluation in programming languages like C and Java, enabling parsers to construct abstract syntax trees that respect operator precedence and associativity during code compilation.[14]
Properties
Closure properties
Context-free languages are closed under several fundamental operations, allowing the construction of new context-free languages from existing ones using context-free grammars or pushdown automata. These properties facilitate the analysis and generation of complex languages while maintaining membership in the class. Specifically, closure holds for union, concatenation, Kleene star, homomorphism, and inverse homomorphism, with explicit constructions demonstrating how to derive equivalent grammars or automata.[15][16]
Union. If L_1 and L_2 are context-free languages generated by context-free grammars G_1 = (V_1, [\Sigma](/page/Sigma), P_1, S_1) and G_2 = (V_2, [\Sigma](/page/Sigma), P_2, S_2) with disjoint sets of nonterminals V_1 and V_2, then L_1 \cup L_2 is context-free. Construct a new grammar G = (V_1 \cup V_2 \cup \{S\}, [\Sigma](/page/Sigma), P_1 \cup P_2 \cup \{S \to S_1 \mid S_2\}, S), where S is a fresh start symbol; this grammar generates strings from either L_1 or L_2.[15]
Concatenation. The concatenation L_1 \cdot L_2 = \{xy \mid x \in L_1, y \in L_2\} is also context-free. Using the same grammars G_1 and G_2, modify G_1 by adding a new start symbol S with production S \to S_1 A, where A is a new nonterminal with A \to S_2, and include all productions from both G_1 and G_2 (adjusting for disjoint nonterminals); this ensures derivations first produce a string from L_1 followed by one from L_2.[15]
Kleene star. The Kleene star L^* = \bigcup_{k=0}^\infty L^k (with L^0 = \{\epsilon\}) of a context-free language L generated by G = (V, \Sigma, P, S) is context-free. Introduce a new start symbol S' with productions S' \to S S' \mid \epsilon, and use the original productions for S; this allows zero or more concatenations of strings from L.[15]
Homomorphism. For a homomorphism h: \Sigma^* \to \Delta^* and context-free language L \subseteq \Sigma^*, the image h(L) = \{h(w) \mid w \in L\} is context-free. Given a grammar for L, replace each terminal a \in \Sigma in the productions with the string h(a); if h(a) is nonempty, introduce auxiliary nonterminals to generate the string sequentially from left to right, ensuring the grammar produces exactly the images under h.[17]
Inverse homomorphism. The preimage h^{-1}(L) = \{w \in \Sigma^* \mid h(w) \in L\} under the same homomorphism h is also context-free. This follows from a pushdown automaton construction: simulate the PDA for L on the fly by applying h to input symbols and pushing the results onto the stack as if reading from h(w), accepting if the simulated PDA accepts.[16]
Context-free languages are not closed under intersection, complement, or difference. For intersection, consider L_1 = \{a^n b^n c^m \mid n, m \geq 0\} and L_2 = \{a^m b^n c^n \mid m, n \geq 0\}, both context-free (generated by simple grammars matching pairs of exponents independently), but their intersection is \{a^n b^n c^n \mid n \geq 0\}, which is not context-free (provable via the pumping lemma for context-free languages).[18]
For complement, suppose context-free languages were closed under complementation; then, since they are closed under union, they would be closed under intersection via De Morgan's laws (\overline{L_1 \cup L_2} = \overline{L_1} \cap \overline{L_2}), contradicting the non-closure under intersection. Thus, there exist context-free languages whose complements are not context-free.[19]
For difference, L_1 \setminus L_2 = L_1 \cap \overline{L_2}; non-closure under either intersection or complement implies non-closure under difference, as both operations are required.[20]
Pumping lemma
The pumping lemma for context-free languages provides a necessary condition that all context-free languages must satisfy, serving as a key tool for proving that certain languages are not context-free by contradiction.
Formally, let L be a context-free language. Then there exists a positive integer p (the pumping length) such that for every string w \in L with |w| \geq p, w can be divided as w = uvxyz where:
- |vxy| \leq p,
- |vy| \geq 1,
- for all integers k \geq 0, the string uv^k x y^k z \in L.
This decomposition ensures that the portions v and y can be "pumped" (repeated any number of times, including zero) while keeping the resulting strings in L.
Proof Sketch
The proof relies on the structure of parse trees generated by a context-free grammar in Chomsky normal form for L, which has at most m variables for some m. Set the pumping length p = 2^m. For any w \in L with |w| \geq p, its parse tree has height at most m+1 but at least m+1 leaves, so by the pigeonhole principle, some path from the root to a leaf contains a repeated variable A \to \alpha and A \to \beta at two nodes. The subtrees rooted at these nodes yield the repeatable substrings v and y (with u, x, and z from the surrounding tree), ensuring |vxy| \leq p due to the bounded height and |vy| \geq 1 since the subtrees are non-empty. Pumping corresponds to repeating the subtree for A, preserving membership in L.[21]
Application Example
A classic application demonstrates that the language L = \{a^n b^n c^n \mid n \geq 0\} is not context-free. Assume for contradiction that L is context-free, so it satisfies the pumping lemma with some pumping length p. Consider w = a^p b^p c^p \in L. Then w = uvxyz with |vxy| \leq p and |vy| \geq 1. Since |vxy| \leq p, the substring vxy lies entirely within the a's, or spans a's and b's, or spans b's and c's, but cannot touch all three blocks due to the length bound. Pumping with k=2 yields uv^2 x y^2 z, which increases the count of a's (or b's, depending on position) without matching increases in the other symbols, resulting in unequal exponents (e.g., more a's than b's or c's), so uv^2 x y^2 z \notin L. This contradiction implies L is not context-free.[21]
Variants
Ogden's lemma strengthens the standard pumping lemma by allowing the selection of a set of at least p "distinguished" positions in w, requiring that v and y together cover at least one distinguished position. This provides finer control over the pumpable substrings' locations, aiding proofs where the standard lemma fails due to ambiguous positioning.[22]
Limitations
The pumping lemma is a necessary but not sufficient condition for context-freeness: while every context-free language satisfies it, some non-context-free languages also satisfy the condition, so it cannot prove that a language is context-free. It is thus primarily useful for establishing non-membership in the class of context-free languages.
Decidability properties
Context-free languages exhibit a mix of decidable and undecidable decision problems, reflecting the computational boundaries of their generative models like context-free grammars and pushdown automata. Several core properties can be algorithmically determined, while others, involving global comparisons, cannot.
The membership problem for context-free languages is decidable: given a context-free grammar G and a string w, it is possible to determine whether w \in L(G). This follows from the existence of effective parsing procedures that enumerate all possible derivations and halt with a yes or no answer.[23]
The emptiness problem is also decidable: determining whether L(G) = \emptyset for a context-free grammar G can be resolved by identifying productive nonterminals through a fixed-point computation. Specifically, a nonterminal is productive if it derives a terminal string, computed iteratively by marking nonterminals that appear in the right-hand sides of productions leading to terminals or other productive nonterminals; the language is empty if the start symbol is unproductive.[23]
Similarly, the finiteness problem—whether L(G) is finite—is decidable. This is established by analyzing the grammar for cycles that allow infinite derivations, effectively reducing to emptiness checks on modified grammars where self-embedding nonterminals (those deriving strings containing themselves) are restricted, ensuring no unbounded pumping occurs if the resulting language is empty.[23]
However, the universality problem is undecidable: it cannot be determined in general whether L(G) = \Sigma^* for a context-free grammar G over alphabet \Sigma. This result is obtained by reduction from the undecidable Post Correspondence Problem, encoding instances of PCP into context-free grammars such that the language equals \Sigma^* if and only if the PCP has a solution.[23]
The equivalence problem between two context-free grammars G_1 and G_2—determining if L(G_1) = L(G_2)—is likewise undecidable. This undecidability follows directly from that of universality, as equivalence to a trivial grammar generating \Sigma^* would solve the universality problem, which is impossible.[23]
In a notable decidable case, it is possible to determine whether a given context-free language is regular. This involves checking if the pushdown automaton accepting the language behaves as a finite-state automaton, for instance, by verifying that the stack usage is bounded or by constructing an equivalent finite automaton if the pushdown is unnecessary.[23]
Parsing and Recognition
Parsing algorithms
Top-down parsing algorithms, such as LL(1) parsers, construct a parse tree by starting from the start symbol and recursively expanding non-terminals based on the input string, using one token of lookahead to predict the next production. These parsers are predictive, relying on a parsing table constructed from FIRST and FOLLOW sets to determine the appropriate production for each non-terminal given the current lookahead symbol. The FIRST set of a string α is the set of terminals that can appear as the first symbol in some derivation of α, while the FOLLOW set of a non-terminal A is the set of terminals that can appear immediately after A in some sentential form. To build the predictive parsing table, for each production A → α, entries are filled in row A and column t where t is in FIRST(α) if ε ∉ FIRST(α), or in FOLLOW(A) if ε ∈ FIRST(α); conflicts arise if multiple productions are predicted for the same entry, indicating the grammar is not LL(1).
Bottom-up parsing algorithms, including LR(0), SLR(1), and LALR(1) parsers, build the parse tree from the input string upward by shifting symbols onto a stack and reducing them according to productions when a handle (right side of a production) is recognized. LR(0) parsers use items of the form A → α • β to represent positions in productions without lookahead, constructing a deterministic finite automaton from the grammar's canonical collection of sets of items to guide shifts and reduces. SLR(1) (Simple LR(1)) extends LR(0) by incorporating FOLLOW sets to resolve reduce actions based on one lookahead symbol, while LALR(1) (Look-Ahead LR(1)) merges states from the full LR(1) automaton to reduce table size, preserving determinism for many practical grammars. Shift-reduce conflicts occur when the parser cannot decide whether to shift the next input symbol or reduce the top stack symbols, often due to ambiguous grammars; these are typically resolved by grammar redesign, such as factoring or left-recursion elimination, to ensure a unique handle.
The Cocke-Younger-Kasami (CYK) algorithm is a general bottom-up parsing method applicable to any context-free grammar in Chomsky normal form (CNF), where productions are either A → BC or A → a. It employs dynamic programming with a triangular table dp[A], where dp[A] is true if the substring from position i to i+j-1 can be derived from non-terminal A. The algorithm initializes the table for single symbols (j=1) and iteratively fills longer substrings by checking if A → BC where B derives the left part and C the right part, recognizing membership if the start symbol spans the full string.[24]
The Earley algorithm provides a general chart-parsing approach that handles all context-free languages, including ambiguous and nondeterministic grammars, using dynamic programming without requiring grammar transformation. It maintains a chart of states for each position j in the input, where each state is a tuple (A → α • β, i, j) indicating that α derives the substring from i to j, with • marking the current position in the production. Parsing proceeds in three phases per position: prediction adds new states for non-terminals β at j; scanning advances • over matching terminals to j+1; and completion links completed productions (• at end of β) back to waiting states expecting that non-terminal. This process explores all possible derivations, achieving O(n^3) time complexity in the worst case.[25]
Parsing algorithms can detect ambiguity in context-free grammars by identifying multiple valid parse trees for the same input string; for instance, LL(1) and LR parsers fail with conflicts on ambiguous grammars, while general methods like Earley or CYK reveal ambiguity through multiple completing states or table entries for the start symbol.[25]
Membership testing
Membership testing for a context-free language (CFL) involves determining whether a given string of length n belongs to the language generated by a context-free grammar (CFG). This problem is decidable, and efficient algorithms exist that run in polynomial time, placing CFL membership in the complexity class P. The Cocke-Younger-Kasami (CYK) algorithm and the Earley parser are two general-purpose methods for testing membership in arbitrary CFLs, both exhibiting a worst-case time complexity of O(n^3).
The CYK algorithm relies on dynamic programming and requires the CFG to be in Chomsky normal form (CNF), where productions are limited to A \to BC (with B, C nonterminals) or A \to a (with a a terminal). Any CFG can be converted to CNF in polynomial time without altering the language it generates, enabling efficient application of CYK. This conversion ensures that the algorithm fills an n \times n table bottom-up, where each entry represents possible nonterminal derivations for substrings, leading to O(n^2) space complexity for the table.
For certain subclasses of CFLs, more efficient linear-time algorithms are available. Top-down LL(k) parsers and bottom-up LR(k) parsers achieve O(n) time complexity when the grammar is suitable—specifically, LL for left-to-right, leftmost derivations without left recursion, and LR for a broader class including left-recursive grammars. LR parsers handle left recursion naturally without infinite loops, unlike standard top-down approaches, which require grammar transformations to eliminate it.
In practice, membership testing via LR parsing is widely implemented in compiler construction tools. For instance, Yacc (Yet Another Compiler-Compiler) generates LR parsers from CFGs, facilitating efficient syntactic analysis in compilers for programming languages. These tools preprocess the grammar to build parsing tables, allowing linear-time recognition during compilation.[26]
An alternative theoretical approach simulates a nondeterministic pushdown automaton (NPDA) equivalent to the CFG, but this is inefficient for membership testing. Direct simulation explores all possible computation paths, potentially leading to exponential time complexity in the worst case due to branching nondeterminism and stack operations. In contrast, the parsing algorithms above avoid this by using deterministic table-driven methods.
Comparisons
Chomsky hierarchy
The Chomsky hierarchy classifies formal grammars into four types based on restrictions on their production rules, each corresponding to a class of languages of increasing generative power. Type 0 grammars are unrestricted, generating recursively enumerable languages; Type 1 grammars are context-sensitive, generating context-sensitive languages; Type 2 grammars are context-free, generating context-free languages; and Type 3 grammars are regular, generating regular languages.[27]
Noam Chomsky first introduced the hierarchy in his 1956 paper "Three Models for the Description of Language," where he explored finite-state, phrase-structure, and transformational models, and formalized it in 1959 in "On Certain Formal Properties of Grammars."[28][29]
Type 2 grammars in the hierarchy, known as context-free grammars, consist of production rules of the form A \to \alpha, where A is a single nonterminal symbol and \alpha is any string composed of terminal and nonterminal symbols, without any context restrictions on the application of rules. These grammars generate exactly the context-free languages.[29][27]
The language classes form strict inclusions: regular languages are properly contained in context-free languages (\text{REG} \subset \text{CFL}), context-free languages are properly contained in context-sensitive languages (\text{CFL} \subset \text{CSL}), and context-sensitive languages are properly contained in recursively enumerable languages (\text{CSL} \subset \text{RE}). For example, the language \{a^n b^n \mid n \geq 0\} is context-free but not regular, as shown by the pumping lemma for regular languages. Similarly, the language \{a^n b^n c^n \mid n \geq 0\} is context-sensitive but not context-free.[27]
Corresponding to these language classes, context-free languages are recognized by pushdown automata, which use a stack for memory, while context-sensitive languages are recognized by linear bounded automata, which operate within linear space bounds relative to the input length.[27]
Non-context-free languages
A canonical example of a non-context-free language is L = \{ a^n b^n c^n \mid n \geq 0 \}, which consists of strings with equal numbers of each of three distinct symbols.[30]
The pumping lemma for context-free languages can be used to prove that L is not context-free. Assume for contradiction that L is context-free, and let p be the pumping length given by the lemma. Select the string w = a^p b^p c^p \in L, which has length greater than p. By the lemma, w = uvxyz where |vxy| \leq p, |vy| > 0, and uv^i x y^i z \in L for all i \geq 0. The constraint |vxy| \leq p implies that the substring vxy lies entirely within one block of identical symbols or spans at most two adjacent blocks. If vxy is confined to one block (e.g., the a's), pumping with i = 2 increases the count of that symbol while leaving the others unchanged, resulting in unequal counts and a string not in L. If vxy spans two blocks (e.g., a's and b's), pumping with i = 2 either introduces symbols out of order or imbalances the counts across the three symbols, again yielding a string not in L. In all cases, a contradiction arises, so L is not context-free.[30]
Another classic non-context-free language is the copy language L = \{ ww \mid w \in \{a, b\}^* \}, comprising all strings that are the concatenation of some word with itself.[31]
To demonstrate that this language is not context-free using the pumping lemma, assume it is, with pumping length p. Choose w = a^p b a^p b \in L (where the first half is a^p b and the second half matches it). Decompose w = uvxyz with the usual conditions. Given |vxy| \leq p, the substring vxy cannot span both halves symmetrically in a way that preserves the copy structure upon pumping. Pumping with i = 2 will either add extra symbols to one half without matching the other or disrupt the identical sequences in the two halves, producing a string where the first and second parts no longer match, hence not in L. This contradiction shows L is not context-free.[21]
The language L = \{ a^n b^m \mid n, m \geq 0, n \neq m \} over the alphabet \{a, b\} provides insight into closure properties, as it equals the union of the context-free languages \{ a^n b^m \mid n > m \} and \{ a^n b^m \mid n < m \}, both of which are context-free (generated by straightforward context-free grammars that produce excess a's or b's). However, examples like \{ a^n b^n c^n \mid n \geq 0 \} arise as intersections of context-free languages, such as \{ a^n b^m c^k \mid m = n \} \cap \{ a^n b^m c^k \mid n = k \}, illustrating that the intersection of two context-free languages need not be context-free.[30]
Ogden's lemma strengthens the pumping lemma by allowing the selection of distinguished positions in the pumped string, enabling more targeted proofs for non-context-freeness in languages where the standard lemma's decomposition is ambiguous, such as variants of the copy language or languages requiring precise symbol matching across distant parts of the string.[32]
Languages like \{ a^n b^n c^n \mid n \geq 0 \} and the copy language are context-sensitive, as they can be recognized by linear bounded automata or generated by context-sensitive grammars that enforce equality through context-dependent rules, such as progressively marking and copying symbol counts.[33]