Formal grammar
A formal grammar is a precise mathematical system used in theoretical computer science and linguistics to define the syntax of a formal language through a finite set of production rules that generate valid strings from an alphabet of symbols. It is typically formalized as a four-tuple G = (\Sigma, V, S, P), where \Sigma is a finite set of terminal symbols (the basic alphabet elements), V is a finite set of nonterminal symbols (variables representing syntactic categories), S \in V is the start symbol, and P is a finite set of production rules of the form \alpha \to \beta (with \alpha containing at least one nonterminal).[1] The concept of formal grammar was pioneered by Noam Chomsky in the mid-1950s as a tool for rigorously describing language structure, beginning with his 1956 paper "Three Models for the Description of Language," where he outlined finite-state, phrase-structure, and transformational grammars as increasingly powerful mechanisms for generating grammatical sentences.[2] In his 1957 book Syntactic Structures, Chomsky expanded this framework into the Chomsky hierarchy, classifying grammars into four types based on rule restrictions: Type-0 (unrestricted, generating recursively enumerable languages), Type-1 (context-sensitive, where the left-hand side of rules is no longer than the right-hand side), Type-2 (context-free, with single nonterminals on the left, ideal for most programming language syntax), and Type-3 (regular, limited to linear right-hand or left-hand rules for simple patterns).[3] This hierarchy corresponds to classes of automata—such as Turing machines for Type-0 and finite automata for Type-3—highlighting the computational complexity of language recognition.[1] Formal grammars are foundational in compiler design, where context-free grammars (often expressed in Backus-Naur Form) specify the syntax of programming languages like Fortran and Java, enabling parsers to validate and analyze code structure during compilation.[3] They also underpin applications in natural language processing, molecular biology for modeling genetic sequences, and symbolic dynamics, with decidability of membership varying by type: undecidable for Type-0, PSPACE-complete for Type-1, polynomial time for Type-2, and linear time for Type-3.[1][4]Introductory examples
Simple string generation
A formal grammar provides a precise mechanism for generating strings from a specified alphabet using a set of production rules applied iteratively from a start symbol. Consider the simple grammar G = (V, \Sigma, P, S), where V = \{S\} is the set of nonterminals, \Sigma = \{a\} is the set of terminals, S is the start symbol, and P = \{ S \to aS \mid \varepsilon \} is the set of production rules.[5] This grammar exemplifies the basic generative process without introducing complex structures. The empty string \varepsilon is derived directly by applying the rule S \to \varepsilon:S \Rightarrow \varepsilon
The single-symbol string "a" is obtained by first applying S \to aS and then S \to \varepsilon:
S \Rightarrow aS \Rightarrow a\varepsilon = a [6]
For "aa", the rule S \to aS is applied twice before terminating with S \to \varepsilon:
S \Rightarrow aS \Rightarrow aaS \Rightarrow aa\varepsilon = aa [6] Repeated applications of S \to aS any number of times, followed by S \to \varepsilon, generate all strings in the language L(G) = \{ a^n \mid n \geq 0 \}, consisting of zero or more repetitions of the symbol "a".[6] The generative process unfolds as a sequence of replacements, starting from the start symbol and substituting nonterminals according to the rules until only terminals remain. For instance, the derivation for "aaa" proceeds as:
This step-by-step replacement highlights how the grammar builds strings incrementally through recursion on the nonterminal S.[6] Such examples illustrate the foundational role of production rules in defining formal languages, with this particular grammar classified as type-3 in the Chomsky hierarchy.[5]S ⇒ a S ⇒ a a S ⇒ a a a S ⇒ a a a ε = aaaS ⇒ a S ⇒ a a S ⇒ a a a S ⇒ a a a ε = aaa
Arithmetic expressions
A formal grammar for arithmetic expressions extends the basic string generation example by incorporating recursion and multiple non-terminals to capture operator precedence and nesting via parentheses. Consider the grammar with non-terminals V = \{E, T, F\}, where E represents expressions (handling addition), T represents terms (handling multiplication), and F represents factors (handling parentheses and numbers); terminals \Sigma = \{+, *, (, ), \text{number}\}; start symbol E; and productions P: \begin{align*} E &\to E + T \mid T \\ T &\to T * F \mid F \\ F &\to (E) \mid \text{number} \end{align*} This grammar enforces multiplication having higher precedence than addition through the hierarchical non-terminals, and left associativity for operators of the same precedence via left recursion.[7][8] To illustrate derivation, consider the leftmost derivation of the expression (2 + 3) * 4: \begin{align*} E &\Rightarrow T \\ &\Rightarrow T * F \\ &\Rightarrow F * F \\ &\Rightarrow (E) * F \\ &\Rightarrow (E + T) * F \\ &\Rightarrow (T + T) * F \\ &\Rightarrow (F + T) * F \\ &\Rightarrow (\text{number} + T) * F \\ &\Rightarrow (2 + T) * F \\ &\Rightarrow (2 + F) * F \\ &\Rightarrow (2 + \text{number}) * F \\ &\Rightarrow (2 + 3) * F \\ &\Rightarrow (2 + 3) * \text{number} \\ &\Rightarrow (2 + 3) * 4 \end{align*} Each step replaces the leftmost non-terminal using a production, demonstrating how recursion in F \to (E) allows nested subexpressions while the structure ensures proper precedence.[7] Although this grammar is unambiguous and correctly models standard arithmetic precedence, a simpler flat grammar (e.g., with productions E \to E + E \mid E * E \mid (E) \mid \text{number}) would be ambiguous, allowing multiple parses for expressions like $1 + 2 * 3—one treating it as (1 + 2) * 3 and another as $1 + (2 * 3). The use of multiple non-terminals in the given grammar resolves this by structurally prioritizing multiplication.[9][7] Parse trees provide a graphical representation of the derivation, with the root as the start symbol E and leaves forming the input string. For (2 + 3) * 4, the tree has E at the root branching to T, which branches to T * F (left child F \to (E), with E further branching to E + T \to T + T \to F + F \to 2 + 3; right child F \to 4). Internal nodes reflect production applications, illustrating the hierarchical structure and confirming the single valid parse.[8] This example fits within Type-2 (context-free) grammars in the Chomsky hierarchy.[7]Formal definition
Components and syntax
A formal grammar is defined as a quadruple G = ([\Sigma](/page/Sigma), V, P, S), where \Sigma is a finite set of terminal symbols representing the basic alphabet of the language, V is a finite set of nonterminal symbols (also called variables) disjoint from \Sigma, P is a finite set of production rules, and S \in V is the start symbol.[1] The terminals in \Sigma are the indivisible units that appear in the final strings of the language generated by the grammar, while the nonterminals in V serve as placeholders or categories that can be rewritten according to the rules.[1] The production rules in P are of the form \alpha \to \beta, where \alpha, \beta \in (V \cup \Sigma)^* and \alpha is a non-empty string containing at least one nonterminal from V.[1] In notation for certain types of grammars, such as context-free grammars, productions are often written as A \to \gamma, where A \in V is a single nonterminal on the left-hand side, and \gamma \in (V \cup \Sigma)^* is a string (possibly empty) on the right-hand side.[1] These rules specify how nonterminals can be replaced to derive strings, ensuring that the grammar remains finite in description despite potentially generating infinite languages through recursive applications of the rules.[10] For instance, a grammar with recursive productions can produce arbitrarily long strings from a finite set of rules, distinguishing it from finite languages that require enumerating all members explicitly.[1] The concept of production systems originated in the work of Emil Post in the 1920s, particularly through his development of string-rewriting mechanisms in his 1921 dissertation and later formalizations.[11] These ideas were formalized and applied to linguistics by Noam Chomsky in the 1950s, notably in his 1956 paper introducing generative grammars and their hierarchical classification.[12] This foundational structure underpins the Chomsky hierarchy, which categorizes grammars based on constraints imposed on the production rules defined here.[12]Mathematical formalism
A formal grammar G is rigorously defined in set-theoretic terms as a 4-tuple G = (\Sigma, V, P, S), where \Sigma is a finite set of terminal symbols serving as the alphabet, V is a finite set of nonterminal (or variable) symbols with V \cap \Sigma = \emptyset, P is a finite set of production rules of the form \alpha \to \beta where \alpha \in (V \cup \Sigma)^+ contains at least one symbol from V and \beta \in (V \cup \Sigma)^*, and S \in V is the designated start symbol.[1] This structure provides the generative mechanism for the grammar by allowing replacement of substrings containing nonterminals according to the rules. The language generated by the grammar, denoted L(G), consists of all terminal strings derivable from the start symbol through repeated applications of the production rules: L(G) = \{ w \in \Sigma^* \mid S \Rightarrow^* w \}, where \Rightarrow denotes a direct derivation (replacing the left-hand side of a production with its right-hand side in a sentential form), and \Rightarrow^* indicates the reflexive transitive closure of \Rightarrow, allowing zero or more such steps.[1] This definition captures the set of all valid strings in the formal language, emphasizing the generative power rooted in iterative rewriting. Formal languages support key operations that facilitate theoretical analysis. Concatenation of two languages L_1 and L_2 over \Sigma yields L_1 L_2 = \{ xy \mid x \in L_1, y \in L_2 \}, combining strings from each.[13] Union is defined as L_1 \cup L_2 = \{ w \mid w \in L_1 \lor w \in L_2 \}, while the Kleene star (or regular closure) operation on a language L produces L^* = \bigcup_{n=0}^\infty L^n, where L^0 = \{\epsilon\} (with \epsilon the empty string) and L^{n+1} = L^n L for n \geq 0.[13] These operations are foundational for constructing complex languages from simpler ones. The class of languages generated by unrestricted formal grammars—known as recursively enumerable languages—is closed under union and concatenation, meaning that if L_1 = L(G_1) and L_2 = L(G_2) for grammars G_1 and G_2, then there exist grammars G_3 and G_4 such that L(G_3) = L_1 \cup L_2 and L(G_4) = L_1 L_2.[13] It is also closed under Kleene star. However, this class is not closed under complement, as the complement of a recursively enumerable language may not be recursively enumerable.[13]Derivation example
To illustrate the derivation process in a formal grammar, consider the context-free grammar G = (\Sigma, V, P, S) for the language of even-length palindromes over the alphabet \Sigma = \{a, b\}, where the nonterminal set is V = \{S\}, the start symbol is S, and the production rules are P = \{ S \to a S a \mid b S b \mid \varepsilon \}.[14][15] A concrete example is the leftmost derivation of the string "abba" from the start symbol S: S \Rightarrow a S a \quad (\text{using } S \to a S a) \Rightarrow a b S b a \quad (\text{using } S \to b S b) \Rightarrow a b \varepsilon b a \quad (\text{using } S \to \varepsilon) = abba This sequence terminates when no nonterminals remain, yielding the terminal string "abba".[16] In this derivation, the intermediate strings—such as a S a and a b S b a—are sentential forms, which consist of a mix of terminals and nonterminals generated during the rewriting process. The final sentential form a b \varepsilon b a simplifies to the pure terminal string by removing the empty production's \varepsilon. These forms demonstrate how the grammar recursively builds symmetric structures around the center.[17] The yield of a derivation is the terminal string obtained upon termination, here "abba", which belongs to the language L(G) generated by the grammar. The yield function maps each valid derivation sequence to its corresponding terminal yield, ensuring that all strings in L(G) can be produced through finite applications of the productions. For simple context-free grammars like this one, membership in L(G) is decidable, as algorithms such as the Cocke-Younger-Kasami method can verify whether a given string derives from the start symbol in polynomial time.[14]Chomsky hierarchy
Type-3 grammars (regular)
Type-3 grammars, also known as regular grammars, represent the simplest class in the Chomsky hierarchy and are restricted to generating regular languages. These grammars consist of a finite set of production rules that enforce a linear structure, specifically right-linear or left-linear form, ensuring that derivations proceed in a manner analogous to state transitions in finite automata.[2] The production rules in a Type-3 grammar are limited to the forms A \to a B, A \to a, or A \to \epsilon (right-linear) or A \to B a, A \to a, or A \to \epsilon (left-linear), where A and B are nonterminal symbols, a is a terminal symbol from the alphabet \Sigma, and \epsilon denotes the empty string. This linear restriction means that each rule appends at most one nonterminal to the right (or left) of a single terminal (or none), preventing the embedding of nonterminals in complex contexts. Such rules ensure that all derivations are leftmost or rightmost, aligning with the sequential processing of regular languages. The languages generated by Type-3 grammars are precisely the regular languages, which are recognized by finite automata and described by regular expressions. Regular languages exhibit closure under key operations: the union of two regular languages is regular, as is their intersection and complement relative to the universal language over the alphabet. These properties follow from constructions on finite automata, such as product automata for union and intersection, and state inversion for complement.[18] Every Type-3 grammar corresponds directly to a nondeterministic finite automaton (NFA), establishing their equivalence to regular languages. The conversion algorithm constructs an NFA where the states are the nonterminals of the grammar plus a designated accepting state; the start state is the grammar's start symbol. For a right-linear grammar, for each rule A \to a B, add a transition from state A to state B labeled a; for A \to a, add a transition from A to the accepting state on a; and for A \to \epsilon, add an \epsilon-transition from A to the accepting state. Accepting states include those from which \epsilon-productions lead to terminals or the empty string. A similar construction applies to left-linear grammars by reversing the transitions and reading the string backward. This mapping preserves the language generated by the grammar.[19] A representative example is the grammar generating the language (ab)^*, with nonterminal S (the start symbol) and rules S \to ab S and S \to \epsilon. A derivation for the string "ab" proceeds as S \Rightarrow ab S \Rightarrow ab \epsilon = ab, while for "abab", it is S \Rightarrow ab S \Rightarrow ab ab S \Rightarrow ab ab \epsilon = abab. This grammar's right-linear structure mirrors the cyclic transitions in an equivalent NFA with two states.Type-2 grammars (context-free)
Type-2 grammars, known as context-free grammars (CFGs), form a central class in the Chomsky hierarchy, enabling the description of languages with recursive, hierarchical structures such as those found in programming languages and natural language syntax. A CFG is defined as a quadruple G = (V, \Sigma, P, S), where V is a finite set of nonterminal symbols (variables), \Sigma is a finite set of terminal symbols, P is a finite set of production rules, and S \in V is the start symbol. Each production rule in P has the form A \to \gamma, where A is a single nonterminal from V, and \gamma is any finite string (including the empty string) composed of symbols from V \cup \Sigma, i.e., \gamma \in (V \cup \Sigma)^*.[20] This context-independent rule structure allows derivations where the replacement of a nonterminal depends only on itself, not surrounding symbols, distinguishing CFGs from more restrictive regular grammars while being simpler than context-sensitive ones.[21] The languages generated by CFGs, called context-free languages (CFLs), are recognized by pushdown automata and include sets like balanced parentheses or arithmetic expressions with parentheses.[22] Regular grammars represent a proper subset of CFGs, as every regular language is context-free but not vice versa, due to the ability of CFGs to handle non-linear recursion for nested constructs.[21] A useful canonical representation for CFGs is the Chomsky normal form (CNF), which restricts productions to A \to BC (where A, B, C \in V and B, C \neq S) or A \to a (where a \in \Sigma), with the possible exception of S \to \epsilon if the empty string is in the language.[22] Any CFG can be algorithmically converted to an equivalent CNF through a series of transformations: first, eliminate \epsilon-productions by substituting alternatives and adjusting the start symbol if needed; second, remove unit productions A \to B by replacing them with the productions of B; third, for rules with more than two nonterminals, introduce new nonterminals to binarize them (e.g., A \to XYZ becomes A \to XY_1, Y_1 \to YZ); finally, replace terminals in non-singleton rules with new nonterminals (e.g., A \to aB becomes A \to X B, X \to a). These steps preserve the generated language and facilitate theoretical analysis and efficient parsing.[22] Ambiguity arises in CFGs when a grammar generates the same string via multiple distinct leftmost derivations or parse trees, leading to non-unique syntactic structures.[23] A grammar is inherently ambiguous if every possible CFG for its language is ambiguous, as opposed to accidental ambiguity, which can be resolved by rewriting the grammar. A well-known example is the dangling else ambiguity in grammars for conditional statements, such as:For the string "if expr1 then if expr2 then stmt1 else stmt2", two parse trees are possible: one where the else attaches to the inner if (associating stmt2 with the second conditional), and another where it attaches to the outer if (making stmt1 part of the inner then-clause). This ambiguity affects compiler design, often resolved by precedence rules favoring outer attachment.[23] The pumping lemma for CFLs offers a tool to prove that certain languages are not context-free by providing an infinite family property. Formally, for every CFL L, there exists a constant p (the pumping length) such that for any w \in L with |w| \geq p, w can be partitioned as w = uvxyz satisfying |vxy| \leq p, |vy| \geq 1, and uv^k x y^k z \in L for all integers k \geq 0. This lemma, derived from the structure of parse trees in CNF (where long strings must have repeatable substructures along a path), ensures that CFLs exhibit "pumping" behavior in sufficiently long strings, useful for showing non-context-freeness (e.g., \{ a^n b^n c^n \mid n \geq 1 \}) by contradiction.[24]stmt → if expr then stmt | if expr then stmt else stmt | otherstmt → if expr then stmt | if expr then stmt else stmt | other
Type-1 grammars (context-sensitive)
Type-1 grammars, also known as context-sensitive grammars (CSGs), are defined as a quadruple G = (V, [\Sigma](/page/Sigma), P, S), where V is a finite set of nonterminal symbols, \Sigma is a finite set of terminal symbols, S \in V is the start symbol, and P is a finite set of production rules of the form \alpha A \beta \to \alpha \gamma \beta, with A \in V, \alpha, \beta \in (V \cup \Sigma)^*, and \gamma \in (V \cup \Sigma)^+ (non-empty), ensuring |\alpha A \beta| \leq |\alpha \gamma \beta|.[25] An exception allows the rule S \to \epsilon if S does not appear in the right-hand side of any other production.[26] The key non-contracting property (|\alpha| \leq |\beta| for each rule \alpha \to \beta) guarantees that sentential forms in a derivation never shorten, limiting the length of derivations to at most exponential in the input size and enabling the membership problem to be decided by exhaustive search over bounded configurations.[27] A canonical example of a context-sensitive language not generable by context-free grammars is L = \{ a^n b^n c^n \mid n \geq 1 \}, which requires enforcing three-way equality beyond the capabilities of context-free pumping.[26] One such CSG for L has nonterminals \{S, B, C\}, terminals \{a, b, c\}, and productions: \begin{align*} &S \to aSBC \mid aBC \\ &aB \to ab \\ &bB \to bb \\ &bC \to bc \\ &cC \to cc \\ &CB \to BC \end{align*} A derivation sketch for a^2 b^2 c^2 begins with S \Rightarrow aSBC \Rightarrow aaBCBC (applying S \to aBC), then aaBCBC \Rightarrow aabCBC (via aB \to ab), aabCBC \Rightarrow aabBCC (via CB \to BC), aabBCC \Rightarrow aabbCC (via bB \to bb), aabbCC \Rightarrow aabbcC (via bC \to bc), and aabbcC \Rightarrow aabbcc (via cC \to cc).[26] The languages generated by Type-1 grammars, known as context-sensitive languages (CSLs), are exactly those accepted by nondeterministic linear bounded automata (LBAs), which are Turing machines restricted to a tape of length linear in the input size.[27]Type-0 grammars (unrestricted)
Type-0 grammars, also known as unrestricted grammars, represent the most general class in the Chomsky hierarchy, imposing no constraints on the form of production rules beyond requiring that the left-hand side contains at least one nonterminal symbol.[28] Formally, a Type-0 grammar is a quadruple G = (V, \Sigma, P, S), where V is a finite set of nonterminal symbols, \Sigma is a finite set of terminal symbols, S \in V is the start symbol, and P is a finite set of production rules of the form \alpha \to \beta, with \alpha \in (V \cup \Sigma)^+ containing at least one nonterminal from V, and \beta \in (V \cup \Sigma)^*.[28] These rules allow arbitrary replacements in derivations, enabling the simulation of complex computational processes without length restrictions or contextual dependencies.[29] The languages generated by Type-0 grammars are precisely the recursively enumerable (RE) languages, which consist of all sets of strings that can be enumerated by a Turing machine.[29] This equivalence arises because any Type-0 grammar can be converted to a Turing machine that simulates its derivations using a two-tape nondeterministic machine, where one tape holds the current sentential form and the other tracks the production application.[29] Conversely, for any Turing machine, an unrestricted grammar can be constructed to generate the same language by encoding the machine's configurations and transitions as backward derivations from accepting states to the initial configuration.[29] This full computational power positions Type-0 grammars as capable of describing any computable process, including those beyond the capabilities of more restricted grammar types. Key decision problems for Type-0 grammars exhibit fundamental undecidability, reflecting the inherent limits of computation for RE languages. The membership problem—determining whether a given string w belongs to L(G) for an unrestricted grammar G—is undecidable, as it reduces to the halting problem for Turing machines.[30] Similarly, the emptiness problem (whether L(G) = \emptyset) and the equivalence problem (whether L(G_1) = L(G_2) for two grammars G_1 and G_2) are undecidable.[30] These results follow from Rice's theorem, which states that any non-trivial semantic property of the RE languages is undecidable, encompassing properties like emptiness and equivalence that depend on the language generated rather than the grammar's syntax.[30] Context-sensitive grammars form a proper subset of this class, generating only the recursive languages.[28]Alternative grammar formalisms
Analytic grammars
Analytic grammars are formal grammars engineered for efficient syntactic recognition, where the validity of a string in the language can be determined through a deterministic parsing procedure that reduces the input to the start symbol without ambiguity or backtracking.[31] Unlike generative grammars that specify how to produce strings, analytic grammars prioritize the analysis phase by incorporating structural constraints such as precedence rules or limited context dependencies to enable straightforward decision-making during parsing. They are often based on the form of context-free grammars but augmented with parsing-oriented mechanisms, such as lookahead or bounded contexts, to enable efficient deterministic parsing, which can sometimes support recognition of languages requiring limited context sensitivity.[32] A prominent type of analytic grammar is the operator-precedence grammar, introduced by Robert Floyd in 1963, which defines parsing actions based on precedence relations between pairs of terminal symbols, typically operators in expression-like structures.[33] In such grammars, every production is of the form A \to \alpha_1 X_1 \alpha_2 X_2 \cdots \alpha_m X_m \alpha_{m+1}, where each X_i is a terminal (operator) and each \alpha_j is a string of nonterminals, ensuring no two operators are adjacent in any sentential form.[34] The precedence relations—denoted as < (yield, for shifting), = (equal, for continuing the handle), and > (take, for reducing)—are specified in a table that dictates shift-reduce decisions deterministically.[35] Another key type is the bounded-right-context grammar, which restricts the right context influencing a production's application to a fixed bounded length, allowing parsing decisions based on only a limited lookahead to avoid nondeterminism.[36] Formally, in an (m, n)-bounded right-context grammar, reductions are uniquely determined by examining at most m symbols to the left and n to the right of the potential handle, ensuring the parser remains well-defined and efficient.[37] This formalism supports languages that require some context sensitivity but confines it to enable linear-time recognition algorithms.[38] To illustrate an operator-precedence grammar, consider a simple grammar for arithmetic expressions involving identifiers (id), addition (+), multiplication (*), parentheses, and the end marker ($). The productions are:- E \to E + T
- E \to T
- T \to T * F
- T \to F
- F \to (E)
- F \to id
| id | + | * | ( | ) | $ | |
|---|---|---|---|---|---|---|
| id | < | < | < | > | > | |
| + | > | = | < | < | > | > |
| * | > | > | = | < | > | > |
| ( | < | < | = | |||
| ) | > | > | > | > | > | |
| $ | < | < | < | < |
Attribute grammars
Attribute grammars extend context-free grammars by associating sets of attributes with each grammar symbol and specifying rules for computing the values of these attributes during parsing or evaluation of a derivation tree. Introduced by Donald Knuth in 1968, this formalism allows the specification of semantic information, such as type checking or code generation, in a modular way tied to the syntactic structure.[39] Attributes are divided into two types: synthesized attributes, which are computed from the attributes of a symbol's descendants in the derivation tree, and inherited attributes, which are computed from the attributes of a symbol's ancestors or siblings. For a production X_0 \to X_1 X_2 \dots X_n, synthesized attributes of X_0 depend on synthesized attributes of the X_i and inherited attributes of the X_i, while inherited attributes of the X_i (for i \geq 2) may depend on synthesized attributes of preceding symbols and inherited attributes from X_0. These rules ensure that attribute computation can capture context-sensitive aspects of language semantics without altering the underlying context-free syntax.[39][40] Attribute evaluation proceeds by constructing a dependency graph for the derivation tree, where nodes represent attribute instances and directed edges indicate dependencies between them (e.g., an edge from attribute a to b if b's value is used to compute a). If the graph is acyclic, attributes can be evaluated in a topological order, with synthesized attributes flowing bottom-up from leaves to root and inherited attributes flowing top-down or laterally among siblings. This graph-based approach detects circular dependencies, ensuring well-defined semantics; for instance, Knuth's original algorithm includes a check for cycles to validate the grammar's semantic specifications.[39][40] A representative example is type checking for arithmetic expressions using the following context-free grammar for expressions E:To incorporate type checking, attributes such asE → n (n is a number) E → E + E E → E * EE → n (n is a number) E → E + E E → E * E
typ (synthesized, representing the expression's type, e.g., int or double) are added, along with rules:
- For E \to n: E.typ = type of n (e.g.,
intfor integer literals,doublefor floating-point literals) - For E \to E_1 + E_2: E.typ =
doubleif E_1.typ = double or E_2.typ = double, elseintif both areint - For E \to E_1 * E_2: Similar to addition, promoting to
doubleif needed
3 * 4.5, the synthesized typ for the first E (3, an integer literal) is int, for the second (4.5, a floating-point literal) is double, and for the root is double, with the dependency graph confirming acyclic flow from subexpressions upward. This extends the grammar to handle semantic rules like promotion without full context-sensitive rewriting.[40][41]
Knuth's two-pass algorithm facilitates efficient evaluation for suitable attribute grammars: the first pass performs syntactic parsing to build the derivation tree, while the second pass computes attributes by traversing the tree—bottom-up for synthesized attributes and top-down for inherited ones—leveraging the dependency graph to order computations. The total number of attribute instances across the tree is bounded by O(n) for an input string of length n, as each symbol in the tree contributes a fixed number of attributes per production, enabling linear-time evaluation when dependencies permit. This method, applicable to L-attributed grammars where inherited attributes depend only on left siblings, underpins many compiler semantic analyzers.[39][40]