Regular grammar
In theoretical computer science and formal language theory, a regular grammar is a type of formal grammar that generates regular languages, the simplest class in the Chomsky hierarchy of formal grammars.[1] These languages consist of all strings that can be recognized by finite automata, and regular grammars are characterized by production rules that are either right-linear or left-linear, ensuring they produce only patterns describable by finite state machines.[1]
Formally, a regular grammar G is defined as a four-tuple G = (N, \Sigma, P, S), where N is a finite nonempty set of nonterminal symbols, \Sigma is a finite set of terminal symbols (the alphabet), P is a finite set of production rules of the form A \to aB, A \to a, or A \to \epsilon (right-linear) or A \to Ba, A \to a, or A \to \epsilon (left-linear), where A, B \in N, a \in \Sigma, and \epsilon is the empty string, and S \in N is the start symbol.[2][3] The language generated by G, denoted L(G), comprises all strings in \Sigma^* that can be derived from S using the rules in P.[2] This structure limits the grammar to linear dependencies, preventing the expression of more complex nested or recursive patterns found in higher levels of the Chomsky hierarchy, such as context-free languages.[1]
Regular grammars are equivalent to other formalisms for regular languages, including finite state automata and regular expressions, meaning any language describable by one can be precisely captured by the others through constructive conversions.[4] For instance, a deterministic finite automaton can be transformed into a regular grammar by creating nonterminals for each state and productions mirroring the transition function, with accepting states yielding terminal productions.[1] Conversely, regular grammars can be converted to nondeterministic finite automata, where nonterminals become states and rules define transitions.[2] This equivalence underscores their foundational role in computability theory, enabling efficient parsing and pattern matching in applications like lexical analysis in compilers.[4]
Fundamentals
Definition
A formal grammar, also known as a phrase structure grammar, is a mathematical system used to describe the syntax of formal languages. It consists of four components: a finite set V of nonterminal symbols (variables), a finite set \Sigma of terminal symbols (the alphabet of the language), a finite set P of production rules of the form A \to \alpha where A \in V and \alpha is a string over V \cup \Sigma, and a distinguished start symbol S \in V. These elements define how strings in the language can be generated by starting from the start symbol and repeatedly applying the production rules to replace nonterminals with other strings until only terminals remain.
A regular grammar is a restricted subclass of formal grammars, specifically classified as Type-3 grammars in the Chomsky hierarchy. Introduced by Noam Chomsky in his 1956 paper "Three Models for the Description of Language," regular grammars were proposed as a formal device capable of generating exactly the regular languages, which are the simplest class in the hierarchy.[5] In a regular grammar, every production rule must adhere to one of two linear forms: right-linear rules of the form A \to aB, A \to a, or A \to \epsilon (where A, B \in V, a \in \Sigma, and \epsilon denotes the empty string), or left-linear rules of the form A \to Ba, A \to a, or A \to \epsilon (where A, B \in V, a \in \Sigma).[6] This restriction ensures that nonterminals appear at most once in each right-hand side, immediately adjacent to a terminal or at the end.[6]
Regular grammars differ fundamentally from the broader class of context-free grammars (Type-2 in the Chomsky hierarchy), which permit productions A \to \alpha where \alpha can be any arbitrary string of terminals and nonterminals, allowing for nested or recursive structures beyond linear sequences.[7] By limiting rules to these linear patterns, regular grammars generate precisely the regular languages, which can be recognized by finite automata and lack the expressive power for more complex dependencies found in context-free languages.[8] This positions regular grammars as the most constrained yet computationally efficient type in the Chomsky hierarchy, foundational to automata theory.[5]
A regular grammar is formally defined as a four-tuple G = (V, \Sigma, R, S), where V is a finite set of nonterminal symbols, \Sigma is a finite set of terminal symbols, R is a finite set of production rules, and S \in V is the start symbol.[9] The sets V and \Sigma are disjoint.[10]
The production rules in R are constrained to ensure the grammar generates only regular languages. Regular grammars are divided into right-linear and left-linear forms, each with specific rule structures that limit the right-hand side to at most one nonterminal symbol, making all rules unary (one symbol on the right) or binary (two symbols, with exactly one nonterminal).[10] In a right-linear grammar, every rule is of the form A \to aB or A \to a, where A, B \in V, a \in \Sigma, or A \to \epsilon (the empty string).[11] Symmetrically, in a left-linear grammar, every rule is of the form A \to Ba or A \to a, or A \to \epsilon.[12]
Empty productions A \to \epsilon are permitted in regular grammars to allow generation of the empty string as part of the language. In some definitions, such productions are restricted to the start symbol S only, ensuring the grammar can produce the empty language or include \epsilon without introducing unnecessary nullable nonterminals elsewhere.[13] This constraint maintains the grammar's regularity while simplifying equivalence proofs to finite automata.[2]
Types
Right-linear grammars
A right-linear grammar is a formal grammar in which every production rule is of the form A \to wB or A \to w, where A and B are nonterminal symbols, and w is a (possibly empty) string of terminal symbols.[14] In some definitions, w is restricted to a single terminal symbol for stricter regularity, ensuring a direct correspondence to finite automaton transitions, though the general form with arbitrary terminal strings still generates only regular languages.[15]
In a right-linear grammar, derivations generate strings by expanding nonterminals from left to right, with each step appending terminals to the current sentential form while moving the nonterminal to the right end.[16] This process mirrors the sequential path traversal in a deterministic finite automaton (DFA), where each production A \to aB (with a a terminal) corresponds to consuming a symbol and transitioning states, and A \to \epsilon or A \to a marks acceptance.[17] For example, the grammar with productions S \to aS \mid b generates the language consisting of zero or more 'a's followed by a 'b' over alphabet \{a, b\}, with derivations like S \Rightarrow aS \Rightarrow aaS \Rightarrow aab building the output incrementally from the left.
Any regular grammar can be converted to an equivalent right-linear grammar without altering the generated language, typically by transforming left-linear rules into right-linear ones via intermediate finite automaton constructions.[18] This normalization preserves the expressive power, as right-linear grammars generate exactly the regular languages.[16]
Right-linear grammars offer an advantage in conversion to nondeterministic finite automata (NFAs), as their right-recursive structure allows a straightforward mapping: each nonterminal becomes a state, productions A \to aB yield transitions on a from A to B, and terminating rules A \to w define paths to accepting states.[19] This direct correspondence simplifies automaton construction compared to more general forms.
Left-linear grammars
A left-linear grammar is a formal grammar in which every production rule has the form A \to B w or A \to w, where A and B are nonterminal symbols from the grammar's variable set, and w is a (possibly empty) string consisting solely of terminal symbols from the grammar's terminal alphabet.[20] This structure ensures that at most one nonterminal appears on the right-hand side of any production, and when present, it is positioned at the left end.[21] Such grammars belong to the type-3 category in the Chomsky hierarchy, restricting the generative power to regular languages.[22]
The generation process in a left-linear grammar involves derivations where the nonterminal is expanded to the right of existing terminal strings, effectively constructing the output string from right to left during the derivation sequence.[23] This right-to-left building order mirrors the behavior of a nondeterministic finite automaton processing input in reverse, allowing the grammar to simulate acceptance by reading symbols from the end of the string toward the beginning.[23] For example, starting from the start symbol and applying rules iteratively prepends terminal strings to the growing derivation, until only terminals remain.[20]
Left-linear grammars are equivalent to right-linear grammars in expressive power, both generating precisely the class of regular languages.[22] To demonstrate this, the productions of a left-linear grammar G can be reversed—replacing each rule A \to B w with A \to w' B (where w' is the reverse of w) and adjusting terminals accordingly—to yield a right-linear grammar G_{\rev} such that the language of G is the reverse of the language of G_{\rev}, i.e., L(G) = \{ w^R \mid w \in L(G_{\rev}) \}.[23] The converse construction similarly interconverts right-linear to left-linear forms, preserving regularity.[23]
These grammars find application in scenarios requiring the description of reversed regular languages, such as when modeling dependencies that naturally align with backward scanning in automata, or in specifying nondeterministic branching at the onset of string formation.[23] For instance, they facilitate the construction of NFAs for reversed inputs by directly adapting the grammar's structure to reversed transitions.[23]
Strictly regular grammars
Strictly regular grammars represent a constrained variant of regular grammars, specifically designed for pedagogical and constructive simplicity in formal language theory. In a right-strictly regular grammar, all production rules take the form A \to a or A \to aB or A \to \epsilon, where A and B are nonterminal symbols, and a is a single terminal symbol from the alphabet \Sigma. Similarly, a left-strictly regular grammar uses rules of the form A \to a or A \to Ba or A \to \epsilon. These forms ensure that each rule appends or prepends exactly one terminal at a time, avoiding concatenations of multiple terminals in a single production.[3]
This restriction maintains equivalence in expressive power to more general right-linear and left-linear grammars, as both classes generate precisely the regular languages. However, strictly regular grammars facilitate a direct and straightforward conversion to nondeterministic finite automata (NFAs): each nonterminal corresponds to a state, each rule A \to aB defines a transition from state A to state B on input a, and terminating rules A \to a or A \to \epsilon indicate accepting states or transitions. This one-to-one mapping simplifies proofs and implementations compared to handling arbitrary terminal strings in general linear rules.[3]
A key limitation of strictly regular grammars is their prohibition on rules with multiple consecutive terminals, such as A \to ab, which are permitted in extended regular grammars. Nonetheless, any extended regular grammar can be systematically converted to an equivalent strictly regular grammar by introducing auxiliary nonterminals; for instance, the rule A \to abC is replaced by new rules A \to aD and D \to bC, where D is a fresh nonterminal. This transformation preserves the generated language while adhering to the single-terminal constraint, though it may increase the number of nonterminals.[24]
Equivalences
Relation to finite automata
Regular grammars generate exactly the regular languages, which are precisely those accepted by finite automata. Specifically, a language is regular if and only if it is generated by a regular grammar if and only if it is accepted by a finite automaton. This equivalence is constructive, allowing direct algorithmic translations between the two formalisms, as established in foundational works on automata theory.
The conversion from a regular grammar to a nondeterministic finite automaton (NFA) maps the grammar's nonterminals to the automaton's states and its production rules to transitions. For a right-linear regular grammar G = (V_N, V_T, P, S), where V_N is the set of nonterminals, V_T the terminals, P the productions, and S the start symbol, construct an NFA M = (Q, \Sigma, \delta, q_0, F) as follows: set Q = V_N \cup \{f\} with a new accepting state f \notin V_N, \Sigma = V_T, q_0 = S, and F = \{f\}. For each production A \to aB (with A, B \in V_N, a \in V_T), add the transition \delta(A, a) = \{B\}; for each A \to a, add \delta(A, a) = \{f\}; for each A \to \epsilon, add \delta(A, \epsilon) = \{f\}. A similar but reversed construction applies to left-linear grammars, introducing a new start state and mapping rules A \to Ba to transitions \delta(B, a) = \{A\}, with A \to a mapping to transitions from the new start state, and A \to \epsilon handled by ε-transitions to the accepting state. This process ensures L(G) = L(M), without requiring elimination of nondeterminism.
The inverse conversion from an NFA to a right-linear regular grammar uses the NFA's states as nonterminals. Given an NFA M = (Q, \Sigma, \delta, q_0, F), construct a grammar G = (V_N, V_T, P, S) with V_N = Q, V_T = \Sigma, S = q_0. For each transition \delta(p, a) = \{q\} (with p, q \in Q, a \in \Sigma), add the production p \to a q; for each final state f \in F, add f \to \epsilon. For left-linear grammars, reverse the rules to q \to p a and handle accepting states with ε-productions. These ε-productions handle acceptance without empty-string transitions in the NFA, preserving language equivalence; subset construction to a DFA is unnecessary for this mapping.
The following pseudocode outlines the grammar-to-NFA conversion for a right-linear grammar:
Algorithm GrammarToNFA(G = (V_N, V_T, P, S)):
Q = V_N ∪ {f} // f is new final state
Σ = V_T
q0 = S
F = {f}
δ = empty transition function
for each production rule in P:
if rule is A → a B (A, B ∈ V_N, a ∈ V_T):
δ(A, a) = {B}
if rule is A → a (A ∈ V_N, a ∈ V_T):
δ(A, a) = {f}
if rule is A → ε (A ∈ V_N):
δ(A, ε) = {f}
return NFA M = (Q, Σ, δ, q0, F)
Algorithm GrammarToNFA(G = (V_N, V_T, P, S)):
Q = V_N ∪ {f} // f is new final state
Σ = V_T
q0 = S
F = {f}
δ = empty transition function
for each production rule in P:
if rule is A → a B (A, B ∈ V_N, a ∈ V_T):
δ(A, a) = {B}
if rule is A → a (A ∈ V_N, a ∈ V_T):
δ(A, a) = {f}
if rule is A → ε (A ∈ V_N):
δ(A, ε) = {f}
return NFA M = (Q, Σ, δ, q0, F)
This algorithm runs in time linear in the number of productions and directly yields an NFA recognizing the grammar's language.
Relation to regular expressions
Regular grammars generate precisely the class of regular languages, which are exactly those described by regular expressions, as formalized by Kleene's theorem establishing the equivalence between regular grammars, finite automata, and regular expressions.[25]
A standard approach to convert a regular grammar to a regular expression proceeds in two steps: first, construct an NFA from the grammar by treating non-terminals as states and productions as transitions (with terminal productions leading to accepting states), then derive the regular expression from the NFA via state elimination—progressively removing states and updating transitions with concatenated expressions—or by applying Arden's lemma to the system of equations modeling paths between states.[26]
Direct conversion methods avoid explicit automaton construction by formulating the grammar as a set of language equations, one per non-terminal, where each equation captures the strings derivable from that non-terminal based on the productions; these equations are then solved iteratively using Arden's rule to yield the regular expression for the start symbol.[26]
Arden's rule provides a key tool for solving such equations: if P and Q are regular expressions with \epsilon \notin L(Q), then the equation X = P \cup X Q has the unique solution X = P Q^*. For instance, given X = aX \cup b, the solution is X = a^* b.[27]
In practice, many regular expression libraries, such as those in programming languages, internally compile expressions to finite automata—equivalent to regular grammars—for efficient pattern matching and parsing.[28]
Examples
Right-linear grammar
A right-linear grammar generates strings by appending terminals to the left of nonterminals. Consider the language of all non-empty strings consisting of one or more 0s followed by one or more 1s, denoted L = 0^+1^+.[1]
The following right-linear grammar G = (N, \Sigma, P, S) generates L, where N = \{S, A, B\}, \Sigma = \{0, 1\}, and P consists of:
- S \to 0A
- A \to 0A \mid 1B
- B \to 1B \mid 1
For example, the string "0011" is derived as: S \to 0A \to 00A \to 001B \to 0011B \to 0011.
Left-linear grammar
A left-linear grammar generates strings by prepending terminals to the right of nonterminals. Consider the language of all strings over \{a, b\} consisting of zero or more a's followed by zero or more b's, denoted L = a^*b^*.[29]
The following left-linear grammar G = (N, \Sigma, P, S) generates L, where N = \{S, T\}, \Sigma = \{a, b\}, and P consists of:
- S \to Sa \mid T
- T \to Tb \mid \epsilon
For example, the string "aabb" is derived as: S \to Sa \to Saa \to Saab \to Saabb \to aabb (noting the reverse derivation order for left-linear).
Mixed example: Binary strings ending with 01
For the language of all binary strings ending with "01", a right-linear grammar can be used: G = (N, \Sigma, P, S), where N = \{S, A\}, \Sigma = \{0, 1\}, and P consists of:
- S \to 0S \mid 1S \mid 0A
- A \to 1
This generates strings like "01", "101", "1001".[30]
Properties
Expressive power
Regular grammars generate exactly the class of regular languages, which consist of all finite or infinite languages that can be recognized by finite automata and thus require only finite memory independent of input length.[12][31] These languages capture patterns with bounded repetition or alternation, such as those describable by finite-state machines without unbounded counting or nesting.[3]
A key characterization of regular languages is provided by the pumping lemma, which establishes a necessary condition for regularity. For any regular language L, there exists a pumping length p > 0 such that every string w \in L with |w| \geq p can be divided as w = xyz where |xy| \leq p, |y| \geq 1, and xy^iz \in L for all integers i \geq 0.
\text{If } L \text{ is regular, then } \exists p > 0 \text{ such that } \forall w \in L \ (|w| \geq p) \ \exists x,y,z \ (w=xyz, |xy| \leq p, |y| \geq 1, \ \forall i \geq 0 \ (xy^iz \in L)).
This lemma implies that sufficiently long strings in regular languages contain a repeatable substring of bounded length without altering membership.[32]
The pumping lemma also proves limitations by showing certain languages are non-regular. For example, the language \{a^nb^n \mid n \geq 0\} is not regular. Assume it is, with pumping length k; then take w = a^kb^k \in L with |w| = 2k \geq k, so w = xyz satisfying the conditions. Since |xy| \leq k, x and y consist only of a's, making |y| \geq 1. Pumping with i=2 yields xy^2z = a^{k+|y|}b^k, which has more a's than b's and thus \notin L, a contradiction.[33]
All standard decision problems for regular languages are decidable. The membership problem—determining if a string w belongs to L—is solvable in linear time by simulating a deterministic finite automaton on w.[34] The emptiness problem (whether L contains any string) is decided by checking reachability of an accepting state from the start state in the automaton's graph. The finiteness problem (whether L is finite) is decided by detecting cycles in paths from the start state to accepting states or by testing membership for all strings up to length $2n-1, where n is the number of states.[34]
In the Chomsky hierarchy, regular grammars correspond to Type-3 grammars, which generate precisely the regular languages; every Type-3 grammar produces a regular language, and every regular language has an equivalent Type-3 grammar.[8] This class is properly contained in Type-2 (context-free) languages, as there exist context-free languages, such as \{a^nb^n \mid n \geq 0\}, that are not regular.[25]
Closure properties
Regular languages, the class generated by regular grammars, exhibit closure under several fundamental operations, meaning that applying these operations to regular languages yields another regular language. This property underscores the robustness of the regular language class within the Chomsky hierarchy. Proofs of closure typically rely on constructions involving finite automata equivalent to the grammars, such as nondeterministic finite automata (NFAs) for union, concatenation, and Kleene star, or deterministic finite automata (DFAs) for complement and intersection.[35][36]
The union of two regular languages L_1 and L_2 is regular. This follows from constructing an NFA that simulates the NFAs for L_1 and L_2 via a disjoint union, with \epsilon-transitions from a new start state to their start states and their accepting states feeding into a new accepting state; the equivalent regular grammar can then be derived from this NFA.[35][37]
Concatenation preserves regularity: if L_1 and L_2 are regular, then L_1 L_2 = \{xy \mid x \in L_1, y \in L_2\} is regular. An NFA for the concatenation connects the accepting states of the NFA for L_1 to the start state of the NFA for L_2 via \epsilon-transitions, yielding an NFA for the result, from which a regular grammar is obtainable.[35][37]
Regular languages are closed under the Kleene star operation: if L is regular, then L^* = \bigcup_{i=0}^\infty L^i (with L^0 = \{\epsilon\}) is regular. This is shown by modifying the NFA for L to include \epsilon-loops from its accepting states back to its start state, plus a direct path to accept \epsilon; the corresponding grammar follows from the NFA construction.[35][37]
Complement closure holds for regular languages: if L \subseteq \Sigma^* is regular, then its complement \overline{L} = \Sigma^* \setminus L is regular. Starting from a DFA for L, swap accepting and non-accepting states to obtain a DFA for \overline{L}, which can be converted to a regular grammar.[36][38]
The intersection of two regular languages L_1 and L_2 is regular. This is established by the product automaton construction: given DFAs for L_1 and L_2, form a DFA whose states are pairs of states from the originals, accepting pairs where both are accepting; the grammar is derived accordingly. De Morgan's law also implies intersection via union and complement, both of which preserve regularity.[38][39]
Regular languages are closed under reversal: if L is regular, then L^R = \{w^R \mid w \in L\} (where w^R reverses w) is regular. For grammars, the reversal of a right-linear grammar is left-linear (and vice versa) by reversing production rules, preserving regularity since both forms generate regular languages; equivalently, reverse all transitions in the corresponding NFA and swap start and accept states.[40][35]
Finally, regular languages are closed under letter-to-letter homomorphisms and their inverses. A homomorphism h: \Sigma^* \to \Gamma^* maps each letter to a word in \Gamma^*; if L \subseteq \Sigma^* is regular, then h(L) = \{h(w) \mid w \in L\} is regular, via NFA substitution of the images for transitions. For inverse homomorphism, h^{-1}(L') = \{w \in \Sigma^* \mid h(w) \in L' \subseteq \Gamma^*\} where L' is regular, the result is regular by tracking possible preimages in an NFA.[41][35]
Mixing linear rules
In the theory of formal languages, mixed linear grammars for regular languages incorporate production rules that are both left-linear (of the form A \to B \alpha or A \to \alpha, where B is a nonterminal and \alpha is a string of terminals) and right-linear (of the form A \to \alpha B or A \to \alpha) within the same grammar.[29][42] While unrestricted mixing can produce linear grammars that generate non-regular context-free languages (such as \{ a^n b^n \mid n \geq 0 \}), structured mixtures that preserve regularity are possible, notably through the framework of strongly regular grammars. In these, rules within each set of mutually recursive nonterminals—defined by a dependency relation where nonterminals can derive each other through intermediate strings—are uniformly left-linear or right-linear, allowing controlled mixing across disjoint components.[43][44]
The expressive power of such mixed regular grammars remains confined to the regular languages, equivalent to those generated by pure left- or right-linear forms.[43][44] However, for uniformity in processing or equivalence proofs, normalization to a pure right-linear (or left-linear) form is often applied, ensuring all rules follow one orientation without altering the language.[45]
Normalization algorithms convert mixed regular grammars to right-linear equivalents via an intermediate nondeterministic finite automaton (NFA). Nonterminals become NFA states, with right-linear rules A \to \alpha B inducing transitions from state A to B on \alpha, and terminal rules A \to \alpha marking accepting states; for left-linear rules A \to B \alpha, the construction reverses the transitions (effectively processing the reversed language) before converting back. This introduces intermediate states corresponding to the NFA's structure, yielding a pure right-linear grammar.[46][45][47]
Mixing linear rules can increase a grammar's ambiguity, as a single string may admit multiple leftmost derivations, complicating parsing despite the language remaining regular and unchanged.[48][49] For example, consider the following mixed grammar generating the regular language L = \{ ab, c \}:
- S \to A b (left-linear)
- A \to a (terminal production)
- S \to c (terminal production)
Deriving "ab" uses the left-linear rule followed by the terminal, yielding L(G) = \{ [ab](/page/AB), c \}, which is equivalent to the pure right-linear grammar S \to ab \mid c. In this case, ambiguity is minimal, but more intricate mixtures can produce multiple derivations for the same string.[29][48]