Regular language
In theoretical computer science and formal language theory, a regular language is a formal language consisting of a set of strings over a finite alphabet that can be recognized by a finite automaton, equivalently expressed using regular expressions, or generated by a regular grammar.[1][2] Regular languages were introduced by mathematician Stephen Cole Kleene in his 1951 technical report, published in 1956 as "Representation of Events in Nerve Nets and Finite Automata," where he introduced the algebraic structure of regular events as a model for neural networks and sequential machines.[3] The class of regular languages occupies the lowest level (Type-3) in the Chomsky hierarchy of formal grammars, which classifies languages based on the restrictions on production rules; regular languages are strictly contained within context-free languages and higher classes.[4] They possess several closure properties, meaning that if two or more regular languages are subjected to operations like union, concatenation, Kleene star (repetition), complement, intersection, reversal, or homomorphism, the resulting language remains regular.[5] A fundamental characterization is provided by the pumping lemma for regular languages, which states that for any regular language L, there exists a pumping length p such that every string s in L with length at least p can be divided as s = xyz where |xy| \leq p, |y| > 0, and xy^iz is in L for all nonnegative integers i; this lemma is often used to prove that certain languages are non-regular.[6] Regular languages and their equivalents, such as deterministic finite automata (DFAs), nondeterministic finite automata (NFAs), and regular expressions, are foundational in automata theory, enabling efficient algorithms for recognition and manipulation since finite automata have finite memory and can process inputs in linear time.[7] In practice, regular expressions power applications in text processing, lexical analysis in compilers (e.g., tokenizing source code), pattern matching in search tools, and built-in features of programming languages like Java, Python, and Perl for string manipulation.[8]Fundamentals
Formal Definition
In formal language theory, an alphabet \Sigma is a finite nonempty set of symbols, such as \Sigma = \{0, 1\}.[9] A string over \Sigma is a finite sequence of symbols from \Sigma, including the empty string \epsilon, which has length zero and contains no symbols.[9] The set of all finite strings over \Sigma, denoted \Sigma^*, is known as the Kleene star of \Sigma and forms the universal set for languages over \Sigma.[9] A regular language over an alphabet \Sigma is a subset L \subseteq \Sigma^* consisting of all strings generated by a regular expression over \Sigma, or equivalently, all strings accepted by a finite automaton over \Sigma.[10] This definition captures the simplest class of formal languages in the Chomsky hierarchy.[11] The formal syntax of regular expressions over \Sigma is defined inductively as follows:- The empty set \emptyset is a regular expression denoting the empty language.
- Any symbol a \in \Sigma is a regular expression denoting the singleton set \{a\}.
- The empty string \epsilon is a regular expression denoting \{\epsilon\}.
- If R_1 and R_2 are regular expressions denoting languages L_1 and L_2, then:
- (R_1 + R_2) (or (R_1 \mid R_2)) is a regular expression denoting L_1 \cup L_2 (union).
- (R_1 \cdot R_2) is a regular expression denoting L_1 L_2 = \{xy \mid x \in L_1, y \in L_2\} (concatenation).
- R_1^* is a regular expression denoting the Kleene star L_1^* = \bigcup_{n=0}^\infty L_1^n, where L_1^0 = \{\epsilon\} and L_1^{n+1} = L_1^n L_1 for n \geq 0.
Parentheses are used to indicate precedence, with star having highest precedence, followed by concatenation, then union.[10]
Examples
Regular languages can be illustrated through simple sets of strings that can be recognized by finite automata, which maintain a finite amount of memory to track patterns. A classic example is the language L_1 = \{ w \in \{0,1\}^* \mid w ends with 1\} , consisting of all binary strings terminating in the symbol 1, such as 01, 101, and 111. This language is regular because a deterministic finite automaton (DFA) with two states suffices: a start state q_0 (non-accepting, representing the last symbol was 0 or start) and an accepting state q_1 (last symbol was 1), with transitions q_0 \xrightarrow{0} q_0, q_0 \xrightarrow{1} q_1, q_1 \xrightarrow{0} q_0, and q_1 \xrightarrow{1} q_1.[12] Another basic regular language is L_2, the set of all even-length binary strings over \{0,1\}, including the empty string \epsilon (length 0), 00, 01, 10, 11 (length 2), and so on. Recognition requires tracking parity with a DFA featuring two states: q_e (even length so far, accepting) as the start state and q_o (odd length, non-accepting), with transitions swapping states on each input symbol regardless of 0 or 1.[13] Trivial cases include the empty language \emptyset, which contains no strings and is recognized by a DFA with no accepting states, and the full language \Sigma^* (for alphabet \Sigma = \{0,1\}), encompassing all possible binary strings, accepted by a single-state DFA where the sole state is accepting and loops on all inputs.[14] In contrast, the language L_3 = \{ 0^n 1^n \mid n \geq 0 \}, such as \epsilon, 01, 0011, and 000111, is not regular because any purported DFA would require unbounded states to match the equal number of 0s and 1s, violating the finite memory constraint; this is shown via the pumping lemma for regular languages.[15] These examples can be described using regular expressions, such as (0 + 1)^*1 for L_1 and (00 + 01 + 10 + 11)^* for even-length strings.[16] In practice, regular languages underpin pattern matching in text processing, like validating email addresses with regex patterns that check formats such as local-part@domain (e.g., user@example.com), ensuring finite-state recognition of structural rules without infinite memory.[17]Equivalent Formalisms
Regular Expressions
Regular expressions provide a symbolic notation for specifying regular languages, originally introduced by Stephen Kleene in his 1956 work on representation of events in finite automata.[18] They consist of atomic expressions combined using operations that correspond to union, concatenation, and Kleene star on languages. The syntax of regular expressions over an alphabet \Sigma is defined recursively as follows: the atomic expressions are the empty set \emptyset, the empty string \varepsilon, and single symbols a \in \Sigma; if r and s are regular expressions, then so are (r + s) (union), (r s) (concatenation), and r^* (Kleene star).[19] Parentheses are used for grouping, and precedence rules apply: Kleene star has highest precedence, followed by concatenation, then union, allowing omission of some parentheses (e.g., a b^* + c means (a (b^*)) + c).[20] The semantics of a regular expression r, denoted L(r), map it to the language it represents, defined recursively: L(\emptyset) = \{\}, L(\varepsilon) = \{\varepsilon\}, L(a) = \{a\} for a \in \Sigma; L(r s) = L(r) L(s) (concatenation of languages); L(r + s) = L(r) \cup L(s); and L(r^*) = \bigcup_{k=0}^\infty L(r)^k, where L(r)^0 = \{\varepsilon\} and L(r)^{k+1} = L(r)^k L(r).[19] For example, the expression a^* b denotes the language of all strings consisting of zero or more a's followed by a single b. Every regular expression can be converted to an equivalent nondeterministic finite automaton (NFA) using Thompson's construction, an algorithm that builds the NFA recursively from subexpressions. The process starts with basic NFAs for atoms and combines them for operations, ensuring each partial NFA has a unique start state i and accept state f:- For \emptyset: States i and f with no transitions between them.
- For \varepsilon: States i and f connected by an \varepsilon-transition.
- For literal a \in \Sigma: States i and f connected by a transition labeled a.
+ (one or more) and ? (zero or one), bounded repetition {m,n}, and character classes [abc] or [:digit:].[21] These go beyond theoretical regular expressions, as they can describe non-regular languages when combined (e.g., backreferences in some implementations), and rely on implementation-specific behaviors like greedy matching.[21]
Finite Automata
Finite automata serve as fundamental computational models for recognizing regular languages, processing input strings sequentially through a finite number of states based on transitions determined by the alphabet symbols. These machines operate without memory beyond their current state, making them suitable for pattern matching in linear time relative to the input length. The two primary variants are deterministic finite automata (DFAs), which have unique transitions for each input symbol from any state, and nondeterministic finite automata (NFAs), which allow multiple possible transitions or even spontaneous moves without consuming input.[22] A DFA is formally defined as a 5-tuple (Q, \Sigma, \delta, q_0, F), where [Q](/page/Q) is a finite set of states, [\Sigma](/page/Sigma) is a finite input alphabet, [\delta](/page/Delta): Q \times \Sigma \to Q is the transition function that specifies a unique next state for each state-symbol pair, q_0 \in Q is the initial state, and F \subseteq Q is the set of accepting (or final) states.[22] To accept a string w \in \Sigma^*, the DFA begins in q_0 and applies \delta successively for each symbol in w; the string is accepted if the resulting state is in F, otherwise rejected.[22] This deterministic behavior ensures exactly one computation path per input, facilitating efficient simulation. An NFA extends the DFA model to a 5-tuple (Q, \Sigma, \delta, q_0, F), where the transition function is \delta: Q \times (\Sigma \cup \{\epsilon\}) \to 2^Q, allowing a set of possible next states (including the empty set) for each state and input symbol or the empty string \epsilon.[22] Acceptance occurs if there exists at least one computation path from q_0 to a state in F that consumes the entire input string, accounting for \epsilon-transitions that enable moves without reading symbols.[22] Despite nondeterminism, NFAs recognize exactly the same class of languages as DFAs, as any NFA can be converted to an equivalent DFA via the powerset construction, which treats subsets of Q as new states and simulates all possible NFA paths in parallel.[22] In this construction, the DFA's states are elements of $2^Q, the initial state is \{q_0\}, accepting states are those subsets intersecting F, and transitions are defined by applying \delta to all states in a subset and taking the union.[22] To optimize DFAs by reducing the number of states while preserving the recognized language, minimization algorithms partition states into equivalence classes based on distinguishability by future inputs. Hopcroft's algorithm achieves this in O(|Q| \log |Q|) time by iteratively refining partitions using a breadth-first search-like refinement process on the state graph, improving upon earlier O(|Q|^2) methods.[23] This complexity makes it practical for large automata in applications like compiler design and lexical analysis. As an illustrative example, consider an NFA recognizing binary strings over \Sigma = \{0, 1\} with an even number of 1's. The NFA has states Q = \{q_e, q_o\}, where q_e (initial and accepting) tracks even parity and q_o tracks odd parity, with transitions: \delta(q_e, 0) = \{q_e\}, \delta(q_e, 1) = \{q_o\}, \delta(q_o, 0) = \{q_o\}, \delta(q_o, 1) = \{q_e\}, and no \epsilon-transitions. For input "101", the computation paths yield states q_e \to q_o \to q_e \to q_o (rejecting, odd 1's) or parallel simulations confirming acceptance only for even counts like "110" ending in q_e. Applying powerset construction, the DFA states are \{\{q_e\}, \{q_o\}, \emptyset\}, but \emptyset is unreachable; transitions mirror the NFA since each yields a singleton, resulting in an identical two-state DFA with states renamed as even and odd parity. NFAs can also be constructed systematically from regular expressions using methods like Thompson's construction.[22]Regular Grammars
A regular grammar, also known as a type-3 grammar in the Chomsky hierarchy, is a formal grammar that generates precisely the regular languages through restrictions on its production rules to ensure linearity in one direction. These grammars are classified as either right-linear or left-linear. In a right-linear grammar, all productions are of the form A \to aB or A \to a, where A and B are nonterminal symbols, a is a terminal symbol from the alphabet \Sigma, and there is a distinguished start symbol S. Left-linear grammars mirror this structure with productions A \to Ba or A \to a. A grammar is regular if it is either right-linear or left-linear but not a mixture of both forms.[24] The language L(G) generated by a regular grammar G consists of all finite strings of terminals that can be obtained from the start symbol S via a derivation, which is a sequence of substitutions applying the production rules. In each derivation step, a nonterminal is replaced by the right-hand side of a matching production, proceeding until only terminals remain; for regular grammars, derivations are inherently linear due to the single nonterminal allowed on the right-hand side. This generative process enumerates the language's strings without ambiguity in structure, though multiple derivations may yield the same string in nondeterministic cases.[24] Regular grammars are equivalent to nondeterministic finite automata (NFAs), with constructive algorithms establishing bidirectional conversions. To convert a right-linear regular grammar to an NFA, map each nonterminal to a state, designate S as the start state, add transitions A \xrightarrow{a} B for each production A \to aB, introduce transitions A \xrightarrow{a} F to a new accepting state F for each A \to a, and mark states accepting if there is a production A \to \epsilon. The reverse conversion from an NFA to a regular grammar assigns nonterminals to states, with the start state as S, and generates productions A \to aB from transitions A \xrightarrow{a} B and A \to a from transitions to accepting states. These algorithms preserve the language exactly, confirming the equivalence.[24][25] As an illustrative example, consider the regular language L = \{ a^n b^m \mid n, m \geq 0 \}, which matches the regular expression a^* b^*. A corresponding right-linear grammar is G = (\{S, B\}, \{a, b\}, P, S), with productions S \to aS \mid B, B \to bB \mid \epsilon. A derivation for the string a^2 b^3 (i.e., "aabbb") proceeds as follows:S \Rightarrow aS \Rightarrow a(aS) \Rightarrow aaB \Rightarrow aabB \Rightarrow aabbB \Rightarrow aabbbB \Rightarrow aabbb. The derivation tree forms a right-skewed spine: the root S branches to a and another S (twice for the a's), then to B, which branches to b and B (thrice), terminating in \epsilon. This tree visually captures the sequential generation, with leaves yielding the terminal string.[26]
Properties
Closure Properties
Regular languages possess notable closure properties under several fundamental operations on formal languages. These properties underscore the robustness of the class, as the result of applying such an operation to one or more regular languages is always regular. Proofs of closure typically rely on constructive methods using finite automata, leveraging the equivalence between regular languages and the languages accepted by nondeterministic finite automata (NFAs) or deterministic finite automata (DFAs).[27] The class of regular languages is closed under union. Given regular languages L_1 and L_2 accepted by NFAs M_1 = (Q_1, \Sigma, \delta_1, s_1, F_1) and M_2 = (Q_2, \Sigma, \delta_2, s_2, F_2), construct an NFA M = (Q_1 \cup Q_2 \cup \{s\}, \Sigma, \delta, s, F_1 \cup F_2) with a new start state s and \varepsilon-transitions from s to s_1 and s_2; the transition function \delta extends \delta_1 and \delta_2. This M accepts exactly L_1 \cup L_2, as paths from s reach accepting states via either submachine.[27] Closure under intersection holds via the product automaton construction. For DFAs M_1 and M_2 as above, form the product DFA M = (Q_1 \times Q_2, \Sigma, \delta, (s_1, s_2), F_1 \times F_2), where \delta((p, q), a) = (\delta_1(p, a), \delta_2(q, a)) for states p \in Q_1, q \in Q_2, and symbol a \in \Sigma. A string is accepted by M if and only if it is accepted by both M_1 and M_2, so M accepts L_1 \cap L_2. Since the state space is finite (|Q_1| \times |Q_2|), M is a DFA.[28] Regular languages are closed under complementation. If L is accepted by a DFA M = (Q, \Sigma, \delta, s, F), then the complement \overline{L} = \Sigma^* \setminus L is accepted by the DFA M' = (Q, \Sigma, \delta, s, Q \setminus F), which swaps accepting and non-accepting states. Every string drives M' to a state in Q \setminus F precisely when M reaches a state not in F, ensuring acceptance of \overline{L}.[5] The class is closed under concatenation. For regular languages L_1 and L_2 accepted by NFAs M_1 and M_2, construct M by adding \varepsilon-transitions from each state in F_1 to s_2, using states Q_1 \cup Q_2, start s_1, and accept states F_2. A string w = uv with u \in L_1, v \in L_2 reaches an accepting state via the path for u in M_1 followed by \varepsilon to M_2 for v, and no other strings are accepted.[29] Closure under the Kleene star operation L^* = \bigcup_{i=0}^\infty L^i (with L^0 = \{\varepsilon\}) is established using NFAs. For L accepted by NFA M = (Q, \Sigma, \delta, s, F), create M' = (Q \cup \{s', f\}, \Sigma, \delta', s', \{f\}) with new states s' (start) and f (sole accept), \varepsilon-transitions from s' to s and from s' to f, and from each state in F via \varepsilon to both s and f. This allows zero or more repetitions of paths accepting strings in L.[29] Regular languages are closed under reversal, where L^R = \{w^R \mid w \in L\} and w^R reverses w. Given NFA M for L, construct NFA M^R by reversing all transitions (if \delta(p, a) includes q, add \delta(q, a) including p), setting start states to F (with \varepsilon-transitions if multiple to a new start if needed, but typically direct), and accept state to s. A path labeled w from s to some f \in F in M corresponds to a path labeled w^R from f to s in M^R, hence from start to accept in M^R.[30] The class is closed under homomorphism. A homomorphism h: \Sigma^* \to \Gamma^* maps each symbol to a string in \Gamma^*. If L \subseteq \Sigma^* is regular, accepted by DFA M = (Q, \Sigma, \delta, s, F), then h(L) is accepted by NFA M_h = (Q_h, \Gamma, \delta_h, s, F), where for each transition \delta(p, a) = q in M, add a chain of states and transitions spelling h(a) from p to q (introducing intermediate states as needed for the fixed-length h(a)); \varepsilon-transitions connect chains. Since |h(a)| is fixed per a and |\Sigma| finite, the added states are finite, and processing w = a_1 \cdots a_k simulates M on w while outputting h(w), accepting if w \in L.[31] Closure under inverse homomorphism holds: For homomorphism h: \Sigma^* \to \Gamma^* and regular L \subseteq \Gamma^* accepted by DFA N = (P, \Gamma, \gamma, t, G), the inverse h^{-1}(L) = \{w \in \Sigma^* \mid h(w) \in L\} is accepted by NFA N_i = (P_i, \Sigma, \delta_i, t, G), where states include copies of P for each position in the images h(a); on input a \in \Sigma, from state p \in P, chain through \gamma applied to the symbols of h(a) using intermediate states, ending at \gamma(p, h(a)). The finite added states per a (bounded by \max |h(a)| \times |P|) ensure finiteness; w is accepted if the simulation reaches G after h(w). Regular languages are closed under the (right) quotient operation when the divisor is regular: if L_1, L_2 are regular, then L_1 / L_2 = \{u \in \Sigma^* \mid \exists v \in L_2, \, uv \in L_1\} is regular. The construction modifies the DFA for L_1 by redefining accepting states to those from which some string in L_2 reaches an original accepting state, using the finite state space.[32] However, the class is not closed under quotient when the dividend is non-regular. For a counterexample, let K = \{a^n b^n c \mid n \geq 0\} (non-regular) and R = \{c\} (regular); then K / R = \{a^n b^n \mid n \geq 0\}, which is non-regular.[32] The following table summarizes the closure operations and corresponding automaton constructions:| Operation | Construction Method |
|---|---|
| Union | New NFA start state with \varepsilon-transitions to starts of input NFAs; union of accept sets |
| Intersection | Product DFA with synchronized transitions on symbols; product of start and accept sets |
| Complement | Same DFA, but swap accepting and non-accepting states |
| Concatenation | NFA combining input NFAs with \varepsilon-transitions from accept of first to start of second |
| Kleene star | Augmented NFA with new start/accept state and \varepsilon-loops from accepts to start |
| Reversal | Reverse transitions in NFA; original accepts become starts (via \varepsilon if needed), original start becomes accept |
| Homomorphism | NFA with \varepsilon-chains replacing transitions for strings h(a) |
| Inverse homomorphism | NFA simulating transitions on full strings h(a) via chained states per symbol |
Decidability Properties
Regular languages enjoy strong decidability properties, allowing algorithms to resolve key questions about them efficiently. Unlike higher classes in the Chomsky hierarchy, all standard decision problems for regular languages—when presented via finite automata, regular expressions, or regular grammars—are computable in polynomial time relative to the input size. The membership problem, determining whether a given string w \in \Sigma^* belongs to a regular language L described by a deterministic finite automaton (DFA) M = (Q, \Sigma, \delta, q_0, F), is decidable by simulating \delta on w starting from q_0. If the final state is in F, then w \in L. This simulation requires examining one transition per symbol in w, yielding a time complexity of O(|w| \cdot |\Sigma|).[33] Emptiness, checking if L(M) = \emptyset for a DFA M, is decidable via graph reachability: model the DFA as a directed graph with states as nodes and transitions as edges, then use breadth-first search (BFS) from q_0 to detect if any state in F is reachable. Since the graph has |Q| nodes and at most |Q| \cdot |\Sigma| edges, BFS runs in O(|Q| \cdot |\Sigma|) time. For nondeterministic finite automata (NFAs), first convert to a DFA using subset construction, which is exponential but finite.[34] Finiteness, determining if |L(M)| < \infty, is decidable by analyzing the DFA's structure for cycles that contribute to infinite languages. Specifically, compute the reachable states from q_0, then check for cycles within the subgraph of states from which an accepting state is reachable; the language is infinite if and only if such a cycle exists. This involves graph traversal algorithms like depth-first search for cycle detection, again in O(|Q| \cdot |\Sigma|) time.[35] Equivalence between two regular languages given by DFAs M_1 and M_2 is decidable by minimizing both automata and verifying isomorphism of the results, or by constructing DFAs for L(M_1) \triangle L(M_2) (symmetric difference) and testing emptiness. Minimization partitions states into equivalence classes using a table-filling or partition refinement algorithm, with Hopcroft's algorithm achieving O(|Q| \log |Q| \cdot |\Sigma|) time, where |Q| is the total number of states across both DFAs.[36] The Myhill-Nerode theorem underpins these results by characterizing regular languages via the right-congruence relation x \equiv_L y if xz \in L iff yz \in L for all z \in \Sigma^*; L is regular if and only if the number of equivalence classes (the index) is finite, equaling the size of the minimal DFA. This equivalence relation directly informs DFA minimization and equivalence testing by identifying indistinguishable states.[37] In contrast to regular languages, non-regular classes like context-sensitive languages exhibit undecidability for problems such as equivalence; for instance, while membership in \{a^n b^n c^n \mid n \geq 0\} is decidable, deciding equivalence among context-sensitive languages in general is not.[34]Complexity Results
The conversion from a nondeterministic finite automaton (NFA) with |Q| states to an equivalent deterministic finite automaton (DFA) via the subset construction algorithm can result in an exponential blowup in the number of states, reaching up to $2^{|Q|} in the worst case.[38] This worst-case bound is tight, as there exist NFAs requiring a DFA with exactly $2^{|Q|} states for equivalence.[39] Minimization of a DFA with n states and alphabet size k can be performed in O(n k \log n) time using Hopcroft's algorithm, which refines an initial partition of states based on distinguishability. This bound is asymptotically optimal, as the algorithm's partitioning process necessitates handling logarithmic factors in the worst case.[40] The construction of an NFA from a regular expression of length m using Thompson's method proceeds in linear time, O(m), by inductively building sub-NFAs for basic symbols and combining them via epsilon transitions for union, concatenation, and Kleene star operations. This efficiency stems from the epsilon-move rules, which add a constant number of states and transitions per operator without requiring determinization.[41] The emptiness problem for DFAs—determining whether the language is empty—is solvable in linear time, O(n), by checking reachability from the initial state to any accepting state via graph traversal.[42] Equivalence of two DFAs with m and n states can be decided in polynomial time, specifically O(mn), by constructing the product automaton and testing emptiness of the symmetric difference.[43] For NFAs, emptiness is NL-complete, reducible to graph reachability, while universality (whether the language equals \Sigma^*) is PSPACE-complete.[44] State complexity measures the minimal number of states required in a DFA recognizing the result of an operation on input DFAs. For union of languages recognized by an m-state DFA and an n-state DFA, up to mn states may be necessary and sufficient in the worst case, achieved via the product construction with merged initial and accepting states.[39] Lower bounds demonstrate that this bound is tight for certain languages over binary alphabets.[45] Similarly, intersection requires up to mn states, using the standard product automaton. Reversal of an n-state DFA language has state complexity up to $2^n, as the reverse may require determinization of an equivalent NFA; tight examples exist over suitable alphabets.[38] The following table summarizes the time and state complexities for key operations on regular languages represented by DFAs or NFAs:| Operation | Time Complexity | State Complexity (Worst Case) |
|---|---|---|
| NFA to DFA (subset construction) | $ O(2^{ | Q |
| DFA Minimization (Hopcroft) | $ O(n \log n \cdot | \Sigma |
| RegExp to NFA (Thompson) | O(m) | O(m) states |
| DFA Emptiness | O(n) | N/A |
| DFA Equivalence | $ O(mn \cdot | \Sigma |
| NFA Emptiness | NL-complete | N/A |
| Union (DFAs, m, n states) | $ O(mn \cdot | \Sigma |
| Intersection (DFAs) | $ O(mn \cdot | \Sigma |
| Reversal (DFA to minimal DFA) | $ O(2^n \cdot | \Sigma |