ReDoS
Regular expression denial of service (ReDoS) is a form of algorithmic complexity attack that exploits vulnerabilities in the implementation of regular expressions, causing software to consume excessive computational resources—often exponentially more than expected—and leading to denial-of-service conditions such as system hangs or crashes.[1][2] This vulnerability arises primarily from backtracking mechanisms in nondeterministic finite automaton (NFA) regex engines, which are common in languages like JavaScript, Python, and .NET, where certain input strings trigger an explosion of evaluation paths.[3][4] Attackers craft short, malicious inputs to force this behavior, making ReDoS a low-bandwidth threat that can disrupt web applications, servers, and even client-side processing without requiring large payloads.[1]
The broader concept of algorithmic denial-of-service attacks traces back to 2003, while ReDoS specifically was first formally outlined in 2009 at the OWASP Israel Conference, where Checkmarx researchers highlighted its practical risks in web applications.[2][3] In practice, ReDoS manifests through catastrophic backtracking, where nested quantifiers or overlapping alternatives in a regex pattern—such as (a+)+$—lead to repeated failures and retries for inputs like repeated 'a's followed by a non-matching character (e.g., "aaaaaaaaaaaaaaaa!").[1][3] For instance, a seemingly simple pattern like ^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@(([0-9a-zA-Z])+([-\w]*[0-9a-zA-Z])*\.)+[a-zA-Z]{2,9})$ for email validation can take seconds or hours to process crafted strings due to the engine exploring thousands of paths.[3]
As of 2018, empirical studies revealed ReDoS as a widespread issue, with super-linear regex patterns appearing in thousands of software modules across ecosystems like JavaScript (e.g., npm) and Python (e.g., PyPI), affecting over 10,000 packages and potentially enabling attacks on diverse applications from web servers to input validators.[4] More recent analyses as of 2024 show continued widespread impact, with a 143% increase in reported ReDoS vulnerabilities in the npm ecosystem and approximately 400 related CVEs since 2014.[5][6] The impact includes resource exhaustion at multiple layers—client-side (e.g., browser hangs), server-side (e.g., CPU spikes in Node.js), and even in security tools like web application firewalls—often without detection until exploited.[1][4] Mitigation strategies emphasize avoiding vulnerable patterns, such as nested repetitions or ambiguous quantifiers, and adopting safer engines like deterministic finite automata (DFA) where possible, alongside techniques like input length limits, timeouts, and automated fuzzing tools to test for backtracking risks.[3][1] Despite advances in detection tools, developers often revise regexes manually rather than overhaul parsing logic, underscoring ongoing challenges in preventing ReDoS at scale.[4]
Fundamentals
Definition
Regular expressions, often abbreviated as regex, are formal patterns used in programming languages and tools to match, search, and manipulate strings of text through operations like concatenation, alternation, and repetition.[7] These patterns enable efficient text processing in applications such as input validation, data extraction, and search functionalities across various software systems.
Regular expression denial of service (ReDoS) is an algorithmic complexity attack that exploits vulnerabilities in the processing of regular expressions by causing excessive computational resource consumption, effectively denying service to legitimate users.[8] In ReDoS, attackers craft malicious inputs that trigger inefficient evaluation in regex engines, particularly those relying on backtracking, resulting in catastrophic slowdowns where processing time escalates exponentially with input size. This leads to worst-case time complexities as high as O(2^n), where n is the input length, in contrast to the typical linear or polynomial time for well-behaved patterns.
Unlike traditional denial-of-service attacks that overwhelm systems through high-volume traffic or resource flooding, ReDoS targets the inherent inefficiencies of regex implementations in specific application components, such as web servers or validators, allowing a single, small input to monopolize CPU cycles and halt operations.[1] This makes ReDoS particularly insidious in software relying on user-supplied inputs for pattern matching, as it exploits the gap between average-case efficiency and pathological worst-case behavior in backtracking-based engines.[8]
Backtracking Mechanism
Backtracking in regular expression engines simulates the execution of a non-deterministic finite automaton (NFA) through an input-directed depth-first search, allowing the engine to handle ambiguity in the pattern by exploring multiple matching paths. The process begins at the NFA's initial state, where the engine attempts to consume the input string symbol by symbol. For each state, if a direct match (e.g., a literal character) succeeds, it advances to the next state; however, when non-determinism arises—such as with alternatives (denoted by |) or quantifiers—the engine prioritizes one path (typically the left branch or maximum repetition) and proceeds recursively. If the chosen path reaches a dead end (i.e., no further match is possible for the remaining input), the engine backtracks to the most recent choice point, restores the previous state, and tries the next alternative, continuing until an accepting path is found or all possibilities are exhausted.[9][10]
This mechanism relies on maintaining a stack of states to enable backtracking, effectively turning the NFA simulation into a recursive traversal that retries failed branches. For instance, in handling unions, the engine first explores the initial alternative fully; upon failure, it unwinds the stack and shifts to the subsequent one. Similarly, for closures like Kleene stars (*), the engine recursively applies the body of the repetition, backtracking through iteration counts when the overall pattern cannot proceed. To avoid infinite loops from epsilon transitions (empty moves), many engines track visited states within a cycle before consuming input, ensuring progress.[9][11]
Greedy and lazy quantifiers play a key role in shaping the backtracking paths by dictating the order of repetition attempts. A greedy quantifier, such as *, instructs the engine to match the maximum possible repetitions of its preceding element before advancing, committing to longer matches first and only backtracking by reducing the count one by one if later parts of the pattern fail. In contrast, a lazy quantifier, like *?, starts with the minimum (often zero) repetitions and incrementally increases them only when required for the overall match to succeed, reversing the backtracking sequence to favor shorter initial commitments. This difference in prioritization can significantly alter the depth and breadth of the search tree explored, though both ultimately rely on the same retry mechanism.[9]
Mathematically, backtracking manifests as a depth-first search tree over the NFA's state space, where each internal node represents a partial match state with pending choices, and edges denote transitions triggered by input consumption or pattern alternatives. The root is the starting state, with branches for each non-deterministic option (e.g., quantifier iterations from maximum downward for greedy cases), leading to leaves that are either accepting matches or failures. For patterns with overlapping ambiguities, such as nested repetitions or alternatives, the tree's node count can expand exponentially—proportional to the factorial of repetition bounds in worst cases—as the engine exhaustively enumerates path combinations before concluding. This growth underscores the non-linear complexity inherent in backtracking simulations of NFAs.[9][10]
The backtracking approach traces its origins to Henry Spencer's public-domain regular expression library developed in the 1980s, which introduced this NFA-based recursive method as an alternative to earlier deterministic finite automaton (DFA) implementations. This library was integrated into Perl's regex engine around 1991, marking a pivotal adoption that emphasized flexibility for extended regex features over strict linear performance. The mechanism has since proliferated, forming the core of engines in languages like JavaScript, which enumerates NFA paths in priority order per ECMAScript specifications, and Python, whose re module employs similar depth-first traversal for pattern matching.[12][11]
Vulnerabilities
Common Patterns
Common patterns vulnerable to ReDoS typically involve syntactic constructs in regular expressions that introduce ambiguity, allowing backtracking regex engines to explore an exponential number of matching paths on certain inputs.[13] These patterns exploit the nondeterministic nature of the expressions, where multiple ways to consume the input string lead to redundant computations during the matching process.[14]
Among the most prevalent vulnerable patterns are nested quantifiers, such as (a+)+, where inner repetitions can be grouped in various ways, creating overlapping possibilities for matching sequences of 'a' characters.[13] Another common construct is optional groups with alternations, exemplified by (a|aa)+?, which permits non-greedy matching that branches into multiple partial matches for strings with repeated substrings.[14] Overlapping subexpressions, such as quantified overlapping disjunctions like (\w|\d)+, further amplify this by allowing the engine to try numerous combinations of word characters and digits in ambiguous sequences.[14] These patterns are characterized by their star height greater than one or quantified overlaps, which inherently support multiple valid parses of the input.[15]
The ambiguity in these patterns causes path explosion because backtracking engines, which attempt all possible ways to advance through the expression, generate a combinatorial number of execution paths. For instance, in a pattern like (a|a)* matching a string of 'a's followed by a non-match, the engine may explore up to \Theta(2^n) paths for an input of length n, as each position can be attributed to either branch of the alternation.[15] This can be illustrated through a simplified pseudocode representation of a backtracking matcher:
function match(pattern, input, pos, state):
if pos == len(input) and state is accepting:
return true
if no more transitions from state:
return false
for each possible transition from state:
new_pos = pos + consumed_length(transition)
if new_pos <= len(input) and match(pattern, input, new_pos, next_state(transition)):
return true
// Backtrack: try alternatives or rewind
return false
function match(pattern, input, pos, state):
if pos == len(input) and state is accepting:
return true
if no more transitions from state:
return false
for each possible transition from state:
new_pos = pos + consumed_length(transition)
if new_pos <= len(input) and match(pattern, input, new_pos, next_state(transition)):
return true
// Backtrack: try alternatives or rewind
return false
In this recursive process, ambiguous patterns cause the function to branch repeatedly, with each level potentially doubling the number of calls, leading to exponential time complexity in the worst case.[13]
Studies indicate that such vulnerable patterns are widespread in open-source codebases. For example, an analysis of nearly 4,000 Python projects on GitHub found that while 42% of projects use regular expressions, a significant portion incorporate potentially vulnerable ones, with broader surveys estimating 10-20% of regexes across ecosystems like Node.js and Python as susceptible to super-linear behavior.[15] In the npm ecosystem, approximately 1% of unique regexes exhibit super-linear complexity, yet they appear in over 13,000 modules, highlighting their prevalence despite low per-regex incidence.[14] Similarly, examinations of Java programs revealed that about 20% of regexes are vulnerable.[13]
Differences in regex engine implementations exacerbate these risks. PCRE, a widely used backtracking engine compatible with Perl syntax, fully supports features like nested quantifiers and alternations, enabling the path explosion described above but at the cost of potential exponential runtime.[14] In contrast, RE2 employs a linear-time automata-based approach without backtracking, rejecting ambiguous constructs like backreferences to guarantee [O(n](/page/O(n)) matching time, though this limits its expressiveness compared to PCRE.[16]
Exploitation Examples
A classic illustration of a ReDoS vulnerability involves the regular expression ^(a+)+$, which intends to match one or more repetitions of one or more 'a' characters at the start and end of a string. When tested against an input like "aaaa!", the backtracking engine explores numerous failure paths due to the nested quantifiers + inside another +. For a smaller input such as "aa!", the trace begins with the engine greedily matching "aa" as a single a+ group within the outer (a+)+, then failing at the $ anchor upon encountering "!". It backtracks by reducing the inner a+ to "a" and attempting a second group with the remaining "a", but fails again at $. Further backtracking tries splitting into single "a" groups or other combinations, resulting in 5 total attempts; scaling to "aaaa!" expands this exponentially to over 15 attempts, demonstrating the quadratic blowup in computation.[17]
In practical code, such vulnerabilities often appear in input validation functions using nested or overlapping groups. For instance, a JavaScript email validator employing the pattern /("[^"]*"|[^@])*@[^@]*/ can hang on crafted inputs with repeated quotes, as the engine backtracks across the alternatives [^"]* and [^@]* to find non-matches before the @. The following snippet exemplifies this:
javascript
const emailRegex = /("[^"]*"|[^@])*@[^@]*/;
function isValidEmail(email) {
return emailRegex.test(email);
}
// Vulnerable: isValidEmail('""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""@example.com');
// This input triggers extensive [backtracking](/page/Backtracking), potentially delaying execution by seconds or more.
const emailRegex = /("[^"]*"|[^@])*@[^@]*/;
function isValidEmail(email) {
return emailRegex.test(email);
}
// Vulnerable: isValidEmail('""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""@example.com');
// This input triggers extensive [backtracking](/page/Backtracking), potentially delaying execution by seconds or more.
This pattern, drawn from real-world usage, highlights how seemingly simple validation logic exposes applications to denial-of-service via user-supplied strings.[18]
Analysis of open-source repositories reveals widespread presence of such vulnerable patterns. A study of 865 popular Node.js modules on npm identified 161 (19%) containing super-linear regexes susceptible to ReDoS, often in parsing or validation routines. More recent systematic reviews confirm ReDoS as the fourth most common server-side vulnerability in npm packages from 2014 to 2023, with approximately 400 related CVEs disclosed across ecosystems, underscoring persistent risks in dependency chains despite growing awareness. Tools like safe-regex and ReDoSHunter have been applied in scans to flag these, revealing nested quantifiers as a frequent culprit in project codebases.[19][20]
In production environments, attackers deliver malicious inputs through common web channels to exploit these flaws. For example, crafted strings can be injected into HTTP headers like User-Agent or Content-Type, where modules such as ua-parser-js or charset process them with vulnerable regexes, causing server delays of up to minutes per request. Similarly, form submissions or API payloads targeting input sanitization—such as email fields in HTML forms or JSON bodies—allow probes like repeated characters to trigger backtracking; an analysis of 475 API-publishing domains found 6 vulnerable to such exploits, leading to patches in services from Microsoft and AWS. These vectors enable low-effort denial-of-service by overwhelming single-threaded runtimes like Node.js with minimal payload sizes, often as small as 38 characters.[19][21]
Impacts
Regular Expression Denial of Service (ReDoS) exploits the backtracking mechanism in many regex engines, leading to exponential time complexity in the worst case. In backtracking implementations, the engine explores multiple possible matches by trying different combinations of quantifiers and alternatives, which can result in an explosion of computation paths. For patterns with nested quantifiers, such as (a+)+b matched against a long string of 'a's without a trailing 'b', the number of backtracking steps grows exponentially with the input length n. This can be modeled as T(n) \approx c \cdot 2^d, where c is a constant representing the base cost per step, and d approximates the number of decision points (often proportional to n), yielding an overall O(2^n) complexity.[22][23]
Benchmarks illustrate this impact on modest hardware, such as a mid-2010s laptop. For instance, a 30-character invalid input to a vulnerable pattern like ^(.*?,){n}P can take 1.8 seconds to process, spiking CPU usage to 99%, while a valid input completes in 0.05 seconds. Extending the input by one character often doubles the runtime, escalating to several seconds for 31 characters and potentially hours for longer strings due to billions of steps—for a 20-character input, one engine requires over 3.6 million steps, and for 100 characters, the theoretical steps exceed $10^{18}. Memory consumption also rises from stack frames for each backtrack level, potentially leading to stack overflows in languages like pre-5.10 Perl.[17][22][23]
In multi-threaded applications, such as those using Node.js, a single ReDoS-triggering input blocks the event loop thread, causing widespread delays and service unavailability as other requests queue up. Performance graphs typically show runtime versus input size as an exponential curve: near-linear for small n (e.g., milliseconds), but vertical escalation beyond a threshold (e.g., seconds to minutes at n=20), contrasting sharply with safe patterns that remain flat.[17][24]
In comparison, linear-time alternatives like Thompson's NFA construction avoid backtracking altogether, simulating the NFA by tracking sets of states as the input is processed once, achieving O(m n) time complexity where m is the pattern length. This makes NFA-based engines, such as RE2, immune to ReDoS while handling the same inputs in microseconds even for large n, orders of magnitude faster than vulnerable backtracking in worst cases.[12]
Real-World Incidents
One of the earliest prominent real-world ReDoS incidents occurred in July 2016, when Stack Overflow experienced a 34-minute outage affecting its entire platform. The disruption was triggered by a user-submitted post containing a malicious string that exploited a vulnerable regular expression designed to trim whitespace, specifically ^[\s\u200c]+|[\s\u200c]+$, leading to excessive backtracking and CPU exhaustion on the servers.[25][18] This event highlighted the risks of processing untrusted user input with complex regex patterns in high-traffic web applications, resulting in temporary unavailability for millions of users worldwide.[26]
In July 2019, Cloudflare encountered a major global outage lasting approximately 27 minutes, impacting a significant portion of the internet's traffic routed through its network. The incident stemmed from a newly deployed Web Application Firewall (WAF) rule incorporating a regular expression with catastrophic backtracking behavior, such as .*.*=, which caused full CPU utilization on edge servers handling HTTP/HTTPS requests.[27][25] This affected millions of websites and services, underscoring the potential for ReDoS to cascade into widespread downtime when introduced in critical infrastructure components like content delivery networks.[28]
Following these high-profile cases, awareness of ReDoS has grown, particularly since 2020, with automated security scanners increasingly identifying vulnerabilities in open-source supply chains. Security analyses indicate ReDoS ranks as the fourth most common server-side vulnerability class in JavaScript ecosystems such as npm.[29] Hundreds of CVEs have been assigned to ReDoS across various libraries and frameworks as of 2023, reflecting broader adoption of static analysis for regex auditing in software development pipelines.
In 2024, ReDoS vulnerabilities continued to emerge in AI-related tools, exemplified by CVE-2024-12720 in the Hugging Face Transformers library, versions up to 4.46.3. This flaw in the Nougat tokenizer's post_process_single() function in tokenization_nougat_fast.py allowed crafted inputs to trigger exponential backtracking, potentially causing denial-of-service in AI inference workflows.[30] The vulnerability affected deployments relying on the library for document understanding tasks, emphasizing ReDoS risks in emerging machine learning supply chains.[31] As of 2025, ReDoS remains prevalent, with ongoing reports of vulnerabilities in diverse software ecosystems.
Mitigation
Design Strategies
To prevent Regular Expression Denial of Service (ReDoS) vulnerabilities, developers should follow principles that minimize or eliminate backtracking in regex patterns, particularly by avoiding constructs that lead to exponential time complexity. Key strategies include using atomic groups, which prevent backtracking within a subpattern once it has matched, and possessive quantifiers, which apply greedy matching without allowing subsequent backtracking to release characters.[32] These features, supported in engines like PCRE and Java's regex, ensure that quantified elements do not revisit prior matches, thereby bounding the execution time to linear in the input length for supported patterns.[29] Additionally, nested unbounded repeats, such as (a+)+ or (a*)*, should be avoided as they introduce infinite ambiguity and catastrophic backtracking; these are identified as high-risk anti-patterns like Star-3 in theoretical analyses of regex ambiguity.[33]
Refactoring techniques transform ambiguous patterns into deterministic equivalents that execute in linear time without altering the intended matches. For instance, the nested repeat (a+)+ can be refactored to the simpler a+, eliminating the outer quantifier's backtracking potential while preserving behavior for strings of 'a' characters.[32] Similarly, patterns with overlapping concatenations, such as a*(aa)* (Concat-1 anti-pattern), can be rewritten to a+ to remove redundant ambiguity where multiple subpatterns generate overlapping languages.[32] These refactors rely on equivalence checks via finite automata, ensuring no loss of expressiveness while preventing super-linear runtime; empirical studies show such revisions resolve over 70% of identified vulnerabilities in practice.[34]
Selecting regex engines with guaranteed linear-time performance is a foundational design choice for ReDoS prevention. Google's RE2 library implements a finite-state machine approach that evaluates alternatives in parallel without backtracking, achieving O(m * n) time complexity where m is the regex size and n is the input length, and explicitly omits features like backreferences that enable exponential behavior.[35] Similarly, Rust's regex crate compiles patterns to a Thompson NFA with lazy DFA evaluation, providing worst-case O(m * n) guarantees for searches and supporting size limits on untrusted patterns to cap compilation overhead.[36] These engines prioritize safety for production use, contrasting with backtracking-based libraries like PCRE, and are recommended for applications handling untrusted input.[29]
Testing practices should be integrated into the software development lifecycle (SDLC) to verify regex security proactively. Developers must include unit tests with large, adversarial inputs—such as strings exceeding 1,000 characters designed to trigger backtracking—to detect slowdowns exceeding linear scaling, often using timeouts to fail failing tests.[34] Input length limits, enforced before regex evaluation (e.g., rejecting strings longer than a fixed threshold), serve as a simple first-line defense, proven effective in over 20% of historical vulnerability fixes by bounding resource usage.[34] Embedding these tests in continuous integration pipelines ensures ongoing validation, with benchmarks on diverse inputs confirming linear performance.[36]
Static analyzers identify potential ReDoS vulnerabilities by examining regular expression patterns without executing them, often modeling backtracking depth through analysis of nested quantifiers, ambiguity in alternatives, and polynomial-time complexity bounds.[37] Tools like Semgrep use pattern-matching rules to flag vulnerable regexes, such as those with repeated capturing groups containing optional elements, supporting multiple languages including Python and JavaScript.[38] RegexStaticAnalysis employs static techniques to detect catastrophic backtracking in expressions like (ht|f)tp(s?)://[^\s]*, evaluating structural risks in real-world codebases.[39] Similarly, RXXR2, an extension of earlier regex analyzers, uses pumping lemma-based checks to bound execution time and identify exponential blowups.[40]
Dynamic testing tools complement static methods by executing regexes against crafted inputs to measure runtime behavior and confirm vulnerabilities. Fuzzers such as regexploit generate adversarial strings heuristically to trigger excessive backtracking, providing proof-of-concept exploits for patterns like ^((mailto:)?[\w.%+-]+@[\w.-]+\.[\w]{2,20})?$.[41] ReScue applies genetic algorithms to evolve input strings that maximize processing time, effectively discovering ReDoS triggers in non-trivial cases by iteratively mutating candidates based on fitness scores derived from execution duration.[42]
Benchmarks evaluating these tools show varying accuracy, with advanced analyzers like RENGAR achieving near-complete detection of known vulnerabilities—detecting all those found by nine prior tools plus 3 to 197 times more in large datasets—while maintaining low false positives through principled modeling of backtracking states.[40] In comparative tests on benchmark sets of vulnerable expressions, static tools like RegexStaticAnalysis identified up to 46% of cases, though hybrid approaches often exceed 90% recall in recent studies.[39] Many tools, including Semgrep and CodeQL, integrate seamlessly with CI/CD pipelines via plugins for GitHub Actions or Jenkins, enabling automated scanning during code reviews to catch ReDoS risks in legacy codebases early.