Fact-checked by Grok 2 weeks ago

Lexical analysis

Lexical analysis, also known as scanning or lexing, is the initial phase of a compiler in computer science that transforms a stream of input characters from source code into a sequence of meaningful units called tokens.^[1] This process reads the source program from left to right, grouping characters into lexemes while ignoring whitespace and comments, to prepare the input for subsequent syntactic analysis. As the foundational step in the compiler frontend, it simplifies the overall compilation by reducing the character-level details to higher-level symbolic representations, enabling efficient parsing and error detection.^[2] In lexical analysis, tokens represent the basic vocabulary elements of a programming language, such as keywords (e.g., "if" or "while"), identifiers (e.g., variable names), literals (e.g., numbers or strings), operators (e.g., "+" or "="), and punctuation (e.g., semicolons or parentheses).^[1] A lexeme is the specific sequence of characters that matches a token's pattern, for instance, the string "result" serving as the lexeme for an identifier token.^[1] Patterns defining valid lexemes are typically specified using regular expressions, which describe rules like decimal integers as [1-9][0-9]* | 0.^[1] The lexical analyzer, or scanner, resolves ambiguities by selecting the longest possible match at each position and prioritizing higher-precedence patterns when ties occur.^[3] Lexical analyzers are commonly implemented using finite automata, where regular expressions are converted into nondeterministic finite automata (NFA) and then optimized into deterministic finite automata (DFA) for efficient scanning.^[4] This approach ensures linear time complexity relative to the input length, making it suitable for large programs.^[4] Tools like Lex, developed at Bell Labs in the 1970s, automate the generation of such analyzers by compiling user-specified regular expressions and associated actions into efficient C code, often paired with parser generators like Yacc.^[4] Modern variants, such as Flex, extend this capability for contemporary systems and languages.^[5]

Fundamentals

Definition and purpose

Lexical analysis, also known as scanning or lexing, is the initial phase in the front-end of a compiler or interpreter, where a stream of input characters from the source code is transformed into a sequence of meaningful tokens. This process involves reading the source program from left to right, grouping characters into logical units based on predefined patterns, and producing an intermediate representation that abstracts away low-level details of the input. As the first step in source code processing, it establishes a foundation for higher-level analyses by standardizing the input into discrete elements suitable for further examination.^[6]^[7] The primary purpose of lexical analysis is to simplify the subsequent phases of compilation, particularly syntax analysis, by isolating significant syntactic units such as keywords, identifiers, operators, and literals from the raw character stream. It achieves this by discarding irrelevant elements, including whitespace and comments, which do not affect the program's semantics but would complicate parsing if retained. This separation of concerns enhances efficiency, as the lexer handles character-level decisions independently of the parser's structure-level responsibilities, leading to a modular compiler design. Additionally, lexical analysis detects and reports basic errors, such as invalid characters or malformed literals, early in the compilation pipeline.^[8]^[9]^[10] The practice of lexical analysis traces its origins to the 1950s, coinciding with the emergence of high-level programming languages and the need for automated translation systems. It was integral to early compilers, notably the FORTRAN system developed by John Backus and his team at IBM, released in 1957 after extensive efforts to translate mathematical formulas into machine code.^[11] This innovation marked a shift from manual assembly to automated processing, proving essential for handling the syntactic diversity of programming languages across various hardware platforms. Since then, lexical analysis has remained a cornerstone of compiler design, evolving to support increasingly complex languages while preserving its core role in input preprocessing. By generating a token stream—comprising elements like tokens and their associated lexemes—lexical analysis provides a clean, abstracted input to the syntax analyzer, enabling the latter to focus on grammatical structure without interference from formatting artifacts.^[7]

Tokens, lexemes, and patterns

In lexical analysis, tokens serve as the fundamental abstract units that represent categories of syntactic elements in the source code, facilitating subsequent phases like parsing. A token typically comprises a token name (or type), such as INTEGER_LITERAL or IDENTIFIER, paired with an optional attribute value that stores relevant details from the matched input, such as the numerical value of a literal. This structure allows the lexer to abstract away the specific character sequences while preserving essential information for the compiler.^[12] A lexeme refers to the concrete sequence of characters in the source program that corresponds to a specific token. For instance, the string "123" is a lexeme recognized as an INTEGER_LITERAL token, while "foo" serves as a lexeme for an IDENTIFIER token. Unlike tokens, lexemes are the raw substrings extracted directly from the input stream during scanning.^[12] Patterns define the rules governing the structure of valid lexemes for each token type, commonly specified using regular expressions to describe allowable character sequences. For example, the pattern [0-9]+ matches integer lexemes like "42" or "0", while the pattern [a-zA-Z_][a-zA-Z0-9_]* captures identifier lexemes starting with a letter or underscore followed by alphanumeric characters. Keywords, such as "if", often use exact literal patterns like the string itself to ensure precise matching.^[12] The interplay among tokens, lexemes, and patterns enables efficient categorization: a single token type can encompass numerous lexemes, as seen with INTEGER_LITERAL accommodating "1", "100", or "999", all validated against the same pattern. To resolve potential ambiguities in matching, lexical analyzers employ the maximal munch principle, selecting the longest possible lexeme that fits any pattern before proceeding to the next input segment. This ensures consistent tokenization, such as preferring "==" as a single equality operator token over two separate "=" assignment tokens.^[12]

Token	Example Lexeme	Pattern
IF	`if`	`if`
INTEGER_LITERAL	`123`	`[0-9]+`
IDENTIFIER	`foo`	`[a-zA-Z_][a-zA-Z0-9_]*`
PLUS	`+`	`\+`

This table illustrates representative mappings, where each lexeme adheres to its token's pattern, highlighting how patterns generalize across similar constructs.^[12]

Disambiguation of key terms

In linguistics, a lexeme is defined as an abstract unit of lexical meaning that underlies a set of related word forms sharing the same core semantics and syntactic category, such as the base form "run" encompassing inflected variants like "runs," "running," and "ran."^[13] This contrasts with its usage in programming language compiler design, where a lexeme refers specifically to the concrete sequence of characters in the source code that matches a predefined pattern for a token, serving as the raw textual representation without implying morphological relations. The divergence arises because linguistic lexemes emphasize semantic unity across forms, while compiler lexemes focus on syntactic matching during input processing.^[14] The term "token" exhibits significant overlap and ambiguity across domains. In compiler design, a token is an abstract category or type (e.g., "identifier" or "keyword") paired with its corresponding lexeme, facilitating efficient parsing by abstracting away the specific character sequence. This distinguishes it from the lexeme, though the terms are occasionally used interchangeably in informal contexts, leading to imprecise descriptions of lexical output. In natural language processing (NLP), a token typically denotes a segmented unit such as a word, subword (e.g., via Byte-Pair Encoding), or punctuation mark derived from text, often without the explicit type-lexeme pairing seen in compilers; here, tokenization prioritizes handling variability in natural text like contractions or multilingual scripts.^[15] Such cross-domain usage can confuse practitioners, as compiler tokens emphasize formal language structure, whereas NLP tokens support probabilistic modeling of unstructured data.^[16] "Lexer" and "scanner" are largely synonymous terms for the component performing lexical analysis, but subtle emphases exist: "lexer" highlights the production of structured tokens from input, while "scanner" underscores the linear traversal and character-by-character consumption of the source stream.^[17] Both trace back to early compiler literature, where the process was described as scanning for patterns, but modern tools like ANTLR predominantly employ "lexer" to denote the generated module that tokenizes input according to grammar rules.^[18] Historical shifts in terminology were formalized in seminal compiler texts, such as Aho, Sethi, and Ullman's 1986 work, which standardized "lexeme" as the matched string and "token" as its categorical representation, influencing subsequent designs and reducing earlier ad-hoc variations in scanning descriptions.^[12] This standardization persists in contemporary tools like ANTLR, where lexer rules directly implement these distinctions without reverting to pre-1986 synonyms like "word" or "symbol unit."^[18] A common pitfall in lexical analysis is conflating tokens with parsing constructs, where tokens produced by the lexer are mistakenly treated as full syntactic units rather than atomic building blocks passed to the parser; this can lead to errors in handling ambiguities like reserved keywords versus identifiers, as the lexer's role ends at token emission without deeper structural validation.^[19]

Processes

Scanning and evaluation

The scanner, also known as the lexer, is the core component of lexical analysis responsible for sequentially reading the input character stream from the source code, typically from left to right, while maintaining a buffer to handle lookahead characters as necessary for pattern matching.^[7] This buffering ensures that the scanner can retract or advance the input pointer efficiently without re-reading the entire stream.^[20] During the evaluation process, the scanner advances through the input position by position, attempting to match the current prefix against predefined lexical patterns, such as regular expressions for keywords, identifiers, or literals.^[21] When multiple patterns match a prefix, the scanner applies disambiguation rules, prioritizing the longest possible match to ensure the most specific token is selected, or falling back to a predefined priority order if lengths are equal.^[21] This mechanism prevents ambiguous parses, such as distinguishing "if" as a keyword rather than the start of an identifier. The underlying model for this scanning is a finite state machine, specifically a deterministic finite automaton (DFA) constructed from the lexical rules, which enables efficient recognition by transitioning between states based on the next input character.^[22] Each state represents partial progress toward matching a pattern, and accepting states correspond to complete tokens, allowing the scanner to process the input in a single pass without backtracking in the DFA implementation.^[23] Error handling in scanning involves detecting and reporting invalid lexemes that do not match any pattern, such as unrecognized characters or malformed constructs like unmatched quotes in string literals.^[20] Upon encountering such errors, the scanner may skip offending characters, insert defaults, or halt with a diagnostic message including the position, ensuring the analysis can continue or fail gracefully.^[20] The performance of this scanning process achieves linear time complexity, O(n), where n is the length of the input, due to the constant-time transitions in the DFA and single-pass nature of the evaluation.^[23] This efficiency is critical for large source files, making lexical analysis a lightweight preliminary step in compilation.^[24]

Tokenization workflow

The tokenization workflow in lexical analysis systematically processes the source code character stream to produce a sequence of meaningful tokens for subsequent parsing. This process begins with reading the input stream character by character from left to right, often buffering a portion of the input to facilitate efficient matching.^[7] The lexer first skips ignorable elements such as whitespace (spaces, tabs, newlines) and comments, which do not contribute to the program's semantics but must be recognized to avoid treating them as part of tokens. For instance, single-line comments starting with "//" or multi-line comments delimited by "/" and "/" are discarded entirely after identification. Next, the lexer attempts to match the current position against predefined patterns, typically represented as regular expressions, to identify potential lexemes—sequences of characters that form valid units in the language.^[25]^[24] Once a potential lexeme is matched, it is classified into an appropriate token category, such as keyword, identifier, operator, or literal, based on the language's lexical rules. To resolve ambiguities where multiple patterns could apply, the maximal munch rule (also known as longest match) is applied, preferring the longest possible lexeme over shorter alternatives. For example, in the input "ifx", the lexer selects "ifx" as a single identifier token rather than the keyword "if" followed by the identifier "x", ensuring unambiguous partitioning.^[26]^[27] If multiple patterns match lexemes of equal length, priority is resolved by the predefined order of rules, where higher-priority patterns (e.g., keywords) are checked before lower ones (e.g., general identifiers). This ordered evaluation prevents misclassification, such as treating "if" as an identifier instead of a keyword. Lexers often employ finite state machines to manage different modes or states during processing, particularly for context-dependent constructs like string literals—where internal quotes or escape sequences are not treated as delimiters—or nested/multi-line comments that require tracking opening and closing boundaries without interrupting the main flow.^[26]^[28] The workflow culminates in outputting a stream of tokens, each typically including the token type, the original lexeme, and source position information (line and column numbers) to support error reporting and debugging in later compiler phases. This stream is fed to the parser on demand, and the lexer may provide limited lookahead—such as peeking at the next token without consuming it—to aid disambiguation during parsing.^[7]^[29]

Grammars and rules

Lexical grammar

Lexical grammar constitutes a formal specification within the lexical analysis phase of compiler design, defining the structure of valid tokens in a programming language as a subset of a context-free grammar restricted to regular grammars.^[30] These grammars focus exclusively on token formation, distinguishing them from broader syntactic grammars by limiting productions to patterns recognizable by finite automata.^[31] The key components of a lexical grammar include terminals, which are the basic indivisible symbols such as alphabetic characters, digits, or punctuation; non-terminals, which represent categories of tokens like identifiers or constants; and productions, which are rewrite rules deriving non-terminals from sequences of terminals and other non-terminals.^[32] For instance, a typical production for an identifier token might be expressed as ID → letter (letter | digit)*, where "letter" and "digit" are themselves defined via terminals or simpler non-terminals.^[33] This grammar plays a critical role in isolating lexical rules from syntactic ones, enabling the scanner to partition the input character stream into unambiguous tokens without considering higher-level structure, thus facilitating efficient and modular compilation.^[34] Formally, lexical grammars generate regular languages, which are equivalent to those accepted by nondeterministic finite automata (NFAs) and can be converted to deterministic finite automata (DFAs) for linear-time recognition during scanning.^[35] In the C programming language, the lexical grammar is detailed in Annex A.1 of the ISO/IEC 9899 standard, specifying productions for tokens such as keywords (e.g., keyword → "if" | "while" | ... ) and identifiers (e.g., identifier → nondigit (nondigit | digit)* ), where nondigit includes letters and underscore, ensuring precise token boundaries.^[36]

Regular expressions in lexing

Regular expressions provide a formal notation for defining the patterns that identify tokens during lexical analysis. Basic elements include literals for specific characters or classes, such as digits denoted by \d to match any decimal digit. These are extended with operators like union (|), which specifies alternatives; concatenation, which sequences patterns; Kleene star (*), which allows zero or more repetitions; and plus (+), which requires one or more repetitions.^[37]^[38] In lexical analysis, each token category—such as keywords, identifiers, or literals—is specified by a distinct regular expression, with the overall lexer recognizing the longest prefix match among all patterns to disambiguate input. The union of these regular expressions collectively describes the valid vocabulary of the programming language's tokens.^[33]^[38] To implement efficient matching, regular expressions are first converted to a nondeterministic finite automaton (NFA) using Thompson's construction, which builds the NFA compositionally by introducing epsilon transitions for operators. This NFA is then transformed into a deterministic finite automaton (DFA) via the subset construction algorithm, enabling linear-time scanning by simulating sets of NFA states in parallel.^[39]^[40] Regular expressions are limited to recognizing regular languages and cannot handle context-sensitive constructs, such as nested comments that require tracking arbitrary depths of pairing, which demand context-free mechanisms. Lexer generators like Lex integrate regular expressions by automating this NFA-to-DFA conversion and code generation, producing optimized scanners from pattern specifications.^[41]^[3]

Challenges

Common obstacles

One of the primary challenges in lexical analysis is resolving ambiguities that arise when multiple regular expression rules match the same input substring, potentially leading to inconsistent tokenization. To address this, lexer generators like JFlex employ the longest match rule, selecting the token corresponding to the longest prefix of the input, and for ties in length, prioritize the rule listed earlier in the specification.^[42] For example, in a rule set distinguishing keywords from identifiers, the input "if8" would match as an identifier rather than the keyword "if" followed by "8", ensuring predictable behavior.^[42] Internationalization poses another obstacle, as early lexical analyzers were designed for ASCII, limiting support for global programming practices. Modern implementations, such as Python 3's lexer, overcome this by decoding source files as UTF-8 Unicode by default and permitting identifiers to include non-ASCII characters from Unicode categories like letters and combining marks, normalized via NFKC form per Unicode Standard Annex 31.^[43]^[44] This enables code with international variable names but requires careful handling of encoding declarations and potential decoding errors to avoid syntax issues.^[43] Performance bottlenecks frequently occur in regex-based scanning due to backtracking, particularly with nondeterministic finite automata (NFAs), which can exhibit exponential time complexity on certain inputs. Deterministic finite automata (DFAs) mitigate this by enabling linear-time processing without backtracking, though they may demand more states; benchmarks on real-world pattern sets show DFAs achieving up to 1.8x throughput gains on hardware like FPGAs when patterns are processed in parallel, as seen in the YARA ruleset.^[45] Thus, production lexers often convert NFAs to DFAs during generation for efficiency.^[45] Error recovery in the face of malformed input is essential to prevent complete analysis failure, yet simplistic halting can frustrate users. Common strategies involve local repairs, such as deleting extraneous characters and restarting the scan from the next valid position, or emitting special error tokens for issues like unterminated strings while skipping to the next line.^[46] These approaches allow the subsequent parser to proceed with partial input, providing informative diagnostics without cascading failures.^[46] Adapting to evolving language standards demands ongoing lexer maintenance, as specification changes can alter token definitions. For instance, ECMAScript 2015 (ES6) introduced new reserved keywords like let, const, and yield, necessitating updates to distinguish them from identifiers and avoid treating legacy code as invalid.^[47] Such revisions ensure compatibility but require testing against vast codebases to prevent regressions.^[48]

Context-sensitive lexing

In lexical analysis, context-sensitive lexing arises when token recognition depends on broader program context, such as prior tokens, global state, or symbol table information, extending beyond the capabilities of regular languages defined by finite automata. This approach is essential for handling language features that require lookahead, lookbehind, or state-dependent rules, as pure regular expressions cannot capture such dependencies.^[49] Prominent examples include XML, where opening and closing tags must share identical names, necessitating the lexer to maintain state for tag matching during tokenization of markup structures. In Python, indentation-based scoping requires the lexer to track a stack of whitespace levels, emitting special INDENT and DEDENT tokens when indentation increases or decreases relative to prior lines, effectively switching modes based on accumulated context. Similarly, mode-switching occurs in languages like SQL with embedded strings, where quote delimiters inside strings (e.g., doubled single quotes for escaping) alter token boundaries depending on the current string mode. For nested structures, cases like non-regular balanced parentheses in certain domain-specific languages demand stateful tracking to correctly delineate tokens across levels, though this borders on parsing responsibilities.^[49]^[50] Solutions typically involve hand-written lexers using explicit state machines to manage modes and context, allowing transitions based on accumulated input history. Hybrid methods combine regular expressions for initial pattern matching with subsequent context checks, such as symbol table lookups, to disambiguate tokens. Tools like Flex support this through start conditions, enabling rule sets to activate conditionally for context-specific scanning without full backtracking. In C++, preprocessing directives like #ifdef require the lexer to enter conditional modes, skipping or including tokens based on macro expansions and prior directives. These techniques introduce limitations by deviating from the linear-time efficiency of deterministic finite automata in standard regular lexing, often incurring overhead from state maintenance or potential nondeterminism that complicates error recovery and optimization in compilers. The added complexity can make lexer maintenance error-prone, particularly in large-scale language implementations where context rules proliferate.^[49]

Advanced techniques

Phrase structure handling

In lexical analysis, whitespace and newlines typically serve to separate tokens, but in layout-sensitive languages, they play a more structural role by influencing token boundaries and generating virtual tokens that affect parsing. Beyond simple skipping, these elements can signal line continuations, trigger automatic punctuation insertion, or delineate code blocks through indentation, requiring the lexer to track positional state across lines. This handling ensures that source code layout aligns with syntactic intent without explicit delimiters like braces.^[43] Line continuation rules allow statements to span multiple lines, treating newlines as non-terminating in specific contexts to improve readability. In Python, explicit continuation uses a backslash () at the end of a line, which escapes the newline and joins it to the next line, though it cannot precede a comment or continue tokens except in string literals. Implicit continuation occurs within parentheses, brackets, or braces, where newlines are ignored without needing a backslash, and indentation of such lines is insignificant. Blank lines are permitted in these constructs, and no NEWLINE token is emitted between them. For example, the code total = (value1 + value2 followed by a newline and + value3) is tokenized as a single expression. Haskell employs similar implicit mechanisms in layout-sensitive contexts, where indentation governs continuation without explicit escapes.^[43]^[51] Automatic semicolon insertion (ASI) addresses ambiguities from optional statement terminators by having the lexer or parser insert semicolons at newlines when necessary to form valid syntax. In JavaScript, as defined in the ECMAScript specification, ASI applies in three cases: when a line terminator precedes a token that cannot follow without a semicolon (e.g., a return statement followed by a newline before an expression), at the end of a block or input stream if required, or after certain constructs like do-while loops. For instance, return newline value; inserts a semicolon after return to avoid treating value as its operand, preventing errors. This rule ensures newline-separated statements are properly delimited, though it can lead to unexpected behavior if layout misaligns with intent, such as in a = b + c newline (d + e).f(); where no insertion occurs.^[52] The off-side rule uses indentation to define block scopes, eliminating braces by comparing column positions relative to a reference point. In Python, the lexer maintains a stack of indentation levels; upon encountering a non-blank line, it compares the leading whitespace column to the top of the stack, emitting INDENT tokens for increased levels and DEDENT for decreases, with the initial stack at column 0. This generates virtual tokens like INDENT before the first statement in a block and DEDENTs to close them, as in a function body indented four spaces. Haskell's off-side rule, activated after keywords like where, let, do, or of without braces, similarly relies on indentation: statements aligned to or beyond the reference column continue the block, while lesser indentation closes it, often inserting virtual semicolons or braces during lexing. The lexer tracks the current column and previous reference to enforce this, treating tabs and spaces consistently in column counting.^[43]^[51] Implementation of phrase structure handling requires the lexer to augment its state machine with layout awareness, such as a stack for indentation levels or a reference column for off-side checks, while processing input line-by-line. Upon reading a line, whitespace is measured to determine relative position, potentially emitting special tokens like INDENT or DEDENT before regular ones, and newlines are conditionally suppressed or converted based on context. This stateful approach integrates with standard tokenization, outputting a stream where layout influences the token sequence for subsequent parsing, as seen in Python's tokenizer which pushes/pops the indentation stack dynamically.^[43]

Lexer generators

Lexer generators are specialized tools that automate the construction of lexical analyzers, or scanners, by processing high-level specifications consisting of regular expression patterns and associated semantic actions to produce optimized code for token recognition. These tools translate declarative rules into efficient implementations, typically using deterministic finite automata (DFAs) for pattern matching, which ensures linear-time scanning performance. By abstracting away the complexities of manual DFA construction and optimization, lexer generators facilitate the development of robust lexers for compilers and interpreters.^[3] The workflow of a lexer generator begins with an input specification file, commonly suffixed with .l, structured into three main parts: a declarations section for definitions and variables, a rules section where regular expressions define token patterns followed by C code snippets for actions (such as assigning token values to yylval), and an optional user code section for additional functions. The tool then compiles these rules by converting the regular expressions into an NFA, optimizing it to a DFA, and generating a source file—often named lex.yy.c—that implements a scanner function like yylex(). This function reads input characters, matches the longest possible prefix against the DFA, executes the corresponding action to produce a token, and repeats until the end of input. When compiled and linked, the resulting executable integrates seamlessly with parser generators like Yacc.^[53]^[5] One of the earliest and most influential lexer generators is Unix Lex, developed in 1975 by M. E. Lesk and E. Schmidt at Bell Laboratories as a companion to the Yacc parser generator, enabling streamlined compiler front-end development on UNIX systems.^[54] Modern successors include Flex, a free, open-source reimplementation of Lex released in 1987 by Vern Paxson, which enhances performance through better DFA minimization and supports features like reentrant scanners while maintaining compatibility with Lex specifications.^[55] For Java environments, JFlex serves as a prominent alternative, generating efficient, Unicode-aware scanner classes in Java from similar rule-based inputs, with optimizations for speed and platform independence.^[56] Lexer generators offer significant advantages, including rapid prototyping of tokenizers without delving into low-level automaton theory, automatic handling of DFA optimizations like state minimization to reduce table sizes and improve runtime efficiency, and clear separation of lexical rules from implementation details for easier maintenance and experimentation.^[57] However, they have notable limitations: confined to regular languages via regular expressions, they cannot natively handle context-sensitive lexing scenarios, such as distinguishing keywords from identifiers based on surrounding code (e.g., "print" as a keyword versus a variable in certain contexts), often necessitating manual post-processing or stateful hacks. Additionally, integration with parsers may require custom bridging code, and large rule sets can lead to bloated transition tables if not carefully managed.^[58]^[59]

References

[1]
[PDF] An Intuitive View of Lexical and Syntax Analysis
Lexical: relates to the words of the vocabulary of a language, (as opposed to grammar, i.e., correct construction of sentences).
[2]
Lexical analysis and regular expressions - CS [45]12[01] Spring 2022
The first step of a compiler is lexing (aka scanning or tokenizing). The goal is to break up an input stream into tokens.Missing: computer | Show results with:computer
[3]
Lex - A Lexical Analyzer Generator by M. E. Lesk and E. Schmidt
The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point.
[4]
https://www.cs.bell-labs.com/cm/cs/cstr/32.pdf
[5]
[PDF] Lecture Notes on Lexical Analysis
Sep 13, 2018 · Lexical analysis is the first phase of a compiler. Its job is to turn a raw byte or char- acter input stream coming from the source file ...
[6]
[PDF] Lexical Analysis - SUIF Compiler
Jun 25, 2008 · Lexical analysis or scanning is the process where the stream of characters making up the source program is read from left-to-right and grouped ...
[7]
Lexical Analysis
Mar 2, 2022 · simplifies parsing; groups strings into categories, literals, white space and comments · cleaner overall design by separating lexical and ...
[8]
[PDF] Lexical Analysis
Not quite! • Look at some history . . . – (Note: some examples borrowed from original. Dragon Book, some from new Dragon book) ... Lexical Analysis in FORTRAN ( ...
[9]
[PDF] CS 4300: Compiler Theory Chapter 3 Lexical Analysis
As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and.
[10]
[PDF] Compilers: Principles, Techniques, and Tools
In the time since the 1986 edition of this book, the world of compiler design has changed significantly. Programming languages have evolved to present new.
[11]
Lexemes (Chapter 4) - Inflectional Paradigms
Dec 18, 2015 · A lexeme is (i) a lexical abstraction that (ii) has either a meaning (ordinarily) or a grammatical function, (iii) belongs to a syntactic ...
[12]
The lexeme in descriptive and theoretical morphology
Feb 19, 2018 · The lexeme has become a cornerstone of much work in both inflectional morphology and word formation (or, as it is increasingly been called, lexeme formation).
[13]
[PDF] Words and Tokens - Stanford University
May 23, 2025 · In rule-based tokenization, we pre-define a standard and implement rules to im- plement that kind of tokenization. Let's explore this for ...
[14]
[PDF] The Foundations of Tokenization - arXiv
Apr 3, 2025 · Tokenization—the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary—is a critical step ...
[15]
January 28, 2015 Structure of a Compiler & Lexical Analysis
Jan 28, 2015 · The first phase of the compiler is the lexical analyzer, often called a lexer or scanner. The lexer reads the stream of characters making up the ...
[16]
Lexer (ANTLR 4 Runtime 4.13.2 API)
The goal of all lexer rules/methods is to create a token object. This is an instance variable as multiple rules may collaborate to create a single token.
[17]
[PDF] Lexical Analysis - CS@Purdue
The first phase of the compiler is the lexical analyzer, also known as the scanner, which recognizes the basic language units, called tokens. The exact ...
[18]
[PDF] Lexical Analysis - Computer Science | UC Davis Engineering
▷ Scanner may come across certain errors: invalid character, invalid token etc. ... ▷ What to do when lexical errors occur? ▻ Delete the characters read ...Missing: unmatched | Show results with:unmatched
[19]
[PDF] Part 2 Lexical analysis
Lexical analysis. 66. Page 24. Conflict resolutions. Principle of the longest matching prefix: we choose the longest prefix of the input that matches any token.Missing: performance | Show results with:performance
[20]
[PDF] Lexical analysis Finite Automata - UMBC
Finite Automata (FA), also called FSM, are abstract models that decide whether to accept or reject a string. FAs describe how tokens are recognized.
[21]
[PDF] Lexical Analysis
Speeding up Matching. ○ In the worst-case, an NFA with n states takes time O(mn2) to match a string of length m. ○ DFAs, on the other hand, take only O(m).
[22]
[PDF] Lexical Analysis
Lexical analysis breaks input into individual words or tokens, producing names, keywords, and punctuation, discarding whitespace and comments.Missing: remove | Show results with:remove
[23]
[PDF] Lecture Notes on Lexical Analysis
Feb 9, 2023 · Lexical analysis is the first phase of a compiler. Its job is to turn a raw byte or char- acter input stream coming from the source file ...<|control11|><|separator|>
[24]
[PDF] “Maximal-Munch” Tokenization in Linear Time
The discussion of the problem in the 1986 book by Aho, Sethi, and Ullman is similar, except that they assume that lexical analysis is performed by simulating ...
[25]
SI413: Scanning
The usual rule of thumb in language design is to follow the rule of "maximal munch": whichever token matches the most characters is the one you take. Thus, *** ...
[26]
[PDF] The Basics of Lexical Analysis State-Machine Lexing - UAF CS
Feb 5, 2025 · A coroutine is a function that can temporarily give up control (yield) at any point, and then later be resumed. Each time a coroutine.
[27]
[PDF] LEXICAL ANALYSIS - Baishakhi Ray
“Lookahead” is required to decide where one token ends and the next token begins. if(i == j) z = 0; else z = 1;. Keyword ...
[28]
Grammars in Compiler Design - Tutorials Point
Grammars, or formal grammars, define the rules that specify the syntax of a programming language. They dictate how individual elements, such as keywords and ...
[29]
[PDF] CS 132 Compiler Construction - UCLA Computer Science
character string value for a token is a lexeme typical tokens: number, id,. , ,. ,. ,. , eliminates white space (tabs, blanks, comments) a key issue is speed.<|control11|><|separator|>
[30]
Gentle introduction into compilers. Part 1: Lexical analysis and ...
Depending on what type of symbols we want to define rules for we can distinguish several types of grammar. **Lexical grammar** describes the structure of a ...
[31]
Introduction of Lexical Analysis - GeeksforGeeks
Aug 26, 2025 · The lexical analyzer identifies the error with the help of the automation machine and the grammar of the given language on which it is based ...
[32]
Lexical Analysis - Compiler Design Course - Morteza Zakeri
Convert the NFAs to Deterministic Finite Automata (DFAs) for efficiency. Minimize DFA States: Minimize the number of states in the DFAs where possible.Missing: evaluation | Show results with:evaluation
[33]
[PDF] Lexical Analysis: Scanning
Grammars for regular languages. Can we place a restriction on the form of a ... NFA to DFA using the subset construction: example 1 s0 s1 s2 s3 ajb a b.
[34]
[PDF] ISO/IEC 9899:1999(E) -- Programming Languages -- C
... Lexical elements ... 1. This International Standard specifies the form and establishes the interpretation of programs written in the C programming language.
[35]
[PDF] Lexical Analysis Regular Expressions (REs)
Lexical Analysis. Regular. Expressions. Nondeterministic. Finite ... (letter for A-Z or a-z, digit for 0-9, etc.) ▫ Legal regular expressions (an operator.
[36]
[PDF] Regular Expressions, Finite Automata, Lexical Analysis
The reading introduces the theory of lexical analysis (regular expressions, transition diagrams, finite automata) and how this connects to implementation ...
[37]
Programming Techniques: Regular expression search algorithm
Programming Techniques: Regular expression search algorithm. Author: Ken Thompson ... This paper shows a compact realization of regular expression matching ...
[38]
[PDF] Algorithm 3.2. (Subset construction.) Constructing a DFA from an NFA.
Our algorithm constructs a transition table Dtran for D. Each DFA state is a set of NFA states and we construct Dtran so that D will simulate “in parallel" all ...
[39]
[PDF] Lexical Analysis - Anoop Sarkar
• We will use regular expressions (regexps) in order to define tokens in our ... Limitations(?) of Regular Expressions. • Regexps can be used only if ...
[40]
[PDF] Scanner Automaton Ambiguity Resolution Lexer Generators JFlex ...
Sep 8, 2012 · Two rules: 1. Longest match – The match with the longest string will be chosen. 2. Rule priority – for two matches of the same length, the first ...
[41]
2. Lexical analysis — Python 3.14.0 documentation
All lexical analysis, including string literals, comments and identifiers, works on Unicode text decoded using the source encoding. Any Unicode code point, ...
[42]
https://people.cs.pitt.edu/~jmisurda/teaching/cs1622/slides/cs1622-lexing-part3.pdf
[43]
[PDF] Deterministic vs. Non Deterministic Finite Automata in ... - arXiv
Oct 18, 2022 · Our paper demonstrates the state count analysis of the minimized DFA and its equivalent optimized NFA using a broad set of real-world benchmarks ...
[44]
[PDF] Lexical Error Recovery - cs.wisc.edu
Lexical errors are handled by deleting characters and restarting scanning, or deleting the first character and resuming. Error tokens are used for runaway ...
[45]
https://arxiv.org/pdf/2210.10077
[46]
https://pages.cs.wisc.edu/~fischer/cs536.f12/lectures/Lecture15.pdf
[47]
Taming context-sensitive languages with principled stateful parsing
However, many mainstream programming and markup languages (C, Haskell, Python, XML, and more) possess context-sensitive features. These features are ...Missing: lexical | Show results with:lexical
[48]
https://262.ecma-international.org/6.0/#sec-identifiers
[49]
Haskell 98 Lexical Structure
The layout (or "off-side") rule takes effect whenever the open brace is omitted after the keyword where, let, do, or of. When this happens, the indentation ...
[50]
https://docs.python.org/3/reference/lexical_analysis.html#indentation
[51]
[PDF] Lex − A Lexical Analyzer Generator
The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given ...Missing: workflow | Show results with:workflow
[52]
Lexical Analysis With Flex, for Flex 2.6.2: Top - Will Estes
Oct 22, 2016 · This manual describes flex , a tool for generating programs that perform pattern-matching on text. The manual includes both tutorial and ...
[53]
[PDF] Lex − A Lexical Analyzer Generator M. E. Lesk and E. Schmidt Bell ...
Jul 21, 2025 · Lex is designed to simplify interfacing with Yacc, for those with access to this compiler-com- piler system. July 21, 1975. Page 2. --. --. Lex ...Missing: paper | Show results with:paper
[54]
westes/flex: The Fast Lexical Analyzer - scanner generator for lexing ...
This is flex, the fast lexical analyzer generator. flex is a tool for generating scanners: programs which recognize lexical patterns in text.
[55]
https://github.com/westes/flex
[56]
Flex (Fast Lexical Analyzer Generator) - GeeksforGeeks
Aug 30, 2025 · Advantages · Efficiency: Flex-generated lexical analyzers are very fast and efficient, which can improve the overall performance of the ...
[57]
Why you should not use (f)lex, yacc and bison - Federico Tomassetti
So, lex-inspired lexer generators are still good enough for common uses. There are even software that are compatible and replace both Lex and Yacc together, ...
[58]
Why can C not be lexed without resolving identifiers?
Jun 29, 2023 · The grammar of C is context-sensitive, and so the parser-lexer combination needs to handle (a small amount of) context sensitivity. Exactly ...Missing: preprocessor | Show results with:preprocessor