Fact-checked by Grok 2 weeks ago

Tokenization

Tokenization is the process of dividing a sequence of data—such as text, code, or sensitive information—into smaller, meaningful units known as tokens, enabling further analysis, processing, or secure handling across various domains in and beyond. In (NLP), tokenization segments raw text into words, subwords, or characters to facilitate models' comprehension of human language, often involving the removal of or normalization of whitespace. Similarly, in compiler design and , it transforms into discrete tokens like identifiers, keywords, operators, and literals, forming the foundational step for syntax and program execution. In data security, tokenization substitutes sensitive elements, such as numbers or identifiers, with non-sensitive surrogate values (tokens) that preserve data utility without exposing originals, thereby enhancing compliance with privacy regulations like PCI DSS. In finance and blockchain technology, asset tokenization converts real-world assets—ranging from to securities—into digital tokens on distributed ledgers, promoting , increased , and streamlined trading while reducing intermediaries. The technique's versatility stems from its role as a preprocessing step that standardizes input for algorithms, though challenges vary by context: for instance, handling ambiguities in natural languages (e.g., contractions or compound words) requires sophisticated rules or models like Byte-Pair Encoding (BPE) for subword tokenization in modern large language models. In programming languages, tokenization must resolve rules to avoid misinterpretation of ambiguous sequences, such as distinguishing operators from identifiers. Security implementations often employ vault-based systems where mappings between tokens and originals are securely stored, ensuring reversibility only under controlled access. Meanwhile, tokenization leverages standards like ERC-20 or ERC-721 for fungible and non-fungible tokens, respectively, to represent ownership and enable interactions. Historically, traces back to early and compiler theory in the mid-20th century, evolving with advances in and distributed systems; today, it underpins critical applications from chatbots and detection to (DeFi), with ongoing research addressing efficiency in handling diverse languages and data types. Its importance has surged with the rise of generative , where token limits directly impact model performance and cost, highlighting the need for optimized tokenizers that balance vocabulary size and coverage.

In natural language processing

Definition and purpose

In (NLP), tokenization is the process of dividing raw text into smaller, discrete units called tokens, which may include words, phrases, subwords, or other linguistically meaningful segments to facilitate computational analysis. This segmentation transforms unstructured text into a structured sequence that machines can process, addressing challenges like punctuation attachment and irregular spacing. For example, the English sentence "Hello, world!" is commonly tokenized as ["Hello", ",", "world", "!"], separating content words from to preserve semantic and syntactic boundaries. The primary purpose of tokenization is to standardize input data for downstream tasks, ensuring consistency and enabling efficient handling of text variations across languages and domains. By breaking text into tokens, it supports processes such as , where are identified; , which evaluates emotional tone; , for extracting entities like persons or locations; and , which maps sequences between languages. This foundational step mitigates issues like out-of-vocabulary words and adapts to diverse text structures, such as contractions in English or character-based writing in languages like . Tokenization traces its origins to the , when it emerged as a core component of early systems designed for indexing and searching text documents in growing collections. Pioneering efforts, such as those by Calvin Mooers who formalized the term "information retrieval" around 1950, relied on basic text segmentation to create searchable term lists, laying the groundwork for automated document processing. These systems, including demonstrations of auto-indexing at conferences, highlighted tokenization's role in enabling rapid access to relevant content amid the computational limitations of the era.

Types of tokenization

Tokenization in varies by granularity, with methods categorized primarily as sentence-level, word-level, subword-level, and character-level approaches. These types determine the unit size of text segments fed into models, balancing vocabulary efficiency, handling of linguistic diversity, and computational demands. Sentence-level tokenization identifies boundaries in text using rules based on markers such as periods, question marks, and exclamation points, often combined with contextual cues to avoid false splits on abbreviations or decimals. This approach is essential for tasks involving , as it structures text into coherent units for higher-level processing like coreference resolution or summarization. Word-level tokenization splits text primarily on whitespace and , treating each resulting unit as a corresponding to a full word. It performs adequately for languages with clear word boundaries, such as English, but encounters challenges in agglutinative languages like Turkish or , where complex morphological affixes create long, variable forms that lead to vocabulary explosion and poor out-of-vocabulary handling. Subword-level tokenization addresses limitations of word-level methods by decomposing words into smaller meaningful units, such as morphemes or frequent n-grams, to manage out-of-vocabulary words and reduce overall vocabulary size. Common techniques include and , which iteratively merge character pairs or select subwords based on likelihood to form a compact set; this results in vocabularies of approximately 30,000–50,000 tokens, enabling efficient modeling while preserving morphological structure. Subword approaches are particularly effective for morphologically rich languages, though they may occasionally split affixes suboptimally. Character-level tokenization treats individual characters (or bytes) as the basic units, providing maximal flexibility for low-resource languages without predefined word boundaries and eliminating out-of-vocabulary issues entirely. However, it significantly increases sequence lengths—often by a factor of 4–5 compared to word-level—raising computational costs for models with fixed context windows. This method suits scenarios prioritizing robustness over efficiency, such as early systems.

Common algorithms and techniques

Tokenization in employs a variety of algorithms and techniques to split text into meaningful units, ranging from simple rule-based methods to sophisticated subword approaches. Rule-based tokenization relies on predefined patterns, such as splitting on whitespace for languages like English, where words are typically separated by spaces, and using regular expressions to handle and special characters. This method is straightforward and computationally efficient but often requires language-specific adjustments, as it struggles with morphological variations or scripts without explicit delimiters. Statistical methods address these limitations, particularly for languages lacking clear word boundaries, such as . Hidden Markov Models (HMMs) model word segmentation as a sequence labeling task, where characters are tagged with tags indicating the start, end, or continuation of words, using Viterbi decoding to find the most probable segmentation. For instance, hierarchical HMMs integrate and unknown word detection, improving accuracy on complex texts by capturing dependencies across levels. These approaches leverage probabilistic transitions trained on annotated corpora, offering robustness to compared to pure rule-based systems. Subword tokenization has become prevalent for handling rare words and out-of-vocabulary (OOV) issues in modern models. Byte-Pair Encoding (BPE), introduced by Sennrich et al. in 2016, starts with a base of individual characters and iteratively merges the most frequent adjacent symbol pairs to build larger subwords. The merge selection is determined by: (x, y) = \arg\max_{x,y} f(xy) where f(xy) is the frequency of the pair in the training corpus. This process continues for a predefined number of operations, resulting in a that decomposes unseen words into known subword units, effectively reducing OOV rates to near zero. BPE's frequency-based pairing enables open-vocabulary handling without explicit morphological rules. WordPiece, originally developed by Schuster and Nakajima in 2012 and popularized in by Devlin et al. in 2018, operates similarly but selects merges to maximize the likelihood of the training data using a on subword probabilities. It prioritizes splits that minimize the overall loss in a objective, making it suitable for morphologically rich languages. The Unigram Language Model, part of the SentencePiece toolkit introduced by Kudo in 2018, takes a probabilistic approach by starting with a large initial vocabulary and iteratively pruning low-probability subwords to optimize a unigram model fit. This method excels in multilingual settings by treating text as a of independent subword tokens, selected via P(w) = \frac{f(w)}{\sum f}. Practical implementations are facilitated by libraries like NLTK, whose word_tokenize() function combines rule-based splitting with the Penn Treebank tokenizer for English-centric tasks. spaCy's tokenizer component uses a combination of rules and statistical models, customizable via language-specific exception rules for efficient pipeline integration. For large-scale applications in modern large language models, the Tokenizers library provides fast, Rust-based implementations of BPE, WordPiece, and Unigram, supporting rapid training and inference on massive corpora.

Role in machine learning models

Tokenization serves as a critical preprocessing step in natural language processing (NLP) pipelines for models, particularly transformers, where raw text is segmented into tokens and mapped to numerical IDs from a fixed . This conversion enables models to process sequences of integers rather than unstructured text, directly influencing input size, computational efficiency, and overall performance; for instance, sizes typically range from 30,000 to 100,000 tokens, balancing coverage of common words and subwords while minimizing the dimensionality of layers. In the original architecture, subword tokenization via byte-pair encoding (BPE) or word-piece models was employed to handle variable-length inputs, with shared vocabularies of approximately 32,000–37,000 tokens for tasks, ensuring robust representation without excessive fragmentation. In large language models (LLMs) like those in the series, tokenization dictates the handling of fixed vocabularies, often around 50,000 tokens for and , which necessitates shorter sequences with special tokens and applying masks to ignore them during . This process affects the effective context window—the maximum number of tokens the model can process at once—limiting input to 1,024 tokens in early models and influencing downstream tasks like generation and comprehension by constraining the amount of information models can attend to simultaneously. and masking prevent information leakage from non-existent tokens but introduce overhead, as masked positions still require computational resources in self- mechanisms, thereby impacting and . Tokenization addresses key challenges in multilingual text processing by enabling models to represent diverse scripts and languages through subword units, which improve embedding quality by capturing morphological similarities across languages and enhance training efficiency by reducing out-of-vocabulary issues. For low-resource languages, however, inefficient tokenization can lead to longer sequences, increasing computational costs and degrading embedding coherence, as seen in analyses where multilingual tokenizers struggle with code-mixing without customization. A key metric in this domain is token efficiency, measuring how subword approaches like BPE compress text; for English, this averages about 1.3 tokens per word, lowering the effective sequence length and computational demands compared to character-level tokenization. The 2017 introduction of the architecture spurred the development of custom tokenizers, shifting from generic methods to tailored subword strategies that better accommodate in multilingual models, such as those handling English-Spanish mixtures by merging language-specific merges during training. This evolution has enabled LLMs to achieve higher performance on cross-lingual tasks, with custom vocabularies optimizing for underrepresented languages and reducing token bloat in mixed inputs.

In lexical analysis

Definition and process

In lexical analysis, tokenization refers to the process of scanning an input stream of characters from a source program and grouping them into a sequence of tokens according to the grammar of the programming language. This is the first phase of compilation, where the lexical analyzer, or scanner, converts the raw character stream into a structured token stream that can be processed by subsequent phases like syntax analysis. Tokens represent meaningful syntactic units, such as keywords, identifiers, operators, and literals, each classified by type and often accompanied by attributes like value or position. The tokenization process involves reading the input character by character from left to right, applying to identify lexemes—sequences of characters that match token definitions—and producing upon recognition. Token patterns are typically specified using expressions, which are converted into finite for efficient implementation: first to a non-deterministic finite automaton (NFA) via , then to a (DFA) using subset construction to enable linear-time scanning. The DFA guides state transitions based on input characters, recognizing the longest possible valid (maximal munch rule) and outputting it to the parser while ignoring whitespace and comments. This approach ensures unambiguous breakdown, outputting a clean for . Tokenization in originated in the 1950s with early compilers like the I system developed by , which processed source input into basic elements despite lacking modern formal structure—such as allowing variables to overlap with reserved words. The concepts were formalized in the and through foundational work on regular expressions and finite automata for scanners, as detailed in seminal texts like Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman (first edition 1986, building on their 1977 book). This formalization established as a rigorous, automata-driven phase distinct from earlier ad-hoc methods. A representative example is the C-like code snippet int x = 5;, which the lexical analyzer tokenizes into the sequence: keyword "int", identifier "x", "=", number "5", and punctuation ";". Unlike tokenization in , which handles flexible and ambiguous text, lexical analysis enforces strict, unambiguous rules based on the language's .

Token categories

In lexical analysis for programming languages, tokens are classified into standard categories that represent the basic syntactic units of source code. These categories enable the parser to interpret the structure and semantics of the program by grouping sequences of characters (lexemes) into meaningful units. The primary token categories include keywords, identifiers, literals, operators, and separators, with additional handling for elements like comments and whitespace, which are typically ignored during tokenization. Keywords are reserved words with predefined meanings in the language, forming a fixed set that cannot be used for other purposes such as variable names. Examples include "if" for conditional statements and "while" for loops, which direct the to specific syntactic constructs. In C++, there are 95 such keywords as of the standard, encompassing core control structures, type specifiers, and newer features like coroutines. Identifiers represent user-defined names for entities like variables, functions, and classes, following language-specific rules for formation. They typically consist of alphanumeric characters and underscores, but must not start with a to distinguish them from numbers. For instance, "myVariable" or "calculateSum" qualifies as an identifier, allowing programmers to label program elements uniquely. Literals are fixed-value constants embedded directly in the code, serving as immediate data for the program. Common subtypes include string literals like "hello", numeric literals such as integers (42) or floating-point numbers (3.14), and boolean literals (true or false). These provide concrete values without requiring computation or reference. Operators are symbols that denote operations on , classified as (acting on one operand, e.g., the operator -) or (acting on two, e.g., + or ==). Logical operators like && (AND) fall into the category, facilitating expressions for computation and . Separators, also known as or delimiters, include structural symbols such as brackets {}, commas ,, and semicolons ;, which organize code into blocks, separate arguments, and terminate statements. During lexical analysis, comments and whitespace are generally skipped and not treated as tokens, as they serve documentation or formatting purposes rather than syntactic roles. In languages like C and C++, preprocessor directives such as #include function as special tokens, processed before standard lexical analysis to handle inclusions, macros, and conditional compilation.

Implementation in compilers

In compiler design, tokenization during lexical analysis is typically implemented using finite automata to efficiently recognize patterns defined by regular expressions for token categories such as keywords, identifiers, and operators. Regular expressions for these patterns are first converted into nondeterministic finite automata (NFAs), which are then transformed into deterministic finite automata (DFAs) for faster execution, allowing the lexer to scan input streams in linear time relative to the input size. This approach ensures that the lexer can match the longest possible prefix of the input that corresponds to a valid token, resolving ambiguities through the maximal munch principle, where the automaton selects the longest matching lexeme rather than a shorter one, such as preferring "==" over a single "=" in languages like C. Lexers can be either hand-written or generated automatically from specifications. Hand-written lexers, often implemented as state machines in code, provide fine-grained control over performance and error recovery, making them suitable for high-speed applications where custom optimizations are needed, such as in production compilers like . In contrast, generated lexers use tools like Lex or its open-source successor Flex, which compile regular expression rules from a .l file into efficient C code that implements a DFA-based scanner, automating the construction while supporting actions for each matched token. Flex-generated scanners are widely adopted for their balance of ease of development and speed, often outperforming naive implementations by avoiding . Error handling in lexer implementation involves detecting invalid sequences that do not match any token pattern and reporting them with contextual details, such as line numbers and positions, to aid . Buffering techniques, such as two-buffer schemes with sentinels (e.g., marking buffer ends with a non-input character like EOF), enable lookahead operations without excessive system calls, allowing the lexer to peek ahead while maintaining efficiency; for instance, a forward can hold upcoming characters for resolving boundaries. Upon encountering an error, such as an unclosed , the lexer may skip to the next valid or insert a synthetic one for recovery, preventing total failure. Practical examples illustrate these techniques in standard libraries. Java's StreamTokenizer class implements a simple hand-crafted lexer that parses input streams into tokens like numbers, identifiers, and quoted strings, using a character table for classification and supporting custom delimiters via flags. Similarly, 's tokenize module provides a scanner for Python source code, generating tokens including comments and handling indentation-aware structures through a line-by-line generator that applies regular expressions for patterns like operators and keywords. These tools demonstrate deterministic matching aligned with token categories such as literals and symbols. In modern just-in-time (JIT) compilers, dynamic tokenization optimizes performance by reusing states across compilations. For example, the for employs a hand-optimized that processes input up to 2.1 times faster than prior versions by minimizing state transitions and integrating lookahead directly into its generation pipeline, enabling rapid parsing during runtime compilation.

In data security

Definition and mechanism

In , tokenization is the process of replacing sensitive elements, such as numbers or personal identification , with non-sensitive values known as . These serve no independent value and cannot be used to derive the original without to a secure mapping system. The primary goal is to protect sensitive by ensuring that even if the token is compromised in a , it reveals no useful about the original value. The core mechanism involves substituting the original sensitive with a randomized or surrogate , which is stored in a secure database or "" that maintains a between the and the original . This is isolated from the systems handling tokenized , accessible only by authorized entities through strict controls. Tokenization can be format-preserving, where the token retains the same , , and format as the original (e.g., a 16-digit number remains 16 digits), or non-format-preserving, which may alter the format entirely. The process is reversible via detokenization, but only by systems with access; unauthorized attempts yield no meaningful results due to the token's lack of inherent to the source . Tokenization emerged in the early as a response to increasing risks and the need for compliance with industry standards, particularly following the introduction of the PCI Data Security Standard (PCI DSS) in 2004. Early implementations focused on to minimize the storage and transmission of sensitive cardholder data, reducing the scope of PCI DSS requirements. For PIN block handling, related standards like ANSI X9.24-1 (revised in 2004) provided foundational practices that supported secure tokenization in financial environments. A representative example illustrates the mechanism: a primary account number (PAN) like "3124 0059 1723 387" is replaced with a token such as "7aF1Zx11 8523mw4c wl5x2," which preserves the format for seamless integration into existing systems. Detokenization reverses this by querying the vault to retrieve the original PAN. Unlike encryption, where the ciphertext has a mathematical relationship to the plaintext via a key (potentially allowing decryption if the key is compromised), tokens bear no such relation, rendering them useless outside the vault and significantly mitigating breach impacts.

Types and variants

Tokenization in encompasses several specialized variants adapted to particular data types and applications, each designed to replace sensitive information with non-sensitive while maintaining . These variants differ in their mechanisms, , and integration with other security measures, allowing organizations to address specific risks such as data breaches or unauthorized access in diverse environments. Payment tokenization is a prominent variant focused on financial transactions, where the Primary Account Number () on payment cards is replaced with a unique, format-preserving to mitigate risks in and point-of-sale systems. This approach, standardized by EMVCo, ensures that the can be used throughout the ecosystem—from merchant acceptance to —without exposing the original , thereby reducing the impact of compromised . A key feature of payment tokens is domain restriction, which limits their validity to specific contexts, such as a particular merchant or device, to further constrain potential if a token is intercepted. Apple's introduction of its Token Service in 2014 for exemplified this variant's adoption in mobile payments, generating device-specific account numbers stored securely on the device rather than on servers, which popularized tokenized transactions for contactless payments. Biometric tokenization addresses the protection of sensitive physiological , such as scans or templates, by substituting these raw with tokens during processes. This variant prevents the storage of actual biometric , which cannot be changed if compromised, instead relying on derived tokens to verify while preserving . For general personally identifiable information (PII), such as Social Security Numbers (SSNs), tokenization variants include vault-based and stateless (vaultless) methods to secure non-financial sensitive across databases and applications. Vault-based tokenization stores the mapping between original PII and tokens in a secure, centralized database, allowing reversible where authorized systems can retrieve the original as needed. In contrast, stateless tokenization employs cryptographic algorithms, such as standards, to generate and detokenize values without maintaining a , offering greater and reduced storage overhead for large-scale PII protection. Hybrid approaches integrate tokenization with to provide multi-layer , where data is first tokenized to remove sensitive values and then the tokens or surrounding data are encrypted for additional protection against unauthorized access. This combination leverages tokenization's data with encryption's scrambling, enhancing resilience in environments handling mixed data types.

Standards and applications

Tokenization in is governed by several key s that promote its use to protect sensitive . The Payment Card Industry Data Security Standard (PCI DSS) version 4.0, released in 2022, recommends tokenization as a method to protect cardholder by replacing primary account numbers (PANs) with non-sensitive tokens, thereby reducing the scope of compliance audits for merchants and service providers; many enhanced requirements became mandatory as of March 31, 2025. The General Data Protection Regulation (GDPR), effective in 2018, encourages techniques, including tokenization, as a means to minimize risks associated with processing while allowing for reversible mappings in controlled environments. Additionally, the for financial messaging facilitates the secure of tokenized , enabling in cross-border transactions and supporting the of tokenized assets into traditional systems. In practical applications, tokenization is widely deployed across industries to comply with these standards and enhance data protection. In , it reduces the DSS compliance scope by limiting the storage and transmission of actual card data, allowing merchants to outsource sensitive operations to token service providers. In healthcare, tokenization supports HIPAA compliance by replacing () with tokens, enabling secure data sharing for research and analytics without exposing identifiable details. For , services like AWS provide tokenization solutions that replace sensitive data elements with unique identifiers, ensuring compliance and security in distributed environments. The adoption of tokenization yields measurable benefits, such as reducing the value of breached data—since tokens hold no intrinsic worth to attackers—and enabling secure on datasets without direct exposure of originals. As of October 2025, reported issuing over 16 billion tokens, with tokenized card-not-present (CNP) transactions showing significantly higher authorization rates and fraud reduction compared to non-tokenized ones. However, implementations must address challenges like secure in token vaults to prevent unauthorized detokenization and for processing high-volume transactions without performance degradation.

In asset management and blockchain

Definition and concept

Tokenization in and refers to the process of converting to physical or assets—such as , , or commodities—into divisible on a , enabling the representation of ownership in a secure, transparent manner. These serve as certificates that encapsulate the underlying asset's value and , allowing for seamless and without traditional intermediaries. At its core, the concept facilitates , where high-value assets can be divided into smaller, accessible units for multiple investors, contrasting with conventional securities that often require full-unit purchases and operate within limited trading hours. This enhances by enabling 24/7 global trading and instant settlement on networks, democratizing access to investments previously reserved for institutional players. The practice traces its roots to the 2017 (ICO) boom, when projects began issuing to fundraise and represent asset-like interests, though many early efforts faced regulatory . It gained mainstream traction with the 2020 surge in (DeFi), as protocols integrated tokenization to unlock liquidity for real-world assets beyond cryptocurrencies. A representative example involves tokenizing a $1 million property into 1,000 digital , each denoting 0.1% ownership, which investors can trade individually to gain proportional to rental income or appreciation without acquiring the entire asset. Within this framework, security tokens—designed to comply with financial regulations and represent investment-grade ownership—differ from utility tokens, which primarily grant access to platform services rather than or profit-sharing .

Tokenization process

The tokenization process for assets on blockchain platforms begins with asset valuation and legal structuring, where the underlying asset—such as , securities, or commodities—is appraised by independent experts to determine its , and a legal framework is established to define ownership rights, transfer restrictions, and compliance requirements through special purpose vehicles (SPVs) or trusts. This step ensures that the tokenized representation accurately reflects the asset's economic and legal attributes, often involving collaboration between issuers, legal advisors, and valuation firms to create binding documentation like tokenization agreements. Following valuation, creation occurs, where developers encode the token's rules and functionalities into blockchain-compatible code, commonly using standards like ERC-1400 on for security tokens, which incorporates features such as transfer restrictions, document references, and compliance controls to mimic traditional securities. These contracts are then deployed and audited by third-party firms to verify security and functionality, preventing vulnerabilities like reentrancy attacks. The tokens are subsequently issued on a chosen blockchain, such as for its robust ecosystem or for lower costs and faster transactions, through a minting process that generates digital representations proportional to the asset's value. Distribution typically happens via Security Token Offerings (STOs), where tokens are sold to s through regulated platforms, ensuring and automated settlement on the . Technical aspects include integrating oracles, such as those from Chainlink, to feed real-time off-chain data—like asset prices or performance metrics—into contracts for accurate valuation and updates. Compliance is further enforced through KYC/AML , where identities are verified via automated checks linked to the contracts, restricting transfers to approved parties. Platforms like and Securitize facilitate this process by providing end-to-end tools for security token issuance, including template smart contracts, compliance modules, and investor onboarding. The entire process generally takes 3-6 months, encompassing legal setup, development, audits, and regulatory approvals, to mitigate risks and ensure robustness. For instance, tZERO is planning to launch its network with up to $1 billion in tokenized assets as of 2025, demonstrating scalable issuance of digital securities. A key concept in token transfers is atomic swaps, which enable exchanges of tokenized assets across s without intermediaries, using hashed time-lock contracts to guarantee either simultaneous completion or full reversal, thus ensuring settlement finality and reducing risk. This mechanism supports efficient, trustless trading of tokenized assets.

Benefits and challenges

Asset tokenization offers several key benefits that enhance the efficiency and accessibility of traditional asset markets. One primary advantage is increased through fractional trading, allowing investors to buy smaller portions of high-value assets like or , which were previously inaccessible to many. This fractionalization democratizes investment opportunities and broadens market participation by lowering entry barriers. Additionally, tokenization reduces costs by eliminating intermediaries such as brokers and custodians, streamlining transactions on networks. It also enables , permitting 24/7 trading from anywhere with connectivity, without geographic restrictions typical of legacy systems. Furthermore, the inherent of ensures immutable records of and transactions, fostering and reducing risks. These benefits have driven substantial market potential, with estimates projecting the tokenized asset economy to reach $16 trillion by 2030, representing about 10% of global GDP, according to a report. Tokenization also accelerates settlement times from the traditional (two business days) to near-instantaneous execution on , improving capital efficiency and reducing counterparty risks. Recent developments as of include tokenized U.S. Treasuries surpassing $8 billion in value and Securitize's agreement to go public via a $1.25 billion SPAC deal, underscoring institutional momentum in the sector. Despite these advantages, asset tokenization faces notable challenges that could hinder widespread adoption. Volatility risks remain prominent, as tokenized assets are often exposed to market fluctuations, amplifying price instability for underlying real-world assets. Scalability issues, particularly on networks like , lead to high gas fees during peak usage, making frequent transactions costly and deterring smaller investors. between different protocols poses another hurdle, as assets tokenized on one chain may not seamlessly transfer or trade on another, fragmenting across ecosystems. A prominent example is real estate tokenization, where the process enables of properties, allowing investors to participate with minimal capital and potentially yielding diversified portfolios. This democratizes access to an asset class traditionally dominated by wealthy individuals or institutions, enhancing through secondary markets for . However, challenges arise if secondary markets remain underdeveloped, leading to persistent illiquidity despite tokenization, as trading volumes may not match traditional exchanges. Early growth in tokenized real-world assets (RWAs) underscores this potential, with platforms like Securitize—backed by —reporting over $4 billion in as of October 2025, signaling institutional interest amid ongoing market maturation.

Regulatory considerations

In the United States, the Securities and Exchange Commission (SEC) frequently classifies many tokenized assets as securities, applying the Howey Test—a 1946 precedent that identifies an investment contract as involving an investment of money in a common enterprise with an expectation of profits derived from the efforts of others. This approach subjects tokenized offerings to federal securities laws, requiring issuers to navigate disclosure and registration obligations to avoid enforcement actions. In the , the Regulation (), adopted in 2023 and fully applicable since December 2024, establishes a harmonized regime for crypto-assets, including tokenized securities and other digital representations of value, to enhance investor protection, market transparency, and across member states. categorizes assets into stablecoins, e-money tokens, and other crypto-assets, mandating licensing for service providers and prohibiting anonymous transactions above certain thresholds. Compliance with these regulations typically involves registering Security Token Offerings (STOs) with oversight bodies like the , verifying investor accreditation to limit participation to sophisticated or high-net-worth individuals, and implementing robust anti-money laundering (AML) and counter-terrorist financing (CTF) controls aligned with (FATF) standards for virtual assets and service providers. The FATF guidelines, updated in 2021 and further refined in 2025, require virtual asset service providers to conduct customer , monitor transactions, and report suspicious activities, applying a risk-based approach to mitigate illicit finance risks in tokenization. Globally, regulatory approaches vary significantly; Singapore's Monetary Authority () has fostered a supportive environment since 2018, issuing guidelines that classify digital tokens as securities or payment instruments when applicable, while promoting innovation through initiatives like Project Guardian to enable tokenized asset markets. In contrast, banned Initial Coin Offerings () in 2017, deeming them illegal public financing activities and prohibiting exchanges, which has stifled domestic tokenization while influencing international caution around similar fundraising models. Key challenges in asset tokenization include jurisdictional conflicts for cross-border transactions, where tokens can traverse multiple legal regimes instantaneously, leading to regulatory , enforcement gaps, and compliance burdens for issuers. To address these, evolving regulatory sandboxes provide testing grounds; for instance, the UK's (FCA) and launched the Digital Securities Sandbox in 2024, allowing firms to experiment with tokenized infrastructure under tailored supervisory conditions until 2029. By 2025, over 90 jurisdictions have adopted or are advancing regulations for virtual assets, including tokenization frameworks, according to FATF assessments, reflecting a global push toward standardized oversight.

References

  1. [1]
    Tokenization - Stanford NLP Group
    Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
  2. [2]
    [PDF] Lexical Analysis - SUIF Compiler
    Jun 25, 2008 · Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: ...
  3. [3]
    What Is Tokenization? | IBM
    Tokenization converts sensitive data into a nonsensitive digital replacement that maps back to the original in order to protect sensitive information.What is tokenization? · What is a token?
  4. [4]
    [PDF] Blockchain Networks: Token Design and Management Overview
    be referred to as tokens, with tokenization designating the concept of representing assets as tokens using blockchain networks. Certain terms used in this ...
  5. [5]
    [PDF] “Maximal-Munch” Tokenization in Linear Time
    The lexical-analysis (or scanning) phase of a compiler attempts to partition an input string into a sequence of tokens. The convention in most languages is that ...
  6. [6]
    The Foundations of Tokenization: Statistical and Computational ...
    Nov 4, 2024 · Tokenization—the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary—is a critical step in ...<|control11|><|separator|>
  7. [7]
    Explaining Tokens — the Language and Currency of AI - NVIDIA Blog
    Mar 17, 2025 · Tokens are units of data processed by AI models during training and inference, enabling prediction, generation and reasoning.
  8. [8]
    [PDF] Words and Tokens - Stanford University
    May 23, 2025 · Speech and Language Processing. Daniel Jurafsky & James H. Martin ... For this reason, sentence tokenization and word tokenization can be.
  9. [9]
    [PDF] arXiv:2012.15524v3 [cs.CL] 5 Oct 2021
    Oct 5, 2021 · Tokenization is the process of splitting text into smaller units called tokens (e.g., words). It is a fundamental preprocessing step for almost ...<|separator|>
  10. [10]
    Tokenization as the initial phase in NLP - ACM Digital Library
    Tokenization is the beginning step of NLP, where notions of word and token are discussed and defined.
  11. [11]
    [PDF] The Foundations of Tokenization - arXiv
    Apr 3, 2025 · Tokenization converts characters into tokens, breaking symbols into subsequences. It's critical in NLP, but can cause issues like ambiguity.
  12. [12]
    [PDF] Introduction to Information Retrieval - Stanford University
    Aug 1, 2006 · ... Tokenization. 22. 2.2.2. Dropping common terms: stop words. 27. 2.2.3. Normalization (equivalence classing of terms). 28. 2.2.4. Stemming and ...
  13. [13]
    [PDF] Information Retrieval: The Early Years - Now Publishers
    Information retrieval, the science behind search engines, had its birth in the late 1950s. Its forbearers came from library science, mathematics and linguistics ...Missing: tokenization | Show results with:tokenization
  14. [14]
    Impact of Tokenization on Language Models: An Analysis for Turkish
    Mar 25, 2023 · We analyze the impact of tokenizers, at different granularity levels from character level to word level, on varying downstream tasks for Turkish ...
  15. [15]
    Tokenization Strategies for Low-Resource Agglutinative Languages ...
    Aug 27, 2025 · Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic ...
  16. [16]
    Neural Machine Translation of Rare Words with Subword Units
    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting ...
  17. [17]
    BERT: Pre-training of Deep Bidirectional Transformers for Language ...
    BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  18. [18]
    SentencePiece: A simple and language independent subword ...
    Aug 19, 2018 · This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural ...
  19. [19]
  20. [20]
    Tokenizer · spaCy API Documentation
    Segment text, and create Doc objects with the discovered segment boundaries. For a deeper understanding, see the docs on how spaCy's tokenizer works.
  21. [21]
    Tokenizers — tokenizers documentation - Hugging Face
    Train new vocabularies and tokenize, using today's most used tokenizers. · Extremely fast (both training and tokenization), thanks to the Rust implementation.
  22. [22]
    Attention Is All You Need
    ### Summary of Tokenization/Subword Units in "Attention Is All You Need" (arXiv:1706.03762)
  23. [23]
    Summary of the tokenizers - Hugging Face
    GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
  24. [24]
    Tokenization efficiency of current foundational large language ...
    Tokenization efficiency can be defined as the number of tokens a tokenizer uses to represent the input text while preserving the integrity of the original text.
  25. [25]
    Compilers: Principles, Techniques, and Tools (Dragon Book)
    Sep 28, 2008 · This website serves as a supplement to the 2nd Edition of the textbook Compilers: Principles, Techniques, and Tools (commonly known as the Dragon Book).
  26. [26]
    [PDF] Lecture Notes on Lexical Analysis
    Feb 9, 2023 · Lexical analysis is the first phase of a compiler, turning raw input into a token stream, classifying tokens, and compressing input.
  27. [27]
    [PDF] Lexical Analysis
    The exact characters in a token is called its lexeme. • Tokens are classified by token types, e.g. identi- fiers, constant literals, strings, operators, punctua ...
  28. [28]
    2. Lexical analysis — Python 3.14.0 documentation
    Besides NEWLINE , INDENT and DEDENT , the following categories of tokens exist: identifiers and keywords ( NAME ), literals (such as NUMBER and STRING ), and ...
  29. [29]
    6 Lexical structure - C# language specification - Microsoft Learn
    Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens. Syntactic analysis, which translates the stream of tokens into ...
  30. [30]
    Tokenization (The C Preprocessor)
    1.3 Tokenization. After the textual transformations are finished, the input file is converted into a sequence of preprocessing tokens.
  31. [31]
    Automating lexical analysis - CS [45]12[01] Spring 2023
    The first step is to convert the regular expression into a nondeterministic finite automaton. An NFA differs from a DFA in that each state can transition to ...
  32. [32]
    [PDF] CS153: Compilers Lecture 8: Lexing - Harvard University
    Use a regular expression to specify the set strings for each Token Type. •Example: [0-9]+ specifies the set of strings for NUM. •Construct the NFA formed by.
  33. [33]
  34. [34]
    [PDF] Compilers and computer architecture: Parsing
    Lexers and parsers generated by such tools are likely to be better (e.g. faster, better error handling) than hand-written parsers. Writing a lexer or parser by ...<|separator|>
  35. [35]
    Lexical Analysis With Flex, for Flex 2.6.2: Top - Will Estes
    Oct 22, 2016 · This manual describes flex , a tool for generating programs that perform pattern-matching on text. The manual includes both tutorial and ...Lex and Posix · How does flex compile the... · Can I get the flex manual in... · Cxx
  36. [36]
    Input Buffering in Compiler Design - GeeksforGeeks
    Sep 1, 2025 · Input buffering is a technique where the compiler reads input in blocks into a buffer, instead of character by character, reducing system calls ...
  37. [37]
    [PDF] Lecture 2 - Lexical Analysis - Compiler Construction
    Tokens are the fundamental building blocks of a program's grammatical structure, representing such basic elements as identifiers, numericliterals, and specific ...Missing: compilers | Show results with:compilers
  38. [38]
    Blazingly fast parsing, part 1: optimizing the scanner - V8.dev
    Mar 25, 2019 · The cornerstone of parser performance is a fast scanner. This article explains how V8's JavaScript scanner recently got up to 2.1× faster.Missing: implementation | Show results with:implementation
  39. [39]
    [PDF] Information Supplement • PCI DSS Tokenization Guidelines
    The purpose of this Information Supplement is to provide guidance for payment industry stakeholders when developing, evaluating, or implementing a tokenization ...
  40. [40]
    PCI DSS History: How the Standard Came To Be - Secureframe
    Oct 2, 2024 · PCI DSS was first introduced in December 2004 and provides a uniform standard for card payment security for businesses worldwide.
  41. [41]
    [PDF] ASC X9 Completes Update of ANSI X9.24-1, Detailing ...
    – August 1, 2017 -- Today the Accredited Standards Committee X9 Inc. (X9) announced the completion of an updated edition of ANSI X9.24-1, Retail Financial.
  42. [42]
    EMV® Payment Tokenisation - EMVCo
    The EMV Payment Tokenisation Specification – Technical Framework defines the roles, functions and requirements that need to be considered when introducing EMV ...
  43. [43]
    EMV® Payment Tokenisation: What, Why and How - EMVCo
    Mar 31, 2022 · EMV Payment Tokenisation is the process of replacing a Primary Account Number (PAN) with a unique alternative value that can be defined in where and how it is ...
  44. [44]
    [PDF] EMV Payment Tokenization Primer and Lessons Learned
    Token domain restriction controls can be unique to the processing environment, the consumer's device or the token requestor. Examples include limiting use of ...
  45. [45]
    Apple Announces Apple Pay
    Sep 9, 2014 · CUPERTINO, California―September 9, 2014―Apple® today announced Apple Pay™, a new category of service that will transform mobile payments ...Apple (AU) · Apple (CA) · Apple (UK) · Apple (EG)Missing: Token | Show results with:Token
  46. [46]
    FIDO Alliance 2019 Progress Report: FIDO Authentication for ...
    Dec 4, 2019 · The National Health Service (NHS) in the United Kingdom released open source code for developers to add FIDO biometric security for app login ...Missing: tokenization | Show results with:tokenization
  47. [47]
    [PDF] Tokens and Tokenization | OpenText
    Tokenization is a process by which PANs (primary account numbers), PHI (protected health information), PII (personally identifiable information), and other ...
  48. [48]
    Types of Tokenization: Vault and Vaultless | Encryption Consulting
    Apr 5, 2024 · Vaultless tokenization is more efficient and safer than vault tokenization, as it does not maintain a database, but instead uses secure cryptographic devices.
  49. [49]
    Tokenization vs. Encryption: Which Is Safer? - Spreedly
    May 22, 2025 · Combining tokenization with encryption can help you to achieve a more defensive approach to payment security that covers all of your compliance ...How Tokenization Works · Tokenization Vs Encryption... · Pci Dss Tokenization Vs...
  50. [50]
    Pseudonymisation | ICO
    Tokenisation-based pseudonymisation​​ Tokenisation replaces identifiers with randomly generated tokens. Tokens can be generated by hashing or by generating ...
  51. [51]
    How ISO 20022 Can Help Scale Tokenized Money Globally?
    Jun 20, 2025 · Discover how ISO 20022 acts as a universal financial language to connect tokenized money, digital assets, and traditional systems for ...
  52. [52]
    Guidance for Tokenization to Improve Data Security and Reduce ...
    Tokenization replaces sensitive data with unique identifiers, enhancing data security, reducing costs, and lowering compliance costs while safeguarding PII.Guidance For Tokenization To... · Overview · How It Works
  53. [53]
    How to use tokenization to improve data security and reduce audit ...
    Jan 25, 2022 · Tokenization is the process of replacing actual sensitive data elements with non-sensitive data elements that have no exploitable value for data security ...Why Tokenize? Common Use... · Tokenization Or Encryption... · Where Encryption May Make...
  54. [54]
    Visa Issues 10 Billionth Token, Generating $40 Billion in Incremental ...
    Jun 4, 2024 · Risk Datamart CY2023, Visa Processed global card-not-present (CNP) payment volume for tokenized vs non-tokenized credentials. 3 Visanet ...
  55. [55]
    Asset Tokenization Explained: Benefits, Risks, and How It Can Work
    Mar 22, 2024 · Asset tokenization refers to the process of recording the rights to a given asset into a digital token that can be held, sold, and traded on a blockchain.What is asset tokenization? · Benefits of asset tokenization
  56. [56]
    What is asset tokenization? - Hedera
    Asset tokenization is the process by which an issuer creates digital tokens on a blockchain or other form of distributed ledger to represent digital or ...
  57. [57]
    What Is Tokenization in Blockchain? - Gemini
    Tokenization turns real or digital assets into blockchain tokens that trade 24/7 with instant settlement and fractional ownership. Learn how it works now!The Benefits of Tokenization · What Does Crypto...
  58. [58]
    Tokenization in financial services: Delivering value and transformation
    Oct 3, 2025 · Tokenization lets you digitally represent asset ownership for tangible or intangible assets on a blockchain. Adoption is accelerating as ...ways tokenization can create... · How tokenization works · Embracing tokenization<|separator|>
  59. [59]
    Blockchain in finance - from Initial Coin Offering (ICO) to ... - inWeb3
    Jun 18, 2021 · We will introduce the two major milestones for blockchain in finance, from the ICO boom in 2017 to the DeFi movement today, and help you understand how they ...
  60. [60]
    6 Examples of Tokenized Assets - Kaleido
    May 24, 2024 · 6 Examples of Tokenized Assets · 1. Real Estate Tokenization · 2. Art and Collectibles Tokenization · 3. Tokenization of Intellectual Property · 4.
  61. [61]
    Explainer: Utility vs. Security Tokens - Crypto Council for Innovation
    Aug 22, 2024 · Security tokens are subject to federal securities regulations, while utility tokens are largely unregulated. The value of security tokens is ...
  62. [62]
    Why and How to Perform Asset Tokenization in 2025 - ScienceSoft
    According to ScienceSoft's experience, an asset tokenization project of average complexity lasts 3–6 months and may cost around $100,000–$300,000. Want to ...
  63. [63]
    What is tokenization? - McKinsey
    Jul 25, 2024 · Tokenization is the process of creating a digital representation of a real thing. Tokenization can also be used to protect sensitive data.<|control11|><|separator|>
  64. [64]
    ERC 1400 - The Security Token Standard - Polymath Network
    ERC1400 aims for consistency in security tokens, creating a unified framework with core compliance, document handling, and forced token transfers.
  65. [65]
    How to Tokenize An Asset - Chainlink
    May 21, 2025 · Select the Asset to Tokenize · Define Token Type · Choose the Blockchain You Want to Issue Your Tokens On · Select a Third-Party Auditor To Verify ...
  66. [66]
    Tokenization: A complete guide - Kraken
    Tokenization refers to the process of representing real-world assets (RWA) on the blockchain using cryptocurrency tokens.
  67. [67]
    Tokenized in America: How the U.S. Financial System Gains Global ...
    Jul 15, 2025 · These oracles provide trusted, onchain attestations about key participant information, such as KYC and AML status, investor accreditation, and ...
  68. [68]
    Polymath | Tokenizing the Global Financial System
    Polymath pioneers tokenized securities, offering a seamless white label platform for creating, issuing, and managing tokens with compliance and transparency.History · Polymath Capital Platform · Security Tokens · By Business
  69. [69]
    Securitize | The Leading Tokenization Platform
    Securitize is the leader in tokenizing real-world assets. Explore our full suite of solutions for asset managers, Web3 firms and DAOs, advisors and investors.The Securitize Story · Securitize for Advisors · Private Market Investing on...Missing: Polymath | Show results with:Polymath
  70. [70]
    The State of Security Tokens 2022 — Tokenization in Private Markets
    Mar 21, 2022 · The $TZROP token, amassing $134 million in initial fundraising across accredited investors, entitles investors to 10% of adjusted gross revenues ...
  71. [71]
    What Is an Atomic Swap? - Chainlink
    Aug 14, 2024 · Atomic swaps are a way for two people to trade tokenized assets across different blockchain networks without relying on a centralized intermediary.
  72. [72]
    Advantages of fractionalizing assets through tokenization - Polymesh
    Apr 12, 2024 · By fractionalizing assets into smaller, more manageable units, tokenization enables easier trading and opens the door to a wider investor base, ...
  73. [73]
    7 Benefits of Tokenization of Assets for Institutions - XBTO
    Oct 1, 2025 · 1. Unlocking liquidity & creating secondary markets · 2. Lower costs & faster settlement · 3. Fractional ownership & broader market access · 4.<|separator|>
  74. [74]
    How will asset tokenization transform the future of finance?
    Aug 8, 2025 · With tokenized assets, settlement can occur in near-real-time on a blockchain network. This acceleration is especially impactful in cross-border ...
  75. [75]
    Tokenization of Private Assets: Unlocking Liquidity, Transparency ...
    Jan 17, 2025 · On the issuer's side, converting private assets into blockchain tokens can increase liquidity through cheaper and more secure transactions.
  76. [76]
    [PDF] Relevance of on-chain asset tokenization in 'crypto winter'
    May 22, 2022 · The total size of illiquid asset tokenization globally would be $16 trillion by 2030 (Exhibit 5). Key reasons for asset illiquidity include ...
  77. [77]
    Recent developments and challenges using blockchain techniques ...
    Dec 7, 2024 · 2. Cost. •. Blockchain-Based Models: Higher transaction costs due to gas fees on networks like Ethereum, which can spike with congestion.
  78. [78]
  79. [79]
    Real estate tokenization: A new era for property investment and ... - EY
    Feb 24, 2025 · Real estate tokenization is the process of converting the value of a physical property into digital tokens that can be bought, sold, or traded on a blockchain ...
  80. [80]
    Real Estate Tokenization: Opportunities And Challenges - Primior
    Jul 15, 2024 · Tokenization significantly increases liquidity in the real estate market. Traditionally, selling property can take months, but tokenized assets ...
  81. [81]
    Top 10 RWA Tokenization Companies in 2025 | Real-World Asset ...
    Jun 4, 2025 · Why Securitize stands out in 2025: With $47 million led by BlackRock to expand RWA tokenization and over $1 billion on-chain assets, ...
  82. [82]
    [PDF] Framework for “Investment Contract” Analysis of Digital Assets
    5 The so-called. “Howey test” applies to any contract, scheme, or transaction, regardless of whether it has any of the characteristics of typical securities.6 ...
  83. [83]
    Howey Test and Cryptocurrency: Understanding Investment Contracts
    Aug 29, 2025 · The Howey Test evaluates whether transactions qualify as "investment contracts" under U.S. securities laws, focusing on the expectation of ...What Is the Howey Test? · Understanding the Howey Test · Applying the Howey Test
  84. [84]
    Markets in Crypto-Assets Regulation (MiCA)
    The Markets in Crypto-Assets Regulation (MiCA) institutes uniform EU market rules for crypto-assets ... Assets Regulation (MiCA) entered into force in June 2023 ...
  85. [85]
    Security Token Offering 2023—regulations and KYC rules - Sumsub
    Jun 21, 2024 · Compliance with AML/CTF requirements;; Knowing whether STO investors are US citizens in order to apply either registration or exemption rules.
  86. [86]
    [PDF] Virtual Assets and Virtual Asset Service Providers - FATF
    It provides guidelines to countries, competent authorities, and industry for the design and implementation of a risk- based AML/CFT regulatory and supervisory ...
  87. [87]
    FATF urges stronger global action to address Illicit Finance Risks in ...
    Jun 26, 2025 · Updated Guidance for a Risk-Based Approach to Virtual Assets and Virtual Asset Service Providers. This updated guidance forms part of the FATF's ...
  88. [88]
    MAS warns Digital Token Exchanges and ICO Issuer
    May 24, 2018 · MAS has warned eight digital token exchanges in Singapore not to facilitate trading in digital tokens that are securities or futures contracts without MAS' ...
  89. [89]
    China bans companies from raising money through ICOs ... - CNBC
    Sep 4, 2017 · In a document, regulators said new projects that raise cash or other virtual currencies through cryptocurrencies will be banned, ...
  90. [90]
  91. [91]
    Digital Securities Sandbox (DSS) - Bank of England
    Dec 23, 2024 · The DSS is due to run until 8 January 2029 but may be extended by HM Treasury through legislation if more time is necessary to transition to a ...