Fact-checked by Grok 2 weeks ago

Tokenization

Tokenization is the process of dividing a sequence of data—such as text, code, or sensitive information—into smaller, meaningful units known as tokens, enabling further analysis, processing, or secure handling across various domains in computer science and beyond. In natural language processing (NLP), tokenization segments raw text into words, subwords, or characters to facilitate machine learning models' comprehension of human language, often involving the removal of punctuation or normalization of whitespace.^[1] Similarly, in compiler design and lexical analysis, it transforms source code into discrete tokens like identifiers, keywords, operators, and literals, forming the foundational step for syntax parsing and program execution.^[2] In data security, tokenization substitutes sensitive elements, such as credit card numbers or personal identifiers, with non-sensitive surrogate values (tokens) that preserve data utility without exposing originals, thereby enhancing compliance with privacy regulations like PCI DSS.^[3] In finance and blockchain technology, asset tokenization converts real-world assets—ranging from real estate to securities—into digital tokens on distributed ledgers, promoting fractional ownership, increased liquidity, and streamlined trading while reducing intermediaries.^[4] The technique's versatility stems from its role as a preprocessing step that standardizes input for algorithms, though challenges vary by context: for instance, handling ambiguities in natural languages (e.g., contractions or compound words) requires sophisticated rules or models like Byte-Pair Encoding (BPE) for subword tokenization in modern large language models.^[1] In programming languages, tokenization must resolve maximal munch rules to avoid misinterpretation of ambiguous sequences, such as distinguishing operators from identifiers.^[5] Security implementations often employ vault-based systems where mappings between tokens and originals are securely stored, ensuring reversibility only under controlled access.^[3] Meanwhile, blockchain tokenization leverages standards like ERC-20 or ERC-721 for fungible and non-fungible tokens, respectively, to represent ownership and enable smart contract interactions.^[4] Historically, tokenization traces back to early computational linguistics and compiler theory in the mid-20th century, evolving with advances in AI and distributed systems; today, it underpins critical applications from chatbots and fraud detection to decentralized finance (DeFi), with ongoing research addressing efficiency in handling diverse languages and data types.^[6]^[7]^[8] Its importance has surged with the rise of generative AI, where token limits directly impact model performance and cost, highlighting the need for optimized tokenizers that balance vocabulary size and coverage.^[9]

In natural language processing

Definition and purpose

In natural language processing (NLP), tokenization is the process of dividing raw text into smaller, discrete units called tokens, which may include words, phrases, subwords, or other linguistically meaningful segments to facilitate computational analysis.^[10] This segmentation transforms unstructured text into a structured sequence that machines can process, addressing challenges like punctuation attachment and irregular spacing.^[11] For example, the English sentence "Hello, world!" is commonly tokenized as ["Hello", ",", "world", "!"], separating content words from punctuation to preserve semantic and syntactic boundaries.^[10] The primary purpose of tokenization is to standardize input data for downstream NLP tasks, ensuring consistency and enabling efficient handling of text variations across languages and domains.^[12] By breaking text into tokens, it supports processes such as parsing, where syntactic structures are identified; sentiment analysis, which evaluates emotional tone; named entity recognition, for extracting entities like persons or locations; and machine translation, which maps sequences between languages.^[10] This foundational step mitigates issues like out-of-vocabulary words and adapts to diverse text structures, such as contractions in English or character-based writing in languages like Chinese.^[13] Tokenization traces its origins to the 1950s, when it emerged as a core component of early information retrieval systems designed for indexing and searching text documents in growing scientific literature collections.^[14] Pioneering efforts, such as those by Calvin Mooers who formalized the term "information retrieval" around 1950, relied on basic text segmentation to create searchable term lists, laying the groundwork for automated document processing.^[15] These systems, including demonstrations of auto-indexing at 1958 conferences, highlighted tokenization's role in enabling rapid access to relevant content amid the computational limitations of the era.^[14]

Types of tokenization

Tokenization in natural language processing varies by granularity, with methods categorized primarily as sentence-level, word-level, subword-level, and character-level approaches. These types determine the unit size of text segments fed into models, balancing vocabulary efficiency, handling of linguistic diversity, and computational demands.^[16] Sentence-level tokenization identifies boundaries in text using heuristic rules based on punctuation markers such as periods, question marks, and exclamation points, often combined with contextual cues to avoid false splits on abbreviations or decimals. This approach is essential for tasks involving discourse analysis, as it structures text into coherent units for higher-level processing like coreference resolution or summarization. Word-level tokenization splits text primarily on whitespace and punctuation, treating each resulting unit as a discrete token corresponding to a full word. It performs adequately for languages with clear word boundaries, such as English, but encounters challenges in agglutinative languages like Turkish or Finnish, where complex morphological affixes create long, variable forms that lead to vocabulary explosion and poor out-of-vocabulary handling.^[17]^[16] Subword-level tokenization addresses limitations of word-level methods by decomposing words into smaller meaningful units, such as morphemes or frequent n-grams, to manage out-of-vocabulary words and reduce overall vocabulary size. Common techniques include Byte-Pair Encoding (BPE) and WordPiece, which iteratively merge character pairs or select subwords based on likelihood to form a compact set; this results in vocabularies of approximately 30,000–50,000 tokens, enabling efficient modeling while preserving morphological structure. Subword approaches are particularly effective for morphologically rich languages, though they may occasionally split affixes suboptimally.^[17] Character-level tokenization treats individual characters (or bytes) as the basic units, providing maximal flexibility for low-resource languages without predefined word boundaries and eliminating out-of-vocabulary issues entirely. However, it significantly increases sequence lengths—often by a factor of 4–5 compared to word-level—raising computational costs for models with fixed context windows. This method suits scenarios prioritizing robustness over efficiency, such as early neural machine translation systems.^[16]^[17]

Common algorithms and techniques

Tokenization in natural language processing employs a variety of algorithms and techniques to split text into meaningful units, ranging from simple rule-based methods to sophisticated subword approaches. Rule-based tokenization relies on predefined patterns, such as splitting on whitespace for languages like English, where words are typically separated by spaces, and using regular expressions to handle punctuation and special characters. This method is straightforward and computationally efficient but often requires language-specific adjustments, as it struggles with morphological variations or scripts without explicit delimiters. Statistical methods address these limitations, particularly for languages lacking clear word boundaries, such as Chinese. Hidden Markov Models (HMMs) model word segmentation as a sequence labeling task, where characters are tagged with tags indicating the start, end, or continuation of words, using Viterbi decoding to find the most probable segmentation. For instance, hierarchical HMMs integrate part-of-speech tagging and unknown word detection, improving accuracy on complex texts by capturing dependencies across levels. These approaches leverage probabilistic transitions trained on annotated corpora, offering robustness to ambiguity compared to pure rule-based systems. Subword tokenization has become prevalent for handling rare words and out-of-vocabulary (OOV) issues in modern models. Byte-Pair Encoding (BPE), introduced by Sennrich et al. in 2016, starts with a base vocabulary of individual characters and iteratively merges the most frequent adjacent symbol pairs to build larger subwords.^[18] The merge selection is determined by:

(x, y) = \arg\max_{x,y} f(xy)

where f(xy) is the frequency of the pair in the training corpus.^[18] This process continues for a predefined number of operations, resulting in a vocabulary that decomposes unseen words into known subword units, effectively reducing OOV rates to near zero.^[18] BPE's frequency-based pairing enables open-vocabulary handling without explicit morphological rules.^[18] WordPiece, originally developed by Schuster and Nakajima in 2012 and popularized in BERT by Devlin et al. in 2018, operates similarly but selects merges to maximize the likelihood of the training data using a greedy algorithm on subword probabilities.^[19] It prioritizes splits that minimize the overall loss in a language model objective, making it suitable for morphologically rich languages.^[19] The Unigram Language Model, part of the SentencePiece toolkit introduced by Kudo in 2018, takes a probabilistic approach by starting with a large initial vocabulary and iteratively pruning low-probability subwords to optimize a unigram model fit.^[20] This method excels in multilingual settings by treating text as a sequence of independent subword tokens, selected via P(w) = \frac{f(w)}{\sum f}.^[20] Practical implementations are facilitated by libraries like NLTK, whose word_tokenize() function combines rule-based splitting with the Penn Treebank tokenizer for English-centric tasks.^[21] spaCy's tokenizer component uses a combination of rules and statistical models, customizable via language-specific exception rules for efficient pipeline integration.^[22] For large-scale applications in modern large language models, the Hugging Face Tokenizers library provides fast, Rust-based implementations of BPE, WordPiece, and Unigram, supporting rapid training and inference on massive corpora.^[23]

Role in machine learning models

Tokenization serves as a critical preprocessing step in natural language processing (NLP) pipelines for machine learning models, particularly transformers, where raw text is segmented into tokens and mapped to numerical IDs from a fixed vocabulary. This conversion enables models to process sequences of integers rather than unstructured text, directly influencing input size, computational efficiency, and overall performance; for instance, vocabulary sizes typically range from 30,000 to 100,000 tokens, balancing coverage of common words and subwords while minimizing the dimensionality of embedding layers. In the original Transformer architecture, subword tokenization via byte-pair encoding (BPE) or word-piece models was employed to handle variable-length inputs, with shared vocabularies of approximately 32,000–37,000 tokens for machine translation tasks, ensuring robust representation without excessive fragmentation.^[24] In large language models (LLMs) like those in the GPT series, tokenization dictates the handling of fixed vocabularies, often around 50,000 tokens for GPT-2 and GPT-3, which necessitates padding shorter sequences with special tokens and applying attention masks to ignore them during computation. This process affects the effective context window—the maximum number of tokens the model can process at once—limiting input to 1,024 tokens in early GPT models and influencing downstream tasks like generation and comprehension by constraining the amount of information models can attend to simultaneously. Padding and masking prevent information leakage from non-existent tokens but introduce overhead, as masked positions still require computational resources in self-attention mechanisms, thereby impacting training and inference efficiency.^[25]^[24] Tokenization addresses key challenges in multilingual text processing by enabling models to represent diverse scripts and languages through subword units, which improve embedding quality by capturing morphological similarities across languages and enhance training efficiency by reducing out-of-vocabulary issues. For low-resource languages, however, inefficient tokenization can lead to longer sequences, increasing computational costs and degrading embedding coherence, as seen in analyses where multilingual tokenizers struggle with code-mixing without customization. A key metric in this domain is token efficiency, measuring how subword approaches like BPE compress text; for English, this averages about 1.3 tokens per word, lowering the effective sequence length and computational demands compared to character-level tokenization.^[26] The 2017 introduction of the Transformer architecture spurred the development of custom tokenizers, shifting from generic methods to tailored subword strategies that better accommodate code-switching in multilingual models, such as those handling English-Spanish mixtures by merging language-specific merges during training. This evolution has enabled LLMs to achieve higher performance on cross-lingual tasks, with custom vocabularies optimizing for underrepresented languages and reducing token bloat in mixed inputs.^[24]

In lexical analysis

Definition and process

In lexical analysis, tokenization refers to the process of scanning an input stream of characters from a source program and grouping them into a sequence of tokens according to the grammar of the programming language.^[2] This is the first phase of compilation, where the lexical analyzer, or scanner, converts the raw character stream into a structured token stream that can be processed by subsequent phases like syntax analysis.^[2] Tokens represent meaningful syntactic units, such as keywords, identifiers, operators, and literals, each classified by type and often accompanied by attributes like value or position.^[2] The tokenization process involves reading the input character by character from left to right, applying pattern matching to identify lexemes—sequences of characters that match token definitions—and producing tokens upon recognition.^[2] Token patterns are typically specified using regular expressions, which are converted into finite automata for efficient implementation: first to a non-deterministic finite automaton (NFA) via Thompson's construction, then to a deterministic finite automaton (DFA) using subset construction to enable linear-time scanning.^[2] The DFA guides state transitions based on input characters, recognizing the longest possible valid token (maximal munch rule) and outputting it to the parser while ignoring whitespace and comments.^[2] This approach ensures unambiguous breakdown, outputting a clean token stream for parsing.^[2] Tokenization in lexical analysis originated in the 1950s with early compilers like the FORTRAN I system developed by IBM, which processed source input into basic elements despite lacking modern formal structure—such as allowing variables to overlap with reserved words.^[2] The concepts were formalized in the 1970s and 1980s through foundational work on regular expressions and finite automata for scanners, as detailed in seminal texts like Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman (first edition 1986, building on their 1977 book).^[27] This formalization established lexical analysis as a rigorous, automata-driven phase distinct from earlier ad-hoc methods.^[27] A representative example is the C-like code snippet int x = 5;, which the lexical analyzer tokenizes into the sequence: keyword "int", identifier "x", operator "=", number "5", and punctuation ";".^[2] Unlike tokenization in natural language processing, which handles flexible and ambiguous text, lexical analysis enforces strict, unambiguous rules based on the language's grammar.^[2]

Token categories

In lexical analysis for programming languages, tokens are classified into standard categories that represent the basic syntactic units of source code. These categories enable the parser to interpret the structure and semantics of the program by grouping sequences of characters (lexemes) into meaningful units. The primary token categories include keywords, identifiers, literals, operators, and separators, with additional handling for elements like comments and whitespace, which are typically ignored during tokenization.^[28] Keywords are reserved words with predefined meanings in the language, forming a fixed set that cannot be used for other purposes such as variable names. Examples include "if" for conditional statements and "while" for loops, which direct the compiler to specific syntactic constructs. In C++, there are 95 such keywords as of the C++20 standard, encompassing core control structures, type specifiers, and newer features like coroutines.^[29] Identifiers represent user-defined names for entities like variables, functions, and classes, following language-specific rules for formation. They typically consist of alphanumeric characters and underscores, but must not start with a digit to distinguish them from numbers. For instance, "myVariable" or "calculateSum" qualifies as an identifier, allowing programmers to label program elements uniquely.^[30]^[28] Literals are fixed-value constants embedded directly in the code, serving as immediate data for the program. Common subtypes include string literals like "hello", numeric literals such as integers (42) or floating-point numbers (3.14), and boolean literals (true or false). These provide concrete values without requiring computation or reference.^[30]^[31] Operators are symbols that denote operations on operands, classified as unary (acting on one operand, e.g., the negation operator -) or binary (acting on two, e.g., addition + or equality ==). Logical operators like && (AND) fall into the binary category, facilitating expressions for computation and control flow. Separators, also known as punctuation or delimiters, include structural symbols such as brackets {}, commas ,, and semicolons ;, which organize code into blocks, separate arguments, and terminate statements.^[32]^[28] During lexical analysis, comments and whitespace are generally skipped and not treated as tokens, as they serve documentation or formatting purposes rather than syntactic roles. In languages like C and C++, preprocessor directives such as #include function as special tokens, processed before standard lexical analysis to handle inclusions, macros, and conditional compilation.^[33]^[32]

Implementation in compilers

In compiler design, tokenization during lexical analysis is typically implemented using finite automata to efficiently recognize patterns defined by regular expressions for token categories such as keywords, identifiers, and operators. Regular expressions for these patterns are first converted into nondeterministic finite automata (NFAs), which are then transformed into deterministic finite automata (DFAs) for faster execution, allowing the lexer to scan input streams in linear time relative to the input size. This approach ensures that the lexer can match the longest possible prefix of the input that corresponds to a valid token, resolving ambiguities through the maximal munch principle, where the automaton selects the longest matching lexeme rather than a shorter one, such as preferring "==" over a single "=" in languages like C.^[34]^[35]^[36] Lexers can be either hand-written or generated automatically from specifications. Hand-written lexers, often implemented as state machines in code, provide fine-grained control over performance and error recovery, making them suitable for high-speed applications where custom optimizations are needed, such as in production compilers like GCC. In contrast, generated lexers use tools like Lex or its open-source successor Flex, which compile regular expression rules from a .l file into efficient C code that implements a DFA-based scanner, automating the construction while supporting actions for each matched token. Flex-generated scanners are widely adopted for their balance of ease of development and speed, often outperforming naive implementations by avoiding backtracking.^[37]^[38] Error handling in lexer implementation involves detecting invalid sequences that do not match any token pattern and reporting them with contextual details, such as line numbers and positions, to aid debugging. Buffering techniques, such as two-buffer schemes with sentinels (e.g., marking buffer ends with a non-input character like EOF), enable lookahead operations without excessive system calls, allowing the lexer to peek ahead while maintaining efficiency; for instance, a forward buffer can hold upcoming characters for resolving token boundaries. Upon encountering an error, such as an unclosed string literal, the lexer may skip to the next valid token or insert a synthetic one for recovery, preventing total parsing failure.^[39]^[40] Practical examples illustrate these techniques in standard libraries. Java's StreamTokenizer class implements a simple hand-crafted lexer that parses input streams into tokens like numbers, identifiers, and quoted strings, using a character table for classification and supporting custom delimiters via flags. Similarly, Python's tokenize module provides a scanner for Python source code, generating tokens including comments and handling indentation-aware structures through a line-by-line generator that applies regular expressions for patterns like operators and keywords. These tools demonstrate deterministic matching aligned with token categories such as literals and symbols. In modern just-in-time (JIT) compilers, dynamic tokenization optimizes performance by reusing scanner states across compilations. For example, the V8 engine for JavaScript employs a hand-optimized scanner that processes input up to 2.1 times faster than prior versions by minimizing state transitions and integrating lookahead directly into its bytecode generation pipeline, enabling rapid parsing during runtime compilation.^[41]

In data security

Definition and mechanism

In data security, tokenization is the process of replacing sensitive data elements, such as credit card numbers or personal identification information, with non-sensitive surrogate values known as tokens. These tokens serve no independent value and cannot be used to derive the original data without access to a secure mapping system. The primary goal is to protect sensitive data by ensuring that even if the token is compromised in a breach, it reveals no useful information about the original value.^[42] The core mechanism involves substituting the original sensitive data with a randomized or surrogate token, which is stored in a secure database or "vault" that maintains a one-to-one mapping between the token and the original data. This vault is isolated from the systems handling tokenized data, accessible only by authorized entities through strict controls. Tokenization can be format-preserving, where the token retains the same length, structure, and format as the original (e.g., a 16-digit credit card number remains 16 digits), or non-format-preserving, which may alter the format entirely. The process is reversible via detokenization, but only by systems with vault access; unauthorized attempts yield no meaningful results due to the token's lack of inherent relation to the source data.^[42] Tokenization emerged in the early 2000s as a response to increasing data breach risks and the need for compliance with payment card industry standards, particularly following the introduction of the PCI Data Security Standard (PCI DSS) in 2004. Early implementations focused on financial services to minimize the storage and transmission of sensitive cardholder data, reducing the scope of PCI DSS requirements. For PIN block handling, related standards like ANSI X9.24-1 (revised in 2004) provided foundational key management practices that supported secure tokenization in retail financial environments.^[42]^[43]^[44] A representative example illustrates the mechanism: a primary account number (PAN) like "3124 0059 1723 387" is replaced with a token such as "7aF1Zx11 8523mw4c wl5x2," which preserves the format for seamless integration into existing systems. Detokenization reverses this by querying the vault to retrieve the original PAN. Unlike encryption, where the ciphertext has a mathematical relationship to the plaintext via a key (potentially allowing decryption if the key is compromised), tokens bear no such relation, rendering them useless outside the vault and significantly mitigating breach impacts.^[42]

Types and variants

Tokenization in data security encompasses several specialized variants adapted to particular data types and applications, each designed to replace sensitive information with non-sensitive surrogates while maintaining usability. These variants differ in their mechanisms, scope, and integration with other security measures, allowing organizations to address specific risks such as data breaches or unauthorized access in diverse environments.^[45] Payment tokenization is a prominent variant focused on financial transactions, where the Primary Account Number (PAN) on payment cards is replaced with a unique, format-preserving token to mitigate risks in e-commerce and point-of-sale systems. This approach, standardized by EMVCo, ensures that the token can be used throughout the payment ecosystem—from merchant acceptance to issuer authorization—without exposing the original PAN, thereby reducing the impact of compromised data.^[46] A key feature of payment tokens is domain restriction, which limits their validity to specific contexts, such as a particular merchant or device, to further constrain potential fraud if a token is intercepted.^[47] Apple's introduction of its Token Service in 2014 for Apple Pay exemplified this variant's adoption in mobile payments, generating device-specific account numbers stored securely on the device rather than on servers, which popularized tokenized transactions for contactless payments.^[48] Biometric tokenization addresses the protection of sensitive physiological data, such as iris scans or fingerprint templates, by substituting these raw biometrics with tokens during authentication processes. This variant prevents the storage of actual biometric data, which cannot be changed if compromised, instead relying on derived tokens to verify identity while preserving privacy.^[45] For general personally identifiable information (PII), such as Social Security Numbers (SSNs), tokenization variants include vault-based and stateless (vaultless) methods to secure non-financial sensitive data across databases and applications. Vault-based tokenization stores the mapping between original PII and tokens in a secure, centralized vault database, allowing reversible substitution where authorized systems can retrieve the original data as needed.^[49] In contrast, stateless tokenization employs cryptographic algorithms, such as format-preserving encryption standards, to generate and detokenize values without maintaining a vault, offering greater scalability and reduced storage overhead for large-scale PII protection.^[50] Hybrid approaches integrate tokenization with encryption to provide multi-layer security, where data is first tokenized to remove sensitive values and then the tokens or surrounding data are encrypted for additional protection against unauthorized access. This combination leverages tokenization's data de-identification with encryption's scrambling, enhancing resilience in environments handling mixed data types.^[51]

Standards and applications

Tokenization in data security is governed by several key standards that promote its use to protect sensitive information. The Payment Card Industry Data Security Standard (PCI DSS) version 4.0, released in 2022, recommends tokenization as a method to protect cardholder data by replacing primary account numbers (PANs) with non-sensitive tokens, thereby reducing the scope of compliance audits for merchants and service providers; many enhanced requirements became mandatory as of March 31, 2025.^[42]^[52] The General Data Protection Regulation (GDPR), effective in 2018, encourages pseudonymization techniques, including tokenization, as a means to minimize risks associated with personal data processing while allowing for reversible mappings in controlled environments.^[53] Additionally, the ISO 20022 standard for financial messaging facilitates the secure transmission of tokenized payment data, enabling interoperability in cross-border transactions and supporting the integration of tokenized assets into traditional systems.^[54] In practical applications, tokenization is widely deployed across industries to comply with these standards and enhance data protection. In e-commerce, it reduces the PCI DSS compliance scope by limiting the storage and transmission of actual card data, allowing merchants to outsource sensitive operations to token service providers.^[42] In healthcare, tokenization supports HIPAA compliance by replacing protected health information (PHI) with tokens, enabling secure data sharing for research and analytics without exposing identifiable details. For cloud storage, services like AWS provide tokenization solutions that replace sensitive data elements with unique identifiers, ensuring compliance and security in distributed environments.^[55] The adoption of tokenization yields measurable benefits, such as reducing the value of breached data—since tokens hold no intrinsic worth to attackers—and enabling secure analytics on datasets without direct exposure of originals.^[56] As of October 2025, Visa reported issuing over 16 billion tokens, with tokenized card-not-present (CNP) transactions showing significantly higher authorization rates and fraud reduction compared to non-tokenized ones.^[57] However, implementations must address challenges like secure key management in token vaults to prevent unauthorized detokenization and scalability for processing high-volume transactions without performance degradation.^[42]

In asset management and blockchain

Definition and concept

Tokenization in asset management and blockchain refers to the process of converting rights to physical or digital assets—such as real estate, art, or commodities—into divisible digital tokens on a distributed ledger, enabling the representation of ownership in a secure, transparent manner.^[58] These tokens serve as digital certificates that encapsulate the underlying asset's value and rights, allowing for seamless transfer and verification without traditional intermediaries.^[59] At its core, the concept facilitates fractional ownership, where high-value assets can be divided into smaller, accessible units for multiple investors, contrasting with conventional securities that often require full-unit purchases and operate within limited trading hours.^[60] This innovation enhances liquidity by enabling 24/7 global trading and instant settlement on blockchain networks, democratizing access to investments previously reserved for institutional players.^[61] The practice traces its roots to the 2017 initial coin offering (ICO) boom, when blockchain projects began issuing tokens to fundraise and represent asset-like interests, though many early efforts faced regulatory scrutiny. It gained mainstream traction with the 2020 surge in decentralized finance (DeFi), as protocols integrated tokenization to unlock liquidity for real-world assets beyond cryptocurrencies.^[62] A representative example involves tokenizing a $1 million property into 1,000 digital tokens, each denoting 0.1% ownership, which investors can trade individually to gain proportional rights to rental income or appreciation without acquiring the entire asset.^[63] Within this framework, security tokens—designed to comply with financial regulations and represent investment-grade ownership—differ from utility tokens, which primarily grant access to platform services rather than equity or profit-sharing rights.^[64]

Tokenization process

The tokenization process for assets on blockchain platforms begins with asset valuation and legal structuring, where the underlying asset—such as real estate, securities, or commodities—is appraised by independent experts to determine its market value, and a legal framework is established to define ownership rights, transfer restrictions, and compliance requirements through special purpose vehicles (SPVs) or trusts.^[65] This step ensures that the tokenized representation accurately reflects the asset's economic and legal attributes, often involving collaboration between issuers, legal advisors, and valuation firms to create binding documentation like tokenization agreements.^[66] Following valuation, smart contract creation occurs, where developers encode the token's rules and functionalities into blockchain-compatible code, commonly using standards like ERC-1400 on Ethereum for security tokens, which incorporates features such as transfer restrictions, document references, and compliance controls to mimic traditional securities.^[67] These contracts are then deployed and audited by third-party firms to verify security and functionality, preventing vulnerabilities like reentrancy attacks. The tokens are subsequently issued on a chosen blockchain, such as Ethereum for its robust ecosystem or Polygon for lower costs and faster transactions, through a minting process that generates digital representations proportional to the asset's value.^[68] Distribution typically happens via Security Token Offerings (STOs), where tokens are sold to investors through regulated platforms, ensuring fractional ownership and automated settlement on the blockchain.^[69] Technical aspects include integrating oracles, such as those from Chainlink, to feed real-time off-chain data—like asset prices or performance metrics—into smart contracts for accurate valuation and updates.^[70] Compliance is further enforced through KYC/AML integration, where investor identities are verified via automated checks linked to the smart contracts, restricting transfers to approved parties.^[70] Platforms like Polymath and Securitize facilitate this process by providing end-to-end tools for security token issuance, including template smart contracts, compliance modules, and investor onboarding.^[71]^[72] The entire process generally takes 3-6 months, encompassing legal setup, development, audits, and regulatory approvals, to mitigate risks and ensure robustness.^[65] For instance, tZERO is planning to launch its network with up to $1 billion in tokenized assets as of 2025, demonstrating scalable issuance of digital securities.^[73] A key concept in token transfers is atomic swaps, which enable peer-to-peer exchanges of tokenized assets across blockchains without intermediaries, using hashed time-lock contracts to guarantee either simultaneous completion or full reversal, thus ensuring settlement finality and reducing counterparty risk.^[74] This mechanism supports efficient, trustless trading of tokenized assets.

Benefits and challenges

Asset tokenization offers several key benefits that enhance the efficiency and accessibility of traditional asset markets. One primary advantage is increased liquidity through fractional trading, allowing investors to buy smaller portions of high-value assets like real estate or private equity, which were previously inaccessible to many.^[58] This fractionalization democratizes investment opportunities and broadens market participation by lowering entry barriers.^[75] Additionally, tokenization reduces costs by eliminating intermediaries such as brokers and custodians, streamlining transactions on blockchain networks.^[76] It also enables global access, permitting 24/7 trading from anywhere with internet connectivity, without geographic restrictions typical of legacy systems.^[77] Furthermore, the inherent transparency of blockchain ensures immutable records of ownership and transactions, fostering trust and reducing fraud risks.^[78] These benefits have driven substantial market potential, with estimates projecting the tokenized asset economy to reach $16 trillion by 2030, representing about 10% of global GDP, according to a Boston Consulting Group report.^[79] Tokenization also accelerates settlement times from the traditional T+2 (two business days) to near-instantaneous execution on blockchain, improving capital efficiency and reducing counterparty risks.^[61] Recent developments as of 2025 include tokenized U.S. Treasuries surpassing $8 billion in value and Securitize's agreement to go public via a $1.25 billion SPAC deal, underscoring institutional momentum in the sector.^[80]^[81] Despite these advantages, asset tokenization faces notable challenges that could hinder widespread adoption. Volatility risks remain prominent, as tokenized assets are often exposed to cryptocurrency market fluctuations, amplifying price instability for underlying real-world assets.^[82] Scalability issues, particularly on networks like Ethereum, lead to high gas fees during peak usage, making frequent transactions costly and deterring smaller investors.^[82] Interoperability between different blockchain protocols poses another hurdle, as assets tokenized on one chain may not seamlessly transfer or trade on another, fragmenting liquidity across ecosystems.^[83] A prominent example is real estate tokenization, where the process enables fractional ownership of properties, allowing investors to participate with minimal capital and potentially yielding diversified portfolios.^[84] This democratizes access to an asset class traditionally dominated by wealthy individuals or institutions, enhancing liquidity through secondary markets for tokens.^[85] However, challenges arise if secondary markets remain underdeveloped, leading to persistent illiquidity despite tokenization, as trading volumes may not match traditional exchanges.^[85] Early growth in tokenized real-world assets (RWAs) underscores this potential, with platforms like Securitize—backed by BlackRock—reporting over $4 billion in assets under management as of October 2025, signaling institutional interest amid ongoing market maturation.^[86]

Regulatory considerations

In the United States, the Securities and Exchange Commission (SEC) frequently classifies many tokenized assets as securities, applying the Howey Test—a 1946 Supreme Court precedent that identifies an investment contract as involving an investment of money in a common enterprise with an expectation of profits derived from the efforts of others.^[87] This approach subjects tokenized offerings to federal securities laws, requiring issuers to navigate disclosure and registration obligations to avoid enforcement actions.^[88] In the European Union, the Markets in Crypto-Assets Regulation (MiCA), adopted in 2023 and fully applicable since December 2024, establishes a harmonized regime for crypto-assets, including tokenized securities and other digital representations of value, to enhance investor protection, market transparency, and financial stability across member states.^[89] MiCA categorizes assets into stablecoins, e-money tokens, and other crypto-assets, mandating licensing for service providers and prohibiting anonymous transactions above certain thresholds.^[89] Compliance with these regulations typically involves registering Security Token Offerings (STOs) with oversight bodies like the SEC, verifying investor accreditation to limit participation to sophisticated or high-net-worth individuals, and implementing robust anti-money laundering (AML) and counter-terrorist financing (CTF) controls aligned with Financial Action Task Force (FATF) standards for virtual assets and service providers.^[90] The FATF guidelines, updated in 2021 and further refined in 2025, require virtual asset service providers to conduct customer due diligence, monitor transactions, and report suspicious activities, applying a risk-based approach to mitigate illicit finance risks in tokenization.^[91]^[92] Globally, regulatory approaches vary significantly; Singapore's Monetary Authority (MAS) has fostered a supportive environment since 2018, issuing guidelines that classify digital tokens as securities or payment instruments when applicable, while promoting innovation through initiatives like Project Guardian to enable tokenized asset markets.^[93] In contrast, China banned Initial Coin Offerings (ICOs) in 2017, deeming them illegal public financing activities and prohibiting cryptocurrency exchanges, which has stifled domestic tokenization while influencing international caution around similar fundraising models.^[94] Key challenges in asset tokenization include jurisdictional conflicts for cross-border transactions, where tokens can traverse multiple legal regimes instantaneously, leading to regulatory arbitrage, enforcement gaps, and compliance burdens for issuers.^[95] To address these, evolving regulatory sandboxes provide testing grounds; for instance, the UK's Financial Conduct Authority (FCA) and Bank of England launched the Digital Securities Sandbox in 2024, allowing firms to experiment with tokenized infrastructure under tailored supervisory conditions until 2029.^[96] By 2025, over 90 jurisdictions have adopted or are advancing regulations for virtual assets, including tokenization frameworks, according to FATF assessments, reflecting a global push toward standardized oversight.^[92]

References

[1]
Tokenization - Stanford NLP Group
Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
[2]
[PDF] Lexical Analysis - SUIF Compiler
Jun 25, 2008 · Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: ...
[3]
What Is Tokenization? | IBM
Tokenization converts sensitive data into a nonsensitive digital replacement that maps back to the original in order to protect sensitive information.What is tokenization? · What is a token?
[4]
[PDF] Blockchain Networks: Token Design and Management Overview
be referred to as tokens, with tokenization designating the concept of representing assets as tokens using blockchain networks. Certain terms used in this ...
[5]
[PDF] “Maximal-Munch” Tokenization in Linear Time
The lexical-analysis (or scanning) phase of a compiler attempts to partition an input string into a sequence of tokens. The convention in most languages is that ...
[6]
The Foundations of Tokenization: Statistical and Computational ...
Nov 4, 2024 · Tokenization—the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary—is a critical step in ...<|control11|><|separator|>
[7]
Explaining Tokens — the Language and Currency of AI - NVIDIA Blog
Mar 17, 2025 · Tokens are units of data processed by AI models during training and inference, enabling prediction, generation and reasoning.
[8]
[PDF] Words and Tokens - Stanford University
May 23, 2025 · Speech and Language Processing. Daniel Jurafsky & James H. Martin ... For this reason, sentence tokenization and word tokenization can be.
[9]
[PDF] arXiv:2012.15524v3 [cs.CL] 5 Oct 2021
Oct 5, 2021 · Tokenization is the process of splitting text into smaller units called tokens (e.g., words). It is a fundamental preprocessing step for almost ...<|separator|>
[10]
Tokenization as the initial phase in NLP - ACM Digital Library
Tokenization is the beginning step of NLP, where notions of word and token are discussed and defined.
[11]
[PDF] The Foundations of Tokenization - arXiv
Apr 3, 2025 · Tokenization converts characters into tokens, breaking symbols into subsequences. It's critical in NLP, but can cause issues like ambiguity.
[12]
[PDF] Introduction to Information Retrieval - Stanford University
Aug 1, 2006 · ... Tokenization. 22. 2.2.2. Dropping common terms: stop words. 27. 2.2.3. Normalization (equivalence classing of terms). 28. 2.2.4. Stemming and ...
[13]
[PDF] Information Retrieval: The Early Years - Now Publishers
Information retrieval, the science behind search engines, had its birth in the late 1950s. Its forbearers came from library science, mathematics and linguistics ...Missing: tokenization | Show results with:tokenization
[14]
Impact of Tokenization on Language Models: An Analysis for Turkish
Mar 25, 2023 · We analyze the impact of tokenizers, at different granularity levels from character level to word level, on varying downstream tasks for Turkish ...
[15]
Tokenization Strategies for Low-Resource Agglutinative Languages ...
Aug 27, 2025 · Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic ...
[16]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting ...
[17]
BERT: Pre-training of Deep Bidirectional Transformers for Language ...
BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
[18]
SentencePiece: A simple and language independent subword ...
Aug 19, 2018 · This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural ...
[19]
https://arxiv.org/abs/1810.04805
[20]
Tokenizer · spaCy API Documentation
Segment text, and create Doc objects with the discovered segment boundaries. For a deeper understanding, see the docs on how spaCy's tokenizer works.
[21]
Tokenizers — tokenizers documentation - Hugging Face
Train new vocabularies and tokenize, using today's most used tokenizers. · Extremely fast (both training and tokenization), thanks to the Rust implementation.
[22]
Attention Is All You Need
### Summary of Tokenization/Subword Units in "Attention Is All You Need" (arXiv:1706.03762)
[23]
Summary of the tokenizers - Hugging Face
GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
[24]
Tokenization efficiency of current foundational large language ...
Tokenization efficiency can be defined as the number of tokens a tokenizer uses to represent the input text while preserving the integrity of the original text.
[25]
Compilers: Principles, Techniques, and Tools (Dragon Book)
Sep 28, 2008 · This website serves as a supplement to the 2nd Edition of the textbook Compilers: Principles, Techniques, and Tools (commonly known as the Dragon Book).
[26]
[PDF] Lecture Notes on Lexical Analysis
Feb 9, 2023 · Lexical analysis is the first phase of a compiler, turning raw input into a token stream, classifying tokens, and compressing input.
[27]
[PDF] Lexical Analysis
The exact characters in a token is called its lexeme. • Tokens are classified by token types, e.g. identi- fiers, constant literals, strings, operators, punctua ...
[28]
2. Lexical analysis — Python 3.14.0 documentation
Besides NEWLINE , INDENT and DEDENT , the following categories of tokens exist: identifiers and keywords ( NAME ), literals (such as NUMBER and STRING ), and ...
[29]
6 Lexical structure - C# language specification - Microsoft Learn
Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens. Syntactic analysis, which translates the stream of tokens into ...
[30]
Tokenization (The C Preprocessor)
1.3 Tokenization. After the textual transformations are finished, the input file is converted into a sequence of preprocessing tokens.
[31]
Automating lexical analysis - CS [45]12[01] Spring 2023
The first step is to convert the regular expression into a nondeterministic finite automaton. An NFA differs from a DFA in that each state can transition to ...
[32]
[PDF] CS153: Compilers Lecture 8: Lexing - Harvard University
Use a regular expression to specify the set strings for each Token Type. •Example: [0-9]+ specifies the set of strings for NUM. •Construct the NFA formed by.
[33]
https://gcc.gnu.org/onlinedocs/gcc-12.1.0/cpp/Tokenization.html
[34]
[PDF] Compilers and computer architecture: Parsing
Lexers and parsers generated by such tools are likely to be better (e.g. faster, better error handling) than hand-written parsers. Writing a lexer or parser by ...<|separator|>
[35]
Lexical Analysis With Flex, for Flex 2.6.2: Top - Will Estes
Oct 22, 2016 · This manual describes flex , a tool for generating programs that perform pattern-matching on text. The manual includes both tutorial and ...Lex and Posix · How does flex compile the... · Can I get the flex manual in... · Cxx
[36]
Input Buffering in Compiler Design - GeeksforGeeks
Sep 1, 2025 · Input buffering is a technique where the compiler reads input in blocks into a buffer, instead of character by character, reducing system calls ...
[37]
[PDF] Lecture 2 - Lexical Analysis - Compiler Construction
Tokens are the fundamental building blocks of a program's grammatical structure, representing such basic elements as identifiers, numericliterals, and specific ...Missing: compilers | Show results with:compilers
[38]
Blazingly fast parsing, part 1: optimizing the scanner - V8.dev
Mar 25, 2019 · The cornerstone of parser performance is a fast scanner. This article explains how V8's JavaScript scanner recently got up to 2.1× faster.Missing: implementation | Show results with:implementation
[39]
[PDF] Information Supplement • PCI DSS Tokenization Guidelines
The purpose of this Information Supplement is to provide guidance for payment industry stakeholders when developing, evaluating, or implementing a tokenization ...
[40]
PCI DSS History: How the Standard Came To Be - Secureframe
Oct 2, 2024 · PCI DSS was first introduced in December 2004 and provides a uniform standard for card payment security for businesses worldwide.
[41]
[PDF] ASC X9 Completes Update of ANSI X9.24-1, Detailing ...
– August 1, 2017 -- Today the Accredited Standards Committee X9 Inc. (X9) announced the completion of an updated edition of ANSI X9.24-1, Retail Financial.
[42]
EMV® Payment Tokenisation - EMVCo
The EMV Payment Tokenisation Specification – Technical Framework defines the roles, functions and requirements that need to be considered when introducing EMV ...
[43]
EMV® Payment Tokenisation: What, Why and How - EMVCo
Mar 31, 2022 · EMV Payment Tokenisation is the process of replacing a Primary Account Number (PAN) with a unique alternative value that can be defined in where and how it is ...
[44]
[PDF] EMV Payment Tokenization Primer and Lessons Learned
Token domain restriction controls can be unique to the processing environment, the consumer's device or the token requestor. Examples include limiting use of ...
[45]
Apple Announces Apple Pay
Sep 9, 2014 · CUPERTINO, California―September 9, 2014―Apple® today announced Apple Pay™, a new category of service that will transform mobile payments ...Apple (AU) · Apple (CA) · Apple (UK) · Apple (EG)Missing: Token | Show results with:Token
[46]
FIDO Alliance 2019 Progress Report: FIDO Authentication for ...
Dec 4, 2019 · The National Health Service (NHS) in the United Kingdom released open source code for developers to add FIDO biometric security for app login ...Missing: tokenization | Show results with:tokenization
[47]
[PDF] Tokens and Tokenization | OpenText
Tokenization is a process by which PANs (primary account numbers), PHI (protected health information), PII (personally identifiable information), and other ...
[48]
Types of Tokenization: Vault and Vaultless | Encryption Consulting
Apr 5, 2024 · Vaultless tokenization is more efficient and safer than vault tokenization, as it does not maintain a database, but instead uses secure cryptographic devices.
[49]
Tokenization vs. Encryption: Which Is Safer? - Spreedly
May 22, 2025 · Combining tokenization with encryption can help you to achieve a more defensive approach to payment security that covers all of your compliance ...How Tokenization Works · Tokenization Vs Encryption... · Pci Dss Tokenization Vs...
[50]
Pseudonymisation | ICO
Tokenisation-based pseudonymisation Tokenisation replaces identifiers with randomly generated tokens. Tokens can be generated by hashing or by generating ...
[51]
How ISO 20022 Can Help Scale Tokenized Money Globally?
Jun 20, 2025 · Discover how ISO 20022 acts as a universal financial language to connect tokenized money, digital assets, and traditional systems for ...
[52]
Guidance for Tokenization to Improve Data Security and Reduce ...
Tokenization replaces sensitive data with unique identifiers, enhancing data security, reducing costs, and lowering compliance costs while safeguarding PII.Guidance For Tokenization To... · Overview · How It Works
[53]
How to use tokenization to improve data security and reduce audit ...
Jan 25, 2022 · Tokenization is the process of replacing actual sensitive data elements with non-sensitive data elements that have no exploitable value for data security ...Why Tokenize? Common Use... · Tokenization Or Encryption... · Where Encryption May Make...
[54]
Visa Issues 10 Billionth Token, Generating $40 Billion in Incremental ...
Jun 4, 2024 · Risk Datamart CY2023, Visa Processed global card-not-present (CNP) payment volume for tokenized vs non-tokenized credentials. 3 Visanet ...
[55]
Asset Tokenization Explained: Benefits, Risks, and How It Can Work
Mar 22, 2024 · Asset tokenization refers to the process of recording the rights to a given asset into a digital token that can be held, sold, and traded on a blockchain.What is asset tokenization? · Benefits of asset tokenization
[56]
What is asset tokenization? - Hedera
Asset tokenization is the process by which an issuer creates digital tokens on a blockchain or other form of distributed ledger to represent digital or ...
[57]
What Is Tokenization in Blockchain? - Gemini
Tokenization turns real or digital assets into blockchain tokens that trade 24/7 with instant settlement and fractional ownership. Learn how it works now!The Benefits of Tokenization · What Does Crypto...
[58]
Tokenization in financial services: Delivering value and transformation
Oct 3, 2025 · Tokenization lets you digitally represent asset ownership for tangible or intangible assets on a blockchain. Adoption is accelerating as ...ways tokenization can create... · How tokenization works · Embracing tokenization<|separator|>
[59]
Blockchain in finance - from Initial Coin Offering (ICO) to ... - inWeb3
Jun 18, 2021 · We will introduce the two major milestones for blockchain in finance, from the ICO boom in 2017 to the DeFi movement today, and help you understand how they ...
[60]
6 Examples of Tokenized Assets - Kaleido
May 24, 2024 · 6 Examples of Tokenized Assets · 1. Real Estate Tokenization · 2. Art and Collectibles Tokenization · 3. Tokenization of Intellectual Property · 4.
[61]
Explainer: Utility vs. Security Tokens - Crypto Council for Innovation
Aug 22, 2024 · Security tokens are subject to federal securities regulations, while utility tokens are largely unregulated. The value of security tokens is ...
[62]
Why and How to Perform Asset Tokenization in 2025 - ScienceSoft
According to ScienceSoft's experience, an asset tokenization project of average complexity lasts 3–6 months and may cost around $100,000–$300,000. Want to ...
[63]
What is tokenization? - McKinsey
Jul 25, 2024 · Tokenization is the process of creating a digital representation of a real thing. Tokenization can also be used to protect sensitive data.<|control11|><|separator|>
[64]
ERC 1400 - The Security Token Standard - Polymath Network
ERC1400 aims for consistency in security tokens, creating a unified framework with core compliance, document handling, and forced token transfers.
[65]
How to Tokenize An Asset - Chainlink
May 21, 2025 · Select the Asset to Tokenize · Define Token Type · Choose the Blockchain You Want to Issue Your Tokens On · Select a Third-Party Auditor To Verify ...
[66]
Tokenization: A complete guide - Kraken
Tokenization refers to the process of representing real-world assets (RWA) on the blockchain using cryptocurrency tokens.
[67]
Tokenized in America: How the U.S. Financial System Gains Global ...
Jul 15, 2025 · These oracles provide trusted, onchain attestations about key participant information, such as KYC and AML status, investor accreditation, and ...
[68]
Polymath | Tokenizing the Global Financial System
Polymath pioneers tokenized securities, offering a seamless white label platform for creating, issuing, and managing tokens with compliance and transparency.History · Polymath Capital Platform · Security Tokens · By Business
[69]
Securitize | The Leading Tokenization Platform
Securitize is the leader in tokenizing real-world assets. Explore our full suite of solutions for asset managers, Web3 firms and DAOs, advisors and investors.The Securitize Story · Securitize for Advisors · Private Market Investing on...Missing: Polymath | Show results with:Polymath
[70]
The State of Security Tokens 2022 — Tokenization in Private Markets
Mar 21, 2022 · The $TZROP token, amassing $134 million in initial fundraising across accredited investors, entitles investors to 10% of adjusted gross revenues ...
[71]
What Is an Atomic Swap? - Chainlink
Aug 14, 2024 · Atomic swaps are a way for two people to trade tokenized assets across different blockchain networks without relying on a centralized intermediary.
[72]
Advantages of fractionalizing assets through tokenization - Polymesh
Apr 12, 2024 · By fractionalizing assets into smaller, more manageable units, tokenization enables easier trading and opens the door to a wider investor base, ...
[73]
7 Benefits of Tokenization of Assets for Institutions - XBTO
Oct 1, 2025 · 1. Unlocking liquidity & creating secondary markets · 2. Lower costs & faster settlement · 3. Fractional ownership & broader market access · 4.<|separator|>
[74]
How will asset tokenization transform the future of finance?
Aug 8, 2025 · With tokenized assets, settlement can occur in near-real-time on a blockchain network. This acceleration is especially impactful in cross-border ...
[75]
Tokenization of Private Assets: Unlocking Liquidity, Transparency ...
Jan 17, 2025 · On the issuer's side, converting private assets into blockchain tokens can increase liquidity through cheaper and more secure transactions.
[76]
[PDF] Relevance of on-chain asset tokenization in 'crypto winter'
May 22, 2022 · The total size of illiquid asset tokenization globally would be $16 trillion by 2030 (Exhibit 5). Key reasons for asset illiquidity include ...
[77]
Recent developments and challenges using blockchain techniques ...
Dec 7, 2024 · 2. Cost. •. Blockchain-Based Models: Higher transaction costs due to gas fees on networks like Ethereum, which can spike with congestion.
[78]
https://caia.org/blog/2025/01/17/tokenization-private-assets-unlocking-liquidity-transparency-access-modern
[79]
Real estate tokenization: A new era for property investment and ... - EY
Feb 24, 2025 · Real estate tokenization is the process of converting the value of a physical property into digital tokens that can be bought, sold, or traded on a blockchain ...
[80]
Real Estate Tokenization: Opportunities And Challenges - Primior
Jul 15, 2024 · Tokenization significantly increases liquidity in the real estate market. Traditionally, selling property can take months, but tokenized assets ...
[81]
Top 10 RWA Tokenization Companies in 2025 | Real-World Asset ...
Jun 4, 2025 · Why Securitize stands out in 2025: With $47 million led by BlackRock to expand RWA tokenization and over $1 billion on-chain assets, ...
[82]
[PDF] Framework for “Investment Contract” Analysis of Digital Assets
5 The so-called. “Howey test” applies to any contract, scheme, or transaction, regardless of whether it has any of the characteristics of typical securities.6 ...
[83]
Howey Test and Cryptocurrency: Understanding Investment Contracts
Aug 29, 2025 · The Howey Test evaluates whether transactions qualify as "investment contracts" under U.S. securities laws, focusing on the expectation of ...What Is the Howey Test? · Understanding the Howey Test · Applying the Howey Test
[84]
Markets in Crypto-Assets Regulation (MiCA)
The Markets in Crypto-Assets Regulation (MiCA) institutes uniform EU market rules for crypto-assets ... Assets Regulation (MiCA) entered into force in June 2023 ...
[85]
Security Token Offering 2023—regulations and KYC rules - Sumsub
Jun 21, 2024 · Compliance with AML/CTF requirements;; Knowing whether STO investors are US citizens in order to apply either registration or exemption rules.
[86]
[PDF] Virtual Assets and Virtual Asset Service Providers - FATF
It provides guidelines to countries, competent authorities, and industry for the design and implementation of a risk- based AML/CFT regulatory and supervisory ...
[87]
FATF urges stronger global action to address Illicit Finance Risks in ...
Jun 26, 2025 · Updated Guidance for a Risk-Based Approach to Virtual Assets and Virtual Asset Service Providers. This updated guidance forms part of the FATF's ...
[88]
MAS warns Digital Token Exchanges and ICO Issuer
May 24, 2018 · MAS has warned eight digital token exchanges in Singapore not to facilitate trading in digital tokens that are securities or futures contracts without MAS' ...
[89]
China bans companies from raising money through ICOs ... - CNBC
Sep 4, 2017 · In a document, regulators said new projects that raise cash or other virtual currencies through cryptocurrencies will be banned, ...
[90]
https://sumsub.com/blog/kyc-legal-for-sto/
[91]
Digital Securities Sandbox (DSS) - Bank of England
Dec 23, 2024 · The DSS is due to run until 8 January 2029 but may be extended by HM Treasury through legislation if more time is necessary to transition to a ...