Fact-checked by Grok 2 weeks ago

Byte-pair encoding

Byte-pair encoding (BPE) is a data compression algorithm that iteratively identifies and replaces the most frequent adjacent pairs of bytes (or ) in a with a new, unused from an extended , building a compact representation while preserving the original sequence through a of merges. Introduced by Philip Gage in 1994, BPE was designed as a general-purpose technique that achieves performance nearly matching more sophisticated methods like (LZW) but with simpler implementation and faster . In , BPE was repurposed in 2016 by Rico Sennrich, Barry Haddow, and Alexandra Birch as a subword tokenization strategy for (NMT), enabling models to handle rare, compound, and out-of-vocabulary words by decomposing them into frequent subword units rather than relying on fixed word-level vocabularies. This adaptation allows for open-vocabulary translation, where unseen words are represented as sequences of known subwords, improving translation quality for morphologically rich languages and proper nouns through like character-level copying or , with reported gains of up to 1.3 points on standard benchmarks. BPE's efficiency in managing vocabulary size—typically limiting it to 30,000– tokens—has made it a of modern large language models (LLMs), including OpenAI's series, where it processes diverse text inputs at the byte level to support multilingual capabilities and reduce out-of-vocabulary issues. For instance, employs a BPE tokenizer with a of 50,257 tokens derived from 50,000 merges on a large English corpus, balancing coverage of common words, subwords, and rare sequences to enhance model performance across unsupervised . Variants like SentencePiece further refine BPE for direct operation on raw text without language-specific preprocessing, underscoring its versatility in contemporary AI systems.

Background and History

Origins in Data Compression

Byte-pair encoding (BPE) was developed by Philip Gage in 1994 as a lossless data compression technique designed to efficiently reduce the size of text files by replacing frequently occurring pairs of bytes with single new symbols, thereby minimizing redundancy while preserving all original information. This approach emerged as a response to the need for simple, adaptive methods that could handle repetitive patterns in data without requiring complex dictionary management. The core of BPE lies in its iterative of identifying and merging the most common adjacent byte pairs across the input data, assigning each merged pair a unique from the extended (typically using unused byte values from 128 to 255 in ASCII). This builds a hierarchical of increasingly longer symbols, allowing the algorithm to capture higher-order redundancies exponentially more effectively than linear methods. By focusing on byte-level operations, BPE ensures lossless through a straightforward expansion routine that reverses the merges in the reverse order of their application. In its initial implementation, BPE was coded in and applied primarily to text files, demonstrating compression ratios that outperformed variants of the Lempel-Ziv-Welch (LZW) on certain datasets, such as reducing a 544,789-byte file to 276,955 bytes compared to LZW's 292,588 bytes with a 14-bit dictionary. The 's compression phase required moderate memory (5-30 KB) and processing time, while was notably fast (under 20 seconds on a 33 MHz 486 PC), making it suitable for resource-constrained environments like systems or applications. BPE arose in the mid-1990s amid the evolution of adaptive algorithms, following the foundational Lempel-Ziv work of 1977 and LZW's 1984 refinement, but before the widespread adoption of in 1993, which combined LZ77 with for broader utility. This timing positioned BPE as a lightweight alternative emphasizing speed and simplicity over maximal depth.

Adoption in

Byte-pair encoding (BPE) was first adapted for (NLP) in the 2016 paper "Neural Machine Translation of Rare Words with Subword Units" by Rico Sennrich, Barry Haddow, and Alexandra Birch, where it served as a subword tokenization technique to enhance (NMT) systems. The method addressed the challenge of out-of-vocabulary (OOV) words by decomposing them into frequent subword units, allowing the model to translate rare or unseen terms through compositionality rather than relying on a fixed word . This adoption stemmed from the shortcomings of word-level tokenization, which often results in sparse representations for low-frequency words and struggles with morphologically complex languages, such as agglutinative ones where words are formed by extensive affixation. By iteratively merging the most frequent pairs of symbols into larger units, BPE strikes a balance between vocabulary size—typically limited to 30,000–60,000 tokens—and sequence length, mitigating the explosion of parameters needed for full word embeddings while preserving semantic coherence. A key modification in this NLP context involved shifting from the original byte-level merging of the compression algorithm to operations on Unicode characters, making the process robust to various scripts and encodings while allowing vocabularies to be learned directly from domain-specific corpora. This adaptation, applied after initial word segmentation, enables BPE to capture linguistic regularities like morphemes without predefined rules. The impact of BPE in NLP has been to enhance generalization in sequence-to-sequence models, including early recurrent neural networks (RNNs) for NMT and subsequent architectures, by alleviating data sparsity and improving handling of morphological variation across languages. For instance, experiments in the original work demonstrated substantial gains in translation quality for rare words, with score improvements of up to 1.3 points on tasks such as English-to-German and English-to-Russian translation. BPE's prominence in NLP grew rapidly after 2016, with integrations into open-source toolkits such as OpenNMT by 2017 for streamlined subword handling in training pipelines, and fairseq by 2019, where it became a standard for efficient vocabulary construction in large-scale NMT experiments. These implementations accelerated its use in research and production systems, solidifying BPE as a foundational technique for subword regularization in modern workflows.

Core Algorithm

Original Byte-Pair Encoding Procedure

The original byte-pair encoding (BPE) procedure is a lossless data compression that operates on a represented as a of bytes, typically from a or binary data stream. It begins by treating the input as a stream of individual bytes, initializing a consisting of all possible unique byte values—commonly 256 symbols for 8-bit ASCII or encoding. The is then represented as a of indices corresponding to these byte symbols, enabling efficient pair tracking and replacement. This initialization sets the stage for iterative merging without requiring predefined subword boundaries, focusing purely on frequency-based compression. The core process involves repeated iterations to identify and merge the most frequent adjacent pairs of symbols. In each iteration, the algorithm scans the current representation of the corpus to count the occurrences of all adjacent symbol pairs, often using a hash table for efficient tallying. The pair with the highest frequency is selected, and a new symbol—typically an unused byte value—is introduced to the vocabulary to represent this merged pair. All occurrences of the selected pair throughout the corpus are then replaced by this new single symbol, effectively reducing the total length of the sequence while preserving the original information. This replacement is performed in-place or via buffer swaps to minimize memory overhead, and the pair counts in the tracking structure are updated to reflect the changes: decrementing counts for pairs destroyed by the merge and incrementing for newly formed adjacent pairs. The process repeats for a predefined number of merges, until a desired vocabulary size is reached, or until no repeated pairs remain, at which point further compression yields no gain. The output of the consists of the compressed as a of indices into the final , alongside a pair table recording the merge history (i.e., which pairs map to which new s) to enable decoding. This pair table serves as the for reconstruction, ensuring lossless recovery of the original byte stream. Decoding reverses the merges in the exact order they were applied during , typically using a -based approach for : each in the compressed is either output as a literal byte or expanded by pushing the corresponding original pair onto the , processing nested expansions locally before proceeding. This single-pass decoding is notably faster than the multi-pass phase. In terms of , each iteration requires counting pair frequencies across the , which can be achieved in linear time using hashing; with m total merges, the overall compression time is O(m n), where n is the length of the representation. The original implementation leverages hashing for pair counting to approximate linear time per scan, but finding the global maximum frequency may introduce additional factors in practice.

Pseudocode Outline

The following provides a high-level outline of the original BPE compression procedure, adapted for clarity from the foundational implementation:
# Input: corpus as byte sequence (list of byte indices)
# Output: compressed sequence of indices, merge_rules (list of (pair -> new_symbol))

initialize vocabulary = [0, 1, ..., 255]  # All possible bytes
corpus = [index for each byte in input]   # Represent as indices
merge_rules = []                         # To store merges for decoding
unused_symbols = set(range(256, 512))    # Assuming 9-bit symbols; adjust as needed

while num_merges < max_merges and len(set(zip(corpus[:-1], corpus[1:]))) > 0:
    # Count frequencies of adjacent pairs
    pair_counts = defaultdict(int)
    for i in range(len(corpus) - 1):
        pair = (corpus[i], corpus[i+1])
        pair_counts[pair] += 1
    
    if not pair_counts:
        break
    
    # Find most frequent pair
    most_frequent_pair = max(pair_counts, key=pair_counts.get)
    new_symbol = unused_symbols.pop()  # Get next unused symbol
    
    # Add to vocabulary and rules
    vocabulary.append(new_symbol)
    merge_rules.append((most_frequent_pair, new_symbol))
    
    # Replace all occurrences in corpus
    i = 0
    while i < len(corpus) - 1:
        if (corpus[i], corpus[i+1]) == most_frequent_pair:
            corpus[i] = new_symbol
            del corpus[i+1]
        else:
            i += 1

return corpus, merge_rules
For decoding, a recursive or stack-based expansion applies the merge rules in reverse order to reconstruct the original corpus from the compressed indices and rules.

Modified Byte-Pair Encoding for Subword Tokenization

The modified (BPE) algorithm adapts the original compression technique for subword tokenization in natural language processing, enabling the segmentation of words into smaller, reusable units to address vocabulary limitations and handle rare or out-of-vocabulary terms effectively. This adaptation shifts the focus from lossless text compression to creating a fixed-size vocabulary of subword units derived from character sequences, facilitating open-vocabulary processing in tasks like machine translation and language modeling. The input for learning merge rules is a training corpus of words or sentences, preprocessed into character sequences with word boundaries explicitly marked—typically by appending a special end-of-word symbol, such as ‘·’, to each word—to prevent unintended merges across word edges during training. The base vocabulary initializes with all unique characters encountered in the corpus, often drawn from Unicode representations, treating the text as a sequence of these atomic units rather than raw bytes. A key modification involves iteratively merging the most frequent adjacent pairs of characters (or later subwords) within the corpus: frequency counts are computed for all pairs, the highest-frequency pair is replaced by a new symbol throughout the text, and this process repeats, building a sequence of merge rules ordered by decreasing frequency. The number of merge operations, denoted as num_merges, serves as the primary hyperparameter, controlling the growth of the vocabulary from the initial character set to a larger set of compound subwords; for instance, 59,500 merges can yield a vocabulary of approximately 63,000 subword units, while 89,500 merges produce around 82,000 units. Optional frequency thresholds may filter pairs below a minimum count to prioritize more reliable merges, though this is not always applied in standard implementations. Unlike the original , which applies merges exhaustively to an entire corpus for compression, these rules are learned once on the training data and then applied selectively to new text for tokenization, emphasizing segmentation over full replacement. To encode a new word using the learned rules, the word is initially decomposed into its individual characters. The merge operations are then applied in the order they were learned—starting with the most frequent (earliest) rule—by repeatedly scanning the sequence to find and replace the highest-frequency (earliest) applicable pair until no further merges can be applied. The output is a sequence of subword units from the fixed vocabulary, each mapped to a unique integer ID for model input; for example, the word "lowest" might be tokenized as the subwords "low" followed by "est". Rare or unseen words are handled by partial matching: the algorithm applies as many merges as possible, falling back to individual characters for any remaining unmatched segments, ensuring no word is entirely unknown while preserving its orthographic integrity as a concatenation of valid units. For instance, an unseen word like "unseen" could be represented as "un" + "s" + "e" + "e" + "n" if no further merges apply beyond the known prefix. In contrast to compression scenarios, where decoding requires reversing all merges in reverse order to recover the original byte sequence, subword tokenization here produces reversible encodings without a full decoding cycle—original text is reconstructed simply by concatenating the subword strings, as the process is applied per word and does not alter the overall corpus structure.

Worked Examples

Compression Example with Text Data

To illustrate the compression process of the original byte-pair encoding (BPE) algorithm, consider a small sample text corpus consisting of the string "ABABCABCD", where each character represents a byte from a simple alphabet {A, B, C, D}. This initial representation is a sequence of 9 bytes: A B A B C A B C D. In the first iteration, the algorithm identifies all adjacent byte pairs in the sequence and counts their frequencies: AB appears three times, BC twice, and BA, CA, CD each once. The most frequent pair, AB, is merged into a new symbol X (an unused byte value), and all occurrences are replaced by scanning the sequence from left to right: the first AB (positions 1-2) becomes X, the next AB (original positions 3-4, now adjacent) becomes X, and the final AB (original positions 6-7) becomes X. The updated sequence is now X X C X C D, reducing the length to 6 symbols. In the second iteration, the adjacent pairs in the new sequence X X C X C D are counted: XC appears twice (positions 2-3 and 4-5), while XX, CX, and CD each appear once. The pair XC is merged into a new symbol Y, and replacements yield X (position 1), Y (from positions 2-3), Y (from positions 4-5), and D (position 6). The resulting sequence is X Y Y D, now 4 symbols long. Further iterations could continue until no pairs repeat or compression benefits diminish, but this example stops here, demonstrating a reduction from 9 to 4 symbols (a 55.6% length reduction in the sequence, excluding the overhead of the merge table). The merges are recorded in a pair table for decoding: first AB → X, then XC → Y. To recover the original, reverse the process starting from the last merge: each Y is replaced by XC, yielding X X C X C D. Then, each X is replaced by AB, resulting in A B A B C A B C D. This reversible substitution ensures lossless compression, with the table stored alongside the encoded sequence to enable exact reconstruction.

Tokenization Example with Vocabulary Building

To illustrate the modified byte-pair encoding (BPE) procedure for subword tokenization in natural language processing, consider a toy corpus consisting of the words "low", "lowest", "newer", "wider". These words are first pre-tokenized into individual characters, with an end-of-word boundary symbol (denoted as ·) appended to each to handle word segmentation distinctly from continuous text. The initial vocabulary comprises the base characters appearing in the corpus: {l, o, w, e, s, t, n, i, d, r} plus the boundary ·. Adjacent character pairs are then counted across the corpus to identify the most frequent ones for merging. The BPE process iteratively replaces the most frequent pair with a new subword token, updating the vocabulary and pair counts after each merge, for a specified number of iterations (here, limited to 4 merges for brevity). The merges learned are: first "r ·" (appearing at the end of "newer" and "wider"), creating "r·" and updating those words to n e w e r· and w i d e r·, respectively; second "l o" (in "low" and "lowest"), forming "lo"; third "lo w" (in the updated "low" and "lowest"), forming "low"; fourth "e r ·" (in the updated "newer" and "wider"), forming "er·". The resulting vocabulary after these 4 merges includes the initial characters plus the new subwords: {l, o, w, e, s, t, n, i, d, r, ·, r·, lo, low, er·}. Each token in the vocabulary is assigned a unique integer ID, typically starting from 0 for the base characters (e.g., l=0, o=1, ..., ·=10, r·=11, lo=12, low=13, er·=14). To encode a word from the corpus, such as "lowest", a greedy longest-match algorithm scans the character sequence l o w e s t · from left to right, selecting the longest recognized subword at each position: "low" (merged, covering l o w), "e" (base), "s" (base), "t" (base), and · (boundary), yielding tokens ["low", "e", "s", "t", "·"] with IDs [13, 3, 4, 5, 10]. For an unseen word like "lower" (l o w e r ·), the encoding uses the greedy longest-match: "low" (l o w), then "er·" (e r ·), yielding ["low", "er·"] with IDs [13, 14]. This demonstrates BPE's ability to handle out-of-vocabulary words by decomposing them into known subwords, promoting open-vocabulary coverage. If the word introduced new characters outside the initial vocabulary, the encoding would fall back to individual characters where no merges apply. The merge process can be visualized in the following table, showing selected pairs, their frequencies, and example encodings:
Merge IterationPair MergedInitial FrequencyUpdated Words (Examples)Final Vocabulary Tokens Added
1r + · → r·2newer → n e w e r·
wider → w i d e r·
2l + o → lo2low → lo w ·
lowest → lo w e s t ·
lo
3lo + w → low2low → low ·
lowest → low e s t ·
low
4e + r · → er·2newer → n e w er·
wider → w i d er·
er·
This table highlights how merges build increasingly longer subwords, reducing the token count for frequent patterns while maintaining flexibility for rare ones. For "lowest", the final encoding is [13, 3, 4, 5, 10] (low, e, s, t, ·); for "newer" (n e w er·), it is [8, 3, 2, 14] (n, e, w, er·).

Applications and Variants

Use in Large Language Models

Byte-pair encoding (BPE) plays a central role in tokenizing inputs for the series of large language models from , starting with in 2019. These models employ a byte-level variant of BPE, often referred to as BBPE, which processes raw bytes to handle diverse text types including , code, and multilingual content without requiring explicit normalization. A primary advantage of this integration is the constrained vocabulary size, which caps the embedding matrix dimensions and reduces the overall parameter count, contributing to computational efficiency during and . Subword units also effectively capture morphological patterns, for example, by decomposing "running" into the root "run" and the "ning," allowing the model to generalize across related forms without exploding the for rare inflections. The tokenizer is trained in advance on vast corpora to identify optimal merge rules; used the WebText dataset of approximately 8 million web pages, while scaled to 400 billion byte-pair-encoded tokens from filtered , , and sources. These learned merges are then applied during model inference to convert input sequences into token IDs for the architecture. In and , the resulting comprises about 50,000 tokens, including 256 byte-level base symbols and special tokens, which compresses infrequent words into compositions of common subwords to support extended context lengths up to 2048 tokens. Later models in the series, such as (2023) with a of approximately 100,000 tokens and GPT-5 (2025) with around 200,000, have expanded this approach for greater efficiency and broader capabilities. This design enhances efficiency for long-form generation while maintaining compatibility with the model's autoregressive prediction. On performance, BPE tokenization lowers for rare and out-of- words by leveraging shared subword representations, enabling better extrapolation to unseen during . However, for morphologically rich languages like those in the Indic family, it often produces longer sequences due to aggressive splitting of complex affixes, increasing computational overhead compared to analytic languages. Practical implementations of GPT-compatible BPE tokenizers are provided in the Tokenizers library, where BPE is the default model for many pre-trained transformers, supporting fast encoding and decoding for research and deployment.

Byte-Level and Other Variants

Byte-level byte-pair encoding (BBPE) applies the BPE algorithm directly to encoded bytes rather than pre-normalized characters, enabling seamless handling of any or multibyte sequences without additional preprocessing. This approach, popularized in , uses a base vocabulary of 256 bytes, which expands to approximately 50,000 tokens through merging, including fallback representations for rare symbols. By operating at the byte level, BBPE avoids common tokenization pitfalls, such as errors with emojis, non-Latin scripts, or irregular , ensuring no unknown tokens occur. In contrast to character-based BPE variants, BBPE merges adjacent byte pairs based on frequency in the raw input, which can result in slightly longer sequences due to the finer granularity of bytes but enhances overall robustness for diverse languages. For multilingual applications, BBPE has been adopted in models like BLOOM, where a of 250,680 tokens supports 46 natural languages and 13 programming languages by minimizing over-segmentation in low-resource settings. Other variants of subword tokenization include the Unigram Language Model, implemented in the SentencePiece toolkit alongside BPE. Unigram treats tokenization as a probabilistic segmentation problem, starting with a large candidate set of subwords and iteratively pruning the least probable ones based on a unigram , rather than merging frequent pairs as in BPE. SentencePiece processes raw text without word boundary detection, using a meta-symbol for spaces to ensure language-agnostic, reversible encoding. This toolkit favors Unigram for its linear-time training complexity and support for subword regularization via sampling, offering comparable performance to BPE but with greater flexibility for unsegmented languages like . WordPiece, developed by , resembles BPE in its iterative merging but selects pairs to maximize the likelihood of the training data rather than raw frequency, targeting vocabularies around 200,000 units for tasks like voice search in and . These variants—BBPE for byte-level universality, Unigram for probabilistic , and WordPiece for likelihood-driven merges—address trade-offs in , multilingual coverage, and sequence length, with BBPE providing enhanced robustness at the cost of marginally increased counts.

References

  1. [1]
    [PDF] A new algorithm for data compression | Semantic Scholar
    This article describes a simple general-purpose data compression algo-rithm, called Byte Pair Encoding (BPE), which provides almost as much ...
  2. [2]
    Neural Machine Translation of Rare Words with Subword Units - arXiv
    Aug 31, 2015 · In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown ...
  3. [3]
    [PDF] Language Models are Unsupervised Multitask Learners | OpenAI
    Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ... The results of this analysis are shown below and suggest that GPT-2 repeats text.
  4. [4]
    FEB94 A New Algorithm for Data Compression
    This article describes a simple general-purpose data compression algorithm, called Byte Pair Encoding (BPE), which provides almost as much compression as the ...
  5. [5]
    Languages Through the Looking Glass of BPE Compression
    The term Byte-Pair Encoding refers to the initial idea of iteratively replacing the most common pair of consecutive bytes with a new symbol (Gage 1994). This ...
  6. [6]
    [PDF] Investigating the Effectiveness of BPE - ACL Anthology
    Nov 3, 2019 · Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization ... Philip Gage. 1994. A new algorithm for data compres- sion. The C ...Missing: spell- checker
  7. [7]
    Neural Machine Translation of Rare Words with Subword Units
    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting ...
  8. [8]
  9. [9]
    rsennrich/subword-nmt: Unsupervised Word Segmentation ... - GitHub
    BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT. We found that for languages that share an alphabet, learning BPE on the concatenation of the (two or more) ...
  10. [10]
    [PDF] Neural Machine Translation of Rare Words with Subword Units
    Figure 1: BPE merge operations learned from dic- tionary {'low', 'lowest', 'newer', 'wider'}. ... atively low precision of 29.1% for this category. However ...
  11. [11]
  12. [12]
    Summary of the tokenizers - Hugging Face
    GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
  13. [13]
    None
    Summary of each segment:
  14. [14]
    None
    ### Summary of BPE Implementation, Unigram Language Model, and Related Details in SentencePiece
  15. [15]
    [PDF] JAPANESE AND KOREAN VOICE SEARCH - Google Research
    ABSTRACT. This paper describes challenges and solutions for building a success- ful voice search system as applied to Japanese and Korean at Google.