Byte-pair encoding

Byte-pair encoding (BPE) is a data compression algorithm that iteratively identifies and replaces the most frequent adjacent pairs of bytes (or symbols) in a text corpus with a new, unused symbol from an extended alphabet, building a compact representation while preserving the original sequence through a dictionary of merges.^[1] Introduced by Philip Gage in 1994, BPE was designed as a general-purpose compression technique that achieves performance nearly matching more sophisticated methods like Lempel–Ziv–Welch (LZW) but with simpler implementation and faster decompression.^[1] In natural language processing, BPE was repurposed in 2016 by Rico Sennrich, Barry Haddow, and Alexandra Birch as a subword tokenization strategy for neural machine translation (NMT), enabling models to handle rare, compound, and out-of-vocabulary words by decomposing them into frequent subword units rather than relying on fixed word-level vocabularies.^[2] This adaptation allows for open-vocabulary translation, where unseen words are represented as sequences of known subwords, improving translation quality for morphologically rich languages and proper nouns through mechanisms like character-level copying or transliteration, with reported gains of up to 1.3 BLEU points on standard benchmarks.^[2] BPE's efficiency in managing vocabulary size—typically limiting it to 30,000–50,000 tokens—has made it a cornerstone of modern large language models (LLMs), including OpenAI's GPT series, where it processes diverse text inputs at the byte level to support multilingual capabilities and reduce out-of-vocabulary issues.^[3] For instance, GPT-2 employs a BPE tokenizer with a vocabulary of 50,257 tokens derived from 50,000 merges on a large English corpus, balancing coverage of common words, subwords, and rare sequences to enhance model performance across unsupervised multitask learning.^[3]^[4] Variants like SentencePiece further refine BPE for direct operation on raw text without language-specific preprocessing, underscoring its versatility in contemporary AI systems.

Background and History

Origins in Data Compression

Byte-pair encoding (BPE) was developed by Philip Gage in 1994 as a lossless data compression technique designed to efficiently reduce the size of text files by replacing frequently occurring pairs of bytes with single new symbols, thereby minimizing redundancy while preserving all original information.^[5] This approach emerged as a response to the need for simple, adaptive compression methods that could handle repetitive patterns in data without requiring complex dictionary management. The core innovation of BPE lies in its iterative process of identifying and merging the most common adjacent byte pairs across the input data, assigning each merged pair a unique symbol from the extended alphabet (typically using unused byte values from 128 to 255 in ASCII).^[5] This builds a hierarchical vocabulary of increasingly longer symbols, allowing the algorithm to capture higher-order redundancies exponentially more effectively than linear substitution methods.^[6] By focusing on byte-level operations, BPE ensures lossless decompression through a straightforward expansion routine that reverses the merges in the reverse order of their application.^[5] In its initial implementation, BPE was coded in ANSI C and applied primarily to text files, demonstrating compression ratios that outperformed variants of the Lempel-Ziv-Welch (LZW) algorithm on certain datasets, such as reducing a 544,789-byte executable file to 276,955 bytes compared to LZW's 292,588 bytes with a 14-bit dictionary.^[5] The algorithm's compression phase required moderate memory (5-30 KB) and processing time, while decompression was notably fast (under 20 seconds on a 33 MHz 486 PC), making it suitable for resource-constrained environments like embedded systems or real-time applications.^[5] BPE arose in the mid-1990s amid the evolution of adaptive compression algorithms, following the foundational Lempel-Ziv work of 1977 and LZW's 1984 refinement, but before the widespread adoption of gzip in 1993, which combined LZ77 with Huffman coding for broader utility.^[5] This timing positioned BPE as a lightweight alternative emphasizing speed and simplicity over maximal compression depth.

Adoption in Natural Language Processing

Byte-pair encoding (BPE) was first adapted for natural language processing (NLP) in the 2016 paper "Neural Machine Translation of Rare Words with Subword Units" by Rico Sennrich, Barry Haddow, and Alexandra Birch, where it served as a subword tokenization technique to enhance neural machine translation (NMT) systems.^[7] The method addressed the challenge of out-of-vocabulary (OOV) words by decomposing them into frequent subword units, allowing the model to translate rare or unseen terms through compositionality rather than relying on a fixed word vocabulary.^[7] This adoption stemmed from the shortcomings of word-level tokenization, which often results in sparse representations for low-frequency words and struggles with morphologically complex languages, such as agglutinative ones where words are formed by extensive affixation.^[7] By iteratively merging the most frequent pairs of symbols into larger units, BPE strikes a balance between vocabulary size—typically limited to 30,000–60,000 tokens—and sequence length, mitigating the explosion of parameters needed for full word embeddings while preserving semantic coherence.^[7] A key modification in this NLP context involved shifting from the original byte-level merging of the compression algorithm to operations on Unicode characters, making the process robust to various scripts and encodings while allowing vocabularies to be learned directly from domain-specific corpora.^[7] This adaptation, applied after initial word segmentation, enables BPE to capture linguistic regularities like morphemes without predefined rules. The impact of BPE in NLP has been to enhance generalization in sequence-to-sequence models, including early recurrent neural networks (RNNs) for NMT and subsequent transformer architectures, by alleviating data sparsity and improving handling of morphological variation across languages.^[7] For instance, experiments in the original work demonstrated substantial gains in translation quality for rare words, with BLEU score improvements of up to 1.3 points on tasks such as English-to-German and English-to-Russian translation.^[7] BPE's prominence in NLP grew rapidly after 2016, with integrations into open-source toolkits such as OpenNMT by 2017 for streamlined subword handling in training pipelines, and fairseq by 2019, where it became a standard for efficient vocabulary construction in large-scale NMT experiments. These implementations accelerated its use in research and production systems, solidifying BPE as a foundational technique for subword regularization in modern NLP workflows.

Core Algorithm

Original Byte-Pair Encoding Procedure

The original byte-pair encoding (BPE) procedure is a lossless data compression algorithm that operates on a corpus represented as a sequence of bytes, typically from a text file or binary data stream. It begins by treating the input as a stream of individual bytes, initializing a vocabulary consisting of all possible unique byte values—commonly 256 symbols for 8-bit ASCII or extended ASCII encoding. The corpus is then represented as a sequence of indices corresponding to these byte symbols, enabling efficient pair tracking and replacement. This initialization sets the stage for iterative merging without requiring predefined subword boundaries, focusing purely on frequency-based compression.^[5] The core process involves repeated iterations to identify and merge the most frequent adjacent pairs of symbols. In each iteration, the algorithm scans the current representation of the corpus to count the occurrences of all adjacent symbol pairs, often using a hash table for efficient tallying. The pair with the highest frequency is selected, and a new symbol—typically an unused byte value—is introduced to the vocabulary to represent this merged pair. All occurrences of the selected pair throughout the corpus are then replaced by this new single symbol, effectively reducing the total length of the sequence while preserving the original information. This replacement is performed in-place or via buffer swaps to minimize memory overhead, and the pair counts in the tracking structure are updated to reflect the changes: decrementing counts for pairs destroyed by the merge and incrementing for newly formed adjacent pairs. The process repeats for a predefined number of merges, until a desired vocabulary size is reached, or until no repeated pairs remain, at which point further compression yields no gain.^[5] The output of the procedure consists of the compressed corpus as a sequence of indices into the final vocabulary, alongside a pair table recording the merge history (i.e., which pairs map to which new symbols) to enable decoding. This pair table serves as the dictionary for reconstruction, ensuring lossless recovery of the original byte stream. Decoding reverses the merges in the exact order they were applied during compression, typically using a stack-based approach for efficiency: each symbol in the compressed sequence is either output as a literal byte or expanded by pushing the corresponding original pair onto the stack, processing nested expansions locally before proceeding. This single-pass decoding is notably faster than the multi-pass compression phase.^[5] In terms of computational complexity, each iteration requires counting pair frequencies across the corpus, which can be achieved in linear time using hashing; with m total merges, the overall compression time is O(m n), where n is the length of the corpus representation. The original implementation leverages hashing for pair counting to approximate linear time per scan, but finding the global maximum frequency may introduce additional factors in practice.^[8]

Pseudocode Outline

The following provides a high-level pseudocode outline of the original BPE compression procedure, adapted for clarity from the foundational implementation:

# Input: corpus as byte sequence (list of byte indices)
# Output: compressed sequence of indices, merge_rules (list of (pair -> new_symbol))

initialize vocabulary = [0, 1, ..., 255]  # All possible bytes
corpus = [index for each byte in input]   # Represent as indices
merge_rules = []                         # To store merges for decoding
unused_symbols = set(range(256, 512))    # Assuming 9-bit symbols; adjust as needed

while num_merges < max_merges and len(set(zip(corpus[:-1], corpus[1:]))) > 0:
    # Count frequencies of adjacent pairs
    pair_counts = defaultdict(int)
    for i in range(len(corpus) - 1):
        pair = (corpus[i], corpus[i+1])
        pair_counts[pair] += 1
    
    if not pair_counts:
        break
    
    # Find most frequent pair
    most_frequent_pair = max(pair_counts, key=pair_counts.get)
    new_symbol = unused_symbols.pop()  # Get next unused symbol
    
    # Add to vocabulary and rules
    vocabulary.append(new_symbol)
    merge_rules.append((most_frequent_pair, new_symbol))
    
    # Replace all occurrences in corpus
    i = 0
    while i < len(corpus) - 1:
        if (corpus[i], corpus[i+1]) == most_frequent_pair:
            corpus[i] = new_symbol
            del corpus[i+1]
        else:
            i += 1

return corpus, merge_rules
# Input: corpus as byte sequence (list of byte indices)
# Output: compressed sequence of indices, merge_rules (list of (pair -> new_symbol))

initialize vocabulary = [0, 1, ..., 255]  # All possible bytes
corpus = [index for each byte in input]   # Represent as indices
merge_rules = []                         # To store merges for decoding
unused_symbols = set(range(256, 512))    # Assuming 9-bit symbols; adjust as needed

while num_merges < max_merges and len(set(zip(corpus[:-1], corpus[1:]))) > 0:
    # Count frequencies of adjacent pairs
    pair_counts = defaultdict(int)
    for i in range(len(corpus) - 1):
        pair = (corpus[i], corpus[i+1])
        pair_counts[pair] += 1
    
    if not pair_counts:
        break
    
    # Find most frequent pair
    most_frequent_pair = max(pair_counts, key=pair_counts.get)
    new_symbol = unused_symbols.pop()  # Get next unused symbol
    
    # Add to vocabulary and rules
    vocabulary.append(new_symbol)
    merge_rules.append((most_frequent_pair, new_symbol))
    
    # Replace all occurrences in corpus
    i = 0
    while i < len(corpus) - 1:
        if (corpus[i], corpus[i+1]) == most_frequent_pair:
            corpus[i] = new_symbol
            del corpus[i+1]
        else:
            i += 1

return corpus, merge_rules

For decoding, a recursive or stack-based expansion applies the merge rules in reverse order to reconstruct the original corpus from the compressed indices and rules.^[5]

Modified Byte-Pair Encoding for Subword Tokenization

The modified byte-pair encoding (BPE) algorithm adapts the original compression technique for subword tokenization in natural language processing, enabling the segmentation of words into smaller, reusable units to address vocabulary limitations and handle rare or out-of-vocabulary terms effectively.^[2] This adaptation shifts the focus from lossless text compression to creating a fixed-size vocabulary of subword units derived from character sequences, facilitating open-vocabulary processing in tasks like machine translation and language modeling.^[2] The input for learning merge rules is a training corpus of words or sentences, preprocessed into character sequences with word boundaries explicitly marked—typically by appending a special end-of-word symbol, such as ‘·’, to each word—to prevent unintended merges across word edges during training.^[2] The base vocabulary initializes with all unique characters encountered in the corpus, often drawn from Unicode representations, treating the text as a sequence of these atomic units rather than raw bytes.^[2] A key modification involves iteratively merging the most frequent adjacent pairs of characters (or later subwords) within the corpus: frequency counts are computed for all pairs, the highest-frequency pair is replaced by a new symbol throughout the text, and this process repeats, building a sequence of merge rules ordered by decreasing frequency.^[2] The number of merge operations, denoted as num_merges, serves as the primary hyperparameter, controlling the growth of the vocabulary from the initial character set to a larger set of compound subwords; for instance, 59,500 merges can yield a vocabulary of approximately 63,000 subword units, while 89,500 merges produce around 82,000 units.^[2] Optional frequency thresholds may filter pairs below a minimum count to prioritize more reliable merges, though this is not always applied in standard implementations.^[9] Unlike the original BPE, which applies merges exhaustively to an entire corpus for compression, these rules are learned once on the training data and then applied selectively to new text for tokenization, emphasizing segmentation over full replacement.^[2] To encode a new word using the learned rules, the word is initially decomposed into its individual characters. The merge operations are then applied in the order they were learned—starting with the most frequent (earliest) rule—by repeatedly scanning the sequence to find and replace the highest-frequency (earliest) applicable pair until no further merges can be applied.^[2] The output is a sequence of subword units from the fixed vocabulary, each mapped to a unique integer ID for model input; for example, the word "lowest" might be tokenized as the subwords "low" followed by "est".^[2] Rare or unseen words are handled by partial matching: the algorithm applies as many merges as possible, falling back to individual characters for any remaining unmatched segments, ensuring no word is entirely unknown while preserving its orthographic integrity as a concatenation of valid units.^[2] For instance, an unseen word like "unseen" could be represented as "un" + "s" + "e" + "e" + "n" if no further merges apply beyond the known prefix.^[2] In contrast to compression scenarios, where decoding requires reversing all merges in reverse order to recover the original byte sequence, subword tokenization here produces reversible encodings without a full decoding cycle—original text is reconstructed simply by concatenating the subword strings, as the process is applied per word and does not alter the overall corpus structure.^[2]

Worked Examples

Compression Example with Text Data

To illustrate the compression process of the original byte-pair encoding (BPE) algorithm, consider a small sample text corpus consisting of the string "ABABCABCD", where each character represents a byte from a simple alphabet {A, B, C, D}.^[5] This initial representation is a sequence of 9 bytes: A B A B C A B C D.^[5] In the first iteration, the algorithm identifies all adjacent byte pairs in the sequence and counts their frequencies: AB appears three times, BC twice, and BA, CA, CD each once.^[5] The most frequent pair, AB, is merged into a new symbol X (an unused byte value), and all occurrences are replaced by scanning the sequence from left to right: the first AB (positions 1-2) becomes X, the next AB (original positions 3-4, now adjacent) becomes X, and the final AB (original positions 6-7) becomes X.^[5] The updated sequence is now X X C X C D, reducing the length to 6 symbols.^[5] In the second iteration, the adjacent pairs in the new sequence X X C X C D are counted: XC appears twice (positions 2-3 and 4-5), while XX, CX, and CD each appear once.^[5] The pair XC is merged into a new symbol Y, and replacements yield X (position 1), Y (from positions 2-3), Y (from positions 4-5), and D (position 6).^[5] The resulting sequence is X Y Y D, now 4 symbols long.^[5] Further iterations could continue until no pairs repeat or compression benefits diminish, but this example stops here, demonstrating a reduction from 9 to 4 symbols (a 55.6% length reduction in the sequence, excluding the overhead of the merge table).^[5] The merges are recorded in a pair table for decoding: first AB → X, then XC → Y.^[5] To recover the original, reverse the process starting from the last merge: each Y is replaced by XC, yielding X X C X C D.^[5] Then, each X is replaced by AB, resulting in A B A B C A B C D.^[5] This reversible substitution ensures lossless compression, with the table stored alongside the encoded sequence to enable exact reconstruction.^[5]

Tokenization Example with Vocabulary Building

To illustrate the modified byte-pair encoding (BPE) procedure for subword tokenization in natural language processing, consider a toy corpus consisting of the words "low", "lowest", "newer", "wider".^[10] These words are first pre-tokenized into individual characters, with an end-of-word boundary symbol (denoted as ·) appended to each to handle word segmentation distinctly from continuous text. The initial vocabulary comprises the base characters appearing in the corpus: {l, o, w, e, s, t, n, i, d, r} plus the boundary ·. Adjacent character pairs are then counted across the corpus to identify the most frequent ones for merging. The BPE process iteratively replaces the most frequent pair with a new subword token, updating the vocabulary and pair counts after each merge, for a specified number of iterations (here, limited to 4 merges for brevity). The merges learned are: first "r ·" (appearing at the end of "newer" and "wider"), creating "r·" and updating those words to n e w e r· and w i d e r·, respectively; second "l o" (in "low" and "lowest"), forming "lo"; third "lo w" (in the updated "low" and "lowest"), forming "low"; fourth "e r ·" (in the updated "newer" and "wider"), forming "er·".^[10] The resulting vocabulary after these 4 merges includes the initial characters plus the new subwords: {l, o, w, e, s, t, n, i, d, r, ·, r·, lo, low, er·}. Each token in the vocabulary is assigned a unique integer ID, typically starting from 0 for the base characters (e.g., l=0, o=1, ..., ·=10, r·=11, lo=12, low=13, er·=14). To encode a word from the corpus, such as "lowest", a greedy longest-match algorithm scans the character sequence l o w e s t · from left to right, selecting the longest recognized subword at each position: "low" (merged, covering l o w), "e" (base), "s" (base), "t" (base), and · (boundary), yielding tokens ["low", "e", "s", "t", "·"] with IDs [13, 3, 4, 5, 10].^[10] For an unseen word like "lower" (l o w e r ·), the encoding uses the greedy longest-match: "low" (l o w), then "er·" (e r ·), yielding ["low", "er·"] with IDs [13, 14]. This demonstrates BPE's ability to handle out-of-vocabulary words by decomposing them into known subwords, promoting open-vocabulary coverage. If the word introduced new characters outside the initial vocabulary, the encoding would fall back to individual characters where no merges apply.^[10] The merge process can be visualized in the following table, showing selected pairs, their frequencies, and example encodings:

Merge Iteration	Pair Merged	Initial Frequency	Updated Words (Examples)	Final Vocabulary Tokens Added
1	r + · → r·	2	newer → n e w e r· wider → w i d e r·	r·
2	l + o → lo	2	low → lo w · lowest → lo w e s t ·	lo
3	lo + w → low	2	low → low · lowest → low e s t ·	low
4	e + r · → er·	2	newer → n e w er· wider → w i d er·	er·

This table highlights how merges build increasingly longer subwords, reducing the token count for frequent patterns while maintaining flexibility for rare ones. For "lowest", the final encoding is [13, 3, 4, 5, 10] (low, e, s, t, ·); for "newer" (n e w er·), it is [8, 3, 2, 14] (n, e, w, er·).^[10]

Applications and Variants

Use in Large Language Models

Byte-pair encoding (BPE) plays a central role in tokenizing inputs for the GPT series of large language models from OpenAI, starting with GPT-2 in 2019. These models employ a byte-level variant of BPE, often referred to as BBPE, which processes raw bytes to handle diverse text types including natural language, code, and multilingual content without requiring explicit Unicode normalization.^[11] A primary advantage of this integration is the constrained vocabulary size, which caps the embedding matrix dimensions and reduces the overall parameter count, contributing to computational efficiency during training and inference. Subword units also effectively capture morphological patterns, for example, by decomposing "running" into the root "run" and the suffix "ning," allowing the model to generalize across related forms without exploding the lexicon for rare inflections.^[11] The tokenizer is trained in advance on vast corpora to identify optimal merge rules; GPT-2 used the WebText dataset of approximately 8 million web pages, while GPT-3 scaled to 400 billion byte-pair-encoded tokens from filtered Common Crawl, Books, and Wikipedia sources. These learned merges are then applied during model inference to convert input sequences into token IDs for the transformer architecture.^[11] In GPT-2 and GPT-3, the resulting vocabulary comprises about 50,000 tokens, including 256 byte-level base symbols and special tokens, which compresses infrequent words into compositions of common subwords to support extended context lengths up to 2048 tokens. Later models in the series, such as GPT-4 (2023) with a vocabulary of approximately 100,000 tokens and GPT-5 (2025) with around 200,000, have expanded this approach for greater efficiency and broader capabilities.^[12]^[13] This design enhances efficiency for long-form generation while maintaining compatibility with the model's autoregressive prediction.^[4]^[11] On performance, BPE tokenization lowers perplexity for rare and out-of-vocabulary words by leveraging shared subword representations, enabling better extrapolation to unseen vocabulary during generation. However, for morphologically rich languages like those in the Indic family, it often produces longer sequences due to aggressive splitting of complex affixes, increasing computational overhead compared to analytic languages. Practical implementations of GPT-compatible BPE tokenizers are provided in the Hugging Face Tokenizers library, where BPE is the default model for many pre-trained transformers, supporting fast encoding and decoding for research and deployment.

Byte-Level and Other Variants

Byte-level byte-pair encoding (BBPE) applies the BPE algorithm directly to UTF-8 encoded bytes rather than pre-normalized characters, enabling seamless handling of any Unicode or multibyte sequences without additional preprocessing.^[3] This approach, popularized in GPT-2, uses a base vocabulary of 256 bytes, which expands to approximately 50,000 tokens through merging, including fallback representations for rare symbols.^[3] By operating at the byte level, BBPE avoids common tokenization pitfalls, such as errors with emojis, non-Latin scripts, or irregular Unicode, ensuring no unknown tokens occur.^[3] In contrast to character-based BPE variants, BBPE merges adjacent byte pairs based on frequency in the raw input, which can result in slightly longer sequences due to the finer granularity of bytes but enhances overall robustness for diverse languages.^[14] For multilingual applications, BBPE has been adopted in models like BLOOM, where a vocabulary of 250,680 tokens supports 46 natural languages and 13 programming languages by minimizing over-segmentation in low-resource settings.^[14] Other variants of subword tokenization include the Unigram Language Model, implemented in the SentencePiece toolkit alongside BPE.^[15] Unigram treats tokenization as a probabilistic segmentation problem, starting with a large candidate set of subwords and iteratively pruning the least probable ones based on a unigram language model, rather than merging frequent pairs as in BPE.^[15] SentencePiece processes raw text without word boundary detection, using a meta-symbol for spaces to ensure language-agnostic, reversible encoding.^[15] This toolkit favors Unigram for its linear-time training complexity and support for subword regularization via sampling, offering comparable performance to BPE but with greater flexibility for unsegmented languages like Japanese.^[15] WordPiece, developed by Google, resembles BPE in its iterative merging but selects pairs to maximize the likelihood of the training data rather than raw frequency, targeting vocabularies around 200,000 units for tasks like voice search in Japanese and Korean.^[16] These variants—BBPE for byte-level universality, Unigram for probabilistic pruning, and WordPiece for likelihood-driven merges—address trade-offs in efficiency, multilingual coverage, and sequence length, with BBPE providing enhanced robustness at the cost of marginally increased token counts.^[15]^[14]