Byte-pair encoding
Byte-pair encoding (BPE) is a data compression algorithm that iteratively identifies and replaces the most frequent adjacent pairs of bytes (or symbols) in a text corpus with a new, unused symbol from an extended alphabet, building a compact representation while preserving the original sequence through a dictionary of merges.[1] Introduced by Philip Gage in 1994, BPE was designed as a general-purpose compression technique that achieves performance nearly matching more sophisticated methods like Lempel–Ziv–Welch (LZW) but with simpler implementation and faster decompression.[1] In natural language processing, BPE was repurposed in 2016 by Rico Sennrich, Barry Haddow, and Alexandra Birch as a subword tokenization strategy for neural machine translation (NMT), enabling models to handle rare, compound, and out-of-vocabulary words by decomposing them into frequent subword units rather than relying on fixed word-level vocabularies.[2] This adaptation allows for open-vocabulary translation, where unseen words are represented as sequences of known subwords, improving translation quality for morphologically rich languages and proper nouns through mechanisms like character-level copying or transliteration, with reported gains of up to 1.3 BLEU points on standard benchmarks.[2] BPE's efficiency in managing vocabulary size—typically limiting it to 30,000–50,000 tokens—has made it a cornerstone of modern large language models (LLMs), including OpenAI's GPT series, where it processes diverse text inputs at the byte level to support multilingual capabilities and reduce out-of-vocabulary issues.[3] For instance, GPT-2 employs a BPE tokenizer with a vocabulary of 50,257 tokens derived from 50,000 merges on a large English corpus, balancing coverage of common words, subwords, and rare sequences to enhance model performance across unsupervised multitask learning.[3][4] Variants like SentencePiece further refine BPE for direct operation on raw text without language-specific preprocessing, underscoring its versatility in contemporary AI systems.Background and History
Origins in Data Compression
Byte-pair encoding (BPE) was developed by Philip Gage in 1994 as a lossless data compression technique designed to efficiently reduce the size of text files by replacing frequently occurring pairs of bytes with single new symbols, thereby minimizing redundancy while preserving all original information.[5] This approach emerged as a response to the need for simple, adaptive compression methods that could handle repetitive patterns in data without requiring complex dictionary management. The core innovation of BPE lies in its iterative process of identifying and merging the most common adjacent byte pairs across the input data, assigning each merged pair a unique symbol from the extended alphabet (typically using unused byte values from 128 to 255 in ASCII).[5] This builds a hierarchical vocabulary of increasingly longer symbols, allowing the algorithm to capture higher-order redundancies exponentially more effectively than linear substitution methods.[6] By focusing on byte-level operations, BPE ensures lossless decompression through a straightforward expansion routine that reverses the merges in the reverse order of their application.[5] In its initial implementation, BPE was coded in ANSI C and applied primarily to text files, demonstrating compression ratios that outperformed variants of the Lempel-Ziv-Welch (LZW) algorithm on certain datasets, such as reducing a 544,789-byte executable file to 276,955 bytes compared to LZW's 292,588 bytes with a 14-bit dictionary.[5] The algorithm's compression phase required moderate memory (5-30 KB) and processing time, while decompression was notably fast (under 20 seconds on a 33 MHz 486 PC), making it suitable for resource-constrained environments like embedded systems or real-time applications.[5] BPE arose in the mid-1990s amid the evolution of adaptive compression algorithms, following the foundational Lempel-Ziv work of 1977 and LZW's 1984 refinement, but before the widespread adoption of gzip in 1993, which combined LZ77 with Huffman coding for broader utility.[5] This timing positioned BPE as a lightweight alternative emphasizing speed and simplicity over maximal compression depth.Adoption in Natural Language Processing
Byte-pair encoding (BPE) was first adapted for natural language processing (NLP) in the 2016 paper "Neural Machine Translation of Rare Words with Subword Units" by Rico Sennrich, Barry Haddow, and Alexandra Birch, where it served as a subword tokenization technique to enhance neural machine translation (NMT) systems.[7] The method addressed the challenge of out-of-vocabulary (OOV) words by decomposing them into frequent subword units, allowing the model to translate rare or unseen terms through compositionality rather than relying on a fixed word vocabulary.[7] This adoption stemmed from the shortcomings of word-level tokenization, which often results in sparse representations for low-frequency words and struggles with morphologically complex languages, such as agglutinative ones where words are formed by extensive affixation.[7] By iteratively merging the most frequent pairs of symbols into larger units, BPE strikes a balance between vocabulary size—typically limited to 30,000–60,000 tokens—and sequence length, mitigating the explosion of parameters needed for full word embeddings while preserving semantic coherence.[7] A key modification in this NLP context involved shifting from the original byte-level merging of the compression algorithm to operations on Unicode characters, making the process robust to various scripts and encodings while allowing vocabularies to be learned directly from domain-specific corpora.[7] This adaptation, applied after initial word segmentation, enables BPE to capture linguistic regularities like morphemes without predefined rules. The impact of BPE in NLP has been to enhance generalization in sequence-to-sequence models, including early recurrent neural networks (RNNs) for NMT and subsequent transformer architectures, by alleviating data sparsity and improving handling of morphological variation across languages.[7] For instance, experiments in the original work demonstrated substantial gains in translation quality for rare words, with BLEU score improvements of up to 1.3 points on tasks such as English-to-German and English-to-Russian translation.[7] BPE's prominence in NLP grew rapidly after 2016, with integrations into open-source toolkits such as OpenNMT by 2017 for streamlined subword handling in training pipelines, and fairseq by 2019, where it became a standard for efficient vocabulary construction in large-scale NMT experiments. These implementations accelerated its use in research and production systems, solidifying BPE as a foundational technique for subword regularization in modern NLP workflows.Core Algorithm
Original Byte-Pair Encoding Procedure
The original byte-pair encoding (BPE) procedure is a lossless data compression algorithm that operates on a corpus represented as a sequence of bytes, typically from a text file or binary data stream. It begins by treating the input as a stream of individual bytes, initializing a vocabulary consisting of all possible unique byte values—commonly 256 symbols for 8-bit ASCII or extended ASCII encoding. The corpus is then represented as a sequence of indices corresponding to these byte symbols, enabling efficient pair tracking and replacement. This initialization sets the stage for iterative merging without requiring predefined subword boundaries, focusing purely on frequency-based compression.[5] The core process involves repeated iterations to identify and merge the most frequent adjacent pairs of symbols. In each iteration, the algorithm scans the current representation of the corpus to count the occurrences of all adjacent symbol pairs, often using a hash table for efficient tallying. The pair with the highest frequency is selected, and a new symbol—typically an unused byte value—is introduced to the vocabulary to represent this merged pair. All occurrences of the selected pair throughout the corpus are then replaced by this new single symbol, effectively reducing the total length of the sequence while preserving the original information. This replacement is performed in-place or via buffer swaps to minimize memory overhead, and the pair counts in the tracking structure are updated to reflect the changes: decrementing counts for pairs destroyed by the merge and incrementing for newly formed adjacent pairs. The process repeats for a predefined number of merges, until a desired vocabulary size is reached, or until no repeated pairs remain, at which point further compression yields no gain.[5] The output of the procedure consists of the compressed corpus as a sequence of indices into the final vocabulary, alongside a pair table recording the merge history (i.e., which pairs map to which new symbols) to enable decoding. This pair table serves as the dictionary for reconstruction, ensuring lossless recovery of the original byte stream. Decoding reverses the merges in the exact order they were applied during compression, typically using a stack-based approach for efficiency: each symbol in the compressed sequence is either output as a literal byte or expanded by pushing the corresponding original pair onto the stack, processing nested expansions locally before proceeding. This single-pass decoding is notably faster than the multi-pass compression phase.[5] In terms of computational complexity, each iteration requires counting pair frequencies across the corpus, which can be achieved in linear time using hashing; with m total merges, the overall compression time is O(m n), where n is the length of the corpus representation. The original implementation leverages hashing for pair counting to approximate linear time per scan, but finding the global maximum frequency may introduce additional factors in practice.[8]Pseudocode Outline
The following provides a high-level pseudocode outline of the original BPE compression procedure, adapted for clarity from the foundational implementation:For decoding, a recursive or stack-based expansion applies the merge rules in reverse order to reconstruct the original corpus from the compressed indices and rules.[5]# Input: corpus as byte sequence (list of byte indices) # Output: compressed sequence of indices, merge_rules (list of (pair -> new_symbol)) initialize vocabulary = [0, 1, ..., 255] # All possible bytes corpus = [index for each byte in input] # Represent as indices merge_rules = [] # To store merges for decoding unused_symbols = set(range(256, 512)) # Assuming 9-bit symbols; adjust as needed while num_merges < max_merges and len(set(zip(corpus[:-1], corpus[1:]))) > 0: # Count frequencies of adjacent pairs pair_counts = defaultdict(int) for i in range(len(corpus) - 1): pair = (corpus[i], corpus[i+1]) pair_counts[pair] += 1 if not pair_counts: break # Find most frequent pair most_frequent_pair = max(pair_counts, key=pair_counts.get) new_symbol = unused_symbols.pop() # Get next unused symbol # Add to vocabulary and rules vocabulary.append(new_symbol) merge_rules.append((most_frequent_pair, new_symbol)) # Replace all occurrences in corpus i = 0 while i < len(corpus) - 1: if (corpus[i], corpus[i+1]) == most_frequent_pair: corpus[i] = new_symbol del corpus[i+1] else: i += 1 return corpus, merge_rules# Input: corpus as byte sequence (list of byte indices) # Output: compressed sequence of indices, merge_rules (list of (pair -> new_symbol)) initialize vocabulary = [0, 1, ..., 255] # All possible bytes corpus = [index for each byte in input] # Represent as indices merge_rules = [] # To store merges for decoding unused_symbols = set(range(256, 512)) # Assuming 9-bit symbols; adjust as needed while num_merges < max_merges and len(set(zip(corpus[:-1], corpus[1:]))) > 0: # Count frequencies of adjacent pairs pair_counts = defaultdict(int) for i in range(len(corpus) - 1): pair = (corpus[i], corpus[i+1]) pair_counts[pair] += 1 if not pair_counts: break # Find most frequent pair most_frequent_pair = max(pair_counts, key=pair_counts.get) new_symbol = unused_symbols.pop() # Get next unused symbol # Add to vocabulary and rules vocabulary.append(new_symbol) merge_rules.append((most_frequent_pair, new_symbol)) # Replace all occurrences in corpus i = 0 while i < len(corpus) - 1: if (corpus[i], corpus[i+1]) == most_frequent_pair: corpus[i] = new_symbol del corpus[i+1] else: i += 1 return corpus, merge_rules
Modified Byte-Pair Encoding for Subword Tokenization
The modified byte-pair encoding (BPE) algorithm adapts the original compression technique for subword tokenization in natural language processing, enabling the segmentation of words into smaller, reusable units to address vocabulary limitations and handle rare or out-of-vocabulary terms effectively.[2] This adaptation shifts the focus from lossless text compression to creating a fixed-size vocabulary of subword units derived from character sequences, facilitating open-vocabulary processing in tasks like machine translation and language modeling.[2] The input for learning merge rules is a training corpus of words or sentences, preprocessed into character sequences with word boundaries explicitly marked—typically by appending a special end-of-word symbol, such as ‘·’, to each word—to prevent unintended merges across word edges during training.[2] The base vocabulary initializes with all unique characters encountered in the corpus, often drawn from Unicode representations, treating the text as a sequence of these atomic units rather than raw bytes.[2] A key modification involves iteratively merging the most frequent adjacent pairs of characters (or later subwords) within the corpus: frequency counts are computed for all pairs, the highest-frequency pair is replaced by a new symbol throughout the text, and this process repeats, building a sequence of merge rules ordered by decreasing frequency.[2] The number of merge operations, denoted as num_merges, serves as the primary hyperparameter, controlling the growth of the vocabulary from the initial character set to a larger set of compound subwords; for instance, 59,500 merges can yield a vocabulary of approximately 63,000 subword units, while 89,500 merges produce around 82,000 units.[2] Optional frequency thresholds may filter pairs below a minimum count to prioritize more reliable merges, though this is not always applied in standard implementations.[9] Unlike the original BPE, which applies merges exhaustively to an entire corpus for compression, these rules are learned once on the training data and then applied selectively to new text for tokenization, emphasizing segmentation over full replacement.[2] To encode a new word using the learned rules, the word is initially decomposed into its individual characters. The merge operations are then applied in the order they were learned—starting with the most frequent (earliest) rule—by repeatedly scanning the sequence to find and replace the highest-frequency (earliest) applicable pair until no further merges can be applied.[2] The output is a sequence of subword units from the fixed vocabulary, each mapped to a unique integer ID for model input; for example, the word "lowest" might be tokenized as the subwords "low" followed by "est".[2] Rare or unseen words are handled by partial matching: the algorithm applies as many merges as possible, falling back to individual characters for any remaining unmatched segments, ensuring no word is entirely unknown while preserving its orthographic integrity as a concatenation of valid units.[2] For instance, an unseen word like "unseen" could be represented as "un" + "s" + "e" + "e" + "n" if no further merges apply beyond the known prefix.[2] In contrast to compression scenarios, where decoding requires reversing all merges in reverse order to recover the original byte sequence, subword tokenization here produces reversible encodings without a full decoding cycle—original text is reconstructed simply by concatenating the subword strings, as the process is applied per word and does not alter the overall corpus structure.[2]Worked Examples
Compression Example with Text Data
To illustrate the compression process of the original byte-pair encoding (BPE) algorithm, consider a small sample text corpus consisting of the string "ABABCABCD", where each character represents a byte from a simple alphabet {A, B, C, D}.[5] This initial representation is a sequence of 9 bytes: A B A B C A B C D.[5] In the first iteration, the algorithm identifies all adjacent byte pairs in the sequence and counts their frequencies: AB appears three times, BC twice, and BA, CA, CD each once.[5] The most frequent pair, AB, is merged into a new symbol X (an unused byte value), and all occurrences are replaced by scanning the sequence from left to right: the first AB (positions 1-2) becomes X, the next AB (original positions 3-4, now adjacent) becomes X, and the final AB (original positions 6-7) becomes X.[5] The updated sequence is now X X C X C D, reducing the length to 6 symbols.[5] In the second iteration, the adjacent pairs in the new sequence X X C X C D are counted: XC appears twice (positions 2-3 and 4-5), while XX, CX, and CD each appear once.[5] The pair XC is merged into a new symbol Y, and replacements yield X (position 1), Y (from positions 2-3), Y (from positions 4-5), and D (position 6).[5] The resulting sequence is X Y Y D, now 4 symbols long.[5] Further iterations could continue until no pairs repeat or compression benefits diminish, but this example stops here, demonstrating a reduction from 9 to 4 symbols (a 55.6% length reduction in the sequence, excluding the overhead of the merge table).[5] The merges are recorded in a pair table for decoding: first AB → X, then XC → Y.[5] To recover the original, reverse the process starting from the last merge: each Y is replaced by XC, yielding X X C X C D.[5] Then, each X is replaced by AB, resulting in A B A B C A B C D.[5] This reversible substitution ensures lossless compression, with the table stored alongside the encoded sequence to enable exact reconstruction.[5]Tokenization Example with Vocabulary Building
To illustrate the modified byte-pair encoding (BPE) procedure for subword tokenization in natural language processing, consider a toy corpus consisting of the words "low", "lowest", "newer", "wider".[10] These words are first pre-tokenized into individual characters, with an end-of-word boundary symbol (denoted as·) appended to each to handle word segmentation distinctly from continuous text. The initial vocabulary comprises the base characters appearing in the corpus: {l, o, w, e, s, t, n, i, d, r} plus the boundary ·. Adjacent character pairs are then counted across the corpus to identify the most frequent ones for merging.
The BPE process iteratively replaces the most frequent pair with a new subword token, updating the vocabulary and pair counts after each merge, for a specified number of iterations (here, limited to 4 merges for brevity). The merges learned are: first "r ·" (appearing at the end of "newer" and "wider"), creating "r·" and updating those words to n e w e r· and w i d e r·, respectively; second "l o" (in "low" and "lowest"), forming "lo"; third "lo w" (in the updated "low" and "lowest"), forming "low"; fourth "e r ·" (in the updated "newer" and "wider"), forming "er·".[10]
The resulting vocabulary after these 4 merges includes the initial characters plus the new subwords: {l, o, w, e, s, t, n, i, d, r, ·, r·, lo, low, er·}. Each token in the vocabulary is assigned a unique integer ID, typically starting from 0 for the base characters (e.g., l=0, o=1, ..., ·=10, r·=11, lo=12, low=13, er·=14). To encode a word from the corpus, such as "lowest", a greedy longest-match algorithm scans the character sequence l o w e s t · from left to right, selecting the longest recognized subword at each position: "low" (merged, covering l o w), "e" (base), "s" (base), "t" (base), and · (boundary), yielding tokens ["low", "e", "s", "t", "·"] with IDs [13, 3, 4, 5, 10].[10]
For an unseen word like "lower" (l o w e r ·), the encoding uses the greedy longest-match: "low" (l o w), then "er·" (e r ·), yielding ["low", "er·"] with IDs [13, 14]. This demonstrates BPE's ability to handle out-of-vocabulary words by decomposing them into known subwords, promoting open-vocabulary coverage. If the word introduced new characters outside the initial vocabulary, the encoding would fall back to individual characters where no merges apply.[10]
The merge process can be visualized in the following table, showing selected pairs, their frequencies, and example encodings:
| Merge Iteration | Pair Merged | Initial Frequency | Updated Words (Examples) | Final Vocabulary Tokens Added |
|---|---|---|---|---|
| 1 | r + · → r· | 2 | newer → n e w e r· wider → w i d e r· | r· |
| 2 | l + o → lo | 2 | low → lo w · lowest → lo w e s t · | lo |
| 3 | lo + w → low | 2 | low → low · lowest → low e s t · | low |
| 4 | e + r · → er· | 2 | newer → n e w er· wider → w i d er· | er· |