Inverted index
An inverted index is a fundamental data structure in information retrieval systems that maps terms (such as words) from a collection of documents to the specific documents containing them, along with their positions within those documents, enabling efficient full-text search and query processing.[1] This structure inverts the traditional forward index, which maps documents to their terms, by organizing information around the terms themselves to facilitate rapid lookups without scanning entire document sets.[2] It consists of two primary components: a dictionary (or vocabulary), which is a sorted list of unique terms along with metadata like document frequency, and postings lists (or inverted lists), which for each term record the document identifiers (docIDs) where the term appears, often sorted by docID for efficient merging during queries, and may include term frequencies or positions for advanced retrieval like phrase searches.[1][2] Building an inverted index involves tokenizing document text, normalizing terms (e.g., stemming or case folding), sorting and grouping occurrences, and storing the results, typically with the dictionary in memory for quick access and postings on disk to handle large-scale collections.[1] Inverted indexes form the backbone of modern search engines and databases, supporting operations like Boolean queries (AND, OR, NOT), ranked retrieval based on term relevance (e.g., using tf-idf weighting), and proximity searches, while techniques such as compression and blocking optimize storage and query speed for massive datasets.[2] Originating in early information retrieval research in the 1960s, they have evolved to handle web-scale data, incorporating optimizations for dynamic updates and distributed environments.Fundamentals
Definition and Purpose
An inverted index is a data structure that maps content elements, such as words or tokens, to the documents in a collection where they occur, along with their locations within those documents, thereby enabling rapid retrieval of relevant information.[3] This structure inverts the traditional document-centric organization of data, shifting the focus to term-centric access for efficient querying.[3] The concept of the inverted index originated in the 1960s within information retrieval systems, with early implementations appearing in library catalogs and pioneering experimental setups like the SMART system developed by Gerard Salton at Harvard and later Cornell University.[4] It built on earlier manual indexing techniques but formalized automated full-text search in computational environments.[5] Key components of an inverted index include the dictionary, which holds unique terms extracted from the documents, and the postings lists, which are associated with each term and contain document identifiers (docIDs) where the term appears, often supplemented by term frequencies or positional information for advanced queries.[3] The primary purpose is to support full-text search by transforming the document-to-content mapping into a content-to-document one, which reduces query processing time from a linear scan of the entire collection to near-constant or logarithmic complexity, depending on the implementation.[3] To illustrate, consider a small collection of three documents:- Doc1: "The quick brown fox jumps."
- Doc2: "The fox runs quickly."
- Doc3: "Brown dogs jump over fences."
| Term | Postings List |
|---|---|
| brown | Doc1 (pos 3), Doc3 (pos 1) |
| fox | Doc1 (pos 4), Doc2 (pos 2) |
| jump | Doc1 (pos 5), Doc3 (pos 3) |
| quick | Doc1 (pos 2), Doc2 (pos 4) |
| the | Doc1 (pos 1), Doc2 (pos 1) |
Comparison to Forward Index
A forward index, also known as a document index, is a data structure that maps each document to the list of terms (such as words or tokens) it contains, often including positional information within the document.[6] This structure facilitates tasks centered on individual documents, such as summarization, where term frequencies help generate abstracts, or clustering, where document-term vectors enable grouping similar documents based on shared vocabulary.[7] In contrast to the inverted index, which adopts a term-centric approach by mapping each term to the documents containing it, the forward index is document-centric, organizing data around documents rather than terms. This fundamental difference affects how data is stored and accessed: an inverted index prioritizes quick term lookups across a corpus, while a forward index emphasizes easy retrieval of a document's full content. The following table illustrates these structural differences using a simple example corpus of three documents.| Aspect | Forward Index Example | Inverted Index Example |
|---|---|---|
| Mapping | Doc1: [apple, banana, cat] Doc2: [banana, dog] Doc3: [apple, cat, elephant] | apple: [Doc1, Doc3] banana: [Doc1, Doc2] cat: [Doc1, Doc3] dog: [Doc2] elephant: [Doc3] |
| Key | Document ID (e.g., Doc1) | Term (e.g., apple) |
| Value | List of terms in the document | List of document IDs containing the term (postings list) |
| Storage Focus | Per-document term lists, suitable for document-level operations | Per-term postings, optimized for cross-document queries |
Construction
Building Process
The building process of an inverted index begins with a preprocessing pipeline applied to the document collection to prepare the text for indexing. Tokenization splits the raw text into individual terms or tokens, typically by identifying word boundaries using spaces, punctuation, or other delimiters. Normalization follows, which includes converting terms to lowercase to ensure case-insensitive matching, stemming or lemmatization to reduce words to their base forms (e.g., "running" to "run"), and removal of stop words such as "the" or "and" to eliminate common but non-informative terms that would otherwise inflate the index size. These steps transform the unstructured text into a standardized sequence of tokens suitable for indexing, ensuring consistency across documents. For small collections that fit in memory, the algorithmic steps involve parsing documents sequentially, extracting terms, and constructing the index directly. First, each document is assigned a unique identifier (docID), typically sequential integers. As the document is parsed, normalized tokens are extracted and paired with the docID to form term-docID pairs. These pairs are collected for the entire collection, then sorted lexicographically by term followed by docID. Duplicates are merged by grouping identical terms and compiling sorted lists of unique docIDs (postings lists) for each term, resulting in the dictionary of terms and their associated postings. Finally, the postings lists are sorted by term to produce the complete inverted index. This merge-sort-based approach ensures efficient organization for subsequent queries.[1] Handling large corpora that exceed available memory requires scalable techniques like blocked sort-based indexing (BSBI), which divides the collection into manageable blocks processed independently before merging. Documents are parsed sequentially into blocks of term-docID pairs until memory limits are reached (e.g., 10 million pairs per block). Each block's pairs are sorted in memory and inverted into temporary postings lists, which are written to disk as partial indexes. The final step merges these temporary indexes using a multi-way merge with a priority queue (min-heap) to combine postings lists for each term across blocks, producing the global inverted index. This approach manages memory by avoiding full-collection loading while maintaining efficiency.[11] An alternative method for large-scale indexing is single-pass in-memory indexing (SPIMI), which processes the document collection in a single pass without generating sorted term-docID pairs for the entire collection. Instead, it uses a hash-based dictionary of terms (no termIDs) and maintains dynamic postings lists in memory. As tokens are encountered, they are added to the appropriate postings list. When memory is full, the current block is sorted by term, compressed, and written to disk as a partial index. The partial indexes are then merged similarly to BSBI. SPIMI reduces I/O overhead, allows larger block sizes due to efficient data structures, and achieves near-linear time complexity in the number of tokens, making it more efficient than BSBI for many practical scenarios, including implementations in search engines like Apache Lucene.[12] Pseudocode for the BSBI merge process illustrates the merging of temporary indexes:This pseudocode uses a priority queue to efficiently select the next smallest term across blocks, ensuring sorted output.[11] The time complexity of building an inverted index via BSBI is \Theta(T \log T), where T is the total number of term-docID pairs (roughly the number of tokens N in the collection), dominated by the sorting step within each block and the merging. Space complexity grows linearly with the collection size, requiring disk storage for temporary blocks proportional to the number of documents and term frequencies, plus in-memory space for one block at a time. For example, indexing the 1 GB Reuters-RCV1 corpus (800,000 documents, 100 million tokens) typically requires about 10 blocks of 10 million pairs each, with total build time on the order of hours depending on hardware.[11] To illustrate, consider a small corpus of five documents after preprocessing (stop words removed, terms lowercased and stemmed):MERGEBLOCKS(blockFiles): mergedPostings = empty dictionary of lists pq = [priority queue](/page/Priority_queue) initialized with first term-docID pair from each blockFile while pq is not empty: pair = pq.extract_min() if pair differs from previous term: write postings for previous term to mergedPostings add pair to current term's postings list if more pairs in pair's blockFile: insert next pair from that blockFile into pq write remaining postings to mergedPostings return mergedPostings as final indexMERGEBLOCKS(blockFiles): mergedPostings = empty dictionary of lists pq = [priority queue](/page/Priority_queue) initialized with first term-docID pair from each blockFile while pq is not empty: pair = pq.extract_min() if pair differs from previous term: write postings for previous term to mergedPostings add pair to current term's postings list if more pairs in pair's blockFile: insert next pair from that blockFile into pq write remaining postings to mergedPostings return mergedPostings as final index
- Doc1: friend roman countryman lend ear
- Doc2: friend roman lend ear
- Doc3: beware greek
- Doc4: friend roman lend ear
- Doc5: time time
- Doc1 pairs: (countryman,1), (ear,1), (friend,1), (lend,1), (roman,1)
- Doc2 pairs: (ear,2), (friend,2), (lend,2), (roman,2)
- Doc3 pairs: (beware,3), (greek,3)
- Doc4 pairs: (ear,4), (friend,4), (lend,4), (roman,4)
- Doc5 pairs: (time,5), (time,5)
- beware: [13]
- countryman: [14]
- ear: [1,2,4]
- friend: [1,2,4]
- greek: [13]
- lend: [1,2,4]
- roman: [1,2,4]
- time: [15]
Storage Structures
An inverted index primarily consists of two core components: a dictionary and postings lists. The dictionary, also known as the term lexicon, serves as a mapping from unique terms to their corresponding postings lists, enabling efficient term lookup. It is typically implemented using a hash table for average O(1) access time or a B-tree for O(log T) access, where T represents the number of unique terms in the collection, balancing speed and ordered traversal capabilities. Postings lists store the locations of each term's occurrences across documents, usually as a sequence of document identifiers (docIDs), often augmented with term frequencies or positional information. These lists can be represented as linked lists for dynamic insertion, fixed-length arrays for in-memory efficiency, or skip lists to accelerate merging during queries by allowing jumps over irrelevant entries. For large collections, postings are sorted by docID to facilitate efficient set operations. Advanced variants extend the basic structure to support more complex queries. Positional indexes include word offsets within each document in the postings, enabling phrase and proximity searches by verifying sequential positions. Zone indexes differentiate document sections (e.g., title, body, or author) by tagging postings with zone identifiers, allowing weighted or restricted searches such as terms appearing only in titles.[16] Implementation considerations focus on balancing memory usage, access speed, and persistence. Inverted files, a common disk-based format, organize the index into blocks for sequential I/O, with the dictionary often held in memory while postings reside on disk as variable-length records addressed via offsets. In-memory storage suits smaller or frequently accessed indexes, whereas persistent storage handles larger corpora through techniques like memory-mapped files. Variable-length postings, due to differing document frequencies, are managed by prefix compression or byte-aligned encoding to minimize space and fragmentation. The following table illustrates a sample dictionary entry and postings list for the term "inverted" in a small document collection:| Term | Postings List |
|---|---|
| inverted | 1: (freq=2, pos=[5, 12]); 3: (freq=1, pos=[17]); 5: (freq=3, pos=[3, 15, 22]) |
Query Processing
Searching Mechanism
The searching mechanism in an inverted index begins with processing a user query by breaking it into individual terms and retrieving the corresponding postings lists from the dictionary. For a single-term query, the system directly accesses the dictionary to locate the term and fetches its postings list, which contains the document identifiers (docIDs) where the term appears, often sorted in ascending order for efficient processing. This lookup is typically logarithmic in the size of the dictionary if implemented as a tree or hash structure, enabling rapid identification of candidate documents. Multi-term queries, such as Boolean combinations using AND, OR, or NOT operators, require operations on multiple postings lists to compute the result set. For an AND query, the system performs an intersection of the sorted postings lists for each term, using a linear merge algorithm that advances pointers through the lists in tandem, matching common docIDs in O(m + n) time, where m and n are the lengths of the two lists. This process can be optimized with skip pointers, which are additional references embedded in the postings lists at regular intervals (e.g., every √L entries for a list of length L), allowing the algorithm to jump over blocks of non-matching docIDs and reduce comparisons, particularly effective when one list is much shorter than the other. OR queries involve a union operation, merging sorted lists while eliminating duplicates, also in linear time, while NOT queries complement the result by excluding docIDs from a specified list. Early termination optimizations halt traversal if a short list is exhausted, minimizing unnecessary scans in skewed distributions common in real corpora.[19] Phrase and proximity searches extend beyond simple term presence by incorporating positional information stored in the postings, where each posting includes a list of token offsets for the term's occurrences within the document. For an exact phrase query like "inverted index", the system retrieves postings for both terms, then for each candidate docID in the intersection, verifies if there exists a pair of positions p1 for "inverted" and p2 for "index" such that p2 = p1 + 1. Proximity searches generalize this to allow a maximum distance k, checking if |p2 - p1| ≤ k, which adds a linear scan over the position lists per candidate document but ensures precise matching of term order and closeness. These positional postings increase storage but enable handling of about 10-20% of real-world queries that specify phrases or nearness.[20] The efficiency of these mechanisms hinges on postings list traversal costs, which dominate query time; for instance, intersecting two lists of average length f (term frequency) across a corpus of N documents costs O(f) per pair, but skip pointers can reduce this by up to 50% in practice for high-frequency terms by skipping √f entries per jump. Optimizations like processing the shortest list first further cut costs, as the traversal bound becomes proportional to the smaller list's length plus mismatches in the longer one. Post-retrieval, the candidate set is passed to relevance ranking, but the searching mechanism focuses solely on efficient raw retrieval.[19] Example: Consider a small corpus with documents D1: "The inverted index is useful", D2: "Inverted indexes speed up search", and D3: "Building an index takes time". The inverted index dictionary maps "inverted" to postings [(1, [21]), (2, [14]) ] and "index" to [(1, [13]), (2, [21]), (3, [21]) ], where each entry is (docID, [positions]). For the query "inverted index" (AND with phrase proximity k=1), fetch both lists and intersect docIDs to get {1,2}. For D1, positions are 2 ("inverted") and 3 ("index"), with |3-2|=1 ≤1, so it matches; for D2, positions 1 and 2 match similarly. D3 lacks "inverted", so the result set is {1,2}. Without positions, only term co-occurrence would be checked, potentially including false positives like D3 if proximity were ignored.Relevance Ranking
Relevance ranking in information retrieval systems using inverted indexes involves assigning scores to candidate documents retrieved during query processing, ordering them by estimated relevance to the query. These scores are typically computed by aggregating term-level contributions from the query terms present in the document's postings lists, balancing factors like term occurrence frequency and rarity across the corpus. Common methods range from simple weighting schemes to more sophisticated probabilistic models, enabling efficient ranking even over large collections. One foundational approach is the term frequency-inverse document frequency (TF-IDF) weighting scheme. Term frequency (TF) measures how often a query term t appears in a document d, often normalized by document length to avoid bias toward longer texts; a basic form is \text{TF}(t, d) = f_{t,d} / |d|, where f_{t,d} is the raw count and |d| is the document length.[22] Inverse document frequency (IDF) quantifies a term's specificity by penalizing common terms, defined as \text{IDF}(t) = \log \left( \frac{N}{\text{df}(t)} \right), where N is the total number of documents and \text{df}(t) is the number of documents containing t. This logarithmic form derives from a probabilistic interpretation where term specificity correlates with the ratio of documents not containing the term, assuming a uniform relevance prior; specifically, Jones derived it as an approximation to \log \left( \frac{P(\neg t | \text{relevant})}{P(\neg t | \text{non-relevant})} \right), simplifying under independence assumptions to emphasize terms rare in the collection but present in relevant documents.[23] The overall TF-IDF score for a document is then \text{score}(d, q) = \sum_{t \in q} \text{TF}(t, d) \cdot \text{IDF}(t), summing contributions across query terms.[22] The vector space model (VSM) extends TF-IDF by representing queries and documents as vectors in a high-dimensional term space, where each dimension corresponds to a term weighted by TF-IDF. The relevance of document d to query q is measured by cosine similarity: \cos \theta = \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||} = \frac{\sum_{t \in q \cap d} \text{TF-IDF}(t, q) \cdot \text{TF-IDF}(t, d)}{\sqrt{\sum_{t \in q} [\text{TF-IDF}(t, q)]^2} \cdot \sqrt{\sum_{t \in d} [\text{TF-IDF}(t, d)]^2}}, which normalizes for vector magnitudes, emphasizing angular proximity over absolute lengths. This derivation follows from the geometric interpretation of term weights as coordinates, where dot product captures overlapping weighted terms, and normalization accounts for document/query sparsity. In practice with inverted indexes, the dot product is accumulated by traversing postings lists for query terms, collecting TF-IDF values from stored weights without full vector materialization; norms can be precomputed and stored per document for efficiency.[22] For improved performance on real-world data, advanced models like BM25 address TF-IDF limitations such as linear term frequency scaling, which over-rewards repeated terms without saturation. BM25, a probabilistic ranking function, estimates relevance as the probability that a document is relevant given the query, building on binary independence assumptions. The score for a query q and document d is: \text{BM25}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{TF}(t, d) \cdot (k_1 + 1)}{\text{TF}(t, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}, where IDF uses the same logarithmic form as TF-IDF, \text{TF}(t, d) is raw frequency, |d| is document length, avgdl is average document length, k_1 (typically 1.2–2.0) controls TF saturation to prevent over-emphasis on high frequencies, and b (typically 0.75) tunes length normalization. The TF component derives from a 2-Poisson model approximating term distribution in relevant documents, where the saturated form \frac{\text{TF} (k_1 + 1)}{\text{TF} + k_1 \cdot \text{len factor}} bounds contribution as TF grows, reflecting diminishing returns; the length term adjusts for shorter documents being less likely to contain terms by chance. IDF multiplies each term's contribution, derived similarly to Jones but integrated into the probabilistic framework. Parameters k_1 and b are empirically tuned, with k_1 balancing frequency sensitivity and b modulating length effects.[24] In inverted index systems, scores are integrated by traversing unioned postings lists for query terms, accumulating per-document contributions on-the-fly (WAND or similar algorithms for top-k efficiency) or post-retrieval for exact ranking; precomputed IDF and norms are stored in index metadata to avoid repeated calculations. For TF-IDF or VSM, weights in postings enable direct summation during traversal, deriving the final score by aggregating as lists are merged. BM25 similarly uses stored TFs and lengths, computing the saturated term during accumulation, with full derivation ensuring probabilistic consistency across documents. Consider a corpus with N = 4 documents, avgdl = 5, k_1 = 1.2, b = 0.75, and query q = \{ \text{"[apple](/page/Apple)", "[fruit](/page/Fruit)"} \}. IDF("apple") = \log(4/2) = 0.693, IDF("fruit") = \log(4/3) = 0.288. Document d1 (|d1|=4): TF("apple")=2, TF("fruit")=1; score = 0.693 * (22.2)/(2 + 1.2(1-0.75 + 0.754/5)) + 0.288 * (12.2)/(1 + 1.2*(1-0.75 + 0.754/5)) ≈ 0.6931.46 + 0.2881.09 ≈ 1.01 + 0.31 ≈ 1.32. Document d2 (|d2|=6): TF("apple")=1, TF("fruit")=0; score ≈ 0.693 * (12.2)/(1 + 1.2*(1-0.75 + 0.756/5)) + 0 ≈ 0.6930.92 ≈ 0.64. Thus, d1 ranks higher due to balanced term presence and length adjustment.Applications
Search Engines
Inverted indexes form the core of web search engine architectures, enabling efficient retrieval of relevant documents from vast, unstructured collections. Early search engines, such as AltaVista launched in 1995, relied on inverted indexes to handle full-text search over millions of web pages, marking a shift from manual directories to automated indexing of natural language queries.[25] This approach allowed AltaVista to index approximately 25 million webpages by preprocessing documents into tokens and building term-document mappings, a foundational technique that influenced subsequent systems.[25] By the late 1990s, engines like Google integrated inverted indexes with link-based ranking, scaling to billions of pages through distributed processing.[26] The crawling and indexing pipeline in search engines begins with web crawlers that systematically fetch pages, often at rates exceeding 100 pages per second, storing raw content in compressed repositories.[26] Upon retrieval, documents undergo HTML parsing to extract textual content, links, and metadata, followed by URL normalization—such as case-folding and removing trailing slashes—to ensure consistent identification and avoid duplicates. The parsed content is then tokenized, with linguistic preprocessing like stemming applied to normalize terms, before feeding into inverted index builders that create forward indexes and convert them into inverted structures via sorting.[25] This pipeline supports ongoing refreshes, with static content recrawled monthly and dynamic pages more frequently, maintaining index freshness amid web evolution.[25] Real-world implementations highlight inverted indexes' adaptability. Google's architecture employs inverted indexes stored in "barrels" for hits (word occurrences with positions and attributes), integrated with PageRank scores derived from a link database to rank results by authority and relevance.[26] Similarly, Apache Lucene, the underlying library for many engines, constructs inverted indexes across immutable segments—each a self-contained index file set including term dictionaries and posting lists—periodically merged to optimize storage and query speed by reducing fragmentation.[27] Modern systems like Elasticsearch extend this by building on Lucene, evolving from early single-node designs to distributed setups that handle petabyte-scale indexes.[28] Scalability challenges arise from the web's dynamic nature, necessitating sharding where indexes are partitioned across clusters for parallel indexing and querying, as pioneered by AltaVista's fixed-hash shards in the 1990s.[29] Handling updates from evolving content involves near-real-time indexing in Elasticsearch, where new segments are created for additions and merges manage deletions without full rebuilds, though this trades I/O for performance.[27] Query processing enhances user experience through features like autocomplete, implemented via prefix trees (tries) on the term dictionary to suggest completions matching user prefixes in logarithmic time.[30] Spell correction leverages edit distance metrics, such as Damerau-Levenshtein, to identify and rank corrections within a small distance threshold (e.g., 1-2 edits) from query terms against the dictionary, improving recall for misspelled inputs.[31]Database Systems
Inverted indexes are integral to full-text search capabilities in relational database management systems (RDBMS), enabling efficient querying over textual data stored in columns. In PostgreSQL, the Generalized Inverted Index (GIN) serves as a specialized structure for full-text search, supporting operations like phrase matching and ranking by mapping terms to lists of row identifiers (TIDs) within tables. Similarly, MySQL employs full-text indexes on columns of types such as CHAR, VARCHAR, or TEXT, which internally use an inverted index to store word positions and facilitate relevance-based retrieval without scanning entire datasets. These integrations allow databases to handle unstructured or semi-structured text alongside traditional relational data, maintaining ACID compliance while accelerating search operations. In NoSQL databases, inverted indexes support scalable, distributed text search tailored to document-oriented or key-value stores. Elasticsearch, built on Apache Lucene, utilizes inverted indexes across shards for distributed querying, enabling horizontal scaling and real-time indexing of large-scale text corpora in JSON documents. MongoDB's text indexes, also inverted in nature, incorporate stemming and language-specific tokenization to normalize terms (e.g., reducing "running" to "run"), allowing efficient searches over collections with multilingual content. Unlike RDBMS counterparts, these NoSQL implementations prioritize eventual consistency and partition tolerance for high-throughput environments. Query processing in these systems leverages SQL extensions or domain-specific languages to intersect postings lists from inverted indexes. In MySQL, the MATCH...AGAINST syntax performs full-text searches with modes like BOOLEAN or NATURAL LANGUAGE, returning relevance scores based on term frequency-inverse document frequency (TF-IDF) calculations over the index. PostgreSQL uses operators such as @@ between tsvector (indexed document representation) and tsquery (parsed search terms) to execute efficient intersections, supporting ranked results via ts_rank. For NoSQL, MongoDB employs JSON-based $text operators in aggregation pipelines to intersect postings and apply filters, while Elasticsearch's Query DSL constructs complex bool queries that merge term postings with scoring functions like BM25. Common use cases include e-commerce platforms where inverted indexes power product catalog searches in MySQL or PostgreSQL, matching user queries against descriptions, titles, and attributes for faceted results. In log analysis, systems like Elasticsearch index server logs to enable rapid pattern matching and anomaly detection across distributed clusters, filtering terabytes of entries in seconds. To manage updates in dynamic environments, databases often use delta indexes—temporary structures that capture changes before merging into the main inverted index—to minimize write amplification without blocking reads. Performance benefits stem from the inverted index's ability to reduce query complexity; a full table scan operates in O(n time for n rows, whereas index-based search achieves O(k log n + m) where k is the number of query terms, n the document count, and m the size of retrieved postings lists, often yielding sub-second responses on million-row datasets. Storage overhead typically ranges from 10-20% additional space beyond the raw text, depending on term sparsity and compression, though this varies with data density and indexing options.Compression
Techniques
Inverted index compression techniques primarily target two components: the dictionary of terms and the postings lists containing document identifiers (docIDs), frequencies, and positions. These methods aim to reduce storage footprint by exploiting redundancies, such as sorted order in terms and gaps in docIDs, while enabling efficient decompression during queries.Dictionary Compression
Dictionary compression focuses on the term lexicon, which stores unique terms as strings. One common approach is front-coding, where terms are sorted lexicographically and grouped into blocks of size k (typically 4–256). For the first term in a block, the full string is stored; for subsequent terms, only the suffix differing from the previous term is stored, along with a byte indicating the length of the shared prefix. This leverages the commonality in prefixes among nearby terms in sorted order. For example, the terms "automata", "automatic", and "automation" might be encoded as "automata" (full, 8 chars), followed by prefix length 7 + "ic" for "automatic", and prefix length 7 + "ion" for "automation", reducing redundancy. Front-coding with blocking can shrink a dictionary like that of the Reuters-RCV1 corpus from 11.2 MB (fixed-width) or 7.6 MB (string with pointers) to about 5.9 MB.[32] Another postings-based method for dictionaries involves blocked arrays, which treat the dictionary as an array of characters and divide it into fixed-size blocks. Pointers are stored only to the start of each block, with additional bytes encoding term lengths within the block to allow reconstruction. For a block size k=4, this compresses the Reuters-RCV1 dictionary to 7.1 MB by eliminating per-term pointers and exploiting local similarities. These techniques assume the dictionary fits in main memory post-compression, prioritizing space savings over exact string matching speed.[32]Postings Compression
Postings lists, which record docIDs (often as sorted gaps), frequencies, and positions, are compressed by encoding integer sequences efficiently. Variable-byte (VB) encoding aligns integers to byte boundaries, using 7 bits for the value and 1 continuation bit (0 for the last byte, 1 otherwise). Small gaps (<128) fit in one byte, while larger ones use multiple. For instance, the gap 5 is encoded as a single byte "00000101" (5 in decimal, with continuation 0). VB is simple and fast to decode via byte-wise operations, achieving about 8.81 bits per integer on the Gov2 collection.[33][32] For denser compression, Elias gamma and delta codes are used on delta-encoded gaps (differences between consecutive docIDs). Gamma code encodes a gap g \geq 1 by a unary representation of \ell = \lfloor \log_2 g \rfloor + 1 (length in bits), followed by \ell - 1 binary bits for the offset g - 2^\ell + 1. The code length is $2 \lfloor \log_2 g \rfloor + 1 bits, approximating $2 \log_2 n bits on average for gaps up to n. For example, gap 13 (\ell = 4) is "1110" (unary) + "101" (binary offset), totaling 7 bits. Delta code refines this by gamma-encoding the length \ell itself before appending the binary offset, yielding slightly shorter codes (e.g., 10 bits for 113 vs. gamma's 11). These Elias codes (from 1975) excel for small, variable gaps in postings, compressing Gov2 to 3.74 bits per integer with delta.[33][32]Frequency and Position Compression
Frequencies and positions in postings often follow skewed (geometric) distributions, making parameterized codes like Golomb-Rice suitable. Introduced by Golomb (1966) and Rice (1971), it divides a number x \geq 1 by parameter b = 2^k (for integer k): the quotient q = \lfloor (x-1)/b \rfloor is unary-coded (q 1's followed by 0), and the remainder r = (x-1) \mod b uses k binary bits. The code length is q + 1 + k bits (or \lfloor (x-1)/b \rfloor + k + 1). Optimal b \approx -1 / \log_2 (1-p) for geometric probability P(x) = p (1-p)^{x-1}; in practice, k is tuned per block (1–4). For example, with x=6, k=2 (b=4): q=1 (unary "10"), r=1 (binary "01"), yielding "1001" (4 bits). This yields 4.08 bits per integer on Gov2 for frequencies/positions, outperforming gamma for skewed data by modeling the distribution explicitly.[33]Hybrid Approaches
Hybrid methods integrate multiple techniques, often combining integer codes with structural aids like skipping for balanced space and query speed. For instance, partitioned Elias-Fano (PEF) partitions postings into blocks, applies Elias-Fano monotonic encoding (a hybrid of bit vectors and select structures), and adds bitmaps for skipping irrelevant blocks during intersections. Skip pointers (every k-th docID) may be stored uncompressed or lightly encoded for rapid jumps, while the bulk uses gamma/delta. This achieves 3.12 bits per integer on Gov2, enabling skips without full decompression. Such hybrids, like those in clustered indexes, group similar postings and apply tailored compression per cluster, reducing overall bits while supporting skipping. Recent advances as of 2025 include in-memory optimizations and new algorithms like AC-SBS for faster decompression in large-scale systems.[33][34][35]Example: Delta + Variable-Byte on Postings
Consider a sample postings list of docIDs [1, 3, 5, 10, 20]. First, compute deltas (gaps): [1, 2, 2, 5, 10]. Uncompressed as 32-bit integers, this requires 160 bits (5 × 32). Using delta + VB:- 1: "00000001" (1 byte, 8 bits)
- 2: "00000010" (1 byte, 8 bits)
- 2: "00000010" (1 byte, 8 bits)
- 5: "00000101" (1 byte, 8 bits)
- 10: "00001010" (1 byte, 8 bits)