Fact-checked by Grok 2 weeks ago

Document retrieval

Document retrieval is the computerized process of producing a relevance-ranked list of documents from a large collection in response to a user's query, achieved by comparing the query to an automatically generated index of the documents' textual content.^[1] This process forms a core subset of information retrieval (IR), which involves identifying unstructured materials—typically text documents—that satisfy a specific information need from vast repositories, often stored digitally.^[2] The field emerged in the late 1940s and early 1950s amid the rapid growth of scientific literature, beginning with systems limited to citation-based searching before advancing to full-text capabilities as computing power and storage costs declined.^[1] Early developments focused on automating manual library processes, leading to the establishment of standardized test collections and evaluation frameworks like the Text Retrieval Conference (TREC) in the 1990s, which benchmarked system performance using metrics such as precision and recall.^[3] At its foundation, document retrieval relies on three primary modules: document processing for indexing terms and structures, query analysis to refine user inputs, and matching functions to compute relevance scores.^[1] Classical models include the Boolean model for exact logical matching, the vector space model using term-frequency inverse-document-frequency (tf-idf) weighting and cosine similarity for ranking, and probabilistic models that estimate relevance likelihood.^[2] More advanced techniques incorporate natural language processing for query expansion and synonym handling.^[3] In contemporary applications, document retrieval powers web search engines like Google, digital libraries, and enterprise knowledge bases, indexing hundreds of billions of web pages.^[2] As of 2023, advancements integrate deep learning and transformer-based models, such as BERT and dense retrievers, to capture semantic relationships and overcome limitations of term-based matching, enhancing performance in tasks like question answering and multimodal search.^[4] By 2025, further developments include retrieval-augmented generation (RAG) with large language models for improved contextual retrieval in AI systems like chatbots.^[5]

Fundamentals

Definition and Core Concepts

Document retrieval is the computerized process of identifying and returning a ranked list of relevant documents from a large collection in response to a user query, with an emphasis on retrieving entire documents as atomic units rather than extracted snippets or passages.^[1]^[6] This task forms a core component of information retrieval systems, where the goal is to satisfy a user's information need by matching query terms to document content through automated indexing and similarity measures.^[2] Central to document retrieval are several foundational concepts. Documents represent the basic, indivisible units of retrieval, typically consisting of unstructured text such as articles, web pages, reports, or book chapters stored in a corpus or collection.^[2] Queries express the user's information need, ranging from natural language phrases to structured expressions like Boolean combinations of keywords.^[2] Relevance serves as the key matching criterion, defined as the extent to which a document provides information that the user perceives as useful for their query, often assessed probabilistically or via similarity scoring.^[2]^[1] Although closely related, document retrieval differs from broader information retrieval (IR), which includes techniques for extracting specific facts, entities, or answers; document retrieval prioritizes the return of complete texts to allow users to explore context holistically.^[7] The standard workflow involves query input and processing, matching against an indexed collection, ranking by relevance, and presenting results, typically in descending order of pertinence.^[1]

Historical Development

The roots of document retrieval lie in 19th-century library cataloging efforts to organize vast collections systematically. In 1841, Anthony Panizzi, keeper of printed books at the British Museum, developed 91 rules for standardizing the cataloging of printed books, which emphasized uniform entry points and descriptive consistency to facilitate user access and retrieval. These rules laid the groundwork for modern bibliographic control, influencing subsequent library practices worldwide.^[8] The mid-20th century introduced mechanized aids, transitioning from manual catalogs to semi-automated systems. During the 1940s and 1950s, punched card technology—pioneered in the 1930s for data processing in the United States—enabled libraries and information centers to encode document attributes on cards for mechanical sorting and selective retrieval, marking an early step toward computational efficiency.^[9] The 1960s brought fully automated systems, highlighted by Gerard Salton's development of the SMART (Salton's Retrieval System) at Cornell University, which automated indexing, weighting, and relevance feedback to improve search accuracy on textual collections. By the 1970s, Boolean retrieval models became widespread, permitting logical combinations of terms (AND, OR, NOT) in operational systems like Dialog, while Salton's 1975 vector space model represented documents and queries as vectors for similarity-based ranking, shifting focus from exact matches to semantic proximity.^[10] The 1980s saw paradigm shifts toward probabilistic approaches, with C.J. van Rijsbergen's work emphasizing relevance probability estimation to handle uncertainty in retrieval.^[11] The 1990s transformed the field with the web's explosion, as AltaVista launched in 1995 with full-text indexing of web pages, followed by Google's 1998 debut using PageRank for link-based ranking at unprecedented scale.^[12]^[13] The Text REtrieval Conference (TREC), started in 1992 by NIST under Donna Harman, standardized benchmarking and spurred innovations in large-scale evaluation. Entering the 2010s, machine learning paradigms dominated, evolving into neural networks post-2017 that learned dense representations for queries and documents, enabling end-to-end models outperforming traditional methods in relevance matching.^[14]

Techniques and Models

Indexing and Query Processing

The indexing process in document retrieval involves preprocessing documents to extract and normalize terms for efficient storage and retrieval. Tokenization is the initial step, where raw text is segmented into individual tokens, typically by splitting on whitespace, punctuation, and other delimiters to handle variations like contractions or hyphenated words. Stop-word removal follows, filtering out high-frequency, low-information words such as "the," "and," or "of," which can constitute 30-50% of text and thus significantly reduce index size without impacting relevance. Stemming then reduces inflected or derived words to their base form—for instance, transforming "computers," "computing," and "computed" to "comput"—using rule-based algorithms like the Porter stemmer, which applies suffix-stripping rules to improve term matching while balancing precision and recall. The resulting terms form the basis of the inverted index, a core data structure that maps each unique term to a postings list containing document identifiers (docIDs) where the term occurs, often augmented with positions or frequencies for advanced features. This structure inverts the traditional document-term matrix, enabling rapid lookup of all documents containing a query term by traversing the postings list rather than scanning entire documents. For example, the term "retrieval" might point to a list like [7, 23, 45, 112], indicating its presence in those documents. Storage considerations are critical for practicality, particularly with compression techniques to manage the voluminous postings lists. Delta encoding compresses these lists by storing differences (gaps) between sorted docIDs instead of absolute values—for instance, docIDs 283154, 283159, and 283202 become gaps 283154, 5, and 43—yielding smaller numbers that require fewer bits via methods like variable-byte or gamma encoding, often reducing index size by 50-70% on corpora like Reuters-RCV1. Scalability for massive collections, such as billions of documents in web-scale search, relies on distributed approaches like the Single-Pass In-Memory (SPIMI) algorithm or MapReduce-based partitioning, which divide the corpus into blocks for parallel processing and merging, ensuring construction remains feasible on commodity hardware. Query processing transforms user input into a form compatible with the index, starting with parsing to interpret Boolean operators such as AND (intersection of postings), OR (union), and NOT (exclusion). For natural language queries, preprocessing mirrors document indexing, applying tokenization, stop-word removal, and stemming to normalize terms. Query expansion enhances recall by augmenting the original query with related terms, such as synonyms from a thesaurus or pseudo-relevance feedback where top-retrieved documents suggest additional keywords; the seminal Rocchio method weights these expansions based on relevance judgments to iteratively refine the query vector. Efficiency in indexing and query processing hinges on algorithmic complexity and representation choices. Index construction achieves linear time complexity O(T), where T is the total number of tokens across all documents, using incremental methods like SPIMI that avoid full sorting by building and merging compressed blocks on-the-fly. Traditional sparse representations, based on bag-of-words and inverted indexes, excel in exact-match scenarios with low storage overhead but limited semantic handling, whereas dense representations convert documents to fixed-dimensional embeddings (e.g., via BERT) for similarity search, necessitating approximate nearest-neighbor indexes like HNSW, though at higher computational cost during construction and updates.

Retrieval Algorithms

Retrieval algorithms in document retrieval form the core mechanisms for matching queries to candidate documents from a collection, typically operating on pre-indexed representations to efficiently identify relevant items. These algorithms range from exact-match approaches to probabilistic and vector-based methods that account for term importance and partial relevance, enabling scalable matching in large corpora. Classical algorithms emphasize set-theoretic operations or geometric interpretations, while probabilistic ones incorporate statistical models of relevance. The Boolean retrieval model represents one of the earliest and simplest approaches, treating documents and queries as sets of terms and using logical operators to determine matches. In this model, a document is retrieved if it satisfies the query's Boolean expression, such as AND for intersection of term sets, OR for union, and NOT for exclusion. For instance, a query like "cat AND dog NOT bird" retrieves documents containing both "cat" and "dog" but excluding "bird". This exact-match paradigm relies on binary relevance—documents either fully match or are discarded—making it precise for professional search systems but limited in handling partial relevance or ranking nuances.^[15] To address Boolean model's rigidity, the vector space model (VSM) represents documents and queries as vectors in a high-dimensional space, where each dimension corresponds to a term in the vocabulary, weighted by its importance. Term frequency-inverse document frequency (tf-idf) weighting is commonly used, defined as tf-idf(t, d) = tf(t, d) \times \log(N / df(t)), where tf(t, d) is the frequency of term t in document d, N is the total number of documents, and df(t) is the document frequency of t. Similarity between a query vector q and document vector d is then computed using cosine similarity:

\cos(\theta) = \frac{q \cdot d}{|q| |d|}

This measures the angle between vectors, prioritizing documents with aligned term distributions while normalizing for document length. The VSM, introduced in seminal work on automatic indexing, facilitates ranked retrieval by scoring all documents on a continuum of similarity rather than binary decisions.^[10] Probabilistic retrieval models extend VSM by estimating the probability of document relevance based on term occurrences, often using the binary independence model as a foundation. A widely adopted variant is the Okapi BM25 ranking function, which refines term weighting to account for saturation effects and document length normalization. The score for a document d given query Q with terms t is:

\text{score}(d, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{\text{tf}(t, d) \cdot (k_1 + 1)}{\text{tf}(t, d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})}

Here, IDF(t) = \log \frac{N - df(t) + 0.5}{df(t) + 0.5} measures term rarity, tf(t, d) is term frequency in d, |d| is document length, avgdl is average document length, and hyperparameters k_1 (typically 1.2–2.0) and b (usually 0.75) tune saturation and length normalization. BM25 balances precision and recall effectively in practice, as demonstrated in early large-scale evaluations. Implementation of these algorithms relies on efficient data structures like inverted indexes, which map terms to lists of documents (postings) for rapid traversal during query processing. For Boolean and VSM/BM25 matching, the algorithm merges postings lists via set operations or accumulates scores by intersecting relevant lists, often using a heap to prioritize high-scoring candidates and limit traversal to top-k results for scalability. Phrase queries, such as exact sequences like "machine learning", require positional indexes that augment postings with term offsets within documents; retrieval involves scanning positions in merged lists to verify adjacency (e.g., position differences of 1 for bigrams). This positional traversal adds overhead but enables precise proximity matching without exhaustive scanning.^[15]

Ranking Mechanisms

Ranking mechanisms in document retrieval involve assigning scores to candidate documents after initial retrieval and ordering them to present the most relevant results to users. These mechanisms aim to refine the output from matching algorithms by considering factors beyond simple term overlap, such as document authority, query intent, and result diversity. Traditional approaches rely on graph-based or probabilistic models, while modern methods leverage machine learning to optimize rankings based on training data labeled for relevance.^[16] Traditional ranking methods, particularly for web documents, emphasize query-independent factors like overall document authority derived from link structures. A seminal example is PageRank, which models the web as a directed graph where pages are nodes and hyperlinks are edges, computing a score that approximates the likelihood of a random surfer visiting a page. The PageRank score for page A is given by:

\text{PR}(A) = (1 - d) + d \sum_{T_i \in B_p(A)} \frac{\text{PR}(T_i)}{C(T_i)}

where d is the damping factor (typically 0.85) to account for the probability of following a random link versus jumping to a random page, B_p(A) are the pages linking to A, and C(T_i) is the out-degree of page T_i. This approach prioritizes pages with high-quality incoming links from authoritative sources, enhancing retrieval for web-scale search without direct query dependence. Learning to rank (LTR) represents a shift to data-driven methods that train models on labeled datasets to predict relevance scores or preferences for documents given a query. LTR frameworks categorize into pointwise, pairwise, and listwise approaches based on how they formulate the ranking problem. Pointwise methods treat ranking as a regression or classification task, assigning an absolute relevance score to each document independently using features like term frequency or proximity; for instance, models regress directly on numerical relevance labels from 0 to 4. Pairwise approaches optimize relative order by minimizing losses for misranked pairs, such as in RankNet, which uses a cross-entropy loss to learn a neural network that outputs probabilities for one document being more relevant than another. Listwise methods consider the entire ranked list, directly optimizing metrics like NDCG by adjusting gradients for permutations; LambdaRank extends pairwise learning by weighting updates based on the target metric's change, enabling efficient training for complex evaluation measures. These methods integrate with earlier retrieval stages by incorporating initial similarity scores as input features.^[17]^[18]^[16] Feature engineering is crucial in LTR, involving the creation of hand-crafted or derived signals that capture query-document interactions and contextual signals to improve model performance. Common features include query-document similarity scores (e.g., BM25 or cosine similarity on TF-IDF vectors), positional information from the initial retrieval, and user context signals like click history or session data; these are combined with retrieval scores from vector space or probabilistic models to form a high-dimensional input vector for the ranking model. Seminal work highlights that effective features significantly improve ranking performance on benchmarks like LETOR datasets, emphasizing sparse, interpretable signals over dense embeddings in early LTR systems.^[16] Modern enhancements to ranking mechanisms address limitations like redundancy in top results by promoting diversity, ensuring a balanced mix of relevant yet non-overlapping documents. The Maximal Marginal Relevance (MMR) algorithm exemplifies this by re-ranking candidates to balance relevance and novelty, formulated as:

\text{MMR} = \lambda \cdot \text{Sim}_1(D, Q) - (1 - \lambda) \cdot \max_{D_j \in S} \text{Sim}_2(D, D_j)

where \text{Sim}_1(D, Q) measures similarity between document D and query Q, \text{Sim}_2(D, D_j) measures redundancy with selected documents S, and \lambda (often 0.5-0.7) trades off the two. This greedy selection reduces overlap while maintaining query focus, improving user satisfaction in scenarios with clustered results.^[19]

Variations and Applications

Structured vs. Unstructured Retrieval

Structured retrieval involves querying data organized according to predefined schemas, such as relational databases where information is stored in tables with rows and columns representing entities and attributes.^[20] This approach typically employs SQL-like queries to perform exact matches and relational joins, allowing users to filter documents based on specific fields like author, publication date, or category through form-based interfaces.^[21] For instance, a query might retrieve all documents where the author equals "Smith" and the date is after 2020, leveraging primary and foreign keys to ensure data integrity and precise results.^[20] In contrast, unstructured retrieval targets free-form content, such as plain text documents, emails, or web pages, where data lacks a fixed schema and similarity is computed based on textual overlap. Information retrieval models, like the vector space model, address challenges such as synonyms and noise by representing documents and queries as vectors in a high-dimensional space, enabling ranking via cosine similarity or term frequency-inverse document frequency (TF-IDF) weighting.^[22] Full-text search in these scenarios often relies on inverted indexes for scalability, but it can encounter issues with large corpora due to the computational demands of processing vast amounts of unorganized text.^[21] Hybrid approaches integrate structured and unstructured methods to leverage metadata filters alongside content-based search, enhancing user control and relevance.^[23] For example, faceted search in e-commerce platforms allows users to narrow results by attributes like price range (structured) while searching product descriptions (unstructured), often using techniques that combine keyword matching with structured predicates in a unified ranking framework.^[24] Tools like Elasticsearch support this duality by indexing both relational fields for exact filtering and textual content for semantic similarity, facilitating queries that blend Boolean conditions with full-text relevance scoring.^[25] A key trade-off between these paradigms is precision versus recall: structured retrieval excels in precision by minimizing false positives through exact schema-based matches, yielding fewer irrelevant results but potentially lower recall if queries overlook nuanced content.^[21] Unstructured retrieval, however, promotes broader recall by capturing semantic variations in free text, though it risks lower precision due to ambiguities like polysemy or incomplete indexing.^[22] Hybrid systems mitigate these by balancing the two, as seen in Elasticsearch applications where structured filters refine unstructured search outputs to improve overall effectiveness without sacrificing coverage.^[25]

Domain-Specific Implementations

In the biomedical domain, document retrieval systems are tailored to handle vast repositories of scientific literature, emphasizing precise indexing and cross-referencing for researchers. PubMed, maintained by the National Library of Medicine (NLM), utilizes Medical Subject Headings (MeSH)—a controlled vocabulary of over 30,000 terms organized hierarchically—to index articles for structured queries that capture synonyms, subheadings, and related concepts.^[26] This enables users to retrieve relevant biomedical documents by exploding MeSH terms to include narrower or broader descriptors, improving recall without sacrificing precision in searches for topics like disease mechanisms or drug interactions. The Entrez system, launched by the National Center for Biotechnology Information (NCBI) in 1991, serves as the primary retrieval interface, allowing access to abstracts, full-text articles, and linked genomic data across more than 39 million citations in PubMed as of 2025.^[27]^[28] Entrez integrates seamless navigation between literature and molecular biology databases, such as linking a PubMed abstract to its corresponding protein sequence, which facilitates interdisciplinary queries in fields like genomics and pharmacology.^[29] Web search engines adapt document retrieval to the scale and dynamism of the internet, prioritizing efficient crawling and real-time processing of diverse content types. Google's core pipeline involves crawling the web using automated bots to discover URLs, indexing the fetched content into a massive inverted index for fast lookup, and serving search results through relevance-ranked retrieval.^[30] To manage dynamic content generated by JavaScript, Googlebot employs a multi-phase process: it first fetches the initial HTML, then renders the page in a headless browser environment to execute scripts and capture the fully loaded state, before incorporating the rendered content into the index.^[31] This adaptation ensures that single-page applications and interactive elements, common in modern websites, are retrievable despite their reliance on client-side rendering, though it introduces challenges like increased server load during rendering queues. Enterprise search implementations focus on internal document ecosystems, integrating retrieval with organizational security and workflow needs. Microsoft SharePoint, a widely used intranet platform, employs a federated search architecture that crawls and indexes content from document libraries, sites, and integrated applications like Microsoft 365, delivering unified results tailored to user roles.^[32] A key feature is its tight integration with access controls, where search results are filtered in real-time based on Microsoft Entra ID permissions, ensuring sensitive documents—such as proprietary reports or HR files—are only surfaced to authorized users via mechanisms like Restricted SharePoint Search, which limits indexing to approved sites.^[33] This approach supports compliance in regulated industries by enforcing granular permissions at the query level, preventing unauthorized exposure while enabling features like metadata-driven faceting for efficient navigation of enterprise knowledge bases. In legal and patent retrieval, systems emphasize citation analysis and chronological context to assess document authority and applicability. Westlaw, developed by Thomson Reuters, provides specialized tools for retrieving case law, statutes, and patents through its comprehensive database, with KeyCite serving as the primary citator to map citation networks—revealing how documents reference and build upon each other across jurisdictions.^[34] KeyCite visualizes these networks via graphical histories, highlighting direct and indirect citations, negative treatments like overrulings, and depth of analysis in citing documents, which aids lawyers in validating precedents. For temporal relevance, the system incorporates update frequencies and historical tracking, adding new citations as soon as they appear in Westlaw's database and flagging changes over time, such as legislative amendments or evolving case interpretations, to ensure retrieval reflects current legal validity.^[35] In patent contexts, Westlaw extends this to prior art searches, leveraging citation links between patents and non-patent literature to evaluate novelty and infringement risks chronologically.^[36]

Evaluation and Challenges

Performance Metrics

Performance metrics in document retrieval systems quantify the effectiveness of retrieving relevant documents from large collections in response to user queries. These metrics are essential for comparing algorithms, optimizing models, and benchmarking progress in information retrieval (IR). They generally fall into two categories: effectiveness measures, which assess relevance and ranking quality, and efficiency measures, which evaluate computational speed and resource usage. Standard evaluation relies on test collections with predefined queries and relevance judgments to ensure reproducible results. Precision and recall are foundational binary relevance metrics for assessing retrieval quality. Precision is defined as the ratio of relevant documents retrieved to the total number of documents retrieved, calculated as

\text{Precision} = \frac{|\text{relevant retrieved}|}{|\text{retrieved}|}

This measures the proportion of retrieved items that are useful, emphasizing the avoidance of irrelevant results.^[37] Recall, conversely, is the ratio of relevant documents retrieved to the total number of relevant documents in the collection, given by

\text{Recall} = \frac{|\text{relevant retrieved}|}{|\text{relevant}|}

It focuses on the system's ability to find all pertinent information, penalizing missed relevant items.^[37] These metrics trade off against each other, as improving one often reduces the other, leading to the use of the F1-score, the harmonic mean of precision and recall:

\text{F1-score} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}

The F1-score provides a balanced single value for systems where precision and recall are equally important. For ranked retrieval, where documents are ordered by relevance, Mean Average Precision (MAP) aggregates precision across recall levels. MAP computes the average precision (AP) for each query—precision averaged over the positions of all relevant documents—and then takes the mean of these AP values across all queries. This metric rewards systems that return relevant documents early in the ranking while accounting for overall recall.^[38] MAP is particularly useful for ad hoc retrieval tasks, as it summarizes performance in a single score that correlates well with user satisfaction in exhaustive evaluations. When relevance is graded (e.g., on a scale from 0 to 4), Normalized Discounted Cumulative Gain (NDCG) evaluates ranking quality by considering both relevance levels and position discounts. Discounted Cumulative Gain (DCG) at position p is

\text{DCG}_p = \sum_{i=1}^{p} \frac{\text{rel}(i)}{\log_2(i+1)}

where \text{rel}(i) is the graded relevance of the document at rank i. NDCG normalizes this by dividing by the ideal DCG for the query, yielding a value between 0 and 1. This metric prioritizes highly relevant documents at top positions, making it suitable for web search where users rarely examine deep results.^[39] Beyond effectiveness, user engagement metrics like click-through rate (CTR) gauge practical utility in interactive settings. CTR is the ratio of documents clicked by users to the number of impressions (times shown), reflecting perceived relevance in real-world deployments such as web search engines.^[40] Efficiency metrics include latency, the time to process and return results for a query, and throughput, measured in queries per second, which assess scalability on large corpora. These ensure systems remain responsive under load, with benchmarks often targeting sub-second latency for user-facing applications. Standardized benchmarking uses test collections like those from the Text REtrieval Conference (TREC), which provide document corpora, queries, and relevance assessments for consistent evaluation. TREC datasets, spanning topics from news to web pages, enable direct comparison of systems via metrics like MAP and NDCG. Ground truth relevance is established through human judgments by trained assessors, who score documents on binary or graded scales, forming the basis for all metric calculations despite challenges in subjectivity and pool depth.

Current Limitations and Future Directions

Document retrieval systems face significant limitations stemming from biases in training data, which can result in unfair rankings that disadvantage certain demographics or topics.^[41] These biases often arise from skewed datasets that reflect historical inequalities, leading to amplified disparities in retrieval outcomes.^[42] Scalability issues are particularly pronounced for multimodal documents combining text and images, where processing large volumes requires substantial computational resources and efficient indexing strategies to maintain performance.^[43] Privacy concerns also persist in query logging practices, as retained user search histories can expose sensitive personal information, necessitating advanced anonymization techniques like differential privacy to mitigate risks.^[44] Handling ambiguity remains a core challenge, especially with short queries that lack sufficient context, resulting in imprecise matches and reduced retrieval accuracy.^[45] Context loss further complicates this by fragmenting relevant information across documents, making it difficult for systems to infer user intent holistically.^[46] In multilingual settings, gaps are evident for low-resource languages, where limited training data leads to poorer semantic understanding and lower retrieval effectiveness compared to high-resource languages.^[47] Looking ahead, neural information retrieval models, such as BERT-based dense retrieval approaches exemplified by Dense Passage Retrieval (DPR) introduced in 2020, continue to advance semantic matching by embedding queries and documents into dense vector spaces for improved relevance. Integration with large language models (LLMs) promises enhanced semantic understanding, enabling retrieval systems to better capture nuanced query intents through generative augmentation.^[48] Federated search across distributed corpora represents another promising direction, allowing secure aggregation of results from multiple sources without centralizing sensitive data.^[49] Ethical considerations are increasingly central, with fairness metrics like demographic parity being employed to measure and mitigate disparities in ranking outcomes across protected groups.^[50] Sustainability challenges arise from the energy costs of large-scale indexing, as training and maintaining vast neural models consume significant resources, prompting research into more efficient algorithms to reduce environmental impact.^[51]

References

[1]
Document Retrieval - an overview | ScienceDirect Topics
Document retrieval is the computerized process of producing a relevance-ranked list of documents in response to an inquirer's request by comparing their request ...
[2]
[PDF] Introduction to Information Retrieval - Stanford NLP Group
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large ...
[3]
[PDF] Information Retrieval: A Survey 30 November 2000 by Ed ...
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a ...
[4]
[PDF] Information Retrieval: Recent Advances and Beyond
This survey focuses on papers published in major conferences and journals in the fields of deep learning,. FIGURE 2. Term map of the information retrieval.
[5]
https://arxiv.org/abs/2005.11401
[6]
A passage retrieval method based on probabilistic information ...
... whole documents. Although passage retrieval in open-domain QA has been a ... document retrieval and passage retrieval. During the document retrieval ...
[7]
Panizzi's 91 Rules for Standardizing the Cataloguing of Books
"The rules on which this Catalogue is based were sanctioned by the Trustees on the 13th of July, 1839; and, with the exception of such modifications as have ...
[8]
1938 -1948 Punched cards as the genesis of enterprise searching
From the mid-1930s onwards in the USA in particular the use of punched cards to enable collections of information to be sorted was gradually being adopted.Missing: mechanized | Show results with:mechanized
[9]
A vector space model for automatic indexing - ACM Digital Library
An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents.
[10]
New Models in Probabilistic Information Retrieval - Google Books
Authors, C. J. Van Rijsbergen, Stephen Edward Robertson, M. F. Porter ; Publisher, Computer Laboratory, University of Cambridge, 1980 ; Length, 123 pages ; Export ...
[11]
The Web Search Engine Altavista is Launched - History of Information
Dec 15, 1995 · Altavista was launched on December 15, 1995, in Palo Alto, California, and was popular until it was shut down by Yahoo! on July 8, 2013.
[12]
The Anatomy of a Large-Scale Hypertextual Web Search Engine
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext.
[13]
[1705.01509] Neural Models for Information Retrieval - arXiv
May 3, 2017 · This tutorial introduces basic concepts and intuitions behind neural IR models, and places them in the context of traditional retrieval models.Missing: seminal | Show results with:seminal
[14]
[PDF] Introduction to Information Retrieval - Stanford University
Aug 1, 2006 · Much of the scientific research on information retrieval has ... PDF files. We will not deal further with these issues in this book ...
[15]
[PDF] Learning to Rank for Information Retrieval and Natural Language ...
Apr 23, 2011 · This book presents a survey on learning to rank and describes methods for learning to rank in detail. The major focus of the book is ...
[16]
[PDF] Learning to Rank using Gradient Descent
We investigate using gradient descent meth- ods for learning ranking functions; we pro- pose a simple probabilistic cost function, and we introduce RankNet, an ...
[17]
[PDF] Learning to Rank with Nonsmooth Cost Functions - Microsoft
In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions ...
[18]
[PDF] The Use of MMR, Diversity-Based Reranking for Reordering ...
We strive to maximize-marginal relevance in retrieval and summarization, hence we label our method. “maximal marginal relevanci” (MMR). MMR *A* Atg. X(Siml(Di, ...
[19]
None
### Summary of Key Concepts on Relational Model for Data Retrieval, Schemas, and Queries (from https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf)
[20]
[PDF] 10 XML retrieval - Stanford NLP Group
In this section, we discuss a number of challenges that make structured re- trieval more difficult than unstructured retrieval. Recall from page 195 the basic ...
[21]
Introduction to Modern Information Retrieval - Semantic Scholar
Introduction to Modern Information Retrieval · Gerard Salton, Michael McGill · Published 1 September 1983 · Computer Science.
[22]
Hybrid Search Ranking for Structured and Unstructured Data
Hybrid Search Ranking for Structured and Unstructured Data. 519 the combination of both for search is not yet investigated in a satisfying way[1]. This ...
[23]
[PDF] Faceted Search
Faceted search addresses weaknesses of conventional search, is a foundation for interactive information retrieval, and provides more effective support than ...
[24]
What is structured data? - Elastic
Unstructured data is qualitative data that does not have an internal structure, consists of text, video, and images, and requires dedicated tools to manage and ...
[25]
Medical Subject Headings - Home Page - National Library of Medicine
MeSH is a controlled, hierarchically-organized vocabulary used for indexing, cataloging, and searching biomedical and health-related information.MeSH for Authors · Introduction to MeSH · MeSH Browser OverviewMissing: Entrez | Show results with:Entrez
[26]
A Brief History of NCBI's Formation and Growth - NIH
1991—Entrez—The search and retrieval system for NCBI's linked databases is introduced in CD form, allowing users to easily find related information from ...
[27]
PubMed
PubMed® comprises more than 39 million citations for biomedical literature from MEDLINE, life science journals, and online books.MeSH · PubMed Update · Download PubMed Data · Phrase Searching in PubMed...Missing: system | Show results with:system
[28]
Entrez Help - NCBI Bookshelf
Jan 20, 2006 · Created: January 20, 2006; Last Update: June 27, 2024. Entrez is NCBI's primary text search and retrieval system that integrates the literature ...The Entrez Databases · Access to the Entrez System · Entrez Searching Options
[29]
In-Depth Guide to How Google Search Works | Documentation
Get an in-depth understanding of how Google Search works and improve your site for Google's crawling, indexing, and ranking processes.Missing: pipeline | Show results with:pipeline
[30]
Understand JavaScript SEO Basics | Google Search Central
Google processes JavaScript web apps in three main phases: Crawling; Rendering; Indexing. Googlebot takes a URL from the crawl queue, crawls it, then passes it ...
[31]
Search in SharePoint | Microsoft Learn
Mar 19, 2023 · Provides an overview of searching in SharePoint and details the search architecture and search extensibility points.
[32]
Restricted SharePoint Search - SharePoint in Microsoft 365
Jun 28, 2025 · Restricted SharePoint Search allows you to restrict both organization-wide search and Copilot experiences to a curated set of SharePoint sites of your choice.
[33]
Westlaw flags: Checking Cases with KeyCite
Apr 21, 2025 · How to use KeyCite in Westlaw to check case law citations so you can be confident that your authorities are good law.Missing: temporal | Show results with:temporal
[34]
Quick Check - Westlaw Edge - Thomson Reuters Legal Solutions
With Quick Check on Westlaw Edge, you can use artificial intelligence to easily scour your legal documents to ensure you haven't missed anything important.Missing: networks temporal
[35]
[PDF] Guide to Law Review Research - Westlaw
Official pretrial and trial transcripts and court documents from the individual notable trials databases currently on Westlaw. Coverage varies by database.
[36]
Machine literature searching VIII. Operational criteria for designing ...
Machine literature searching VIII. Operational criteria for designing information retrieval systems. Allen Kent,. Allen Kent. Search for more papers by this ...
[37]
[PDF] Overview of the TREC 2004 Robust Retrieval Track
The four measures are mean average precision. (MAP), the average of precision ... Figure 3: Individual topic average precision scores for three TREC 2004 runs.
[38]
Cumulated gain-based evaluation of IR techniques
Cumulated gain-based evaluation of IR techniques. Authors: Kalervo Järvelin. Kalervo Järvelin. University of Tampere, Finland. View Profile. , Jaana Kekäläinen.
[39]
[PDF] Evaluating Web Search Engines Using Clickthrough Data - Publication
A naive approach to evaluation might compare two ranked lists based solely on the average click-through rate on both of them. This has some problems, though: a ...
[40]
Bias and Unfairness in Information Retrieval Systems
Aug 24, 2024 · In this paper, we present a comprehensive survey of existing works on emerging and pressing bias and unfairness issues in IR systems when the integration of ...
[41]
Artificial intelligence algorithm bias in information retrieval systems ...
May 29, 2025 · ABSTRACT. This scoping review examines AI algorithm bias and its implications for library and information science (LIS) professionals.
[42]
A Survey of Multimodal Retrieval-Augmented Generation for ... - arXiv
Addressing this limitation, M3DocRAG (Cho et al., 2024a) introduces approximate indexing to accelerate large-scale retrieval and establishes the ...
[43]
Anonymizing Query Logs by Differential Privacy - ACM Digital Library
We introduce a framework to anonymize query logs by differential privacy, the latest development in privacy research.
[44]
Advances in Cross-Lingual Information Retrieval with Multilingual ...
Oct 1, 2025 · Short queries frequently lack context, making them ambiguous and reducing retrieval accuracy. Polysemy and homonymy are especially ...
[45]
Addressing Context Loss in RAG Systems with Contextual Retrieval
Sep 23, 2024 · Contextual Retrieval effectively overcomes context loss, leading to more accurate and relevant information retrieval from large knowledge bases.
[46]
Opportunities and Challenges of Large Language Models for Low ...
Sep 3, 2025 · This study systematically evaluates the applications of LLMs in low-resource language research, encompassing linguistic variation, historical documentation, ...
[47]
A comprehensive survey on integrating large language models with ...
Jun 7, 2025 · This article surveys the relationship between LLMs and knowledge bases, looks at how they can be applied in practice, and discusses related technical, ...
[48]
[PDF] MoR: Better Handling Diverse Queries with a Mixture of Sparse ...
Nov 4, 2025 · 2025. Efficient federated search for retrieval- · augmented generation. In Proceedings of the 5th. Workshop on Machine Learning and Systems, Eu ...
[49]
Should Fairness be a Metric or a Model? A Model-based Framework ...
The metric used to calculate fairness is difference in demographic parity, where a positive value indicates bias against the protected group and a negative ...Missing: sustainability | Show results with:sustainability
[50]
[PDF] Report from the 4th Strategic Workshop on Information Retrieval in ...
their environmental costs (e.g., energy and water consumption) raise further concerns about who bears the burden and who benefits. To address these issues ...Missing: parity | Show results with:parity