Fact-checked by Grok 2 weeks ago

Learning to rank

Learning to rank (LTR) is a supervised machine learning paradigm applied primarily to information retrieval tasks, where the objective is to construct a ranking model or function that orders a set of documents, items, or candidates by their relevance to a user query, utilizing training data consisting of queries paired with relevance labels for documents.^[1] This approach emerged in the late 1990s and early 2000s as a response to the shortcomings of traditional information retrieval models—such as Boolean, vector space, and probabilistic models—which often suffered from low precision and reliance on manually tuned parameters, failing to adapt effectively to the scale and complexity of web-scale data.^[2] LTR methods are broadly categorized into three approaches based on how they formulate the ranking problem during training: pointwise, which treats ranking as an independent prediction task for individual items by assigning absolute relevance scores (e.g., via regression or classification models like maximum entropy); pairwise, which focuses on learning the relative order between pairs of items to optimize pairwise comparisons (e.g., using neural networks like RankNet or boosting algorithms like RankBoost); and listwise, which directly optimizes the quality of entire ranked lists by minimizing list-level loss functions (e.g., through permutation-based models like ListNet or boosting with information retrieval metrics).^[2]^[1] Each approach balances computational efficiency with the ability to capture dependencies among items, with listwise methods generally achieving superior performance on benchmarks by aligning more closely with end-to-end ranking objectives, as demonstrated in evaluations on datasets like TREC Web Track (where ListNet improved mean average precision by up to 10% over pairwise baselines).^[1] The framework's importance stems from its ability to enhance search engine relevance, personalize recommendations, and optimize ad placements, with core applications in web search engines, e-commerce product ranking, and content prioritization for news feeds.^[2] Performance in LTR is typically evaluated using metrics that emphasize position-aware relevance, such as Normalized Discounted Cumulative Gain (NDCG), which rewards higher rankings for more relevant items while normalizing scores for fair query comparisons, and Mean Average Precision (MAP), which averages precision across relevant documents per query.^[2] Originally rooted in information retrieval, LTR has expanded into natural language processing (NLP) tasks, including query-focused applications like question answering reranking and machine translation candidate selection, as well as queryless scenarios such as essay scoring and text summarization ordering.^[3] Recent advancements incorporate neural networks, large language models for zero-shot ranking, and multilingual adaptations, with pairwise methods remaining prevalent in NLP due to their simplicity, though challenges like evaluation robustness (e.g., only 15% of studies use paired t-tests for significance) persist.^[3]

Fundamentals

Definition and Motivation

Learning to rank (LTR) is a supervised machine learning paradigm that aims to automatically construct a ranking model from training data to sort objects—such as documents or items—by their relevance, preference, or importance with respect to a given query or context.^[4] Unlike traditional heuristic-based ranking methods, which rely on manually engineered features and rules like TF-IDF or BM25 scoring, LTR directly learns a ranking function through optimization of machine learning objectives tailored to ordering tasks.^[4]^[5] The motivation for LTR arises from the limitations of rule-based systems in handling the scale and complexity of modern information retrieval environments, where personalized and context-aware rankings are essential to enhance user satisfaction and relevance.^[4] In large-scale applications like web search, manual heuristics fail to incorporate diverse signals—such as user behavior logs or multifaceted relevance factors—leading to suboptimal ordering of results; LTR addresses this by leveraging data-driven models to improve precision in top-ranked outputs and overall user engagement.^[5] For instance, in search engines, LTR enables better alignment of results with implicit user preferences derived from click-through data.^[6] At its core, LTR relies on training data composed of queries, associated documents or items, and relevance labels that indicate the degree of match, often using graded scales such as 0 (irrelevant) to 4 (highly relevant) to capture nuanced preferences.^[5] These labels, typically obtained from human annotators or implicit feedback like clicks, form the basis for feature representations that the model learns to score and order.^[4]

Problem Formulation

In learning to rank (LTR), the problem is formally defined as follows: given a query q, a set of candidate documents D_q = \{d_j\}_{j=1}^m retrieved for that query, and relevance labels r_d for each document d \in D_q, the goal is to learn a ranking function f(q, d) (or equivalently f(x), where x = \Phi(q, d) is a feature vector extracted from the query-document pair) that outputs real-valued scores for sorting the documents in descending order of relevance.^[7] The ranked list is then obtained by applying a sorting operation to these scores, such that higher scores correspond to more relevant positions.^[7] The optimization objective involves minimizing a loss function L over a training dataset of queries, documents, and labels, where L quantifies the discrepancy between the predicted ranking and the ground-truth relevance, often focusing on position-based errors to reflect the importance of top-ranked results.^[7] This is typically achieved through empirical risk minimization, adapting supervised learning paradigms to the ranking task.^[7] Relevance labels r_d can take various forms depending on the data collection method and application needs: binary labels distinguish relevant (1) from irrelevant (0) documents; ordinal labels provide graded relevance levels, such as 0 (irrelevant), 1 (marginally relevant), 2 (relevant), or 3 (highly relevant); and pairwise preferences indicate relative ordering between document pairs, often encoded as +1 if one document is preferred over another and -1 otherwise.^[7] The training pipeline for LTR generally proceeds in three stages: first, feature extraction to represent each query-document pair as a vector of informative attributes (e.g., textual, structural, or query-dependent features); second, model training to optimize the ranking function f using the extracted features and labels via gradient-based or other optimization methods; and third, inference during deployment, where new documents are scored and sorted to produce the final ranking.^[7]

Applications

In Information Retrieval

Learning to rank (LTR) plays a central role in information retrieval (IR) by employing machine learning models to assign predicted relevance scores to documents, thereby ordering search results to better match user queries. These models integrate diverse signals, including traditional query-document similarity measures and dynamic user behavior data such as click-through rates and dwell times, to produce more nuanced rankings than heuristic-based approaches.^[4] In web search engines, LTR is commonly applied to re-rank the top-k candidates retrieved via efficient lexical methods, enhancing the overall relevance of presented results.^[8] Key use cases include web search ranking, where LTR refines initial retrievals from probabilistic models like BM25, and sponsored search auctions, where it determines ad placements by balancing relevance, bid values, and expected click-through rates to optimize user experience and revenue.^[4] For instance, in web search, LTR re-orders a shortlist of documents to prioritize those with higher predicted utility, while in ad auctions, it ranks advertisements in real-time based on auction-specific features like advertiser quality scores.^[9] These applications leverage LTR's ability to learn from labeled relevance judgments, enabling adaptive ranking that evolves with changing user preferences and content landscapes.^[10] The benefits of LTR in IR are evident in empirical performance gains, such as improvements in precision@K—measuring the proportion of relevant items in the top-K results—and mean average precision (MAP), which aggregates precision across recall levels. Studies demonstrate that LTR models can achieve up to 16% higher MAP compared to BM25 baselines in two-stage pipelines, underscoring their superiority over rule-based systems in handling complex relevance signals.^[11] These enhancements translate to more effective search experiences, with higher user satisfaction and engagement metrics in production systems.^[4] A typical workflow in IR begins with initial retrieval using inverted indexes to efficiently scan and score documents against a query via BM25, yielding a broad candidate set of hundreds to thousands of items. This is followed by LTR re-ranking, where machine learning models process extracted features from the candidates to generate final relevance scores and reorder the list for presentation.^[8] This hybrid approach balances computational efficiency with ranking accuracy, making it scalable for large-scale web and ad search environments.^[4]

In Recommendation Systems

In recommendation systems, learning to rank (LTR) adapts techniques from information retrieval by treating user profiles or session histories as analogous to queries and candidate items—such as products, videos, or media—as documents to be ranked based on predicted user engagement, often measured through implicit feedback like clicks or views.^[12] This framework enables personalized ranking of large candidate sets, optimizing for top-N positions where user interactions are most likely to occur, and has become a cornerstone for scaling recommendations in consumer platforms.^[13] A prominent application is in e-commerce, such as Amazon's product ranking, where LTR models process user search queries or browsing history alongside item features like price and reviews to prioritize listings that maximize purchase probability. For instance, Amazon's RankFormer, a Transformer-based listwise LTR approach, incorporates listwide labels to capture overall session quality from implicit signals, achieving a 13.7% lift in revenue through improved ranking of top results.^[14] Similarly, in video platforms like YouTube, LTR ranks next-video suggestions by predicting watch time, using deep neural networks to weigh user history, demographics, and content freshness, with improvements in holdout MAP of up to 14% over baselines in candidate generation as reported in early deep learning models.^[15] Unique to recommendation contexts, LTR often integrates session context—such as recent user actions or temporal factors—to refine rankings dynamically, as seen in YouTube's models that condition predictions on watch sequences to promote relevant yet novel content.^[15] Additionally, to mitigate filter bubbles that reinforce user echo chambers, diversity-aware LTR methods incorporate regularization terms based on item similarity matrices, boosting intra-list diversity (e.g., genre spread in movie recommendations) by up to 19% on datasets like Netflix, though at a modest cost to precision metrics like NDCG.^[16] These adaptations yield performance gains, including elevated click-through rates (CTR) and NDCG in top positions; for example, listwise LTR variants in e-commerce benchmarks have improved NDCG@10 to 0.723 from baseline levels around 0.65.^[13]

In Other Domains

Learning to rank (LTR) has found applications in bioinformatics for prioritizing gene candidates based on their relevance to specific diseases, leveraging feature vectors derived from genomic data such as expression levels, sequence motifs, and pathway interactions. In gene prioritization tasks, LTR models integrate heterogeneous data sources to rank genes by their likelihood of association with a disease phenotype, outperforming traditional scoring methods in recall and precision. For instance, the HyDRA framework employs a hybrid distance-score rank aggregation approach to fuse multiple data types, achieving top performance on benchmark datasets like the Online Mendelian Inheritance in Man (OMIM) for disease gene identification.^[17] In protein structure prediction, LTR techniques sort decoy models generated by simulation methods to identify native-like conformations, using features like energy scores, stereochemical constraints, and residue contacts. Machine learning-to-rank approaches, such as those combining pairwise and listwise losses, have demonstrated superior ranking accuracy compared to physics-based filters, with improvements in weighted mean Pearson’s correlation coefficient (wmPMCC) of up to 20% on CASP datasets.^[18] This enables more efficient identification of biologically relevant structures for downstream analyses like drug targeting. In finance, LTR is applied to credit scoring by ranking applicants or loans according to default risk, incorporating features from financial histories, transaction patterns, and macroeconomic indicators to produce ordinal risk assessments. Pairwise LTR methods, like RankNet variants, enhance model interpretability and fairness in regulatory-compliant scoring systems, reducing bias in protected attributes while maintaining high area under the ROC curve (AUC) values above 0.85 on public datasets. For fraud detection, LTR ranks transactions or users by susceptibility to financial fraud, using temporal and graph-based features to prioritize high-risk cases; the FRAUDability framework, for example, employs adversarial ranking to estimate user vulnerability, improving attack success detection by 58% in simulated e-commerce scenarios.^[19] Beyond these, LTR supports drug discovery by ranking molecular compounds for binding affinity to target proteins, integrating multi-assay data from virtual screening pipelines. Gradient boosting decision trees with ranking losses, such as LambdaRank, outperform regression baselines in ligand-based virtual screening, achieving improved NDCG@10 scores on phosphodiesterase datasets by better handling ordinal relationships in diverse assay environments. In social media, LTR prioritizes feed content by relevance and engagement potential, using user interaction histories and content embeddings to mitigate biases in recommendation streams; unbiased LTR variants address position bias in feeds, boosting utility metrics like expected reciprocal rank by 15% on platform-scale data.^[20]^[21] Cross-domain challenges in LTR arise when transferring models between fields like bioinformatics and finance, due to shifts in feature distributions and relevance judgments. Domain adaptation techniques unify LTR with transfer learning to align source and target domains, as in cross-task scoring frameworks that adapt query-specific rankings via kernel methods, improving mean average precision by 10-15% on heterogeneous document corpora.^[22] These methods emphasize regularization to preserve ranking structures while mitigating negative transfer in sparse-data scenarios.

Data and Features

Query-Document Representations

In learning to rank (LTR), queries and documents are encoded into feature vectors that facilitate the assessment of relevance, typically formulated as pairs (q, d) where q represents the query and d the document. These representations transform raw text and metadata into numerical formats suitable for machine learning algorithms, emphasizing aspects like lexical overlap, semantic similarity, and structural properties. Early LTR systems relied on sparse, high-dimensional vectors derived from traditional information retrieval techniques, while modern approaches incorporate dense embeddings for capturing contextual nuances.^[23]^[24] Basic textual representations often employ the bag-of-words (BoW) model, which converts queries and documents into vectors where each dimension corresponds to a term in the vocabulary, with values indicating term occurrences, disregarding word order and grammar. This approach enables straightforward computation of similarities, such as cosine similarity between query and document vectors, but suffers from high dimensionality in large vocabularies. To address limitations of raw counts, term frequency-inverse document frequency (TF-IDF) weighting is commonly applied, scaling each term's frequency in a document by its inverse frequency across the corpus, thereby emphasizing distinctive terms while downweighting common ones like stop words. TF-IDF vectors remain sparse but provide a more informative basis for ranking by highlighting query-relevant content.^[25]^[24] For enhanced semantic understanding, dense embeddings have become prevalent, representing queries and documents as low-dimensional continuous vectors that encode contextual relationships. Word2Vec, a shallow neural network model, generates word-level embeddings by predicting surrounding words (skip-gram) or target words from context (CBOW), allowing aggregation into document representations via averaging or weighting; these embeddings capture analogies and similarities, improving over BoW for paraphrased queries. More advanced models like BERT produce contextualized embeddings through bidirectional transformer training on masked language modeling, yielding query-document representations that account for token interactions within sequences, often pooled to form fixed-size vectors for LTR input. BERT-derived features have demonstrated superior performance in capturing nuanced relevance compared to static embeddings.^[26] Features in LTR are categorized as query-dependent or query-independent to reflect their reliance on the query. Query-dependent features, such as query term frequency in the document or BM25 scores measuring term overlap, directly incorporate both query and document content to evaluate match quality. In contrast, query-independent features, like document length, PageRank authority scores, or URL depth, assess intrinsic document quality without query involvement, providing stable signals across diverse queries. Combining both types enriches representations, as query-dependent features handle specificity while query-independent ones mitigate biases in sparse matches.^[23]^[27] LTR datasets typically consist of labeled triples (q, d, r), where r denotes relevance grades (e.g., 0-4 scales), enabling supervised training on real-world search interactions. The Microsoft Learning to Rank (MSLR) dataset, derived from Bing search logs, includes over 30,000 queries with millions of document impressions and 136 extracted features per pair, supporting diverse LTR experiments.^[10] Similarly, the Yahoo Learning to Rank datasets, released through an ICML challenge, consist of two sets for web search: Set 1, based on U.S. data, with 29,921 queries (19,944 for training) and 519 features; Set 2, based on data from an Asian country, with 6,330 queries (1,266 for training) and 596 features; both graded on 0-4 relevance by human assessors.^[28] These formats standardize representations across sparse to dense features, facilitating model benchmarking. High-dimensionality and sparsity pose challenges in LTR representations, as vocabularies can exceed millions of terms, resulting in vectors with mostly zero entries for non-matching documents in large corpora. To handle this, techniques such as feature hashing map high-dimensional spaces to fixed lower dimensions via locality-sensitive hashing, preserving approximate similarities while reducing storage; dimensionality reduction methods like principal component analysis (PCA) or latent semantic analysis (LSA) project sparse vectors onto dense subspaces, though they risk information loss in ranking contexts. Sparse-aware models, including support vector machines with L1 regularization, promote feature selection during learning, yielding compact representations that maintain ranking accuracy on datasets like MSLR. These strategies ensure scalability without sacrificing relevance signals.^[29]^[30]

Feature Engineering Techniques

Feature engineering in learning to rank (LTR) involves crafting and selecting features from raw query-document or user-item data to capture relevance signals that improve ranking model accuracy. These features transform basic representations, such as bag-of-words or embeddings, into higher-level indicators of relevance, enabling models to learn nuanced patterns. Effective feature engineering is crucial because LTR models rely on these inputs to approximate human judgments or user preferences, often handling sparse and high-dimensional data from search engines or recommendation systems. Common types of features in LTR include linguistic, structural, and behavioral categories. Linguistic features focus on textual content, such as term frequency (TF), inverse document frequency (IDF), and n-grams that measure query term occurrences and their variations within documents. For instance, in information retrieval datasets like LETOR, TF-IDF variants and phrase-based features quantify semantic overlap between queries and documents. Structural features capture document organization and metadata, including URL depth, which indicates navigational hierarchy in web pages, and anchor text relevance from hyperlinks. Behavioral features incorporate user interactions, such as click-through rates or dwell time from historical logs, providing implicit relevance signals in real-world systems. Feature engineering processes emphasize selection and dimensionality reduction to manage complexity. Feature selection often employs mutual information (MI), which measures the dependency between a feature and the relevance label, ranking features by their information gain to prioritize those with high predictive power while discarding redundant ones. In LTR, MI-based selection has been applied to filter thousands of term-based features, improving model efficiency without significant performance loss. Dimensionality reduction techniques like principal component analysis (PCA) project high-dimensional feature spaces into lower dimensions by retaining principal components that explain maximum variance, addressing issues in datasets with correlated features such as multiple proximity metrics. For example, linear feature extraction methods incorporating PCA have been used to compress ranking features while preserving ranking quality.^[31]^[32] Domain-specific techniques tailor features to application contexts. In information retrieval, proximity features like term co-occurrence distance quantify how closely query terms appear in a document, enhancing models by favoring documents with clustered relevant terms over scattered ones. These features, such as the minimum span between query terms, have been integrated into LTR pipelines to boost retrieval effectiveness. In recommendation systems, user-item interaction graphs serve as features, where nodes represent users and items, and edges encode interactions like ratings or views; graph-based features, such as node degrees or shortest path lengths, capture collaborative patterns for personalized ranking.^[33]^[34] Best practices in LTR feature engineering prioritize avoiding data leakage and ensuring scalability. Leakage occurs when features inadvertently include future information, such as post-query clicks used in training, leading to overoptimistic models; to mitigate this, features must be constructed using only data available at ranking time, with techniques like unbiased click collection separating labels from features. For scalability with millions of features, sparse representations and embedding layers are employed, as in frameworks supporting large-scale training on distributed systems without full materialization of dense vectors. These practices ensure robust, deployable LTR systems handling real-time queries efficiently.^[35]^[36]

Evaluation

Performance Metrics

Performance metrics in learning to rank (LTR) evaluate the quality of produced rankings by measuring how well relevant items are positioned at the top of lists, often using ground-truth relevance labels for queries. These metrics are essential for comparing ranking models, as they quantify aspects like accuracy in retrieving relevant documents and the preservation of ranking order. In LTR, evaluation typically involves offline assessment on benchmark datasets, where higher scores indicate better performance in simulating user satisfaction.^[37] Precision at K (P@K) measures the proportion of relevant items among the top K ranked results for a query, emphasizing the accuracy of the highest positions. It is computed as P@K = (number of relevant items in top K) / K. This metric is particularly useful in LTR for web search scenarios where users focus on initial results, prioritizing low false positives in small lists. Recall at K (R@K) complements this by assessing coverage, defined as R@K = (number of relevant items in top K) / (total number of relevant items). R@K highlights retrieval completeness within the top K, aiding evaluation when exhaustive recall matters alongside precision. The F1 score at K balances these via the harmonic mean: F1@K = 2 × (P@K × R@K) / (P@K + R@K), providing a single value that penalizes extremes in either metric. These top-K variants are standard in LTR benchmarks like LETOR, where K is often set to 10 for practical relevance.^[37] Mean Average Precision (MAP) extends precision to account for ranking quality across all relevant items, averaging precision values at each relevant document's position. For a single query q with M relevant documents, Average Precision (AP) is AP(q) = (1/M) × Σ P@k over all k where the item at k is relevant. MAP is then the mean over Q queries: MAP = (1/Q) × Σ AP(q). This metric rewards systems that rank relevant items early and penalizes interleaving with non-relevant ones, making it robust for LTR in information retrieval tasks with varying relevance depths. MAP has been a core measure in TREC evaluations since the 1990s, influencing LTR model optimization.^[37] Normalized Discounted Cumulative Gain (NDCG) addresses graded relevance in rankings, weighting higher positions more heavily while discounting lower ones logarithmically. The Discounted Cumulative Gain at position p is DCG@p = Σ_{i=1 to p} (rel_i / log_2(1 + i)), where rel_i is the graded relevance score of the item at rank i (e.g., 0 for irrelevant, up to 4 or 5 for highly relevant). NDCG@p normalizes this by the ideal DCG (IDCG@p) for a perfect ranking: NDCG@p = DCG@p / IDCG@p, yielding values between 0 and 1. This formulation, introduced for multi-level relevance, is ideal for LTR applications like recommendation systems where partial relevance exists. NDCG is widely adopted in LTR due to its sensitivity to position and normalization across queries with different relevance distributions.^[38]^[37] For assessing overall ranking agreement with ground truth, correlation-based metrics like Kendall's Tau and Spearman's Rho are employed. Kendall's Tau (τ) measures the proportion of concordant and discordant pairs between predicted and true rankings: τ = (number of concordant pairs - number of discordant pairs) / [n(n-1)/2], where n is the number of items, ranging from -1 (inverse order) to 1 (perfect agreement). It is suitable for LTR when evaluating pairwise order preservation, especially with ties in rankings. Spearman's Rho (ρ) computes the Pearson correlation on ranked scores: ρ = 1 - [6 × Σ d_i^2 / (n(n^2 - 1))], where d_i is the rank difference for item i. Rho emphasizes monotonic relationships, making it useful for LTR in scenarios with ordinal scores. Both are applied in IR to validate ranking stability, with Tau preferred for small datasets due to lower variance.^[39]^[37]

Evaluation Protocols

Offline evaluation of learning to rank (LTR) models typically involves held-out test datasets partitioned for cross-validation, allowing models to be trained on subsets of data and evaluated on unseen portions to assess generalization. Standard benchmarks, such as the Yahoo! Learning to Rank Challenge dataset, employ five-fold cross-validation splits, where each fold consists of training, validation, and test sets, enabling robust performance estimation across multiple iterations. Metrics like normalized discounted cumulative gain (NDCG) are applied to these held-out sets to measure ranking quality by comparing predicted rankings against ground-truth relevance labels, often derived from click logs or expert annotations adjusted for biases using techniques like inverse propensity scoring.^[40]^[41] Online evaluation shifts focus to live deployment environments, where LTR models are tested through user interactions to capture real-world behavior beyond static labels. This approach measures user engagement via proxies such as click-through rate (CTR), which tracks the proportion of users clicking on ranked items, and dwell time, indicating how long users spend on selected content to infer satisfaction. Experiments on large-scale search systems demonstrate that online metrics like CTR and dwell time often correlate loosely with offline scores, highlighting the need for hybrid assessments to bridge simulation and production gaps. A/B testing serves as a primary framework for online comparison, randomly assigning users to control (existing model) or treatment (new model) groups and analyzing differences in engagement metrics for statistical significance, typically using p-values from t-tests or proportional tests to determine if improvements exceed noise. Interleaving methods enhance efficiency by merging rankings from two models into a single list per user, attributing clicks to the originating model and requiring fewer interactions than traditional A/B tests—often converging faster under relevance-aware user models—while still enabling p-value-based hypothesis testing for preference detection. These protocols, validated on platforms like Airbnb search, reduce evaluation variance but demand careful bias correction to avoid systematic errors in credit assignment.^[42]^[43] Key challenges in these protocols include label bias in crowdsourced relevance judgments, where annotator inconsistencies or subjective preferences skew ground-truth data, leading to models that misalign with true user intent and requiring debiasing via expectation-maximization or weighted aggregation.^[41] Temporal drift further complicates evaluations, as evolving user behaviors, content freshness, or query distributions over time degrade model performance on static datasets, necessitating periodic retraining or drift-detection mechanisms to maintain relevance in production systems.^[44]

Learning Approaches

Pointwise Methods

Pointwise methods in learning to rank treat the ranking task as an independent prediction problem for each document given a query, framing it as either a regression or classification problem to estimate the absolute relevance score of individual documents without considering interactions between them. This approach transforms the overall ranking objective into optimizing a scoring function f(q, d) that directly predicts the relevance label r_d for each query-document pair (q, d), allowing the use of standard machine learning techniques designed for scalar outputs.^[45] Common loss functions for pointwise methods focus on the discrepancy between predicted and true relevance scores. For regression tasks with graded relevance, the mean squared error (MSE) is frequently employed:

L = \frac{1}{N} \sum_{i=1}^{N} \left( f(q, d_i) - r_{d_i} \right)^2,

where N is the number of documents, serving as a surrogate that bounds ranking measures like NDCG. In binary classification settings, where relevance is treated as relevant or non-relevant, logistic loss is used to model the probability of relevance, minimizing the cross-entropy between predicted probabilities and binary labels.^[45] Representative models include linear regression for straightforward relevance prediction and more advanced ensemble methods like gradient boosting trees adapted for pointwise objectives. A seminal example is McRank, which combines multiple binary classifiers via gradient boosting to estimate expected relevance scores from class probabilities, demonstrating effective performance on web search ranking tasks. Another early approach, subset ranking, uses regression to learn scores that approximate ideal positions in the ranked list. Pointwise methods offer simplicity and computational efficiency, as they leverage off-the-shelf regression or classification algorithms without requiring group-aware optimizations, making them suitable for large-scale applications.^[45] However, they disregard the relative ordering among documents for the same query, potentially leading to suboptimal performance on ranking-specific metrics like NDCG, since absolute score predictions do not guarantee correct pairwise preferences.^[46]

Pairwise Methods

Pairwise methods in learning to rank address the ranking task by focusing on the relative ordering of document pairs rather than absolute scores. For a given query q, these approaches consider pairs of documents (d_i, d_j) where the relevance label satisfies r_i > r_j, and optimize a model to ensure the predicted score f(q, d_i) > f(q, d_j). This formulation treats each pair as a binary classification problem, aiming to correctly distinguish the more relevant document from the less relevant one across all such pairs.^[1] The optimization typically involves minimizing a pairwise loss function that penalizes inversions in the predicted order. Common losses include the hinge loss, which enforces a margin between pair scores similar to support vector machines, and the logistic loss derived from probabilistic models of preferences. A foundational probabilistic framework is the Bradley-Terry model, which defines the probability of document i being ranked above j as

P(i > j \mid q) = \frac{1}{1 + \exp\left( f(q, d_j) - f(q, d_i) \right)},

where f(q, d) is the scoring function. This model underpins loss functions that encourage the predicted scores to align with observed preferences by minimizing the cross-entropy between true and predicted pairwise probabilities.^[47] Prominent algorithms exemplify these principles. RankSVM incorporates pairwise constraints into a maximum-margin framework, solving an optimization problem that maximizes the separation between correctly ordered pairs while regularizing the model.^[48] RankBoost adapts boosting to ranking by sequentially training weak learners on pairwise errors, aggregating them to minimize overall ranking loss through weighted voting on preferences.^[49] RankNet uses a neural network to directly optimize the Bradley-Terry-based logistic loss via gradient descent, enabling pairwise learning with flexible feature representations.^[47] These methods offer advantages over pointwise approaches, which predict independent relevance scores, by directly optimizing relative orders that align more closely with ranking objectives and yielding superior performance on preference-based tasks.^[1] However, the need to evaluate and optimize over all document pairs introduces quadratic complexity in the number of candidates per query, limiting scalability for large retrieval sets without approximations or sampling.^[50]

Listwise Methods

Listwise methods in learning to rank (LTR) treat the entire ranked list of documents for a query as the fundamental unit of learning, directly optimizing the quality of the permutation rather than individual items or pairs. This approach contrasts with pairwise methods, which approximate global ranking quality through local pairwise comparisons. By modeling the ranking task as predicting a probability distribution over all possible permutations of the list, listwise methods aim to minimize losses that reflect the desired ordering more holistically. Seminal works introduced probabilistic frameworks to handle the combinatorial explosion of permutations, approximating the optimal ranking through tractable objectives. Key listwise losses include ListMLE, which maximizes the likelihood of the ground-truth permutation under a predicted scoring function, and SoftRank, which smooths non-differentiable ranking metrics like NDCG by optimizing expected ranks under a probabilistic model. A foundational example is the ListNet loss, formulated as the Kullback-Leibler (KL) divergence between the predicted probability distribution over permutations—derived from a neural network's output scores via Plackett-Luce normalization—and the ground-truth distribution:

L(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{\pi \in \Pi} P(\pi | \mathbf{y}) \log P(\pi | \hat{\mathbf{y}})

where \Pi denotes the set of all permutations, P(\pi | \mathbf{y}) is the ground-truth probability (uniform over ideal permutations or based on relevance labels), and P(\pi | \hat{\mathbf{y}}) is the predicted probability, computed as:

P(\pi | \hat{\mathbf{y}}) = \prod_{i=1}^{n} \frac{\exp(\hat{y}_{\pi(i)})}{\sum_{j=i}^{n} \exp(\hat{y}_{\pi(j)})}

for a list of size n. This divergence encourages the model to produce rankings that closely match the ideal order in a distributionally aware manner. Prominent models employing listwise optimization include LambdaRank, which adapts gradient boosting to directly minimize ranking metrics like NDCG by computing position-based gradients (lambdas) for each document, and LambdaMART, a multiple additive regression trees (MART) extension that combines LambdaRank's metric optimization with LambdaNet's neural components for enhanced flexibility. More recent neural variants, such as listwise neural rankers, integrate deep architectures like transformers to capture list-level interactions, often building on ListNet-style losses for end-to-end training. Listwise methods offer the benefit of direct alignment with evaluation metrics such as NDCG, leading to improved ranking quality in practice, as demonstrated in benchmarks where they outperform pairwise approaches on datasets like MSLR-WEB30K. However, their computational intensity arises from the need to evaluate list-level losses, which scale poorly with list length and require approximations for large-scale applications.

Notable Algorithms and Models

One of the earliest and most influential algorithms in learning to rank (LTR) is RankNet, a pairwise neural network model introduced in 2005 that optimizes a cross-entropy loss to predict pairwise relevance probabilities between documents. RankNet uses gradient descent to train a multi-layer perceptron on feature differences, enabling effective ranking for information retrieval tasks.^[51] Another classic approach is Multiple Additive Regression Trees (MART), a gradient-boosted ensemble of regression trees adapted for LTR, which constructs relevance scores by sequentially adding trees to minimize ranking errors. MART, based on Friedman's gradient boosting framework from 2001, excels in handling non-linear feature interactions and was a strong performer in early LTR evaluations. Among advanced methods, LambdaMART, proposed in 2010, integrates listwise optimization from LambdaRank with MART as the base learner, directly approximating gradients for metrics like normalized discounted cumulative gain (NDCG) to improve ranking quality. This combination allows LambdaMART to outperform pairwise models like RankNet on benchmark datasets derived from TREC.^[52]^[53] Coordinate Ascent, introduced in 2007, is an iterative optimization algorithm for linear ranking models that maximizes non-smooth metrics like NDCG by alternately updating individual feature weights while fixing others. It is particularly efficient for sparse, high-dimensional features in ad hoc retrieval and has been widely used in systems requiring interpretable weights.^[54] In modern neural LTR for recommendation systems, DeepCTR frameworks incorporate models like the Deep Interest Network (DIN), which uses attention mechanisms over user behavior sequences to predict click-through rates and rank items dynamically. DIN, from 2018, captures temporal interests more effectively than static embeddings, leading to relative improvements of up to 12% in AUC on industrial datasets. Transformer-based rankers represent a post-2015 shift toward deep semantic understanding; for instance, monoBERT fine-tunes BERT as a pointwise classifier on query-document pairs, leveraging contextual embeddings for relevance scoring. On TREC Complex Answer Retrieval benchmarks, monoBERT achieves a mean average precision (MAP) score of 0.348.^[55] Empirical comparisons on TREC-derived benchmarks, such as those in LETOR, show tree-based methods like LambdaMART maintaining strong performance in sparse feature settings, while neural models like monoBERT and DIN dominate in dense, semantic-heavy tasks post-2015 due to pre-trained representations.^[53]^[56]

History and Developments

Early Foundations

The foundations of learning to rank (LTR) emerged from 1990s advancements in information retrieval (IR), particularly probabilistic ranking models that assigned scores to documents based on query relevance. A key precursor was the Okapi BM25 algorithm, developed by Stephen Robertson, Steve Walker, and colleagues in the mid-1990s as part of the Okapi IR system at City University London. BM25 extended earlier probabilistic frameworks by incorporating term frequency saturation and inverse document frequency, enabling effective ranking without machine learning but highlighting the need for tunable scoring functions. Parallel developments in early machine learning for IR included relevance feedback techniques, such as the Rocchio algorithm, originally proposed in 1971 but widely adapted in 1990s systems to iteratively refine query representations using vector space models and user-provided relevance labels.^[57] These methods demonstrated how labeled relevance data could improve ranking, setting the stage for supervised LTR approaches. The Text REtrieval Conference (TREC), initiated in 1992 by the National Institute of Standards and Technology (NIST), significantly influenced LTR by establishing standardized large-scale test collections with human-generated relevance judgments.^[58] TREC's ad hoc retrieval tracks provided millions of documents paired with thousands of queries and binary or graded relevance labels, creating the labeled datasets essential for training and evaluating machine learning models in IR.^[59] This infrastructure addressed prior limitations in small-scale experiments, fostering empirical research that underscored the value of labeled data for developing data-driven ranking systems. By the early 2000s, TREC data became a cornerstone for LTR experimentation, enabling comparisons between heuristic and learned rankers. Between 2000 and 2005, seminal papers introduced pointwise LTR methods, framing ranking as a regression problem to predict absolute relevance scores for individual documents. A notable contribution was the work by Ramesh Nallapati (2004), which applied maximum entropy regression to ad hoc retrieval tasks using TREC data, demonstrating how such regressors could outperform traditional IR baselines by learning from feature-based relevance labels.^[60] These efforts built on earlier ML integrations, shifting focus from rule-based scoring to supervised prediction of document scores. The first dedicated LTR workshop at SIGIR 2007 further solidified academic momentum, convening researchers to discuss pointwise, pairwise, and emerging listwise approaches using TREC-derived benchmarks. Theoretically, LTR drew from ordinal regression, where ranks are treated as ordered categories rather than continuous values, as formalized in early models like the proportional odds logistic regression by McCullagh in 1980 and extended to ranking boundaries in Chu and Keerthi's 2005 large-margin framework. Similarly, connections to preference learning emphasized pairwise comparisons for inducing total orders, with foundational work by Fürnkranz and Hüllermeier in 2003 proposing algorithms to learn rankings from binary preference data, bridging statistical learning and IR evaluation metrics.^[61] These underpinnings provided LTR with rigorous loss functions and optimization principles, ensuring learned models preserved ordinal structure and preference consistency.

Adoption in Industry

Learning to rank (LTR) techniques saw early adoption in major search engines during the late 2000s, marking a shift from heuristic-based ranking to machine learning-driven approaches. Yahoo integrated LambdaMART, a gradient-boosted decision tree model combining pairwise and listwise optimization, into its core web search ranking pipeline by 2009, as evidenced by the production datasets released for the 2010 Yahoo! Learning to Rank Challenge, which drew from Yahoo's internal systems.^[40] Similarly, Microsoft's Bing adopted LTR methods starting with RankNet in 2004 for initial relevance improvements over linear models, evolving to ensembles of LambdaMART by 2010, which directly optimized metrics like normalized discounted cumulative gain (NDCG) for better top-k results.^[52]^[62] Google's adoption of LTR principles came later through RankBrain, launched in 2015 as a deep neural network system that implicitly learned ranking signals from query-document interactions, handling about 15% of searches involving novel queries by embedding words into vectors for semantic matching.^[63] Although not explicitly termed LTR, RankBrain represented a neural extension of ranking optimization, enhancing relevance for long-tail queries without traditional feature engineering.^[64] Beyond search engines, LTR methods influenced recommendation and social platforms. Facebook employed LTR in its News Feed ranking starting around 2010, evolving from the EdgeRank heuristic to machine learning models that included pairwise approaches to predict relative engagement between content pairs, personalizing feeds for over 2 billion users by scoring posts on affinity and interaction signals.^[65] In e-commerce, Alibaba integrated deep LTR models into its Taobao and Tmall platforms by the early 2010s, using neural networks for multi-stage ranking in product search and recommendations, incorporating user behavior and contextual features to optimize click-through and conversion rates.^[66] These adoptions yielded measurable impacts, with LTR models like LambdaMART delivering significant relevance gains—often 5-10% improvements in NDCG and user engagement metrics over baselines in production environments—while enabling scalable training on large datasets.^[52] Open-sourcing efforts further accelerated industry uptake, exemplified by Microsoft's LightGBM framework in 2017, which extended gradient boosting for ranking tasks and became widely used for its efficiency in handling sparse, high-dimensional features.^[67]

Recent Advances

Since the mid-2010s, the integration of deep learning has transformed learning to rank (LTR) by enabling more expressive models that capture complex semantic relationships in queries and documents. Transformer-based architectures, in particular, have become prominent for their ability to handle long-range dependencies through self-attention mechanisms. A seminal advancement is the ColBERT model, introduced in 2020, which employs a late interaction paradigm: it encodes queries and documents separately using BERT to produce token-level embeddings, then computes relevance via efficient maximum similarity matching between query and document tokens, achieving state-of-the-art performance on passage retrieval tasks while reducing computational overhead compared to cross-encoder models.^[68] This approach has influenced subsequent works, such as ColBERTv2 in 2022, which refines late interaction with lightweight fine-tuning to further enhance retrieval effectiveness and speed. End-to-end neural information retrieval (IR) systems have further advanced LTR by jointly optimizing ranking with other IR components, such as query encoding and document representation. Transformer-based models like those explored in the TREC Deep Learning tracks since 2019 have demonstrated superior performance by incorporating contextual embeddings directly into ranking functions, often outperforming traditional sparse retrieval methods on benchmarks like MS MARCO. For instance, conformer-enhanced transformer-kernel models, benchmarked in 2021, improved ranking accuracy under blind evaluation settings by integrating convolutional layers for local feature capture alongside global attention.^[69] These developments have shifted LTR toward dense vector representations, where relevance is modeled as similarity in high-dimensional spaces rather than term overlaps. Efficiency improvements have addressed the scalability challenges of deep LTR models, particularly for large-scale deployment. Approximate nearest neighbor (ANN) search techniques, such as hierarchical navigable small world graphs, enable fast retrieval of similar embeddings for ranking, reducing query latency from milliseconds to microseconds on billion-scale corpora without significant accuracy loss.^[70] Knowledge distillation has complemented this by transferring knowledge from large teacher models (e.g., BERT-based rankers) to compact student models, achieving up to 3x speedup in scoring time while retaining over 95% of the original performance on ranking metrics like NDCG.^[71] These methods have made neural LTR viable for real-time applications, as evidenced in hybrid systems blending sparse and dense retrieval. Multimodal LTR has emerged to handle diverse data types beyond text, incorporating visual and auditory features for richer ranking in video and image search. In platforms like TikTok, recommendation systems leverage multimodal fusion to rank micro-videos by combining textual captions, visual frames, and audio signals, improving user engagement through graph contrastive learning that aligns cross-modal representations.^[72] This approach dynamically weighs modalities based on content relevance, enhancing personalization in short-video feeds. Emerging trends focus on privacy and fairness in LTR. Federated LTR enables collaborative model training across distributed devices without sharing raw user data, preserving privacy while adapting to local preferences; for example, federated pairwise gradient descent methods from 2021 achieve comparable accuracy to centralized training on online ranking tasks with differential privacy guarantees.^[73] Counterfactual evaluation addresses biases in logged interaction data, allowing unbiased LTR by estimating propensities and correcting for position bias; recent two-stage frameworks in 2025 extend this to sequential ranking policies, improving robustness on real-world datasets.^[74]

Challenges

Vulnerabilities and Robustness

Learning to rank (LTR) models are susceptible to adversarial vulnerabilities, where malicious actors can manipulate inputs to alter ranking outcomes. Query perturbations involve subtle modifications to search queries, such as synonym substitutions or noise addition, to mislead the model into promoting irrelevant or harmful items higher in the list.^[75] Feature poisoning, another form of attack, targets document features like text embeddings or metadata by injecting deceptive elements that exploit the model's reliance on these signals for relevance scoring.^[76] Early studies demonstrated the effectiveness of black-box attacks, where attackers query the model without internal access, achieving significant rank shifts with minimal perturbations on datasets like MSLR-WEB30K. To enhance robustness, techniques such as adversarial training incorporate min-max optimization during model training, where the objective minimizes the model's loss on worst-case perturbations generated by an inner maximization step.^[77] This approach trains the ranker to withstand query or document attacks by augmenting the training data with adversarially crafted examples, often using surrogate losses to balance effectiveness and resilience.^[78] Input sanitization methods complement this by preprocessing queries and features to detect and filter anomalies, such as unusual synonym patterns or embedding outliers, thereby reducing the attack surface without retraining the model. Bias issues in LTR exacerbate vulnerabilities by amplifying inherent skews in training data, leading to unfair rankings that disproportionately favor certain demographics or popular items. In recommender systems, this manifests as demographic skew, where models trained on interaction data reinforce underrepresentation of minority groups, such as gender or ethnic biases in job or product recommendations.^[79] Popularity bias further compounds this, as LTR algorithms iteratively promote high-exposure items, creating a feedback loop that marginalizes diverse content and reduces overall system equity.^[80] Real-world case studies highlight these exploits, particularly in search engine manipulation through SEO spam. Attackers optimize content with keyword stuffing or link farms to poison features, tricking LTR models into elevating low-quality sites. In production systems, this has led to persistent issues in e-commerce and news search, where biased or adversarially altered rankings propagate misinformation or commercial spam, underscoring the need for ongoing robustness evaluations. Recent studies as of 2024 have investigated the robustness of counterfactual LTR models through reproducibility experiments, revealing sensitivities to simulation assumptions that affect practical deployment.^[81]

Scalability and Practical Issues

Learning to rank (LTR) models often require training on massive datasets comprising billions of query-document pairs to achieve high performance in real-world search and recommendation systems. For instance, production systems at companies like Snap train deep neural network ranking models on billions of user interaction examples, highlighting the immense computational resources needed for gradient-based optimization in pairwise or listwise approaches. Similarly, Google's large-scale LTR efforts emphasize the burden of optimizing objectives over vast corpora, where pairwise methods like RankSVM incur high costs due to quadratic complexity in list sizes. Inference latency poses another challenge, as real-time ranking demands sub-millisecond predictions for thousands of candidates per query in high-throughput environments like web search. To address these demands, distributed training frameworks have become essential. TensorFlow Ranking (TF-Ranking), an open-source library, enables scalable LTR by leveraging TensorFlow's distributed strategies, such as between-graph replication and asynchronous updates across hundreds of workers, achieving near-linear speedup on datasets with hundreds of millions of examples while maintaining ranking metrics like MRR. Candidate generation techniques further enhance efficiency by first retrieving a small subset (e.g., hundreds) of potential items using approximate methods like inverted indexes or two-tower embeddings, limiting full LTR re-ranking to this reduced scope and reducing overall latency by orders of magnitude in systems like Instagram's Explore recommendations. Recent work as of 2024 has introduced scale-invariant LTR methods to handle feature scaling discrepancies between training and production, improving consistency in large-scale deployments.^[82] Data acquisition remains a significant hurdle, particularly the high costs of obtaining relevance labels through human annotation for training LTR models. Annotating large-scale datasets for web search can be prohibitively expensive, often requiring active learning strategies to select informative query-document pairs and minimize labeling efforts while maximizing model gains. The cold-start problem exacerbates this, where new queries or items lack interaction history, leading to poor initial rankings; approaches like dataset transfer with inverse propensity weighting have been proposed to adapt models from related domains in e-commerce LTR settings. In practice, hybrid systems integrate LTR models with rule-based heuristics to balance accuracy and efficiency, such as using heuristics for initial feature selection in ranking pipelines to handle noisy or sparse data. Continuous monitoring for model drift is also critical, tracking shifts in input distributions or ranking performance metrics to detect degradation and trigger retraining, as implemented in production monitoring tools for ranking models.

References

[1]
[PDF] Learning to Rank: From Pairwise Approach to Listwise Approach
The paper is concerned with learning to rank, which is to construct a model or a function for ranking objects. Learning to rank is useful for.Missing: survey | Show results with:survey
[2]
An Overview of Learning to Rank for Information Retrieval
May 23, 2025 · This paper presents an overview of learning to rank. It includes three parts: related concepts including the definitions of ranking and ...
[3]
[PDF] Methods, Applications, and Directions of Learning-to-Rank in NLP ...
Jun 21, 2024 · Learning-to-rank (LTR) is the process of applying machine learning methods to the task of ranking, i.e., to learn how to order elements in a.
[4]
now publishers - Learning to Rank for Information Retrieval
### Summary of Learning to Rank for Information Retrieval (INR-016)
[5]
[PDF] A Short Introduction to Learning to Rank - Northeastern University
SUMMARY. Learning to rank refers to machine learning techniques for training the model in a ranking task. Learning to rank is useful for many applications ...
[6]
[PDF] A Literature Review on Methods for Learning to Rank - SciTePress
This paper aims to provide a systematic review of the literature that addresses Learning to Rank. Our review would help future information systems re- searchers ...
[7]
Learning to Rank for Information Retrieval - SpringerLink
Tie-Yan Liu is a lead researcher at Microsoft Research Asia. He leads a team working on learning to rank for information retrieval, and graph-based machine ...
[8]
From structured search to learning-to-rank-and-retrieve
Using reinforcement learning improves candidate selection and ranking for search, ad platforms, and recommender systems.
[9]
Optimal online learning in bidding for sponsored search auctions
... ranking, filtering, placement, and pricing of ads. In this paper, we introduce a click-through rate prediction algorithm based on the learning-to-rank approach.
[10]
Microsoft Learning to Rank Datasets
Jun 10, 2010 · We released two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it ...Dataset Descriptions · Dataset Partition · Feature List
[11]
[PDF] Two-Stage Learning to Rank for Information Retrieval
MSE is the most effective Stage. A model: it achieves an average gain of 16% in MAP over the CA[BM25] baseline. Page 10. 432. V. Dang, M. Bendersky, and W.B. ...
[12]
Learning to rank for recommender systems - ACM Digital Library
This tutorial will provide an in depth picture of the progress of ranking models in the field, summarizing the strengths and weaknesses of existing methods.
[13]
None
### Summary of E-Commerce Learning to Rank (LTR) Survey (arXiv:2412.03581)
[14]
[PDF] RankFormer: Listwise Learning-to-Rank Using Listwide Labels
RankFormer is a Transformer-based architecture that jointly optimizes listwise and listwide learning-to-rank objectives, modeling overall list quality.<|separator|>
[15]
[PDF] Deep Neural Networks for YouTube Recommendations
Sep 15, 2016 · In this paper we will focus on the immense im- pact deep learning has recently had on the YouTube video recommendations system. Figure 1 ...
[16]
[PDF] Incorporating Diversity in a Learning to Rank Recommender System
In this paper, we explore the use of regularisation to enhance the diversity of the recommendations produced by these methods. Given a matrix of pairwise item ...
[17]
Sorting protein decoys by machine-learning-to-rank | Scientific Reports
Aug 17, 2016 · The learning-to-rank methods combine information retrieval techniques with machine learning theory, and their goal is to obtain a ranking ...
[18]
FRAUDability: Estimating Users' Susceptibility to Financial Fraud ...
Dec 2, 2023 · In this paper, we examine the application of adversarial learning based ranking techniques in the fraud detection domain and propose FRAUDability.
[19]
Compound virtual screening by learning-to-rank with gradient ... - arXiv
May 4, 2022 · Ranking prediction models learn based on ordinal relationships, making them suitable for integrating assay data from various environments.
[20]
Unbiased Learning to Rank in Feeds Recommendation
Mar 8, 2021 · We propose an Unbiased Learning to Rank with Combinational Propensity (ULTR-CP) framework to remove the inherent biases jointly caused by ...
[21]
Unifying learning to rank and domain adaptation - ACM Digital Library
In this paper, we propose to study the cross-task document scoring problem, where a task refers to a query to rank or a domain to adapt to, as the first attempt ...
[22]
On the usefulness of query features for learning to rank
Learning to rank studies have mostly focused on query-dependent and query-independent document features, which enable the learning of ranking models of ...
[23]
[PDF] Learning to Rank with (a Lot of) Word Features - Ronan Collobert
Ranking is then performed by sorting the documents based on their similarity score with the query. For example, a classical vector space model, see e.g. [1], ...
[24]
[PDF] Scoring, term - Introduction to Information Retrieval
Second, they give us a simple means for scoring (and thereby ranking) documents in response to a query. 2. Next, in Section 6.2 we develop the idea of weighting ...
[25]
https://nlp.stanford.edu/IR-book/pdf/06vect.pdf
[26]
Yahoo! Learning to Rank Challenge Overview
This paper provides an overview and an analysis of this challenge, along with a detailed description of the released datasets.
[27]
An exponentiated gradient algorithm for sparse learning-to-rank
This paper focuses on the problem of sparse learning-to-rank, where the learned ranking models usually have very few non-zero coefficients.
[28]
[PDF] Learning Sparse SVM for Feature Selection on Very High ...
It iteratively generates a pool of vio- lated sparse feature subsets and then combines them via ef- ficient Multiple Kernel Learning (MKL) algorithm. FGM shows ...
[29]
Feature selection for ranking - ACM Digital Library
We propose a new feature selection method in this paper. Specifically, for each feature we use its value to rank the training instances, and define the ranking ...
[30]
[PDF] Linear feature extraction for ranking - IRLab
May 2, 2018 · In this paper, we have addressed the feature extraction problem for learning to rank, and have proposed LifeRank, a linear feature ...Missing: seminal | Show results with:seminal<|separator|>
[31]
An exploration of proximity measures in information retrieval
This paper explores how the proximity of query terms in a document can be used to promote scores, proposing five different proximity measures.
[32]
[PDF] Graph Learning based Recommender Systems: A Review - IJCAI
GLRS em- ploy advanced graph learning approaches to model users' preferences and intentions as well as items' characteristics for ...
[33]
Can clicks be both labels and features? Unbiased behavior feature ...
In this paper, we explore the possibility of incorporating user clicks as both training labels and ranking features for learning to rank. We formally ...Missing: linguistic | Show results with:linguistic
[34]
TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank - arXiv
Nov 30, 2018 · We propose TensorFlow Ranking, the first open source library for solving large-scale ranking problems in a deep learning framework.Missing: millions | Show results with:millions
[35]
[PDF] Evaluation in information retrieval - Stanford NLP Group
This includes explaining the kinds of evaluation measures that are standardly used for document retrieval and related tasks like text clas-.
[36]
https://arxiv.org/abs/1812.00073
[37]
A new rank correlation coefficient for information retrieval
The Kendall's Τ statistic, however, does not make such distinctions and equally penalizes errors both at high and low rankings. In this paper, we propose a new ...
[38]
[PDF] Yahoo! Learning to Rank Challenge Overview
Learning to rank is a relatively new field in which machine learning algorithms are used to learn this ranking function. It is of particular importance for web ...
[39]
https://dl.acm.org/doi/10.1145/1390334.1390435
[40]
http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf
[41]
https://arxiv.org/pdf/2305.02914.pdf
[42]
A Modification of LambdaMART to Handle Noisy Crowdsourced ...
We consider noisy crowdsourced assessments and their impact on learning-to-rank algorithms. Starting with EM-weighted assessments, we modify LambdaMART in ...
[43]
[PDF] A Short Introduction to Learning to Rank
This short paper gives an introduction to learning to rank, and it specifically explains the fundamen- tal problems, existing approaches, and future work of ...Missing: seminal | Show results with:seminal
[44]
Learning to Rank for Information Retrieval - ACM Digital Library
... pointwise, pairwise, and listwise approaches. The advantages and disadvantages with each approach are analyzed, and the relationships between the loss ...
[45]
[PDF] Learning to Rank using Gradient Descent
We investigate using gradient descent meth- ods for learning ranking functions; we pro- pose a simple probabilistic cost function, and we introduce RankNet, an ...
[46]
[PDF] Optimizing Search Engines using Clickthrough Data
The following experiments verify whether the inferences drawn from the clickthrough data are justified, and whether the Ranking SVM can successfully use such ...
[47]
[PDF] An Efficient Boosting Algorithm for Combining Preferences
In this paper, we introduce and study an efficient learning algorithm called RankBoost for com- bining multiple rankings or preferences (we use these terms ...
[48]
[PDF] Large Scale Learning to Rank - Google Research
Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing an objective defined over.
[49]
[PDF] Learning to Rank using Gradient Descent
We investigate using gradient descent meth- ods for learning ranking functions; we pro- pose a simple probabilistic cost function, and we introduce RankNet, an ...
[50]
[PDF] From RankNet to LambdaRank to LambdaMART: An Overview
RankNet, LambdaRank, and LambdaMART have proven to be very suc- cessful algorithms for solving real world ranking problems: for example an ensem- ble of ...Missing: history shift
[51]
[PDF] LETOR: A Benchmark Collection for Research on Learning to Rank ...
LETOR is a benchmark collection for learning to rank research in information retrieval, released by Microsoft Research Asia, that eases algorithm development.
[52]
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf
[53]
[PDF] arXiv:1910.14424v1 [cs.IR] 31 Oct 2019
Oct 31, 2019 · We propose two variants, called monoBERT and duoBERT, that formulate the ranking problem as point- wise and pairwise classification, ...
[54]
[PDF] A Deep Look into Neural Ranking Models for Information Retrieval
Later, between 2014 and 2015, work on neural ranking models began to grow, such as new variants of DSSM [13], ARC I and ARC II [17], MatchPyra- mid [18], and ...
[55]
Relevance feedback in information retrieval - Semantic Scholar
Semantic Scholar extracted view of "Relevance feedback in information retrieval" by J. Rocchio.
[56]
[PDF] The first text REtrieval conference (TREC-1)
a long history of experimentation in information retrieval. Research started with exj)eriments in indexing languages, such as the Cranfield I tests ...
[57]
[PDF] The Text REtrieval Conference (TREC): History and Plans for TREC-9
The first conference took place in September, 1992 with 25 participating groups including most of the leading text retrieval research groups. Although scaling ...
[58]
Pairwise Preference Learning and Ranking - SpringerLink
The main objective of this work is to investigate the trade-off between the quality of the induced ranking function and the computational complexity of the ...
[59]
RankNet: A ranking retrospective - Microsoft Research
Jul 7, 2015 · RankNet is a feedforward neural network model. Before it can be used its parameters must be learned using a large amount of labeled data, called the training ...
[60]
How AI powers great search results - The Keyword
a smarter ranking system. When we launched RankBrain in 2015, it was the first deep learning system deployed in Search. At the ...
[61]
A Complete Guide to the Google RankBrain Algorithm
Sep 2, 2020 · RankBrain is a system by which Google can better understand the likely user intent of a search query. It was rolled out in the spring of 2015, ...
[62]
News Feed ranking, powered by machine learning
Jan 26, 2021 · We use ML to predict which content will matter most to each person to support a more engaging and positive experience.Missing: pairwise | Show results with:pairwise
[63]
Reinforcement Learning to Rank in E-Commerce Search Engine
For better utilizing the correlation between different ranking steps, in this paper, we propose to use reinforcement learning (RL) to learn an optimal ranking ...
[64]
microsoft/LightGBM - GitHub
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, ...Releases · Pull requests 47 · Workflow runs · [RFC] [python-package] use...
[65]
ColBERT: Efficient and Effective Passage Search via Contextualized ...
Apr 27, 2020 · ColBERT introduces a late interaction architecture that independently encodes the query and the document using BERT and then employs a cheap yet ...
[66]
[PDF] Improving Transformer-Kernel Ranking Model Using Conformer and ...
We benchmark our models under the strictly blind evaluation setting of the TREC 2020 Deep Learning track and find that our proposed architecture changes lead to ...
[67]
[PDF] Knowledge Distillation for High Dimensional Search Index
Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine ...
[68]
Distilled Neural Networks for Efficient Learning to Rank - arXiv
Feb 22, 2022 · This paper proposes using distillation, pruning, and fast matrix multiplication to speed up neural scoring time in learning to rank, achieving ...Missing: approximate nearest neighbor
[69]
Multi-modal Graph Contrastive Learning for Micro-video ...
We propose a novel learning method named Multi-Modal Graph Contrastive Learning (MMGCL), which aims to explicitly enhance multi-modal representation learning.
[70]
Effective and Privacy-preserving Federated Online Learning to Rank
In this paper, we propose a Federated OLTR method, called FPDGD, which leverages the state-of-the-art Pairwise Differentiable Gradient Descent (PDGD) and adapts ...
[71]
Towards Two-Stage Counterfactual Learning to Rank
Jul 18, 2025 · Abstract. Counterfactual learning to rank (CLTR) aims to learn a ranking policy from user interactions while correcting for the inherent biases ...Abstract · Information & Contributors · Published In<|control11|><|separator|>
[72]
[2106.03614] Adversarial Attack and Defense in Deep Ranking - arXiv
Jun 7, 2021 · In this paper, we propose two attacks against deep ranking systems, i.e., Candidate Attack and Query Attack, that can raise or lower the rank ...
[73]
[PDF] Adversarial Ranking Attack and Defense
Abstract. Deep Neural Network (DNN) classifiers are vulnerable to adversarial attack, where an imperceptible perturbation could result in misclassification.
[74]
An in-depth study on adversarial learning-to-rank
Feb 28, 2023 · To cope with these problems, firstly, we show how to perform adversarial learning-to-rank in a listwise manner by following the GAN framework.
[75]
[PDF] Perturbation-Invariant Adversarial Training for Neural Ranking Models
Given a ranking model, WSRA aims to promote a target document in rankings by replacing important words in its text with synonyms in a semantics- preserving way.
[76]
[PDF] Practical Relative Order Attack in Deep Ranking - CVF Open Access
In this paper, we formulate a new adversarial attack against deep ranking systems, i.e., the Or- der Attack, which covertly alters the relative order among a ...
[77]
Fairness in Ranking, Part II: Learning-to-Rank and Recommender ...
In Part II of the survey we discuss technical work on fairness in supervised learning-to-rank (LtR) and highlight representative examples of recent fairness ...Missing: seminal | Show results with:seminal
[78]
[PDF] Ranking with Popularity Bias: User Welfare under Self-Amplification ...
Nov 1, 2023 · Such popularity-driven rankers gradually amplify the positions of higher-ranked items. Particularly, though users' selections are positively ...