Fact-checked by Grok 2 weeks ago

String metric

A string metric, also known as a string similarity metric or string distance function, is a mathematical measure that quantifies the degree of similarity or dissimilarity between two strings, typically by computing a numerical value where lower scores indicate greater resemblance despite variations in spelling, order, or length.^[1] These metrics are fundamental in computer science and information theory for handling approximate string matching, where exact equality is insufficient due to errors, abbreviations, or inconsistencies in textual data.^[2] String metrics can be broadly categorized into character-based, token-based, and hybrid approaches, each suited to different aspects of string comparison. Character-based metrics, such as the Levenshtein (edit) distance, which counts the minimum number of single-character insertions, deletions, or substitutions needed to transform one string into another, focus on sequential alignments and are ideal for detecting typographical errors.^[3] Token-based metrics, like Jaccard similarity, treat strings as sets of words or n-grams and measure overlap, making them effective for handling reordered or abbreviated terms in longer texts.^[4] Hybrid methods, such as SoftTFIDF, combine token frequencies with edit distances to balance precision and recall in noisy datasets.^[3] These metrics find wide application in data processing tasks, including duplicate detection, entity resolution, spell checking, and information retrieval systems, where they enable efficient similarity searches over large-scale textual corpora.^[2] For instance, in record linkage for census or database integration, metrics like Jaro-Winkler have demonstrated high accuracy by emphasizing prefix matches common in names.^[3] Ongoing research addresses scalability challenges through filter-and-verify strategies and specialized indexing, ensuring their utility in modern big data environments.^[4]

Fundamentals

Definition

A string metric is a function that quantifies the degree of similarity or dissimilarity between two strings, where strings are typically represented as finite sequences of characters, tokens, or symbols from a finite alphabet.^[4] These metrics enable approximate string matching by providing a numerical score that reflects how closely two strings resemble each other, accommodating variations that arise in real-world data processing.^[5] String metrics can be broadly categorized into distance measures and similarity measures. Distance metrics, such as edit distance, compute a non-negative value indicating the minimum effort required to transform one string into another, where lower values denote greater similarity.^[4] In contrast, similarity metrics yield scores where higher values indicate closer resemblance, often derived by inverting or normalizing distance functions.^[5] This distinction allows flexibility in applications, as the choice depends on whether the focus is on divergence or affinity. Exact string matching often fails in practical scenarios due to inconsistencies in data, such as typographical errors, phonetic variations, abbreviations, or alternative spellings of names and terms.^[2] String metrics address these challenges by providing a robust way to identify near-matches, which is particularly valuable in domains like information retrieval where precise equality is impractical.^[4] For instance, search engines leverage such metrics to return relevant results despite minor discrepancies in user queries.

Properties

String metrics can be broadly classified as distance measures or similarity measures, each adhering to specific mathematical properties that may align with or deviate from the axioms of a general metric space. A distance metric d on a set of strings X satisfies four fundamental axioms: non-negativity, where d(x, y) \geq 0 for all x, y \in X; identity of indiscernibles, where d(x, y) = 0 if and only if x = y; symmetry, where d(x, y) = d(y, x) for all x, y \in X; and the triangle inequality, where d(x, z) \leq d(x, y) + d(y, z) for all x, y, z \in X.^[6] These properties ensure that the distance function behaves intuitively, forming the basis for metric spaces in mathematics. Many string distance metrics, such as the Levenshtein edit distance, satisfy all four axioms, making them true metrics.^[7] However, not all string metrics strictly adhere to these requirements; for instance, certain q-gram-based distances may yield zero for distinct strings like anagrams, violating the identity of indiscernibles.^[8] In contrast, string similarity metrics typically produce values in the range [0, 1], where 1 indicates identical strings and 0 represents maximum dissimilarity, providing a normalized measure of resemblance rather than separation. Common deviations from metric axioms in string metrics include asymmetry and failure of the triangle inequality in normalized edit distances, where scaling disrupts the additive property.^[8]^[9] To enhance comparability across different metrics or string lengths, normalization techniques are often applied, scaling raw distance outputs to the interval [0, 1] by dividing by a factor such as the maximum possible distance or string length.^[8] This process, while useful for practical applications, may introduce deviations from strict metric properties, such as the triangle inequality in some cases.^[9]

History and Development

Origins

The origins of string metrics trace back to early 20th-century efforts in linguistics and information processing, particularly through phonetic algorithms designed to handle variations in spoken and written names. The Soundex algorithm, developed for indexing names by their phonetic sound rather than exact spelling, was patented in 1918 by Robert C. Russell as a method to group similar-sounding surnames for efficient searching. This system assigned a four-character code (one letter followed by three digits) based on consonant sounds, aiming to match homophones like "Smith" and "Smyth" despite orthographic differences. Although initially manual, Soundex influenced computational approaches by providing a foundational model for phonetic matching in data systems.^[10] Early motivations for string metrics arose from practical challenges in census and library systems, where inconsistencies in name spellings and codes hindered accurate record matching and retrieval. In the United States, Soundex was adopted by the Census Bureau starting in the 1930s for indexing historical census data from 1880 to 1920, addressing errors from transcription, immigration, and dialectal variations. Similarly, libraries in the mid-20th century faced bibliographic duplication issues, prompting the need for similarity measures to merge variant catalog entries without losing references. These applications underscored the value of metrics that could quantify textual proximity beyond identical matches.^[11] In the 1950s and 1960s, string metrics gained formal footing in computing through information retrieval (IR) systems and error-correcting codes, as researchers sought to automate handling of noisy or erroneous text data. IR pioneers, building on electromechanical searching devices from the 1940s, integrated string similarity to tolerate misspellings in keyword queries, with early experiments in the late 1950s evaluating systems like those at MIT's Project MAC. Concurrently, error-correcting codes in communication theory inspired metrics for deletions, insertions, and substitutions in discrete sequences, treating strings as error-prone signals. A key contribution was Frederick J. Damerau's 1964 work on detecting and correcting spelling errors via adjacent transpositions and other operations, applied to bibliographic and text processing tasks. This paved the way for Vladimir Levenshtein's 1965 formulation of edit distance as a metric for binary codes resilient to such errors, marking a pivotal early advancement in computational string comparison.^[12]

Key Milestones

In 1965, Soviet mathematician Vladimir Levenshtein introduced the concept of edit distance, a foundational string metric quantifying the minimum number of single-character edits required to transform one string into another, originally developed in the context of error-correcting codes for binary sequences.^[12] During the late 1980s and early 1990s, phonetic algorithms advanced significantly with the publication of Metaphone in 1990 by Lawrence Philips, which improved upon earlier methods like Soundex by providing a more accurate encoding of English pronunciations into phonetic keys for fuzzy string matching in name indexing and search applications.^[13] In the same year, statistician William E. Winkler developed the Jaro-Winkler metric at the U.S. Census Bureau, extending the Jaro similarity measure to emphasize prefix matches, thereby enhancing accuracy in record linkage tasks for handling typographical errors in large-scale census datasets.^[14] From the 2000s onward, string metrics became integral to machine learning pipelines for tasks like entity resolution and clustering, with adaptations enabling scalable computation in big data environments; for instance, MapReduce frameworks in Hadoop facilitated parallel string similarity joins, allowing efficient processing of massive datasets that traditional algorithms could not handle.^[15]^[16] This period also marked the evolution toward hybrid string metrics, which combine multiple approaches—such as edit distances with phonetic or semantic embeddings—to achieve higher precision in diverse applications, including post-2000 uses in genomics for sequence alignment.^[17]^[18] In the 2020s, string metrics have increasingly integrated deep learning techniques, such as transformer-based models like BERT, to incorporate semantic and contextual similarity beyond traditional character or token-based methods. These advancements, prominent as of 2025, enhance applications in natural language processing, AI-driven entity resolution, and plagiarism detection by capturing nuanced textual meanings.^[19]^[20]

Applications

Information Retrieval and Search

String metrics play a crucial role in spell-checking and autocomplete features within search systems by enabling fuzzy matching to rank query suggestions despite user typos or variations. For instance, these metrics compute the similarity between a user's input and candidate terms in a dictionary or query log, prioritizing those with minimal differences to provide relevant corrections in real-time.^[21] The Levenshtein distance, a foundational edit-based metric, serves as a baseline for such error correction by quantifying the minimum operations needed to transform one string into another.^[22] In search indexes, string metrics facilitate deduplication by identifying near-duplicate content during web crawling, ensuring efficient storage and retrieval without redundant data. This process involves comparing document fingerprints or substrings to detect similarities above a threshold, such as using shingling techniques that break texts into overlapping n-grams for scalable comparison.^[23] Google's web crawler, for example, employs such near-duplicate detection to filter out boilerplate or mirrored pages, improving index quality since the early 2000s.^[23] A prominent application is Google's "Did you mean?" feature, which has utilized variants of edit distance for query correction since its introduction in 2001 as one of the earliest machine learning applications in search.^[24] This system generates suggestions by measuring syntactic similarities via generalized edit distances, enhanced with probabilistic term rewrites derived from query logs, outperforming basic models in reformulation accuracy.^[25] However, applying exact string metrics at scale poses significant challenges due to high computational costs, particularly for large corpora where pairwise comparisons become prohibitive. To address this, approximations like n-gram-based metrics are employed, which decompose strings into fixed-length substrings for faster similarity estimation via intersection or frequency comparisons, balancing accuracy and efficiency in production search engines.^[26]

Bioinformatics and Genomics

In bioinformatics and genomics, string metrics play a crucial role in measuring sequence similarity by aligning DNA, RNA, or protein strings to identify mutations, insertions, deletions, and regions of homology that indicate evolutionary relationships or functional conservation.^[27] The Levenshtein distance, a foundational edit-based metric, quantifies the minimum number of single-character edits required to transform one biological sequence into another, serving as a proxy for evolutionary divergence in tasks like detecting point mutations or homologous segments.^[27] This approach enables the comparison of long genomic strings, where small differences can reveal significant biological insights, such as single nucleotide polymorphisms (SNPs) associated with disease.^[27] Tools like BLAST employ variants of edit-distance metrics for efficient approximate matching in large biological databases, using heuristic seeding with short exact matches followed by local alignment extensions to handle gaps and mismatches without exhaustive computation.^[27] These adaptations reduce the time complexity from quadratic to near-linear for database searches, allowing rapid identification of similar sequences across genomes while accounting for biological variability.^[27] In phylogenetics, string dissimilarity measures derived from edit distances facilitate the construction of evolutionary trees by estimating the mutational distance between species' sequences, with tools applying Levenshtein distance to raw reads and contigs for reference-free tree inference.^[28] By computing pairwise dissimilarities and aggregating them via methods like neighbor-joining, these metrics reconstruct branching patterns that reflect divergence times, as demonstrated in analyses of microbial and eukaryotic genomes.^[28] Adaptations of edit distances incorporate biological context through weighting schemes, where substitution costs reflect nucleotide transition/transversion biases or amino acid physicochemical properties, as implemented in the Needleman-Wunsch algorithm for global alignments.^[29] For instance, purine-to-purine or pyrimidine-to-pyrimidine substitutions may incur lower costs than transversions, enhancing accuracy in modeling evolutionary substitutions over uniform edit penalties.^[29] Token-based metrics, such as k-mer overlap, complement these by aiding gene annotation through rapid comparison of short sequence motifs.^[27]

Selected Examples

Levenshtein Distance

The Levenshtein distance, also known as the edit distance, is defined as the minimum number of single-character operations—insertions, deletions, or substitutions—required to transform one string into another.^[12] This metric quantifies the similarity between two sequences by counting the smallest set of edits needed to make them identical, making it a foundational tool in string comparison.^[34] The distance is typically computed using a dynamic programming approach, which builds a matrix to track the optimal edit paths. Let s_1 and s_2 be the two strings of lengths m and n, respectively. A matrix dp of size (m+1) \times (n+1) is constructed, where

dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = i

(cost of deleting i characters from s_1) and

dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = j

(cost of inserting j characters into s_1). For each i = 1 to m and j = 1 to n, the recurrence relation is:

dp = \min \begin{cases} dp[i-1] + 1 & \text{(delete)} \\ dp[j-1] + 1 & \text{(insert)} \\ dp[i-1][j-1] + \text{(1 if } s_1 \neq s_2 \text{ else 0)} & \text{(substitute or match)} \end{cases}

The value dp gives the Levenshtein distance. This algorithm runs in O(mn) time and space, efficiently handling the alignment of characters across the strings. To illustrate, consider transforming "kitten" into "sitting". The dynamic programming matrix is filled as follows, with each cell reflecting the minimum edits up to that prefix:

	''	s	i	t	t	i	n	g
''	0	1	2	3	4	5	6	7
k	1	1	2	3	4	5	6	7
i	2	2	1	2	3	4	5	6
t	3	3	2	1	2	3	4	5
t	4	4	3	2	1	2	3	4
e	5	5	4	3	2	2	3	4
n	6	6	5	4	3	3	2	3

The bottom-right cell value of 3 indicates the minimum edits: substitute 'k' with 's', substitute 'e' with 'i', and insert 'g'. Variants of the Levenshtein distance extend the unit cost model (where each operation costs 1) to weighted versions, assigning different costs to insertions, deletions, and substitutions based on context or domain-specific needs, such as in error-correcting codes or natural language processing. For instance, in weighted Levenshtein distance, the recurrence incorporates cost matrices for operations, allowing penalties to vary by character or position to better model real-world transformations.

Jaro-Winkler Distance

The Jaro-Winkler distance is a string metric designed for measuring similarity between two strings, particularly effective in record linkage tasks where typographical errors and variations in personal names are common. It builds upon the Jaro similarity by incorporating a penalty for transpositions and an additional boost for matching prefixes, making it suitable for applications like name matching in databases. Developed by William E. Winkler as an enhancement to the Jaro metric, it assigns higher similarity scores to strings that agree on initial characters, reflecting the observation that prefixes are more discriminative in names and addresses.^[14] The Jaro component of the metric first computes the base similarity s as follows:

s = \frac{1}{3} \left( \frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} \right)

where m is the number of matching characters between strings s_1 and s_2 (with |s_1| \leq |s_2|), and t is half the number of transpositions among the matching characters (i.e., the number of pairs of matching characters that are out of their original order). Matching characters are those within a certain distance window, typically \lfloor |s_2|/2 \rfloor - 1, to account for minor shifts. If m = 0, then s = 0. The Winkler adjustment then scales this similarity by favoring common prefixes up to length 4:

d = s + (p \cdot (1 - s) \cdot l)

where p = 0.1 is the prefix scaling factor, and l is the length of the common prefix (capped at 4). The distance is then $1 - d, with higher similarity d indicating lower distance. This formulation was introduced to improve agreement weights in probabilistic record linkage models.^[14]^[35] To illustrate computation, consider the strings "jones" and "johnson". The matching characters are j, o, n, s (m=4), with no transpositions (t=0). The lengths are |s1|=5 and |s2|=7. The base Jaro similarity is s = \frac{1}{3} \left( \frac{4}{5} + \frac{4}{7} + \frac{4-0}{4} \right) = \frac{1}{3} (0.8 + 0.5714 + 1) \approx 0.7905. The common prefix length is l=2 ("jo"), so the adjusted similarity is d = 0.7905 + (0.1 \cdot (1 - 0.7905) \cdot 2) \approx 0.7905 + 0.0419 = 0.8324, yielding a distance of approximately 0.1676. This example demonstrates how the prefix boost elevates the score for name-like strings sharing initial segments.^[35] Compared to simpler edit distances like Levenshtein, the Jaro-Winkler metric offers advantages in handling transpositions (e.g., "martha" vs. "marhta" scores 0.9444 similarity) and prefix matches, which are prevalent in noisy name data from censuses or administrative records. It achieved notable improvements in matching accuracy during the 1990 U.S. Census Post-Enumeration Survey, where standard edit distances underperformed on partial agreements and typographical variations. The metric's sensitivity to order and position makes it particularly effective for short strings like surnames, outperforming unweighted insertion/deletion penalties in empirical linkage tasks.^[14]

Comparison and Evaluation

Performance Metrics

Performance metrics for string metrics encompass both accuracy in similarity assessment tasks and computational efficiency, enabling systematic evaluation across diverse applications such as record linkage and ontology alignment. Accuracy is typically gauged using standard information retrieval measures adapted to pairwise matching or clustering scenarios, including precision (the proportion of predicted matches that are correct), recall (the proportion of actual matches identified), and the F1-score (the harmonic mean of precision and recall). These metrics are computed against gold-standard reference alignments, where higher F1-scores indicate balanced performance in detecting true similarities while minimizing false positives.^[5]^[36] Efficiency evaluations focus on time and space complexity, which are critical for scalability in large datasets. Edit distance-based metrics, such as Levenshtein distance, exhibit quadratic time complexity O(mn) and space complexity O(mn) due to dynamic programming, where m and n are string lengths, making them computationally intensive for long strings. In contrast, some token-based or approximation methods achieve linear time complexity O(n), such as certain n-gram overlaps or hashed approximations, reducing the burden in high-volume comparisons. Empirical optimizations, like bit-parallel algorithms for longest common subsequence (a component in some metrics), can lower practical runtime while preserving accuracy.^[37] Standard datasets for benchmarking string metrics include those from the Ontology Alignment Evaluation Initiative (OAEI), such as the benchmark tracks (#204, #301–#304) and conference dataset comprising 16 real-world ontologies with reference alignments for entity name matching. Additionally, census records datasets, like those used in U.S. Census Bureau evaluations, provide labeled pairs of personal names for testing name-matching robustness against variations in spelling and formatting. These datasets facilitate reproducible comparisons by simulating real-world noise, such as abbreviations and transliterations.^[36]^[5] Empirical studies highlight performance variations across metrics; for instance, in name-matching tasks on datasets like census and restaurant records, Jaro-Winkler distance achieves higher F1-scores (e.g., up to 10% improvement in maximum F1) compared to Levenshtein distance, particularly for short strings with prefix similarities, while being approximately 10 times faster due to its heuristic focus on transpositions and common characters. In OAEI benchmarks, Jaro-Winkler and hybrid variants like SoftTFIDF yield superior F-measures (around 0.7–0.8) over pure edit distances in ontology alignment, emphasizing the trade-off between computational cost and accuracy in domain-specific contexts.^[5]^[36]

Choice Criteria

Selecting an appropriate string metric is crucial for achieving effective similarity assessment in various applications, as the choice directly impacts the relevance and efficiency of results. Factors such as the nature of the data, prevalent error types, and available computational resources guide this selection. For instance, short strings like names or codes benefit from character-level metrics, while longer texts such as documents require set-oriented approaches.^[5]^[4] Key factors influencing the choice include the data type, error patterns, and computational budget. For short strings, edit-based metrics excel due to their focus on sequential character differences, whereas token-based metrics are preferable for long strings where word or phrase overlap matters more.^[38]^[2] Error types also play a role: edit distances effectively capture typos through operations like insertions or substitutions, but token-based methods better handle abbreviations or reordered elements by treating strings as unordered sets.^[4]^[38] Computational budget is another consideration, as edit-based computations often scale quadratically with string length, making them resource-intensive for large datasets, while token-based alternatives operate more linearly.^[4]^[2] Guidelines for metric selection emphasize matching the approach to the underlying string structure. Edit-based metrics are ideal for sequential data where order and minimal changes are key, such as in spell-checking or genome alignment. Token-based metrics suit set-like comparisons, ignoring order to assess content overlap in tasks like document retrieval. Phonetic metrics, which encode strings by pronunciation, are recommended for scenarios involving sound-alike errors, such as name matching across dialects.^[39]^[4]^[38] Trade-offs between accuracy and speed must be balanced, particularly in resource-constrained environments. Exact edit distances provide high precision but at a high computational cost, often necessitating approximations or filtering techniques for real-time applications like search engines. In contrast, simpler token-based metrics offer faster processing with potentially reduced sensitivity to fine-grained errors, making them suitable when speed is prioritized over exhaustive detail.^[5]^[4] Hybrid approaches address limitations of individual metrics by combining them, often through weighted sums or learned ensembles, to leverage complementary strengths. For example, integrating edit and token methods via soft matching improves robustness to both character errors and structural variations, as demonstrated in record linkage tasks. These combinations allow tailored accuracy-speed profiles, though they require careful weighting to avoid overcomplicating computation.^[5]^[4]^[2]

References

[1]
String metrics – Knowledge and References - Taylor & Francis
A string metric is a measurement of dissimilarity between two sequences or words, commonly used in information theory and computer science.
[2]
[PDF] A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A metric that is used for measuring the distance between the text strings is called String metric and used for string matching and comparison. This survey ...
[3]
[PDF] A Comparison of String Distance Metrics for Name-Matching Tasks
These methods have been used to match individuals and/or families between samples and cen- suses, e.g., in evaluation of the coverage of the U.S. decen- nial ...
[4]
[PDF] String Similarity Search and Join : A Survey - Database Group
These similarity functions measure the similarity of two strings based on the common tokens shared by their corresponding token sets. Consider two strings r ...
[5]
[PDF] A Comparison of String Distance Metrics for Name-Matching Tasks
Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names.
[6]
[PDF] Metric Spaces - UC Davis Math
A metric space is a set with a distance function between every pair of points, satisfying non-negative, zero-if-equal, symmetric, and triangle inequality ...
[7]
The String-to-String Correction Problem | Journal of the ACM
The string-to-string correction problem is to determine the distance between two strings as measured by the minimum cost sequence of “edit operations”
[8]
[PDF] stringdist: Approximate String Matching, Fuzzy Text Search, and ...
Jan 10, 2025 · From a mathematical point of view, string metrics often do not obey the demands that are usually required from a distance function. For example, ...
[9]
[PDF] The Normalized Edit Distance with Uniform Operation Costs is a Metric
Apr 23, 2022 · In this paper we prove that ned, the original edit normalization approach proposed by Vidal and Marzal [8] does satisfy the triangle inequality ...
[10]
US1261167A - Index. - Google Patents
US1261167A. United States. Patent. Download PDF Find Prior Art Similar. Inventor: Robert C Russell; Current Assignee. The listed assignees may be inaccurate ...Missing: Soundex | Show results with:Soundex
[11]
Soundex System | National Archives
Jan 9, 2024 · The Soundex system is a coded surname index based on sound, not spelling. Codes use the first letter and numbers based on sound, like W-252.
[12]
[PDF] Binary codes capable of correcting deletions, insertions, and reversals
845-848, August, 1965. Original article submitted January 2, 1965. Investigations of transmission of binary infor- mation usually consider a channel model in ...
[13]
metaphone
Definition: An algorithm to code English words phonetically by reducing them to 16 consonant sounds. ... Lawrence Philips. Hanging on the Metaphone. Computer ...Missing: 1980 | Show results with:1980
[14]
[PDF] DOCUMENT RESUME ED 325 505 TM 015 738 AUTHOR Winkler ...
AUTHOR. Winkler, William E. TITLE. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. PUB DATE.
[15]
[PDF] Efficient Parallel Set-Similarity Joins Using MapReduce
In this paper we study how to efficiently perform set-simi- larity joins in parallel using the popular MapReduce frame- work. We propose a 3-stage approach ...Missing: metrics | Show results with:metrics
[16]
[PDF] A Comparison of String Metrics for Matching Names and Records
These methods have been used to match individuals and/or fami- lies between samples and censuses, e.g., in evaluation of the coverage of the U.S. decennial ...
[17]
Full article: Learning to combine multiple string similarity metrics for ...
In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task.
[18]
Combining string and phonetic similarity matching to identify ...
Nov 12, 2019 · In this paper, we presented a hybrid similarity approach that efficiently performs a joint string and language-dependent phonetic similarity ...
[19]
[PDF] Online Spelling Correction for Query Completion - Microsoft
R@N and P@N are metrics for measuring offline spelling correction. We use these metrics to evaluate our system in the exact match mode. Next, we introduce ...
[20]
[PDF] Learning a Spelling Error Model from Search Query Logs
This paper investigates using the Expec- tation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a ...
[21]
[PDF] Detecting Near-Duplicates for Web Crawling - Google Research
So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the ...
[22]
Google Search: A timeline of the 25 biggest moments - The Keyword
Sep 6, 2023 · 2001: “Did you mean?” "Did you mean," with suggested spelling corrections, was one of our first applications of machine learning. Previously, if ...2001: ``did You Mean?'' · 2008: Voice Search · 2022: Multisearch<|separator|>
[23]
None
### Summary of Google's Use of Edit Distance for Query Correction
[24]
[PDF] N-Gram Similarity and Distance - Department of Computing Science
We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their ...<|control11|><|separator|>
[25]
Levenshtein Distance, Sequence Comparison and Biological ...
Levenshtein edit distance has played a central role—both past and present—in sequence alignment in particular and biological database similarity search in ...
[26]
Reference-free phylogeny from sequencing data - PMC - NIH
The tool provides an estimation of the Levenshtein distance between the sequences, which in turn estimates the number of mutations between the organisms.
[27]
A general method applicable to the search for similarities in the ...
A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed.Missing: original | Show results with:original
[28]
A technique for computer detection and correction of spelling errors
A technique for computer detection and correction of spelling errors. Author: Fred J. Damerau ... Published: 01 March 1964 Publication History. 962citation ...
[29]
[PDF] Combining Approximate String Matching Algorithms and Term ...
Token-based measures use white space, line breaks, or punctuation characters to break a string into words and symbols (called tokens), and calculate the ...
[30]
[PDF] Further Generalizations of the Jaccard Index - HAL
Oct 10, 2021 · We start by providing a brief historic perspective of the Jaccard index, which was introduced in 1901 by Paul Jac- card (1868–1944) under the ...Missing: paper | Show results with:paper
[31]
The N-Grams Based Text Similarity Detection Approach Using Self ...
In this paper, four specific measures have been used to evaluate texts similarity: cosine, dice, extended Jaccard's, and overlap coefficient. There is ...
[32]
Levenshtein Distance - an overview | ScienceDirect Topics
Levenshtein distance is defined as the minimum number of insertions, deletions, or substitutions required to transform one string into another. It serves as a ...
[33]
[PDF] Evaluating String Comparator Performance for Record Linkage
Jun 9, 2005 · Abstract. We compare variations of string comparators based on the Jaro-Winkler comparator and edit distance comparator.
[34]
[PDF] A Comparative Evaluation of String Similarity Metrics for Ontology ...
Feb 10, 2015 · In this paper we first analyze naming variations in competing ontologies, then we evaluate a wide range of string similarity metrics, from the ...<|separator|>
[35]
[PDF] Employing Trainable String Similarity Metrics for Information ...
The most widely used similarity metric for short strings is Levenshtein distance, defined as the minimum number of in- sertions, deletions or substitutions ...
[36]
Measurement of Text Similarity: A Survey - MDPI
Aug 31, 2020 · This paper systematically combs the research status of similarity measurement, analyzes the advantages and disadvantages of current methods,
[37]
String similarity search and join: a survey | Frontiers of Computer ...
Nov 24, 2015 · In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce ...