Fact-checked by Grok 2 weeks ago

String metric

A string metric, also known as a string similarity metric or string distance function, is a mathematical measure that quantifies the degree of similarity or dissimilarity between two strings, typically by computing a numerical value where lower scores indicate greater resemblance despite variations in spelling, order, or length. These metrics are fundamental in and for handling , where exact equality is insufficient due to errors, abbreviations, or inconsistencies in textual data. String metrics can be broadly categorized into character-based, token-based, and hybrid approaches, each suited to different aspects of string comparison. Character-based metrics, such as the , which counts the minimum number of single-character insertions, deletions, or substitutions needed to transform one string into another, focus on sequential alignments and are ideal for detecting typographical errors. Token-based metrics, like Jaccard similarity, treat strings as sets of words or n-grams and measure overlap, making them effective for handling reordered or abbreviated terms in longer texts. Hybrid methods, such as SoftTFIDF, combine token frequencies with edit distances to balance in noisy datasets. These metrics find wide application in tasks, including duplicate detection, entity resolution, spell checking, and systems, where they enable efficient similarity searches over large-scale textual corpora. For instance, in for or database integration, metrics like Jaro-Winkler have demonstrated high accuracy by emphasizing prefix matches common in names. Ongoing research addresses scalability challenges through filter-and-verify strategies and specialized indexing, ensuring their utility in modern environments.

Fundamentals

Definition

A string metric is a that quantifies the degree of similarity or dissimilarity between two strings, where strings are typically represented as finite sequences of characters, tokens, or symbols from a finite . These metrics enable by providing a numerical score that reflects how closely two strings resemble each other, accommodating variations that arise in real-world . String metrics can be broadly categorized into distance measures and similarity measures. Distance metrics, such as , compute a non-negative value indicating the minimum effort required to transform one into another, where lower values denote greater similarity. In contrast, similarity metrics yield scores where higher values indicate closer resemblance, often derived by inverting or normalizing distance functions. This distinction allows flexibility in applications, as the choice depends on whether the focus is on or . Exact string matching often fails in practical scenarios due to inconsistencies in data, such as typographical errors, phonetic variations, abbreviations, or alternative spellings of names and terms. String metrics address these challenges by providing a robust way to identify near-matches, which is particularly valuable in domains like where precise equality is impractical. For instance, search engines leverage such metrics to return relevant results despite minor discrepancies in user queries.

Properties

String metrics can be broadly classified as distance measures or similarity measures, each adhering to specific mathematical properties that may align with or deviate from the axioms of a general . A distance d on a set of strings X satisfies four axioms: non-negativity, where d(x, y) \geq 0 for all x, y \in X; , where d(x, y) = 0 x = y; , where d(x, y) = d(y, x) for all x, y \in X; and the , where d(x, z) \leq d(x, y) + d(y, z) for all x, y, z \in X. These properties ensure that the distance function behaves intuitively, forming the basis for metric spaces in mathematics. Many string distance metrics, such as the Levenshtein edit distance, satisfy all four axioms, making them true metrics. However, not all string metrics strictly adhere to these requirements; for instance, certain q-gram-based distances may yield zero for distinct strings like anagrams, violating the identity of indiscernibles. In contrast, string similarity metrics typically produce values in the range [0, 1], where 1 indicates identical strings and 0 represents maximum dissimilarity, providing a normalized measure of resemblance rather than separation. Common deviations from metric axioms in string metrics include asymmetry and failure of the triangle inequality in normalized edit distances, where scaling disrupts the additive property. To enhance comparability across different metrics or string lengths, normalization techniques are often applied, scaling raw distance outputs to the interval [0, 1] by dividing by a factor such as the maximum possible or string length. This process, while useful for practical applications, may introduce deviations from strict metric properties, such as the in some cases.

History and Development

Origins

The origins of string metrics trace back to early 20th-century efforts in and information processing, particularly through phonetic algorithms designed to handle variations in spoken and written names. The algorithm, developed for indexing names by their phonetic sound rather than exact spelling, was patented in 1918 by Robert C. Russell as a method to group similar-sounding surnames for efficient searching. This system assigned a four-character code (one letter followed by three digits) based on sounds, aiming to match homophones like "Smith" and "Smyth" despite orthographic differences. Although initially manual, Soundex influenced computational approaches by providing a foundational model for phonetic matching in data systems. Early motivations for string metrics arose from practical challenges in and systems, where inconsistencies in name spellings and codes hindered accurate record matching and retrieval. In the United States, was adopted by the Census Bureau starting in the 1930s for indexing historical data from 1880 to 1920, addressing errors from transcription, , and dialectal variations. Similarly, libraries in the mid-20th century faced bibliographic duplication issues, prompting the need for similarity measures to merge variant catalog entries without losing references. These applications underscored the value of metrics that could quantify textual proximity beyond identical matches. In the and 1960s, string metrics gained formal footing in computing through (IR) systems and error-correcting codes, as researchers sought to automate handling of noisy or erroneous text data. IR pioneers, building on electromechanical searching devices from the 1940s, integrated string similarity to tolerate misspellings in keyword queries, with early experiments in the late evaluating systems like those at MIT's Project . Concurrently, error-correcting codes in inspired metrics for deletions, insertions, and substitutions in discrete sequences, treating strings as error-prone signals. A key contribution was Frederick J. Damerau's 1964 work on detecting and correcting spelling errors via adjacent transpositions and other operations, applied to bibliographic and text processing tasks. This paved the way for Vladimir Levenshtein's 1965 formulation of as a metric for binary codes resilient to such errors, marking a pivotal early advancement in computational string comparison.

Key Milestones

In 1965, Soviet mathematician introduced the concept of , a foundational string metric quantifying the minimum number of single-character edits required to transform one string into another, originally developed in the context of error-correcting codes for binary sequences. During the late 1980s and early 1990s, phonetic algorithms advanced significantly with the publication of in 1990 by Lawrence Philips, which improved upon earlier methods like by providing a more accurate encoding of English pronunciations into phonetic keys for fuzzy string matching in name indexing and search applications. In the same year, statistician William E. Winkler developed the Jaro-Winkler metric at the U.S. Bureau, extending the Jaro similarity measure to emphasize prefix matches, thereby enhancing accuracy in tasks for handling typographical errors in large-scale census datasets. From the onward, string metrics became integral to pipelines for tasks like entity resolution and clustering, with adaptations enabling scalable computation in environments; for instance, frameworks in Hadoop facilitated parallel string similarity joins, allowing efficient processing of massive datasets that traditional algorithms could not handle. This period also marked the evolution toward hybrid string metrics, which combine multiple approaches—such as edit distances with phonetic or semantic embeddings—to achieve higher precision in diverse applications, including post-2000 uses in for . In the 2020s, string metrics have increasingly integrated techniques, such as transformer-based models like , to incorporate semantic and contextual similarity beyond traditional character or token-based methods. These advancements, prominent as of 2025, enhance applications in , AI-driven entity resolution, and plagiarism detection by capturing nuanced textual meanings.

Applications

String metrics play a crucial role in spell-checking and features within search systems by enabling fuzzy matching to rank query suggestions despite user typos or variations. For instance, these metrics compute the similarity between a user's input and candidate terms in a or query log, prioritizing those with minimal differences to provide relevant corrections in real-time. The , a foundational edit-based , serves as a for such correction by quantifying the minimum operations needed to transform one into another. In search indexes, string metrics facilitate deduplication by identifying near-duplicate content during web crawling, ensuring efficient storage and retrieval without redundant data. This process involves comparing document fingerprints or substrings to detect similarities above a , such as using shingling techniques that break texts into overlapping n-grams for scalable comparison. Google's , for example, employs such near-duplicate detection to filter out boilerplate or mirrored pages, improving index quality since the early 2000s. A prominent application is Google's "Did you mean?" feature, which has utilized variants of for query correction since its introduction in 2001 as one of the earliest applications in search. This system generates suggestions by measuring syntactic similarities via generalized edit distances, enhanced with probabilistic term rewrites derived from query logs, outperforming basic models in reformulation accuracy. However, applying exact string metrics at scale poses significant challenges due to high computational costs, particularly for large corpora where pairwise comparisons become prohibitive. To address this, approximations like n-gram-based metrics are employed, which decompose strings into fixed-length substrings for faster similarity estimation via intersection or frequency comparisons, balancing accuracy and efficiency in production search engines.

Bioinformatics and Genomics

In bioinformatics and genomics, string metrics play a crucial role in measuring sequence similarity by aligning DNA, RNA, or protein strings to identify mutations, insertions, deletions, and regions of homology that indicate evolutionary relationships or functional conservation. The Levenshtein distance, a foundational edit-based metric, quantifies the minimum number of single-character edits required to transform one biological sequence into another, serving as a proxy for evolutionary divergence in tasks like detecting point mutations or homologous segments. This approach enables the comparison of long genomic strings, where small differences can reveal significant biological insights, such as single nucleotide polymorphisms (SNPs) associated with disease. Tools like employ variants of edit-distance metrics for efficient approximate matching in large biological databases, using seeding with short exact matches followed by local alignment extensions to handle gaps and mismatches without exhaustive computation. These adaptations reduce the from quadratic to near-linear for database searches, allowing rapid identification of similar sequences across genomes while accounting for biological variability. In , string dissimilarity measures derived from edit distances facilitate the construction of evolutionary trees by estimating the mutational distance between species' sequences, with tools applying to raw reads and contigs for reference-free tree inference. By computing pairwise dissimilarities and aggregating them via methods like neighbor-joining, these metrics reconstruct branching patterns that reflect divergence times, as demonstrated in analyses of microbial and eukaryotic genomes. Adaptations of edit distances incorporate biological context through weighting schemes, where substitution costs reflect nucleotide transition/transversion biases or amino acid physicochemical properties, as implemented in the Needleman-Wunsch algorithm for global alignments. For instance, purine-to-purine or pyrimidine-to-pyrimidine substitutions may incur lower costs than s, enhancing accuracy in modeling evolutionary substitutions over uniform edit penalties. Token-based metrics, such as k-mer overlap, complement these by aiding gene annotation through rapid comparison of short sequence motifs.

Categories

Edit Distance-Based Metrics

Edit distance-based metrics measure the dissimilarity between two strings by determining the minimum number of single-character edit operations—typically insertions, deletions, and substitutions—needed to transform one string into the other, where each operation has a unit cost. This approach models string differences as a sequence of atomic changes at the character level, providing a quantifiable distance that reflects potential errors in data entry or transmission. The concept was originally developed in the context of error-correcting codes for binary sequences but has since been generalized to arbitrary strings. A key variant is the standard edit distance, also known as , which considers only insertions, deletions, and substitutions. An extension, the Damerau-Levenshtein distance, incorporates transpositions of adjacent characters as an additional operation, making it particularly suitable for capturing common typographical errors like swapped letters. This variant was motivated by the need to handle single-error corrections in spell-checking systems. The serves as the canonical example within this category. These metrics are typically computed using dynamic programming, which constructs a to track the optimal between the strings and achieves a of O(nm), where n and m are the lengths of the input strings; can similarly be O(nm) but is optimizable to O(min(n, m)) in some implementations. This algorithmic foundation ensures exact computation but scales quadratically with string length. The primary advantage of edit distance-based metrics lies in their intuitive alignment with real-world error models, such as keyboard mistypes or inaccuracies, enabling effective applications in spell correction and approximate matching. However, the O(nm) complexity renders them computationally expensive for long strings, often necessitating approximations or indexing techniques in large-scale scenarios.

Token-Based Metrics

Token-based string metrics operate by first decomposing input strings into discrete units known as , typically words, phrases, or subsequences like n-grams, and then representing each string as a bag () or set of these tokens. Similarity is subsequently assessed through set-theoretic operations, such as and , which quantify the overlap between the token collections without regard to their sequential arrangement in the original strings. This approach shifts focus from character-level manipulations to higher-level content comparison, making it particularly suitable for processing variable-length texts where exact positional matching is less critical. A foundational example is the , which measures similarity as the size of the intersection of two token sets divided by the size of their union. Formally, for token sets A and B, J(A, B) = \frac{|A \cap B|}{|A \cup B|} with values ranging from 0 (no overlap) to 1 (identical sets). Introduced by Paul Jaccard in 1901 to compare floral distributions in ecological samples, it has been adapted for string comparison by tokenizing strings via delimiters like spaces or . Another representative method involves applied to frequency vectors of n-grams, where strings are broken into overlapping subsequences of length n (e.g., bigrams for n=2), and similarity is computed as the cosine of the angle between these vectors, emphasizing weighted token overlaps in longer documents. These metrics excel in applications involving extended texts, such as plagiarism detection and document similarity assessment, where they efficiently capture shared content across large corpora by leveraging token overlap rather than exhaustive pairwise alignments. For instance, in plagiarism detection systems, tokenization followed by Jaccard or cosine measures on n-grams has demonstrated effectiveness on diverse datasets including and files, as token sets naturally handle paraphrasing or minor reorderings. However, token-based metrics have notable limitations, including their disregard for token order, which can equate rearranged phrases with originals despite semantic differences, and their dependence on tokenization choices, such as handling or , which may introduce variability or bias in results.

Selected Examples

The , also known as the , is defined as the minimum number of single-character operations—insertions, deletions, or substitutions—required to transform one into another. This quantifies the similarity between two sequences by counting the smallest set of edits needed to make them identical, making it a foundational tool in string comparison. The distance is typically computed using a dynamic programming approach, which builds a matrix to track the optimal edit paths. Let s_1 and s_2 be the two strings of lengths m and n, respectively. A matrix dp of size (m+1) \times (n+1) is constructed, where dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = i (cost of deleting i characters from s_1) and dp{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = j (cost of inserting j characters into s_1). For each i = 1 to m and j = 1 to n, the recurrence relation is: dp = \min \begin{cases} dp[i-1] + 1 & \text{(delete)} \\ dp[j-1] + 1 & \text{(insert)} \\ dp[i-1][j-1] + \text{(1 if } s_1 \neq s_2 \text{ else 0)} & \text{(substitute or match)} \end{cases} The value dp gives the . This algorithm runs in O(mn) time and space, efficiently handling the alignment of characters across the strings. To illustrate, consider transforming "" into "sitting". The dynamic programming is filled as follows, with each cell reflecting the minimum edits up to that prefix:
''sitting
''01234567
k11234567
i22123456
t33212345
t44321234
e55432234
n66543323
The bottom-right cell value of 3 indicates the minimum edits: substitute 'k' with 's', substitute 'e' with 'i', and insert 'g'. Variants of the extend the unit cost model (where each operation costs 1) to weighted versions, assigning different costs to insertions, deletions, and substitutions based on context or domain-specific needs, such as in error-correcting codes or . For instance, in weighted , the recurrence incorporates cost matrices for operations, allowing penalties to vary by character or position to better model real-world transformations.

Jaro-Winkler Distance

The Jaro-Winkler distance is a string metric designed for measuring similarity between two strings, particularly effective in tasks where typographical errors and variations in personal names are common. It builds upon the Jaro similarity by incorporating a penalty for transpositions and an additional boost for matching prefixes, making it suitable for applications like name matching in databases. Developed by William E. Winkler as an enhancement to the Jaro metric, it assigns higher similarity scores to strings that agree on initial characters, reflecting the observation that prefixes are more discriminative in names and addresses. The Jaro component of the metric first computes the base similarity s as follows: s = \frac{1}{3} \left( \frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} \right) where m is the number of matching characters between strings s_1 and s_2 (with |s_1| \leq |s_2|), and t is half the number of transpositions among the matching characters (i.e., the number of pairs of matching characters that are out of their original order). Matching characters are those within a certain distance window, typically \lfloor |s_2|/2 \rfloor - 1, to account for minor shifts. If m = 0, then s = 0. The Winkler adjustment then scales this similarity by favoring common prefixes up to length 4: d = s + (p \cdot (1 - s) \cdot l) where p = 0.1 is the prefix scaling factor, and l is the length of the common prefix (capped at 4). The distance is then $1 - d, with higher similarity d indicating lower distance. This formulation was introduced to improve agreement weights in probabilistic record linkage models. To illustrate computation, consider the strings "jones" and "johnson". The matching characters are j, o, n, s (m=4), with no transpositions (t=0). The lengths are |s1|=5 and |s2|=7. The base Jaro similarity is s = \frac{1}{3} \left( \frac{4}{5} + \frac{4}{7} + \frac{4-0}{4} \right) = \frac{1}{3} (0.8 + 0.5714 + 1) \approx 0.7905. The common prefix length is l=2 ("jo"), so the adjusted similarity is d = 0.7905 + (0.1 \cdot (1 - 0.7905) \cdot 2) \approx 0.7905 + 0.0419 = 0.8324, yielding a distance of approximately 0.1676. This example demonstrates how the prefix boost elevates the score for name-like strings sharing initial segments. Compared to simpler edit distances like Levenshtein, the Jaro-Winkler metric offers advantages in handling transpositions (e.g., "martha" vs. "marhta" scores 0.9444 similarity) and prefix matches, which are prevalent in noisy name data from censuses or administrative records. It achieved notable improvements in matching accuracy during the 1990 U.S. Census Post-Enumeration Survey, where standard edit distances underperformed on partial agreements and typographical variations. The metric's sensitivity to order and position makes it particularly effective for short strings like surnames, outperforming unweighted insertion/deletion penalties in empirical linkage tasks.

Comparison and Evaluation

Performance Metrics

Performance metrics for string metrics encompass both accuracy in similarity assessment tasks and computational efficiency, enabling systematic evaluation across diverse applications such as and ontology alignment. Accuracy is typically gauged using standard measures adapted to pairwise matching or clustering scenarios, including (the proportion of predicted matches that are correct), (the proportion of actual matches identified), and the F1-score (the of and ). These metrics are computed against gold-standard reference alignments, where higher F1-scores indicate balanced performance in detecting true similarities while minimizing false positives. Efficiency evaluations focus on time and space complexity, which are critical for scalability in large datasets. Edit distance-based metrics, such as Levenshtein distance, exhibit quadratic time complexity O(mn) and space complexity O(mn) due to dynamic programming, where m and n are string lengths, making them computationally intensive for long strings. In contrast, some token-based or approximation methods achieve linear time complexity O(n), such as certain n-gram overlaps or hashed approximations, reducing the burden in high-volume comparisons. Empirical optimizations, like bit-parallel algorithms for longest common subsequence (a component in some metrics), can lower practical runtime while preserving accuracy. Standard datasets for benchmarking string metrics include those from the Ontology Alignment Evaluation Initiative (OAEI), such as the benchmark tracks (#204, #301–#304) and conference dataset comprising 16 real-world ontologies with reference alignments for entity name matching. Additionally, records datasets, like those used in U.S. Census Bureau evaluations, provide labeled pairs of personal names for testing name-matching robustness against variations in spelling and formatting. These datasets facilitate reproducible comparisons by simulating real-world noise, such as abbreviations and transliterations. Empirical studies highlight performance variations across metrics; for instance, in name-matching tasks on datasets like and records, Jaro-Winkler distance achieves higher F1-scores (e.g., up to 10% improvement in maximum F1) compared to , particularly for short strings with prefix similarities, while being approximately 10 times faster due to its focus on transpositions and common characters. In OAEI benchmarks, Jaro-Winkler and hybrid variants like SoftTFIDF yield superior F-measures (around 0.7–0.8) over pure edit distances in ontology alignment, emphasizing the trade-off between computational cost and accuracy in domain-specific contexts.

Choice Criteria

Selecting an appropriate string metric is crucial for achieving effective similarity assessment in various applications, as the choice directly impacts the relevance and efficiency of results. Factors such as the nature of the data, prevalent error types, and available computational resources guide this selection. For instance, short strings like names or codes benefit from character-level metrics, while longer texts such as documents require set-oriented approaches. Key factors influencing the choice include the , error patterns, and computational budget. For short strings, edit-based metrics excel due to their focus on sequential character differences, whereas token-based metrics are preferable for long strings where word or phrase overlap matters more. Error types also play a role: edit distances effectively capture typos through operations like insertions or substitutions, but token-based methods better handle abbreviations or reordered elements by treating strings as unordered sets. Computational budget is another consideration, as edit-based computations often scale quadratically with string length, making them resource-intensive for large datasets, while token-based alternatives operate more linearly. Guidelines for metric selection emphasize matching the approach to the underlying string structure. Edit-based metrics are ideal for sequential data where order and minimal changes are key, such as in spell-checking or genome alignment. Token-based metrics suit set-like comparisons, ignoring order to assess content overlap in tasks like document retrieval. Phonetic metrics, which encode strings by pronunciation, are recommended for scenarios involving sound-alike errors, such as name matching across dialects. Trade-offs between accuracy and speed must be balanced, particularly in resource-constrained environments. Exact edit distances provide high but at a high computational , often necessitating approximations or filtering techniques for applications like search engines. In contrast, simpler token-based metrics offer faster processing with potentially reduced sensitivity to fine-grained errors, making them suitable when speed is prioritized over exhaustive detail. Hybrid approaches address limitations of individual metrics by combining them, often through weighted sums or learned ensembles, to leverage complementary strengths. For example, integrating and methods via soft matching improves robustness to both character errors and structural variations, as demonstrated in tasks. These combinations allow tailored accuracy-speed profiles, though they require careful weighting to avoid overcomplicating computation.

References

  1. [1]
    String metrics – Knowledge and References - Taylor & Francis
    A string metric is a measurement of dissimilarity between two sequences or words, commonly used in information theory and computer science.
  2. [2]
    [PDF] A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
    A metric that is used for measuring the distance between the text strings is called String metric and used for string matching and comparison. This survey ...
  3. [3]
    [PDF] A Comparison of String Distance Metrics for Name-Matching Tasks
    These methods have been used to match individuals and/or families between samples and cen- suses, e.g., in evaluation of the coverage of the U.S. decen- nial ...
  4. [4]
    [PDF] String Similarity Search and Join : A Survey - Database Group
    These similarity functions measure the similarity of two strings based on the common tokens shared by their corresponding token sets. Consider two strings r ...
  5. [5]
    [PDF] A Comparison of String Distance Metrics for Name-Matching Tasks
    Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names.
  6. [6]
    [PDF] Metric Spaces - UC Davis Math
    A metric space is a set with a distance function between every pair of points, satisfying non-negative, zero-if-equal, symmetric, and triangle inequality ...
  7. [7]
    The String-to-String Correction Problem | Journal of the ACM
    The string-to-string correction problem is to determine the distance between two strings as measured by the minimum cost sequence of “edit operations”
  8. [8]
    [PDF] stringdist: Approximate String Matching, Fuzzy Text Search, and ...
    Jan 10, 2025 · From a mathematical point of view, string metrics often do not obey the demands that are usually required from a distance function. For example, ...
  9. [9]
    [PDF] The Normalized Edit Distance with Uniform Operation Costs is a Metric
    Apr 23, 2022 · In this paper we prove that ned, the original edit normalization approach proposed by Vidal and Marzal [8] does satisfy the triangle inequality ...
  10. [10]
    US1261167A - Index. - Google Patents
    US1261167A. United States. Patent. Download PDF Find Prior Art Similar. Inventor: Robert C Russell; Current Assignee. The listed assignees may be inaccurate ...Missing: Soundex | Show results with:Soundex
  11. [11]
    Soundex System | National Archives
    Jan 9, 2024 · The Soundex system is a coded surname index based on sound, not spelling. Codes use the first letter and numbers based on sound, like W-252.
  12. [12]
    [PDF] Binary codes capable of correcting deletions, insertions, and reversals
    845-848, August, 1965. Original article submitted January 2, 1965. Investigations of transmission of binary infor- mation usually consider a channel model in ...
  13. [13]
    metaphone
    Definition: An algorithm to code English words phonetically by reducing them to 16 consonant sounds. ... Lawrence Philips. Hanging on the Metaphone. Computer ...Missing: 1980 | Show results with:1980
  14. [14]
    [PDF] DOCUMENT RESUME ED 325 505 TM 015 738 AUTHOR Winkler ...
    AUTHOR. Winkler, William E. TITLE. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. PUB DATE.
  15. [15]
    [PDF] Efficient Parallel Set-Similarity Joins Using MapReduce
    In this paper we study how to efficiently perform set-simi- larity joins in parallel using the popular MapReduce frame- work. We propose a 3-stage approach ...Missing: metrics | Show results with:metrics
  16. [16]
    [PDF] A Comparison of String Metrics for Matching Names and Records
    These methods have been used to match individuals and/or fami- lies between samples and censuses, e.g., in evaluation of the coverage of the U.S. decennial ...
  17. [17]
    Full article: Learning to combine multiple string similarity metrics for ...
    In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task.
  18. [18]
    Combining string and phonetic similarity matching to identify ...
    Nov 12, 2019 · In this paper, we presented a hybrid similarity approach that efficiently performs a joint string and language-dependent phonetic similarity ...
  19. [19]
    [PDF] Online Spelling Correction for Query Completion - Microsoft
    R@N and P@N are metrics for measuring offline spelling correction. We use these metrics to evaluate our system in the exact match mode. Next, we introduce ...
  20. [20]
    [PDF] Learning a Spelling Error Model from Search Query Logs
    This paper investigates using the Expec- tation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a ...
  21. [21]
    [PDF] Detecting Near-Duplicates for Web Crawling - Google Research
    So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the ...
  22. [22]
    Google Search: A timeline of the 25 biggest moments - The Keyword
    Sep 6, 2023 · 2001: “Did you mean?” "Did you mean," with suggested spelling corrections, was one of our first applications of machine learning. Previously, if ...2001: ``did You Mean?'' · 2008: Voice Search · 2022: Multisearch<|separator|>
  23. [23]
    None
    ### Summary of Google's Use of Edit Distance for Query Correction
  24. [24]
    [PDF] N-Gram Similarity and Distance - Department of Computing Science
    We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their ...<|control11|><|separator|>
  25. [25]
    Levenshtein Distance, Sequence Comparison and Biological ...
    Levenshtein edit distance has played a central role—both past and present—in sequence alignment in particular and biological database similarity search in ...
  26. [26]
    Reference-free phylogeny from sequencing data - PMC - NIH
    The tool provides an estimation of the Levenshtein distance between the sequences, which in turn estimates the number of mutations between the organisms.
  27. [27]
    A general method applicable to the search for similarities in the ...
    A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed.Missing: original | Show results with:original
  28. [28]
    A technique for computer detection and correction of spelling errors
    A technique for computer detection and correction of spelling errors. Author: Fred J. Damerau ... Published: 01 March 1964 Publication History. 962citation ...
  29. [29]
    [PDF] Combining Approximate String Matching Algorithms and Term ...
    Token-based measures use white space, line breaks, or punctuation characters to break a string into words and symbols (called tokens), and calculate the ...
  30. [30]
    [PDF] Further Generalizations of the Jaccard Index - HAL
    Oct 10, 2021 · We start by providing a brief historic perspective of the Jaccard index, which was introduced in 1901 by Paul Jac- card (1868–1944) under the ...Missing: paper | Show results with:paper
  31. [31]
    The N-Grams Based Text Similarity Detection Approach Using Self ...
    In this paper, four specific measures have been used to evaluate texts similarity: cosine, dice, extended Jaccard's, and overlap coefficient. There is ...
  32. [32]
    Levenshtein Distance - an overview | ScienceDirect Topics
    Levenshtein distance is defined as the minimum number of insertions, deletions, or substitutions required to transform one string into another. It serves as a ...
  33. [33]
    [PDF] Evaluating String Comparator Performance for Record Linkage
    Jun 9, 2005 · Abstract. We compare variations of string comparators based on the Jaro-Winkler comparator and edit distance comparator.
  34. [34]
    [PDF] A Comparative Evaluation of String Similarity Metrics for Ontology ...
    Feb 10, 2015 · In this paper we first analyze naming variations in competing ontologies, then we evaluate a wide range of string similarity metrics, from the ...<|separator|>
  35. [35]
    [PDF] Employing Trainable String Similarity Metrics for Information ...
    The most widely used similarity metric for short strings is Levenshtein distance, defined as the minimum number of in- sertions, deletions or substitutions ...
  36. [36]
    Measurement of Text Similarity: A Survey - MDPI
    Aug 31, 2020 · This paper systematically combs the research status of similarity measurement, analyzes the advantages and disadvantages of current methods,
  37. [37]
    String similarity search and join: a survey | Frontiers of Computer ...
    Nov 24, 2015 · In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce ...