Fact-checked by Grok 2 weeks ago

Lexical similarity

Lexical similarity is a quantitative measure in that assesses the degree to which the vocabularies of two languages, dialects, or texts overlap, typically expressed as a of shared words that are similar in both form and meaning. This metric is foundational to , a subfield of historical and , where it helps gauge genetic relatedness and divergence time between languages without relying on written records. The method originated with in the mid-20th century as part of , which posits that core vocabulary—such as basic terms for body parts, numbers, and natural phenomena—changes at a relatively constant rate across languages, akin to . To compute lexical similarity, researchers compare standardized wordlists, often the Swadesh 100- or 200-item lists, which include culturally neutral and universal concepts like "hand," "water," or "eat." Words are deemed similar if they exhibit phonological or morphological resemblance indicating cognacy (shared ancestry), with scores ranging from 0% (no overlap) to 100% (identical vocabularies); for example, and show about 89% similarity on such lists. , a comprehensive catalog of world languages, employs this approach using regionally adapted 200-word lists to classify over 7,000 languages and dialects as of the 2025 edition. Beyond classification, lexical similarity informs studies on , , and acquisition; for instance, high similarity (above 85%) often indicates dialects with potential for partial comprehension between speakers, as seen in . However, limitations include sensitivity to borrowing, challenges in sign languages, and the exclusion of semantic or syntactic factors, prompting refinements like automated similarity algorithms in modern . Despite these, it remains a key tool for mapping language families and supporting endangered language documentation.

Fundamentals

Definition

Lexical similarity in refers to the degree to which the vocabularies of two or more s overlap, typically measured by the proportion of shared words that are similar in both form and meaning. This overlap primarily arises from cognates—words inherited from a common ancestral —or from loanwords introduced through . Cognates reflect genetic relatedness, stemming from shared historical origins, whereas loanwords indicate contact-induced similarity due to cultural or trade interactions between speakers. To minimize the influence of borrowing and cultural diffusion, assessments of lexical similarity emphasize basic vocabulary, which includes terms for universal concepts such as body parts (e.g., hand, eye), numbers (e.g., one, two), and natural phenomena (e.g., water, sun). These items are selected for their stability across languages and low susceptibility to replacement, allowing for more reliable comparisons of underlying genetic ties. A prominent example is the Swadesh 100-word list, developed by Morris Swadesh in the mid-20th century, which compiles such core items to facilitate standardized evaluations of vocabulary retention and divergence. This list's rationale lies in its focus on culturally neutral terms that evolve slowly, providing a benchmark for distinguishing inherited lexicon from borrowed elements. Lexical similarity differs from other forms of linguistic resemblance, such as phonological similarity (which examines parallels in sound inventories and pronunciation patterns) or syntactic similarity (which assesses commonalities in grammatical structures and sentence formation). While these aspects may correlate in closely related languages, isolates as the primary indicator of shared or influence.

Historical Development

The study of lexical similarity traces its roots to 19th-century comparative linguistics, where scholars employed vocabulary comparisons to establish genetic relationships among languages. , a prominent philologist, pioneered the use of such comparisons in constructing models for , notably in his 1853 work "Die ersten Spaltungen des indogermanischen Urvolkes," which visualized the divergence of through shared lexical elements. This approach marked a foundational shift toward systematic reconstruction of proto-languages based on vocabulary, influencing subsequent . In the mid-20th century, lexical similarity studies advanced through the development of , a quantitative method introduced by in the 1950s. Swadesh proposed using standardized lists of basic vocabulary—later known as Swadesh lists—to measure lexical retention rates and estimate divergence times between languages, assuming a relatively constant rate of vocabulary replacement. His seminal paper outlined this framework for dating prehistoric contacts, building on earlier comparative traditions while introducing empirical metrics to . This innovation facilitated broader applications in linguistic , though it sparked immediate controversy. The 1950s and 1960s saw intense debates surrounding , an extension of that applied lexical similarity to date language splits, often critiqued for oversimplifying historical processes like borrowing and irregular change. Critics, including Henry Hoenigswald, highlighted methodological flaws, yet the era solidified lexical comparison as a core tool in . By the 2000s, the field evolved toward computational approaches, transitioning from manual identification to automated detection enabled by digital databases. The Indo-European Lexical Cognacy Database (IELex), developed by Michael Dunn and colleagues in 2011, exemplifies this shift, providing a structured repository of over 200 semantic concepts across 20+ to support phylogenetic analyses and machine learning-based clustering. This marked a pivotal integration of into lexical similarity research, enhancing scalability and precision in reconstructing language histories. Subsequent projects, such as Lexibank (released in 2022 and updated as of 2025), have further advanced this by standardizing global lexical datasets for cross-linguistic computational analysis.

Methodologies

Cognate Identification

are words in different that descend from the same ancestral form in a , serving as key evidence for establishing genetic relationships between . They arise from regular sound correspondences in descendant from a shared proto-form. Loanwords, resulting from borrowing where words are adopted from one into another, are distinct from and do not indicate shared ancestry through a . Identification relies on specific criteria, including phonological matching—where sounds correspond systematically across —and semantic stability, ensuring the words retain core meanings despite minor variations. The primary manual approach to cognate identification is the , which systematically compares words across languages to detect potential matches, establishes recurrent sound correspondences, and reconstructs hypothetical proto-forms through etymological analysis. This process involves applying established sound laws to verify relationships; for instance, accounts for systematic consonant shifts in relative to other Indo-European branches, such as the change from Proto-Indo-European *p to Germanic *f, as seen in the cognate set linking Latin pater to English father. Experts iteratively refine these reconstructions by cross-referencing multiple languages to confirm regularity and exclude chance resemblances. Challenges in cognate identification include distinguishing true cognates from false cognates, which are superficially similar words lacking a shared etymological origin, such as English bad (meaning poor quality) and Bad (meaning ), potentially leading to erroneous assumptions of relatedness. Semantic shift further complicates the process, as meanings can evolve divergently over time even among genuine s, requiring careful evaluation of historical context to avoid misclassification. Cognate identification often prioritizes basic vocabulary, such as terms for body parts or numerals, to reduce the influence of borrowing. Tools and resources for accurate identification emphasize expert judgment, particularly through specialized etymological dictionaries that compile reconstructed roots and sets based on rigorous comparative analysis. Julius Pokorny's Indogermanisches etymologisches Wörterbuch (1959) exemplifies this, providing detailed entries on Proto-Indo-European roots and their reflexes across descendant languages, enabling linguists to trace and validate systematically. Such resources facilitate the integration of phonological, morphological, and semantic evidence, though they require ongoing updates to incorporate new archaeological and linguistic findings.

Quantitative Measures

Quantitative measures of lexical similarity typically involve calculating the proportion of shared s between languages using standardized word lists, providing a numerical index of relatedness. The most straightforward approach is the of shared cognates, defined as similarity = (number of cognates / total words in the list) × 100, where cognates are identified for corresponding meanings across languages. This metric aggregates cognate judgments to yield a scalar value between 0 and 100, with higher percentages indicating greater lexical overlap due to common ancestry. Lexicostatistics formalizes this process using Swadesh lists, which consist of 100 or 200 basic vocabulary items selected for their supposed stability across languages, such as body parts and natural phenomena. In this framework, cognate density (CD) is computed as CD = (C / N), where C is the number of shared cognates and N is the total size of the list, often expressed as a to assess retention rates over time. Swadesh lists enable consistent comparisons by focusing on core vocabulary less prone to borrowing, though the choice of list size affects precision, with larger lists (e.g., 200 words) reducing in density estimates. Advanced techniques extend these basics by incorporating automated string similarity and probabilistic inference. For instance, normalized measures orthographic or phonetic divergence between word forms, scaled by word length to range from 0 (identical) to 1 (completely dissimilar), aiding in automated detection within large datasets. Bayesian models, in contrast, treat presence as a under phylogenetic substitution rates, estimating the probability of relatedness by integrating lexical data with tree priors to account for evolutionary divergence and potential borrowing. Thresholds in these measures provide interpretive benchmarks for relatedness, adjusted for factors like list size and borrowing rates. Common guidelines suggest >85% shared cognates for dialects or very close , 60-80% for closely related languages or branches, while <30% indicates distant or unrelated languages, with corrections for smaller lists increasing the minimum for significance and models subtracting estimated borrowed items to refine density calculations.

Language Family Examples

Indo-European Languages

The exemplifies lexical similarity through its diverse branches, all descending from a reconstructed spoken around 6,000–8,000 years ago. High degrees of shared persist within sub-branches, particularly in core or basic lexicon, where cognates—words inherited from the common —predominate. For instance, derived from show substantial overlap; and exhibit approximately 89% lexical similarity in standardized word lists, reflecting conserved Latin roots in everyday terms like frère (French) and fratello (Italian) for "brother." This similarity underscores the family's genetic unity, with quantitative measures revealing gradients of divergence over millennia. Specific inter-branch comparisons highlight varying cognate retention rates in basic vocabulary, often assessed via Swadesh-style lists of 100–200 universal concepts. In the Germanic branch, English and share about 60% cognates, as seen in pairs like hand and Hand, though English's divergence is accentuated by later sound shifts and external influences. Similarly, the Indo-Iranian branch displays notable retention; and (Farsi) align at around 35% in core terms, evident in cognates such as mātṛ (Sanskrit) and mādar (Persian) for "mother," preserving Indo-Iranian phonological patterns. Within the Slavic branch, and demonstrate around 77–80% shared cognates in 158-item basic lists, including forms like ruka (Russian) and ręka (Polish) for "hand." Borrowing complicates these patterns, introducing non-cognate elements that mask genetic ties. English, for example, incorporates over 50% Romance vocabulary from Latin and sources post-Norman , such as mother (native Germanic) alongside borrowed maternal, which diminishes its apparent lexical proximity to other Germanic languages like compared to pre-borrowing estimates. This areal influence highlights how contact can overlay inherited similarity, requiring methods to distinguish loans from cognates. The Indo-European Lexical Cognacy (IELex) database facilitates systematic by providing automated codings for over 200 languages across more than 200 semantic meanings, enabling phylogenetic modeling of similarity distributions. Updated resources like the IE-CoR extend this work, encoding inherited relationships while accounting for borrowings in core vocabulary.

East Asian Languages

In the Sino-Tibetan language family, lexical similarity between like and Tibeto-Burman languages such as is relatively low, though these are often complicated by tonal variations and historical divergence. A phylogenetic of 131 Sino-Tibetan languages using 110 core vocabulary items identified 1,726 binary sets resistant to borrowing, confirming a common origin around 8,000 years before present but highlighting low due to phonetic shifts, including tones that distinguish meanings in both branches. For instance, basic terms show partial overlap, but efforts reveal that many apparent similarities stem from proto-forms altered by tonogenesis in Sinitic versus clusters in Tibeto-Burman. Relations between and illustrate high levels of borrowing contrasted with low genetic similarity. Approximately 36.7% of Japanese vocabulary consists of Sino-Japanese words borrowed from , primarily through compounds introduced during historical contact periods, yet native Japonic cognates with Chinese core terms remain minimal, around 10% or less, as the languages belong to distinct families. This borrowing dominates formal and technical , while everyday native Japanese words diverge significantly, underscoring the need to distinguish contact-induced similarity from in East Asian contexts. The Altaic hypothesis proposes lexical connections between Turkic, Mongolic, and Japonic languages, with suggested cognate matches ranging from 15-25% in reconstructed basic vocabulary, but these are widely rejected as resulting from areal diffusion and loans rather than genetic relatedness. Statistical tests on lexical reconstructions show some significant p-values for Turkic-Japonic pairs (e.g., 1.8 × 10^{-4}), yet the overall consensus attributes similarities to long-term contact across Eurasia, with geographical barriers making direct inheritance unlikely for Japonic inclusion. Core Altaic (Turkic-Mongolic-Tungusic) exhibits stronger evidence of shared features, but extension to Japonic remains unsupported by rigorous comparative methods. Measuring lexical similarity in East Asian languages requires adjustments for tonal systems, as standard lists like Swadesh overlook pitch contours that alter meanings. Tonally sensitive approaches, such as those incorporating phonetic reconstructions, reveal divergences in shared concepts; for example, the word for "mother" appears as tonally marked *mā in Chinese (high tone) but evolves to non-tonal haha in Japanese, illustrating genetic separation despite superficial script-based overlaps in borrowed terms. These adaptations ensure comparisons account for isolating morphology and Sino-script influence, prioritizing inherited over borrowed elements.

Other Language Families

In the Austronesian , the Malayo-Polynesian shows high lexical similarity, particularly in basic vocabulary. For instance, and share numerous cognates, highlighting their shared proto-Malayo-Polynesian roots and the family's rapid dispersal across the Pacific. This level of overlap is typical for closely related branches, though sound changes and geographic separation reduce . Within the Niger-Congo family, the subgroup exhibits moderate to high lexical similarity, driven by shared grammatical features like noun classes. and , both , share a moderate number of cognates in core vocabulary, as determined by lexicostatistical analyses of Swadesh lists, reflecting their common from West-Central around 3,000 years ago. This similarity supports subgrouping within Bantu but diminishes with greater geographic distance. The Uto-Aztecan family illustrates moderate lexical similarity across its branches, influenced by geographic factors. and , representing southern and northern branches respectively, show approximately 20% overlap in basic lists, according to lexicostatistical studies, though divergence due to migration and contact has led to distinct phonological systems. This level underscores the family's origin in the region before splitting into Numic, Takic, and Southern Uto-Aztecan groups. Language isolates and small families often display low lexical similarity with neighbors. Basque, an isolate in , shares less than 5% genetic cognates with surrounding like and , despite heavy borrowing (over 50% of modern Basque lexicon from Romance sources due to prolonged contact). Similarly, the contested proposes 10-20% cognates across Native American families excluding Na-Dene and Eskimo-Aleut, based on mass comparison of basic vocabulary, but these links are widely rejected for lacking rigorous sound correspondences and relying on chance resemblances.

Applications and Limitations

Linguistic Classification

Lexical similarity plays a central role in constructing phylogenetic trees for , where higher degrees of similarity between languages suggest closer genetic relationships, and these similarities are often transformed into distance matrices for algorithmic reconstruction. In this approach, methods such as the neighbor-joining algorithm are applied to pairwise lexical distances derived from standardized word lists, producing branching tree structures that visualize hierarchies and subgroupings. These trees provide a quantitative framework for hypothesizing descent, with branch lengths calibrated to reflect divergence times based on lexical retention rates. To enhance reliability, lexical similarity data is frequently integrated with evidence from and , creating more robust classifications that account for multiple lines of . For instance, in the Austronesian , lexical phylogenies have been combined with morphological reconstructions and phonological correspondences to confirm established subgroups, such as the division between Western and Central-Eastern Malayo-Polynesian branches. This multidisciplinary integration mitigates the limitations of relying solely on vocabulary, as shared lexical items can sometimes result from borrowing rather than common ancestry, while morphological and phonological patterns offer complementary indicators of genetic relatedness. Databases like the Automated Similarity Judgment Program (ASJP) facilitate global-scale comparisons by compiling 40-item core vocabulary lists from over 6,000 languages, enabling the automated computation of lexical similarities and the generation of comprehensive phylogenetic trees. ASJP data has been instrumental in testing macro-family hypotheses, such as Nostratic, which posits a distant common ancestor for Indo-European, Uralic, and other Eurasian families; weighted techniques applied to ASJP word lists have provided statistical support for such groupings by identifying non-random similarities beyond chance or . In modern linguistic work, lexical similarity analysis extends to the of endangered languages, aiding efforts to identify potential relatives and prioritize preservation. By comparing limited vocabulary data from under-documented tongues against established databases, researchers can hypothesize affiliations that guide fieldwork and revitalization, as seen in prioritization schemes that quantify a language's based on lexical distances to better-resourced relatives. This application underscores the utility of lexical methods in rapidly assessing relationships for languages at risk of , informing targeted projects.

Methodological Criticisms

One major methodological criticism of lexical similarity measures concerns borrowing biases, where extensive language contact leads to the adoption of loanwords that artificially inflate similarity scores between unrelated language families. In regions of prolonged interaction, such as trade or colonial zones, borrowed vocabulary can dominate basic word lists, misleading genetic classifications by suggesting closer relatedness than exists. For example, Swahili, a Bantu language, incorporates approximately 30% Arabic loanwords due to historical Arab-Swahili trade and Islamic influence beginning around the 10th century, potentially overestimating lexical similarity between Semitic and African languages. Similarly, lexicostatistical tree reconstructions are vulnerable to distortion if borrowed items are not rigorously excluded, as contact-induced loans can propagate across dialects and confound phylogenetic signals. Another key issue is list dependency, where the selection of vocabulary items introduces variability and cultural biases into similarity calculations. Standard tools like the Swadesh 100- or 200-word lists, designed to capture stable "basic" , rely on intuitive choices that may not yield equivalents across all languages, leading to inconsistent identifications. For instance, concepts such as "" or "to swim" lack direct parallels in some , reducing comparability and introducing measurement error. Moreover, these lists prioritize universal notions but overlook culturally specific or modern terms (e.g., technology-related words), embedding ethnocentric biases that favor Indo-European perspectives and undervalue semantic domains with high cross-linguistic variability. This dependency on fixed lists amplifies inconsistencies, as alternative compilations (e.g., expanded 300-item lists for Austronesian languages) yield different similarity profiles depending on cultural relevance. Statistical challenges further undermine lexical similarity methods, particularly the core assumption in of constant retention rates for basic vocabulary, which has disproven. Retention varies widely due to factors like , intensity, and hotspots; for example, retains over 95% of vocabulary compared to about 81% in over similar time depths, invalidating uniform decay models. In , rates range from 5% to 50% over 4,000 years, highlighting how and cultural isolation disrupt constancy. Additionally, these methods are highly sensitive to small sample sizes, where chance replacements or limited data can skew results dramatically, as demonstrated by unstable estimates from short word lists. Such issues have led to widespread rejection of by historical linguists. To address these limitations, alternatives like (MDS) and Bayesian phylogenetics have gained traction for more robust analyses. visualizes lexical distances as points in a low-dimensional space, capturing non-linear relationships and reducing distortions from borrowing or list biases by representing languages as configurations that correlate with geographic or historical factors. Bayesian phylogenetics, meanwhile, employs probabilistic models to infer trees from lexical data while accommodating variable rates, borrowing events, and integration of non-lexical evidence such as phonology or , offering calibrated divergence estimates without assuming uniformity. These approaches enhance reliability by incorporating uncertainty and multifaceted data, providing a pathway beyond traditional .

References

  1. [1]
    Methodology | Ethnologue Free
    Lexical similarity. The percentage of lexical similarity between two linguistic varieties is determined by comparing a set of standardized wordlists and ...
  2. [2]
    How Many Is Enough?—Statistical Principles for Lexicostatistics
    These principles validate the practice of using the Swadesh 100- and 200-word lists to indicate degree of relatedness between languages, and enable a frequency- ...
  3. [3]
    Measuring Lexical Similarity across Sign Languages in Global ...
    In this paper, we present a novel approach for measuring lexical similarity across any two sign languages using the Global Signbank platform.
  4. [4]
    Automated methods for the investigation of language contact, with a ...
    Aug 23, 2019 · An example for genealogical similarity are German Zahn and English tooth, both going back to Proto-Germanic *tanθ-. Contact-induced similarity ...
  5. [5]
    [PDF] Language distance and tree reconstruction
    Nov 1, 2015 · The Swadesh list, in contrast to a list of arbitrary words, contains terms which are common to all cultures and which concern the basic ...
  6. [6]
    [PDF] An Algorithm for Building Language Superfamilies Using Swadesh ...
    Swadesh list are words of basic vocabulary used in lexicostatistical studies to identify comparable approximate number of cognate words present in the words ...
  7. [7]
    [PDF] A Lexicostatistical Analysis Of Romani, Hindustani, and Czech
    Apr 7, 2000 · Swadesh's model of lexical change is thus analogous to the scientific model of radioactive decay (Embleton 1986).
  8. [8]
    [PDF] 8 Historical linguistics: the study of language change - Pearson
    Dec 2, 2022 · In this chapter we examine the nature and causes of language change and sur- vey phonological, morphological, syntactic, lexical, and semantic ...
  9. [9]
    A very short introduction to the linguistic Tree Model
    Aug 29, 2023 · On August 15 1853, German philologist August Schleicher finished his paper „Die ersten Spaltungen des indogermanischen Urvolkes“ (‚The first ...
  10. [10]
    Networks uncover hidden lexical borrowing in Indo-European ... - NIH
    Nov 24, 2010 · ... August Schleicher introduced the family tree to linguistics [17]. Few years later, his model was rejected by several scholars arguing ...
  11. [11]
    Language evolution and human history: what a difference a date ...
    'Glottochronology' attempts to do just that. In the early 1950s, a full decade before Zuckerkandl & Pauling introduced the idea of a molecular clock to biology, ...
  12. [12]
    Language trees with sampled ancestors support a hybrid ... - Science
    Jul 28, 2023 · This excess branch length is caused by large numbers of excess entries in the IELex database, representing not just the primary word for a given ...
  13. [13]
    [PDF] Identifying Cognates by Phonetic and Semantic Similarity
    In the narrow sense used in historical linguistics, cognates are words in related languages that have developed from the same ancestor word. An ex- ample of a ...
  14. [14]
    Evolutionary dynamics of language systems - PMC - PubMed Central
    Oct 4, 2017 · Identifying lexical cognates relies on identifying systematic sound correspondences between languages within similar semantic categories.
  15. [15]
  16. [16]
    [PDF] Cognates in Linguistic Analysis - Longdom Publishing
    Cognates are words that share a common ancestry, deriving from the same root in a proto-language. They often have similar meanings and sound similar across ...
  17. [17]
    [PDF] Tracking Semantic Change in Cognate Sets for English and ...
    Aug 6, 2021 · In most cases, cognates have preserved similar meanings across languages, but there are also ex- ceptions. These are called deceptive cognates ...Missing: challenges | Show results with:challenges
  18. [18]
    Towards Greater Accuracy in Lexicostatistic Dating
    Towards Greater Accuracy in Lexicostatistic Dating. Morris Swadesh. Morris ... Paper Award, Third Place), (Feb 2023): 220–235. https://doi.org/10.1007 ...
  19. [19]
    How Many Is Enough?—Statistical Principles for Lexicostatistics - PMC
    Dec 12, 2016 · These principles validate the practice of using the Swadesh 100- and 200-word lists to indicate degree of relatedness between languages, and ...
  20. [20]
    Distributions of cognates in Europe as based on Levenshtein distance
    Aug 11, 2011 · We show that a normalized Levenshtein distance function can efficiently and reliably simulate bilingual orthographic similarity ratings.
  21. [21]
    Ethnologue | Languages of the world
    Find, read about, and research all 7159 living languages. Ethnologue is the ultimate source of information on the world's languages.Methodology · Browse the Countries of the... · Browse By Language Name · Credits
  22. [22]
    [PDF] Comparative study of common words of Sanskrit and Persian ...
    Jul 2, 2020 · In order to be able to compare the degree of similarity between Old Persian, Avestan and Sanskrit, all three of which come from the same Aryan ...
  23. [23]
    The Indo-European Cognate Relationships dataset | Scientific Data
    Sep 2, 2025 · IE-CoR emerges from a long history of previous work across several fields within linguistics. Work to draw up cognate datasets, specifically, ...
  24. [24]
    Dated phylogeny suggests early Neolithic origin of Sino-Tibetan ...
    Nov 27, 2020 · Our dataset comprises information on shared cognates for 110 items of vocabulary for 131 Sino-Tibetan languages (Fig. 1), and makes use of ...Missing: percentage | Show results with:percentage
  25. [25]
  26. [26]
    SINO-TIBETAN LINGUISTICS - Annual Reviews
    Subsequent sporadic attempts to find cognates between Tibetan and. Chinese (e.g. 181) did not get far, in the absence of any serious scheme for the ...<|separator|>
  27. [27]
    Japanese Language - an overview | ScienceDirect Topics
    The Japanese vocabulary consists of four word types: Japanese origin (47.5 percent), Sino-Japanese (36.7 percent; long ago adopted from China), foreign loan ...
  28. [28]
    Permutation test applied to lexical reconstructions partially supports ...
    Jun 1, 2021 · The so-called Altaic hypothesis suggests common ancestry for several universally accepted language families spoken across Eurasia, namely the ...
  29. [29]
  30. [30]
  31. [31]
    Akan vs. East Asian tone languages | Memory & Cognition
    Apr 13, 2023 · Finally, some tone languages, including East Asian tone languages, use tone to differentiate lexical meanings (e.g., mother vs. horse) ...
  32. [32]
    The Diversity of Tone Languages and the Roles of Pitch Variation in ...
    Feb 26, 2019 · Languages that use only pitch accents, such as Japanese, are considered tone languages in this paper.Missing: similarity East mother
  33. [33]
    Do You Speak Hawaiian? | Languages Of The World
    May 15, 2014 · This table reveals not only the close similarity of Hawaiian to other Austronesian, and especially Polynesian, languages but also some sound ...
  34. [34]
    [PDF] The position of the Malayopolynesian Languages of Formosa
    Swadesh lists of the following Formosan languages are available to me: Atayal. (Atl , At2), Seedik (Sel , Se2), Bunun (Bul, Bu2), Rukai (R2), Paiwan (Pal) ...
  35. [35]
    [PDF] A Quantitative Lexicostatistics Study of the Evolution of the Bantu ...
    Aug 16, 2019 · In this paper we study the Bantu language family, which is a major language family in Africa using lexicostatistics, using a method originally.Missing: Zulu | Show results with:Zulu
  36. [36]
  37. [37]
    How similar is Nahuatl to Hopi? - Nawatl Scholar
    Dec 22, 2017 · The claim is that Nahuatl and Hopi are so closely related that people who speak one will also be able to understand the other.
  38. [38]
    uto-aztecan lexicostatistics 2.01 jason d. haugen
    Miller's cognate density measure yields a symmetri- cal table based on the number of cognates each language pairing shares on a modified. Swadesh-100 wordlist.
  39. [39]
    Romance in Contact With Basque
    ### Summary of Lexical Similarity Between Basque and Romance Languages
  40. [40]
    (PDF) Review of Language in America (by Joseph H. Greenberg)
    Apr 11, 2016 · (5) Finnish, Japanese, and other randomly chosen languages fit Greenberg's Amerind data as well as or better than any of the American Indian ...
  41. [41]
    Automated Classification of the World's Languages: A Description of ...
    Aug 10, 2025 · This program yields branching structures (ASJP trees) reflecting the lexical similarity of languages. ASJP trees for languages of the sample ...
  42. [42]
  43. [43]
    (PDF) How Accurate and Robust Are the Phylogenetic Estimates of ...
    Aug 6, 2025 · The results show that the Austronesian language phylogenies are highly congruent with the traditional subgroupings, and the date estimates are ...
  44. [44]
    Global-scale phylogenetic linguistic inference from lexical resources
    The Automated Similarity Judgment Program (ASJP; see Data Citation 1) database contains 40-item core vocabulary lists from more than 7,000 languages and ...
  45. [45]
    Support for linguistic macrofamilies from weighted sequence ... - PNAS
    Sep 24, 2015 · First, a crude similarity measure between word lists was defined and the 1% of all ASJP doculect pairs with highest similarity was kept as ...
  46. [46]
    Tongues on the EDGE: language preservation priorities based on ...
    Dec 13, 2017 · Prioritizing the documentation of threatened and isolated languages is a key goal in linguistics [6]. Recently developed methods for quantifying ...
  47. [47]
    [PDF] Borrowing: A case of Arabic and Swahili
    Bantu family while Arabic belongs to Semitic family. It is surprising that Swahili vocabulary is made up of about 30% of Arabic loanwords. Therefore ...
  48. [48]
    [PDF] Lexicostatistical Tree Reconstruction Incorporating Borrowing
    Such a list is now generally known as a Swadesh-list, and contains meanings referring to such supposed cultural universals as body parts, numerals, ...
  49. [49]
    Local similarity and global variability characterize the semantic ...
    We show that meanings across languages manifest lower variability within semantic domains and greater variability between them.
  50. [50]
    The many faces of uniformitarianism in linguistics | Glossa
    May 20, 2019 · Today, glottochronology is rejected entirely by most historical linguists (see e.g. Wells 1973: 429–430; Campbell & Poser 2008: 167–168), and ...
  51. [51]
    (PDF) Multidimensional scaling and linguistic theory - ResearchGate
    Dec 10, 2020 · MDS refers to a statistical technique that represents objects (lexical items, linguistic contexts, languages, etc.) as points in a space so that ...
  52. [52]
    Bayesian phylogenetic analysis of linguistic data using BEAST
    Sep 23, 2021 · This article introduces Bayesian phylogenetics as applied to languages. We describe substitution models for cognate evolution, molecular clock ...
  53. [53]
    A test of Generalized Bayesian dating: A new linguistic dating method
    Aug 12, 2020 · Holman and collaborators [7] introduced an alternative to glottochronology ... Posterior summarization in Bayesian phylogenetics using Tracer 1.7.