Fact-checked by Grok 2 weeks ago

Language identification

Language identification (LI), also known as language detection, is the computational task of determining the natural language in which a given text or speech segment is expressed, serving as a foundational step in natural language processing (NLP) and speech recognition systems.^[1]^[2] For textual data, LI involves analyzing features such as character n-grams, word distributions, and linguistic patterns to classify documents or segments among thousands of languages, with high accuracy achievable for well-resourced languages like those in Western Europe using long texts.^[1] In spoken language identification, the process relies on acoustic cues including phonotactics, prosody, and intonation to distinguish languages or dialects from audio inputs, often as a precursor to selecting appropriate speech recognizers.^[2] The importance of LI stems from its role in enabling multilingual applications, such as machine translation and information retrieval, where incorrect language detection can propagate errors through downstream NLP pipelines.^[1] It also supports spoken applications like automated call routing.^[2] It facilitates processing of diverse data sources like web pages, social media, and global telephony, particularly supporting low-resource languages that lack extensive training data. As of 2025, over the past five decades since the 1960s, LI research has evolved from rule-based and statistical approaches to advanced machine learning techniques, including support vector machines and neural networks, achieving accuracies exceeding 99% in controlled settings for major languages.^[1] Despite these advances, challenges persist in handling short texts, code-switching, dialects, and under-resourced languages, where performance drops significantly due to data scarcity and linguistic similarities.^[1]^[2] Ongoing efforts as of 2025 focus on robust, off-the-shelf systems and shared tasks like those from the NIST Language Recognition Evaluations, integrating deep learning models such as convolutional and recurrent neural networks to improve generalization across real-world scenarios.^[2]

Introduction

Definition and Scope

Language identification (LID) is the computational task of automatically determining the natural language of an input, such as written text or spoken audio, through algorithmic analysis.^[3] In its core form, LID processes textual documents by examining linguistic patterns to assign a language label, or analyzes speech signals to estimate the spoken language based on phonetic and prosodic cues. The task originated in theoretical linguistics in the 1960s, with early formulations framing it as a learnability problem for identifying languages from positive examples, often as a preprocessing step for machine translation systems.^[4] The scope of LID encompasses several modalities and configurations. For written text, it typically involves statistical models like character n-gram frequency analysis to detect languages from short or long documents.^[3] Spoken LID, in contrast, relies on acoustic features such as mel-frequency cepstral coefficients (MFCCs), phonotactics, and intonation patterns extracted from audio signals to distinguish languages.^[5] Hybrid systems combine these approaches, integrating textual transcripts with acoustic data for robust identification in mixed environments. LID is distinct from related tasks like script detection, which identifies writing systems (e.g., Latin vs. Cyrillic) without specifying the language, and named entity recognition, which extracts specific entities (e.g., persons or locations) within an assumed language.^[6]^[7] Key concepts in LID include distinctions based on input complexity and knowledge assumptions. Monolingual LID assumes a single language per input, simplifying classification for uniform documents, while multilingual LID detects and segments multiple languages within the same input, such as in code-switched text.^[8] Closed-set LID operates on a predefined list of known languages with available training data, outputting the most likely match from that set, whereas open-set LID handles unseen or unknown languages by rejecting or flagging inputs outside the trained classes.^[9]^[10] These variations enable LID's integration into broader natural language processing pipelines, including as a precursor to machine translation tools.^[4]

Importance and Applications

Language identification plays a crucial role in facilitating global communication by automating the routing and processing of multilingual content across digital platforms. In social media and search engines, it enables efficient content moderation, personalized recommendations, and improved search relevance for diverse user bases, thereby bridging linguistic barriers and promoting inclusive online interactions.^[11] Similarly, in customer support systems, automatic language detection streamlines interactions by directing queries to appropriate agents or translation services, enhancing response times and user satisfaction in multinational environments.^[12] Key applications of language identification span several domains, including web content classification, where tools like Google Translate employ auto-detection to identify source languages for real-time translation of web pages and documents, supporting 244 languages as of October 2024 following the addition of 110 new languages using the PaLM 2 AI model.^[13]^[14] In forensic linguistics, native language identification enhances authorship attribution by analyzing linguistic traces in multilingual texts, improving accuracy by up to 9% in investigations involving non-native speakers.^[15] For speech-to-text systems in call centers, language detection in automatic speech recognition (ASR) handles multilingual calls by identifying spoken languages in real-time, enabling accurate transcription and translation for over 99 languages despite challenges like accents and code-switching.^[12] Additionally, in accessibility tools, language identification ensures proper pronunciation by screen readers for multilingual users, preventing misinterpretation of non-English text through HTML language tags.^[16] The economic value of language identification lies in its ability to reduce manual labor in localization industries, where automation accelerates content adaptation for global markets. The language services industry, heavily reliant on such technologies, reached USD 71.7 billion in 2024, with projections to USD 75.7 billion in 2025, driven by AI-enhanced localization that cuts costs and boosts scalability in sectors like healthcare and media.^[17] This efficiency is particularly vital amid the multilingual web's growth, where approximately 7,000 languages exist worldwide, yet only about 10 have substantial online presence, underscoring the need for tools to expand digital inclusion beyond dominant languages like English, which accounts for 60% of internet content.^[18] Emerging applications include its role in AI ethics for bias detection, where language identification reveals disparities in model performance, such as AI detectors flagging non-native English writing as generated content at rates up to 19% higher than native text.^[19] In security contexts, it aids in analyzing encrypted communications, like VoIP traffic, by inferring languages from packet lengths with up to 86.6% accuracy in binary classifications, highlighting privacy risks and informing countermeasures.^[20]

Historical Development

Early Approaches

Early approaches to language identification relied on manual techniques employed by linguistic experts, particularly in the field of cryptanalysis, where identifying the language of encrypted texts was a prerequisite for decryption in some cases. These methods involved analyzing phonological patterns, such as sound distributions; morphological features, like word formation rules; and syntactic structures, including sentence organization. A key example was the use of character frequency analysis, which exploited the predictable distribution of letters in specific languages to aid in code-breaking, a practice refined in 19th-century cryptanalysis during military and diplomatic efforts.^[21]^[22] The transition to computational methods began in the mid-20th century with the advent of digital text processing. In 1965, Seppo Mustonen developed one of the first automated systems using multiple discriminant analysis on character-based features, including vowel-consonant ratios and word length distributions, achieving approximately 76% accuracy in distinguishing English, Swedish, and Finnish texts from 300-word samples.^[23] This statistical approach marked the shift from purely manual analysis to machine-assisted identification, though it still required predefined linguistic models without machine learning. Early computational efforts, such as those on IBM mainframes in the late 1960s and early 1970s, focused on processing machine-readable texts by calculating simple statistical profiles, laying the groundwork for broader application.^[24] By the 1970s, rule-based systems emerged as key milestones, emphasizing fixed heuristics tailored to specific language families. For instance, Y. Nakamura's 1971 system identified 25 Latin-alphabet languages through rules governing character and word occurrence rates.^[23] Morton D. Rau's 1974 work advanced this by incorporating bigram probabilities and vowel-consonant ratios, attaining 89% accuracy in differentiating English from Spanish using an IBM System/360 Model 67, though these methods struggled with scalability for non-Latin scripts due to reliance on alphabetic assumptions.^[24] Similarly, the 1977 probabilistic approach by A. S. House and E. P. Neuburg provided a framework using Markovian probabilities on phonetic features for spoken language identification, limited to European languages.^[23] Influential late-1980s contributions built on these foundations with more sophisticated probabilistic models. Kenneth R. Beesley's 1988 program utilized character n-gram probabilities to automatically identify languages in online text, demonstrating effective differentiation between English and French by comparing sequence frequencies against cryptanalysis-inspired models, with confidence levels exceeding 60% after just 10-12 words.^[25] These early systems highlighted the potential of n-grams for rapid identification but underscored persistent challenges in handling diverse scripts and short texts.^[23]

Evolution in the Digital Age

The 1990s heralded a pivotal shift in language identification toward corpus-based statistical methods, enabled by the growing availability of digitized text collections that allowed for empirical modeling of linguistic patterns across languages. Early efforts focused on character n-grams as discriminative features, with Cavnar and Trenkle's 1994 approach achieving 99.8% accuracy on a USENET newsgroup corpus spanning eight languages, establishing n-grams as a foundational technique for scalable LID. This era also saw dictionary-based methods, such as Giguet's 1995 use of alphabets and function words, which complemented statistical models by incorporating lexical resources. By the 2000s, advancements accelerated through the integration of web crawling, which provided vast, diverse training data and expanded LID to dozens of languages previously underrepresented in curated corpora. Baldwin and Lui's 2010 work utilized n-gram models trained on web-crawled texts covering 67 languages, demonstrating how internet-scale data improved robustness and coverage in digital environments. Support vector machines emerged as a key classifier during this period; for instance, Kruengkrai et al. applied SVMs to character n-grams in 2005, enhancing classification precision for multilingual texts. Parallel corpora like Europarl, introduced by Koehn in 2005 and later used for source language identification in translations, further supported LID for European languages by offering aligned multilingual data.^[26]^[27] The late 2000s and 2010s witnessed the influence of shared tasks that standardized evaluation and spurred innovation amid surging computational resources. The Discriminating between Similar Languages (DSL) shared task, launched in 2014, utilized corpora like the DSL Corpus to benchmark methods for closely related languages.^[28] These events, alongside the proliferation of big data and GPU acceleration, enabled LID systems to process larger volumes of internet-sourced text with greater efficiency. In recent years, LID has integrated deeply with large language models, leveraging pre-trained architectures for superior performance. The release of multilingual BERT in 2018 allowed fine-tuning on LID tasks, yielding models that detect languages across 100+ variants with high accuracy using minimal domain-specific data, as evidenced by community implementations on platforms like Hugging Face. Subsequent advances include XLM-RoBERTa (2019), which improved multilingual representations for better generalization in low-resource LI, and ongoing shared tasks like VarDial (up to 2023) focusing on dialectal and similar language discrimination using transformer-based models.^[29] As of 2025, integration with larger LLMs such as GPT-4 enables zero-shot LI with accuracies exceeding 95% for many languages on short texts. The ubiquity of smartphones has amplified this progress by embedding LID in mobile ecosystems, powering features like automatic language detection in apps such as Google Translate, which processes user inputs in real-time across over 100 languages to facilitate seamless communication.

Methods and Techniques

Statistical and Rule-Based Methods

Statistical methods for language identification rely on probabilistic models that analyze frequency distributions of textual elements, such as characters or words, to determine the most likely language. These approaches, including n-gram-based techniques, compute the probability P(\text{language} \mid \text{text}) as the product of conditional probabilities of n-grams given the language, often approximated using maximum likelihood estimation from pre-built language profiles. For character-level identification, a common scoring mechanism sums the log-probabilities: \text{score} = \sum \log P(\text{char}_n \mid \text{language}), where \text{char}_n represents an n-gram of characters.^[30] This formulation enables efficient comparison against multiple language models without requiring extensive training data, making it suitable for resource-constrained environments.^[30] A seminal example is the Cavnar-Trenkle algorithm, which uses overlapping character n-grams of lengths 1 to 5 to generate ranked frequency profiles for each language. The method scores a document by calculating an "out-of-place" distance metric based on rank differences between the document's n-gram ranks and those in the language profile, selecting the language with the minimal distance. Evaluated on 3,478 Usenet documents across eight Western European languages, it achieved 99.8% accuracy for texts longer than 300 bytes, demonstrating robustness to noise like OCR errors.^[31]^[30] Naive Bayes classifiers extend these statistical foundations by treating language identification as a generative classification task, assuming feature independence to compute posterior probabilities via Bayes' theorem: P(\text{language} \mid \text{features}) \propto P(\text{language}) \prod P(\text{feature}_i \mid \text{language}). Applied to word unigrams or character n-grams, this approach excels in low-compute settings, offering explainability through interpretable probability estimates and achieving near-perfect accuracy on sentence-length texts with smoothed 5-grams.^[30]^[30] Rule-based methods complement statistical techniques with hand-crafted heuristics, such as script detection via Unicode ranges to identify languages associated with specific writing systems—for instance, Cyrillic characters indicating Russian or Devanagari for Hindi. These deterministic rules, often combined with keyword matching for low-resource languages, provide rapid initial filtering but are limited to scenarios with distinct orthographic features.^[30] Overall, statistical and rule-based methods prioritize simplicity and efficiency, thriving in explainable, low-resource applications despite challenges with closely related language pairs.^[30]

Machine Learning and Neural Approaches

Machine learning approaches to language identification (LID) represent a shift from hand-crafted rules to data-driven models that learn patterns from large corpora. Classical methods typically involve feature extraction techniques such as term frequency-inverse document frequency (TF-IDF) applied to words or character n-grams, which are then fed into classifiers like support vector machines (SVMs) or random forests. For instance, TF-IDF weighting on lexical features has been shown to enhance LID performance by emphasizing distinctive terms across languages. SVMs, particularly using libraries like LIBLINEAR for efficient linear classification on high-dimensional n-gram features, have achieved strong results in tasks involving diverse language sets, such as distinguishing South African languages at word level. Random forests have also been effective as ensemble classifiers, often outperforming single models in shared tasks like the German Dialect Identification (GDI) evaluation.^[32]^[33]^[34] Neural approaches, emerging prominently after 2015, leverage architectures suited to sequential data for capturing contextual dependencies in text. Recurrent neural networks (RNNs) and long short-term memory (LSTM) units process character or word sequences, enabling better handling of morphological variations compared to classical methods. For example, LSTM-based models with character n-gram embeddings secured top performance in code-switched Arabic dialect identification tasks. These methods marked an advancement in modeling long-range dependencies, though they require substantial training data.^[32]^[35] Transformer-based models, such as multilingual BERT (mBERT) introduced in 2018, utilize cross-lingual embeddings pre-trained on vast multilingual corpora to support zero-shot LID, where models identify unseen languages via transfer learning. mBERT's contextual representations allow fine-tuning for LID without language-specific training, demonstrating utility in probing multilingual knowledge. Building on this, XLM-R (2020) extends to 100 languages with improved masked language modeling, enabling robust transfer for LID in low-resource settings.^[29] Advanced neural techniques include hierarchical models that first map scripts to language families before fine-grained classification, addressing ambiguities in shared writing systems. For instance, hierarchical classifiers in models like LIMIT process over 350 languages by cascading script detection with language-specific heads, improving accuracy on short texts. A typical neural LID pipeline computes probabilities as follows:

\text{output} = \text{softmax}(W \cdot \text{embedding}(\text{text}))

where the embedding layer captures contextual features from input text, and W is a learned projection matrix. Deep learning methods, particularly via transfer learning in transformers, have yielded accuracies exceeding 95% on datasets spanning over 100 languages, as demonstrated by XLM-R adaptations in multilingual LID benchmarks. Recent developments as of 2025 include the application of generative large language models, such as fine-tuned GPT variants, for zero-shot LID in low-resource and code-switched scenarios, further enhancing generalization.^[36]^[29]^[37]

Challenges and Limitations

Distinguishing Similar Languages

Distinguishing closely related languages or dialects poses significant challenges in language identification (LID) due to high degrees of lexical, phonological, and orthographic overlap. For instance, Spanish and Portuguese exhibit approximately 89% lexical similarity, meaning a substantial portion of their vocabularies consists of cognates, which complicates automated differentiation based on word-level features alone.^[38] Similarly, Arabic dialects share the same script and much of the core lexicon derived from Modern Standard Arabic, obscuring phonological distinctions in written text and leading to frequent misclassifications, as over 56% of dialectal sentences can be valid across multiple varieties.^[39] To address these issues, fine-grained models leverage phonological features to capture subtle acoustic or orthographic differences that broader classifiers overlook. Perceptual phonetic similarity spaces, constructed from features such as consonant inventories and vowel qualities, enable clustering of related languages by quantifying overall phonetic overlap, aiding in the separation of Indo-European or Semitic pairs.^[40] Additionally, mutual information scores serve as a feature selection criterion in LID systems, scoring n-grams based on their information gain to isolate discriminative elements between similar languages, though they may prioritize domain-specific over linguistic cues.^[41] For specific cases like Serbian and Croatian, discriminative character n-grams combined with support vector machines achieve high in-domain accuracy (up to 99.5% F1 on news corpora) by focusing on orthographic markers, but cross-domain performance drops due to stylistic variations.^[42] Case studies from shared tasks highlight persistent difficulties. In the Discriminating between Similar Languages (DSL) shared tasks from 2014 to 2017, systems struggled with Indo-Aryan pairs like Hindi and Bhojpuri, where linguistic proximity and short texts resulted in weighted F1 scores below 80% for confused instances, despite overall task accuracies reaching 88-90%.^[43] Orthographic reforms further exacerbate challenges, as seen in Norwegian Bokmål and Nynorsk, where flexible spelling rules (e.g., variable infinitive endings like "-e" or "-a") allow multiple valid forms for the same word, reducing transcription consistency in LID models and contributing to error rates in automatic speech recognition adaptations.^[44] Mitigation strategies often involve ensemble methods that integrate LID outputs with external signals, such as geographic metadata, to refine predictions for similar varieties. Region-specific LID models, which condition language candidates on location priors (e.g., limiting to local dialects in North Africa), improve F-scores by up to 10 points for short texts by resolving ambiguities between closely related options.^[45]

Handling Multilingual and Dialectal Variations

Language identification systems encounter significant challenges when processing code-switched texts, where speakers alternate between two or more languages within a single utterance, as seen in bilingual communities using Spanglish, a mix of Spanish and English. This phenomenon is prevalent in social media and informal communication, complicating token-level detection due to overlapping vocabulary and grammatical structures. Conditional Random Fields (CRFs) have been employed effectively for word-level language tagging in such scenarios, leveraging features like lexical dictionaries, character n-grams, and contextual cues to label tokens as belonging to one language, the other, or mixed. For instance, in English-Spanish code-mixed data, CRF models achieve test accuracies around 85-95%, outperforming baselines by incorporating probability-based classifiers for ambiguous tokens.^[46] Dialectal variations further exacerbate identification difficulties, particularly for under-resourced languages where data scarcity limits model training. In African languages like Swahili, which exhibits substantial dialectal diversity across regions such as Zanzibar and mainland Tanzania, variations in lexicon, morphology, and syntax hinder accurate detection, especially in speech or social media contexts influenced by slang. These challenges are amplified by the under-resourced nature of many such languages; with over 7,000 languages worldwide, only around 5%—approximately 350 languages based on digital presence indicators—have substantial digital text available as of 2023, leaving African dialects with minimal annotated resources.^[47]^[48]^[49] This data paucity results in models trained on standard variants performing poorly on dialectal inputs, often requiring specialized corpora to capture tonal and morphological nuances. Noisy inputs, including OCR errors and transliteration—where text from non-Latin scripts is romanized, such as Hindi words in English script—introduce additional variability that standard models struggle to handle. OCR noise, like character substitutions or deletions, can alter word forms, leading to misidentification in multilingual documents, while transliteration creates ambiguous representations that mimic multiple languages. Strategies to mitigate these include robust preprocessing techniques, such as noise simulation via rule-based edits or encoder-decoder models to generate synthetic noisy samples, and few-shot learning approaches that adapt models to limited examples despite label noise. These methods enhance robustness by mining hard examples and stabilizing representations between clean and noisy texts, improving performance on tasks like language detection in transliterated content.^[50]^[51] In real-world applications, failure to account for multilingual and dialectal variations can propagate errors to downstream tasks, such as sentiment analysis on dialectal Arabic, where misidentification of dialects leads to accuracy drops of 10-20% compared to Modern Standard Arabic processing. For example, in multi-dialect Twitter datasets, models like SVM achieve only 51-61% accuracy on UAE and Egyptian dialects due to morphological complexity and code-mixing, underscoring the need for dialect-aware identification to maintain reliable sentiment classification.^[52]

Evaluation and Tools

Performance Metrics and Datasets

Performance in language identification (LID) is primarily evaluated using accuracy, which measures the proportion of correctly identified language instances out of the total, though it can be misleading in imbalanced datasets where high-resource languages dominate.^[53] To address class imbalance, the F1-score is widely adopted, particularly the macro-averaged F1-score, calculated as the unweighted average of per-language F1-scores (where each language's F1 is the harmonic mean of precision and recall), which penalizes poor performance on minority languages and provides a balanced view of model effectiveness across diverse linguistic distributions.^[54] Additionally, the confusion matrix serves as a tool for error analysis, visualizing misclassifications between languages to reveal patterns such as frequent confusions between closely related varieties, enabling targeted improvements in model robustness. Key datasets for training and testing LID systems include the SETimes corpus (developed circa 2005–2007), which provides parallel texts for discriminating between similar South Slavic languages like Bosnian, Croatian, and Serbian, including short excerpts to simulate real-world identification challenges in regional contexts.^[55] For broader coverage, the open LID dataset introduced in 2023 aggregates web-sourced text across 201 languages, balancing samples to about 600,000 lines per language for a total of 121 million lines, facilitating evaluation of multilingual models on low-resource scenarios.^[54] An update, OpenLID-v2 released in October 2025, extends coverage to 200 language varieties with improved handling of dialects.^[56] In spoken LID, the VoxLingua107 dataset from 2020 provides 6,628 hours of audio segments extracted from YouTube videos across 107 languages, averaging 62 hours per language, to support acoustic feature learning in diverse speech environments.^[57] Standard evaluation protocols involve k-fold cross-validation on held-out test sets to ensure generalizability, with splits maintaining language balance to avoid overfitting to training distributions.^[53] Benchmarks from the Discriminating between Similar Languages (DSL) shared tasks highlight progress; for instance, the 2015 edition achieved top accuracies of 95.54% on text-based discrimination across 10 language groups using ensemble classifiers.^[58] Later iterations, such as those in VarDial evaluations up to 2022 and continuing through 2024, demonstrate incremental gains through neural methods, underscoring the shift toward handling dialectal nuances in multi-label settings.^[59] A notable limitation in LID datasets is bias toward high-resource languages, where English and similar Indo-European tongues are overrepresented compared to low-resource ones—leading to inflated performance metrics that fail to generalize to underrepresented languages and perpetuating inequities in global NLP applications.^[60]

Software Implementations

Several open-source libraries provide accessible implementations for language identification (LID), enabling developers to integrate LID into applications without proprietary dependencies. Langdetect, available in both Java and Python implementations, relies on an n-gram-based approach to detect languages and supports over 55 languages, making it suitable for straightforward text processing tasks.^[61] Google's Compact Language Detector 3 (CLD3), an open-source neural network model, offers compact inference code and supports detection across more than 100 languages, prioritizing efficiency for resource-constrained environments.^[62] Frameworks like spaCy and Hugging Face Transformers facilitate the integration of LID through extensible pipelines and pre-trained models, allowing customization for specific use cases. In spaCy, extensions such as spacy-language-detection enable seamless addition of LID components to NLP workflows, leveraging underlying libraries for multi-language support.^[63] Hugging Face Transformers hosts numerous fine-tuned models for LID, including those based on architectures like XLM-RoBERTa, which can be deployed for detecting dozens of languages with high accuracy.^[64] A notable example is FastText's LID model from 2017, which uses linear classifiers on subword n-grams to identify 176 languages and remains widely adopted for its balance of speed and coverage.^[65] Commercial offerings provide scalable, managed LID services with additional features like confidence scoring and enterprise integration. Microsoft Azure Text Analytics API delivers real-time LID for unstructured text, returning language identifiers along with confidence scores for over 100 languages, optimized for cloud-based applications.^[7] IBM Watson Natural Language Understanding includes LID as part of its API suite, supporting detection in multiple languages for large-scale text analysis in business contexts.^[66] When deploying LID software, considerations such as API latency and licensing play critical roles in practicality. For instance, lightweight models like FastText achieve latencies under 50 milliseconds for short texts on standard hardware, ensuring responsiveness in interactive systems.^[65] Licensing for open-source tools often permits free use under permissive terms, but extensions for low-resource languages may require custom training data, potentially involving additional compliance with data usage policies in commercial frameworks.

References

[1]
Spoken language identification: An overview of past and present ...
This paper reviews modern methods of automatic language identification. It examines what information in speech helps to distinguish among languages.
[2]
[PDF] Language Identification: The Long and the Short of the Matter
Language identification is the task of identify- ing the language a given document is written in. This paper describes a detailed examina-.Missing: definition | Show results with:definition
[3]
Automatic Language Identification in Texts | Computational Linguistics
Mar 15, 2025 · Language identification (LI) for text data, in the ideal scenario, determines the human languages used at every location in a corpus.Missing: definition | Show results with:definition
[4]
Language identification in the limit - ScienceDirect.com
A class of possible languages is specified, together with a method of presenting information to the learner about an unknown language, which is to be chosen ...
[5]
Spoken language identification: : An overview of past and present ...
Feb 1, 2025 · This paper reviews modern methods of automatic language identification. It examines what information in speech helps to distinguish among languages.
[6]
[PDF] Visual Script and Language Identification - arXiv
Jan 8, 2016 · Abstract—In this paper we introduce a script identification method based on hand-crafted texture features and an artificial neural network.
[7]
What is language detection in Azure AI Language? - Microsoft Learn
Aug 20, 2025 · Script detection: To distinguish between multiple scripts used to write certain languages, such as Kazakh, language detection returns a script ...
[8]
LanideNN: Multilingual Language Identification on Character Window
Monolingual language identification assumes that the given document is written in one language. In multilingual language identification, the document is usually ...
[9]
[PDF] Automatic Detection & Language ID of Multilingual Documents
Language identification techniques commonly assume that ev- ery document is written in one of a closed set of known languages for which there is training data,.
[10]
[1707.04817] Open-Set Language Identification - arXiv
Jul 16, 2017 · Abstract:We present the first open-set language identification experiments using one-class classification.Missing: multilingual | Show results with:multilingual
[11]
[PDF] Societal Impacts of Language Technology: How to Work with Known ...
Apr 4, 2024 · • Language ID systems don't identify my dialect. • … Social-media based disease warning systems fail to work in my community (Jurgens et al ...
[12]
How does Automatic Speech Recognition Navigate ... - Gladia
Sep 24, 2024 · Language detection leverages deep learning models trained on vast amounts of multilingual audio data, analyzes the incoming speech to identify ...
[13]
Google Translate
Google's service, offered free of charge, instantly translates words, phrases, and web pages between English and over 100 other languages.
[14]
[PDF] Native Language Identification Improves Authorship Attribution
This study investigates the integration of native language identification into authorship attribu- tion, a previously unexplored aspect that is par-.
[15]
Identify Languages - Digital Accessibility at Princeton
Use the WAVE tool to scan the page. It will mark any text assigned a language tag with a globe icon. Click the icon for further information.<|separator|>
[16]
The 2025 Nimdzi 100
We estimate that the language services industry, with a 5.6% growth, reached USD 71.7 billion in 2024 and project it to grow to USD 75.7 billion in 2025.
[17]
Workshop 297 Report: Digital Inclusion Through a Multilingual Internet
Jun 7, 2024 · Out of 7,000 languages, only about ten languages have “any substantial online presence.”7 This report summarizes the discussions during ...Missing: statistics | Show results with:statistics
[18]
AI-Detectors Biased Against Non-Native English Writers | Stanford HAI
May 15, 2023 · According to the study, all seven AI detectors unanimously identified 18 of the 91 TOEFL student essays (19%) as AI-generated and a remarkable ...
[19]
https://hai.stanford.edu/news/ai-detectors-biased-against-non-native-english-writers
[20]
[PDF] Basic-and-Historical-Cryptography.pdf
•Cryptography - study of encryption principles/methods. •Cryptanalysis (codebreaking) - the study of principles/ methods of deciphering ciphertext without ...
[21]
[PDF] Statistical Techniques for Language Recognition
Feb 25, 1993 · We explain how to apply statistical techniques to solve several language-recognition problems that arise in cryptanalysis and other domains.
[22]
None
Below is a merged summary of the early approaches to automatic language identification in texts (pre-1990), consolidating all information from the provided segments into a single, comprehensive response. To maximize detail and clarity, I’ve organized the key information into a table in CSV format, followed by a narrative summary that ties everything together. This approach ensures all references, methods, and details are retained while maintaining readability.
[23]
[PDF] Language Identification by Statistical Analysis - DTIC
An analysis was conducted of English and Spanish text. The statistical analysis determined the independent probability of letters and the joint probability of ...Missing: IBM | Show results with:IBM
[24]
Language Identifier: A Computer Program for Automatic Natural ...
Beesley Address from 1988: Automated Language Processing Systems (a.l.p. ... probabilities for 3-grams, 4-grams, etc. In doing, so, the traditional ...
[25]
Europarl: A Parallel Corpus for Statistical Machine Translation
We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web.
[26]
[PDF] Source Language Markers in EUROPARL Translations
This paper shows that it is very often possible to identify the source language of medium-length speeches in the EU-. ROPARL corpus on the basis of fre-.
[27]
[PDF] Overview for the First Shared Task on Language Identification in ...
Oct 25, 2014 · The main goal of this language identification shared task is to increase awareness of the outstanding challenges in the automated processing of ...Missing: VARCON | Show results with:VARCON
[28]
https://aclanthology.org/W14-3907.pdf
[29]
[PDF] N-Gram-Based Text Categorization
Another approach to language classification involves the use of N-gram analysis. The basic idea is to identify N-grams whose occurrence in a document gives ...
[30]
[PDF] Automatic Language Identification in Texts: A Survey
Abstract. Language identification (“LI”) is the problem of determining the natural language that a document or part thereof is written in.
[31]
[PDF] Using Character Ngrams for Word-Level Language Identification in ...
It is also based on a LIBLinear L2-regularized logistic regression model (dual, -s 7) for classification, but takes as input not the character grams, but.
[32]
https://jair.org/index.php/jair/article/download/11675/26513/21897
[33]
https://ceur-ws.org/Vol-3681/T4-2.pdf
[34]
Unsupervised Cross-lingual Representation Learning at Scale - arXiv
Nov 5, 2019 · Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average ...Missing: identification | Show results with:identification
[35]
From N-grams to Pre-trained Multilingual Models For Language ...
In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African ...
[36]
Lexical simplification benchmarks for English, Portuguese, and ...
According to Ethnologue lexical similarity between Spanish and Portuguese is about 89%. On the other hand, although the procedures to collect the ...
[37]
[PDF] Revisiting Common Assumptions about Arabic Dialects in NLP
Jul 27, 2025 · Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about.
[38]
A Perceptual Phonetic Similarity Space for Languages
The goal of the present study was to devise a means of representing languages in a perceptual similarity space based on their overall phonetic similarity.
[39]
[PDF] Experiments in Sentence Language Identification with Groups of ...
In this paper we consider the task of classifying short segments of text in closely-related languages for the Discriminating Similar Languages shared task, ...
[40]
[PDF] A Benchmark for Discriminating between Bosnian, Croatian ...
In this paper, we introduce the BENCHic-lang benchmark for discriminating between four very similar languages: Bosnian, Croatian, Montenegrin and Serbian.
[41]
[PDF] Discriminating between Indo-Aryan Languages Using SVM ...
Aug 20, 2018 · In the four editions of the DSL shared task a variety of computation methods have been tested. This includes Maximum Entropy (Porta and Sancho, ...
[42]
[PDF] Whispering in Norwegian: Navigating Orthographic and Dialectic ...
Feb 2, 2024 · This article introduces NB-Whisper, an adaptation of Ope-. nAI's Whisper, specifically fine-tuned for Norwegian language.
[43]
[PDF] Geographically-Informed Language Identification - ACL Anthology
May 20, 2024 · This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic ...Missing: metadata | Show results with:metadata
[44]
[PDF] Word-level Language Identification using CRF: Code-switching ...
Oct 25, 2014 · We describe a CRF based system for word-level language identification of code-mixed text. Our method uses lexical,.
[45]
[PDF] DIALECTAL VARIATION IN SWAHILI – BASED ON THE DATA ...
This study examines some lexical and morphosyntactic variation found among the Swahili varieties in Zanzibar, Tanzania. Swahili is spoken on the Eastern African ...
[46]
[PDF] Automatic Speech Recognition for African Low-Resource Languages
Jul 31, 2025 · African languages are complex, described by rich morphology, tonal variation, and substantial dialectal diversity. These features, combined with.
[47]
Ethnologue | Languages of the world
More than 7,000 languages are spoken today. We explore exactly how many there are, their geographic distribution, and compare endangered languages with the ...Browse the Countries of the... · Browse By Language Name · Credits · English
[48]
Indicators for the Presence of Languages in the Internet - OBDILCI
Roughly 20% of Web content is in English and 19% is in Chinese · About 7.7% is in Spanish · Hindi, Russian, Arabic, French and Portuguese each make up around 3.5% ...
[49]
Robust Learning for Text Classification with Multi-source Noise ...
Jul 15, 2021 · We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) ...Missing: transliteration preprocessing few- shot
[50]
[2401.04619] Language Detection for Transliterated Content - arXiv
Jan 9, 2024 · This paper addresses this challenge through a dataset of phone text messages in Hindi and Russian transliterated into English utilizing BERT for language ...
[51]
(PDF) Comparative Evaluation of Sentiment Analysis Methods ...
Nov 3, 2017 · PDF | Sentiment analysis in Arabic is challenging due to the complex morphology of the language. The task becomes more challenging when ...
[52]
[PDF] Overview of the DSL Shared Task 2015 - ACL Anthology
(2014) used TED talks and reported 97% accuracy for discriminating between 25 languages. Yet, this is not a solved problem, and there are a number of scenarios ...
[53]
An Open Dataset and Model for Language Identification
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033% across 201 languages, outperforming previous work.
[54]
VoxLingua107: a Dataset for Spoken Language Recognition - arXiv
This paper investigates the use of automatically collected web audio data for the task of spoken language recognition.
[55]
Overview of the DSL Shared Task 2015 - ResearchGate
This paper describes the submission made by the MMS team to the Discriminating between Similar Languages (DSL) shared task 2015. We participated in the ...
[56]
[PDF] Findings of the VarDial Evaluation Campaign 2022 - ACL Anthology
Oct 16, 2022 · This report presents the results of the shared tasks organized as part of the VarDial Evalu- ation Campaign 2022. The campaign is part.
[57]
[PDF] The AI Language Gap | Cohere
The language gap in AI means that speakers of low-resource languages face a growing divide in the availability of high-quality language models and the resources ...<|control11|><|separator|>
[58]
langdetect - PyPI
This library is a direct port of Google's language-detection library from Java to Python. All the classes and methods are unchanged.Langdetect 1.0.3 · Langdetect 0.1.0 · Langdetect 1.0.0 · Langdetect 1.0.5
[59]
google/cld3 - GitHub
Jun 15, 2024 · CLD3 is a neural network model for language identification. This package contains the inference code and a trained model.Issues 52 · Pull requests 3 · Actions · Security
[60]
spacy-language-detection - PyPI
Sep 8, 2021 · Spacy_language_detection is a fully customizable language detection for spaCy pipeline forked from spacy-langdetect in order to fix the seed problem.
[61]
papluca/xlm-roberta-base-language-detection - Hugging Face
Jul 15, 2022 · The model was fine-tuned on the Language Identification dataset, which consists of text sequences in 20 languages. The training set contains 70k ...
[62]
Language identification - fastText
Oct 2, 2017 · A fast and accurate tool for text-based language identification. It can recognize more than 170 languages, takes less than 1MB of memory and can classify ...Language Identification · Training Data · Using Subword FeaturesMissing: 100 | Show results with:100
[63]
IBM Watson Natural Language Understanding
Watson Natural Language Understanding is an API uses machine learning to extract meaning and metadata from unstructured text data.