Fact-checked by Grok 2 weeks ago

Lemmatization

Lemmatization is a fundamental process in natural language processing (NLP) that involves reducing inflected or irregularly derived words to their base or dictionary form, known as the lemma, which represents the canonical or citation form of the word. Unlike stemming, which applies heuristic rules to truncate words to a common root regardless of linguistic validity, lemmatization relies on morphological analysis, part-of-speech tagging, and contextual information to ensure the output is a genuine dictionary entry. For instance, the inflected forms "running," "runs," and "ran" are all mapped to the lemma "run," while "better" lemmatizes to "good" based on its comparative derivation. This technique is essential for in pipelines, where it standardizes vocabulary to improve the efficiency and accuracy of downstream tasks such as , , and . By grouping morphological variants under a single , lemmatization reduces data sparsity and dimensionality in feature representations, enabling more robust models, particularly in morphologically rich languages like or that exhibit extensive inflectional paradigms. The process typically employs lexicon-based methods, such as those leveraging resources like , or rule-based systems that incorporate syntactic context to resolve ambiguities, as outlined in foundational frameworks. In practice, lemmatization algorithms, including those implemented in libraries like NLTK's WordNetLemmatizer, first perform to disambiguate forms—e.g., treating "saw" as a lemma ("see") rather than a ("saw")—before applying transformation rules derived from linguistic knowledge. Its applications extend to search engines, where it enhances query matching by normalizing user inputs, and to for consistent querying across large datasets. Despite advances in neural contextual embeddings that can implicitly handle , explicit lemmatization remains valuable for interpretable preprocessing and performance in low-resource scenarios.

Fundamentals

Definition

Lemmatization is the process of reducing the inflected or derived forms of words to their or form, known as the , typically by considering the part-of-speech () context of the word. This normalization technique groups variant word forms—such as plurals, tenses, or comparative adjectives—into a single base representation, facilitating consistent analysis in tasks. For instance, it transforms "running" into "run" or "better" into "good," ensuring that semantically related variants are treated uniformly without altering the word's core meaning. In , the is defined as the base or citation form of a word, as it appears as the headword in and serves to represent all its inflected variants. Lemmatization draws on morphological , the of word structure, to handle inflectional morphology—such as adding "-s" for plurals or "-ed" for past tenses—and, in some cases, derivational changes. This process ensures that words are mapped to valid entries rather than arbitrary truncations, promoting accuracy in linguistic computations. Lemmatization evolved from foundational work in morphological processing during the projects, where normalizing word forms was essential for handling linguistic variations across languages. A basic workflow for lemmatization involves inputting a word along with its tag to determine the appropriate ; for example, "saw" lemmatizes to "see" when tagged as a ( of "to see") but remains "saw" as a (referring to a ). This dependency distinguishes lemmatization from simpler methods like , which may not account for contextual meaning.

Comparison to Stemming

Stemming is a process in that reduces words to their root or base form by truncating suffixes, aiming to normalize related word variants for tasks like . A seminal example is the Porter Stemmer , introduced in 1980, which applies a series of rules to remove common English suffixes such as "-ing", "-ed", and "-s" to produce a common stem, often without regard for linguistic validity. In contrast to lemmatization, which relies on vocabulary analysis and morphological rules to map words to their canonical dictionary form (lemma) while considering context like part-of-speech, stemming operates through simpler, rule-based truncation that can result in non-words. For instance, stemming might reduce "university" to "univers", an invalid form, whereas lemmatization preserves it as "university" since it is already the base noun. This makes stemming faster and less computationally intensive but prone to over-stemming (grouping unrelated words, e.g., "university" and "universe") or under-stemming (failing to group related words). Lemmatization, often requiring part-of-speech tagging for accuracy, produces valid words but demands more resources. Lemmatization typically achieves higher in normalization tasks due to its morphological , often showing modest improvements in over in English benchmarks. , while approximate, suffices for reducing vocabulary size but introduces errors in semantic-sensitive applications. Stemming is preferred for rapid indexing in large-scale search engines, where speed and reduced index size are critical, while lemmatization suits semantic tasks like or that require exact, meaningful forms.
Input WordStemming Output (Porter)Lemmatization Output
studiesstudi
studyingstudi
feetfeetfoot

Methods

Dictionary-Based Approaches

Dictionary-based approaches to lemmatization rely on pre-built electronic dictionaries or lexical databases that map inflected word forms to their canonical base forms, known as lemmas. These methods are particularly effective for resource-rich languages like English, where comprehensive dictionaries exist. A seminal example is , a lexical database developed at starting in the late 1980s, which organizes English words into synsets—sets of cognitive synonyms—and explicitly links morphological variants, such as "dogs" to "dog" or "better" to "good," to facilitate lemmatization. The process begins with tokenizing the input text into individual words. Each is then looked up in the using exact matching or, in more advanced implementations, fuzzy matching to handle minor variations. To disambiguate lemmas, especially for words with multiple senses across parts of speech (), the approach incorporates POS information; for instance, "saw" as a verb lemmatizes to "see," while as a noun it remains "saw." Princeton , with its approximately 117,000 synsets covering nouns, verbs, adjectives, and adverbs, supports this by providing POS-specific morphological derivations. Early implementations in the late and early involved converting printed dictionaries into machine-readable formats, such as the (1978), which was made available in machine-readable form around 1983 and provided structured lexical data for tasks including resolution. These approaches offer high accuracy for words present in the dictionary due to the explicit mapping of forms to lemmas. They are also straightforward to implement, as seen in libraries like NLTK's Lemmatizer, which leverages WordNet for efficient lookups without requiring custom rule development. However, limitations include ineffective handling of out-of-vocabulary words, such as proper nouns (e.g., "Einstein" remains unchanged) or rare inflections not covered in the dictionary, leading to fallback to the original form. Additionally, these methods demand significant storage, with the database occupying around 12 MB in compressed form. A simple representation of the lookup process is as follows:
function lemmatize(word, pos='n'):
    if word in dictionary and pos in dictionary[word]:
        return dictionary[word][pos]
    else:
        return word  # fallback to original
This illustrates the reliance on dictionary presence and POS matching, with morphological analysis sometimes serving as a complementary step for unresolved cases.

Rule-Based and Morphological Methods

Rule-based and morphological methods for lemmatization rely on linguistic knowledge encoded as explicit rules and analyzers to decompose words into their base forms, particularly suited for handling inflection in morphologically rich languages. Morphological analysis involves breaking down words into morphemes—such as roots, prefixes, and suffixes—using finite-state transducers (FSTs), which model the mapping between surface forms and underlying lexical representations. These transducers operate via transition functions that process input symbols to produce output, formally defined as \delta(q, a) = (q', b), where q is the current state, a the input symbol, q' the next state, and b the output symbol. Xerox developed practical FST tools like lexc, twolc, and xfst in the 1990s for building such analyzers, enabling efficient morphological parsing and generation. The foundations of these methods trace back to 1980s computational morphology projects, notably Kimmo Koskenniemi's two-level morphology model introduced in , which uses parallel rules to constrain morpheme combinations at lexical and surface levels for recognition and production. In rule-based lemmatization, hand-crafted rules target inflectional patterns, such as stripping the "-s" suffix from (e.g., "cats" to "cat") or handling irregular verbs like "went" to "go" through dedicated transformations. These rules often integrate with part-of-speech () taggers to disambiguate forms, ensuring context-aware lemma selection. Dictionary lookups serve as a baseline for validating rule outputs, confirming canonical forms against lexical entries. The lemmatization process typically parses the word's structure to identify morphemes, applies reversal rules to strip affixes and restore the base, and selects the canonical lemma, with exception lists managing irregularities not captured by general patterns. Open-source tools like , an affix-rule-based morphological analyzer developed in the early , exemplify this approach by supporting lemmatization alongside spell-checking for multiple languages. Such methods excel in languages with complex morphology, like and , achieving accuracies of 85-95% on standard benchmarks due to their explicit handling of inflectional paradigms.

Machine Learning Techniques

Machine learning techniques for lemmatization primarily rely on data-driven models trained on annotated corpora to predict the base form of words, often integrating part-of-speech (POS) information to resolve ambiguities. Early statistical approaches, emerging in the 1990s, utilized Hidden Markov Models (HMMs) trained on resources like the Penn Treebank—a corpus of over 4.5 million words of American English with POS and syntactic annotations—to perform sequence labeling for morphological analysis, including lemma prediction through probabilistic transitions between word forms and tags. These models treated lemmatization as a hidden state inference problem, estimating emission probabilities for inflected forms and transition probabilities across possible lemmas, enabling scalable processing of unseen words via Viterbi decoding. Post-2010 advancements shifted toward neural architectures, with recurrent neural networks (RNNs) and (LSTM) units enabling context-aware lemmatization by capturing sequential dependencies in sentences. For instance, bidirectional LSTMs (biLSTMs) encode input sequences of inflected words and tags, followed by a that generates lemmas autoregressively. BERT-based lemmatizers, introduced around 2018, leverage encoders pretrained on multilingual corpora to produce contextual embeddings, fine-tuned for joint tagging and lemmatization, achieving superior handling of and long-range dependencies compared to earlier RNNs. The training process for these models is typically supervised, using datasets of triples consisting of inflected forms, corresponding s, and tags—often from Universal Dependencies (UD) treebanks—to optimize a loss over multi-class predictions for each token's lemma. A prominent example is the lemmatizer in , which employs convolutional neural networks (CNNs) within a multi-task framework to extract morphological features from character and word embeddings, yielding accuracies exceeding 97% on UD benchmarks for languages like English and achieving over 95% for many others in multilingual settings. Multilingual extensions, such as UDPipe introduced in 2016, support over 100 languages by training joint models for tokenization, tagging, lemmatization, and parsing on UD treebanks, incorporating from high-resource languages to address low-resource challenges through shared representations and pretraining on related scripts. By 2023, integration with large language models (LLMs) like those in the family enabled zero-shot lemmatization, where prompts guide the model to infer lemmas without task-specific , demonstrating effectiveness on classical and low-resource languages via in-context learning.

Applications

General Natural Language Processing

Lemmatization serves as a critical step in (NLP) pipelines, where it normalizes inflected word forms to their base or dictionary form, facilitating more effective tokenization and of text . By reducing morphological variations—such as converting "running," "runs," and "ran" to "run"—lemmatization streamlines preprocessing, enabling consistent representation of words across documents. This normalization is particularly valuable in handling the variability of , where a single can encompass dozens of surface forms, thereby enhancing the efficiency of subsequent tasks like feature extraction and model training. In core NLP tasks, lemmatization contributes to improved performance in and text classification. For instance, in search engines like , which has incorporated dictionary-based normalization akin to lemmatization since its initial release in 2010, it enhances recall by matching query terms to relevant document variants, allowing users to retrieve results for related inflections without exact matches. Similarly, in text classification applications such as , lemmatization reduces noise in feature vectors, often improving model accuracy by several percentage points through better generalization across word forms. Lemmatization also plays a supportive role in downstream applications, including named entity recognition (NER) and , by standardizing entity mentions and alignments. In NER, it aids in resolving variations of entity names or terms, such as linking "companies" and "company" to a unified form, which improves entity consistency and extraction precision in pipelines. For systems like , lemmatization during preprocessing ensures consistent word alignments across languages, contributing to higher translation quality by mitigating inflectional mismatches in parallel corpora. Evaluation of lemmatization typically involves metrics such as accuracy, measured on datasets like those from the CoNLL shared tasks, where state-of-the-art models achieve accuracies exceeding 95% by correctly inflected tokens to lemmas. Additionally, its impact extends to language modeling, where lemmatized inputs reduce scores—indicating better predictive —by shrinking vocabulary size and probability distributions over word forms. This effect is evident in n-gram and neural models, where lemmatization can lower by incorporating morphological awareness as an additional mechanism. Lemmatization is a standard component in widely used NLP libraries, reflecting its ubiquity in general text processing workflows. The Natural Language Toolkit (NLTK), initiated in 2001, integrates WordNet-based lemmatization as a core feature for vocabulary normalization in research and educational applications. Likewise, , released in 2015, embeds rule-based and lookup lemmatization directly into its efficient processing pipeline, supporting rapid deployment in production environments.

Specialized Domains like Biomedicine

In , lemmatization addresses the unique challenges of specialized vocabularies, such as morphological variants of anatomical and pathological terms, while preserving acronyms and identifiers that lack inflectional forms. For instance, tools like BioLemmatizer normalize "tumors" to "tumor" to standardize references in clinical narratives, drawing on resources like the Unified Medical Language System (UMLS), developed by the National Library of Medicine in 1986 to integrate biomedical terminologies for consistent processing. UMLS's SPECIALIST Lexicon supports this by providing lexical variants and normalization, enabling accurate handling of domain-specific inflections without altering non-inflected elements like the acronym "CT" for . Adaptations for biomedicine often involve custom dictionaries derived from medical ontologies, such as integrating to expand coverage of clinical concepts in lemmatization pipelines. These ontologies facilitate the creation of specialized lexicons that account for hierarchical relationships in biomedical terms, improving morphological analysis in rule-based systems. Additionally, rule-based tweaks target the etymological structure of pharmacological nomenclature, which frequently derives from Latin and roots; for example, BioLemmatizer employs inflectional rules to process variants like "phosphorylations" to "," adapting general English to biomedical derivations without extensive derivational decomposition. Key applications include clinical for tasks like de-identifying electronic health records (EHRs), where lemmatization normalizes terms to reduce noise and enhance entity recognition while protecting . In EHR processing, it supports the of structured insights from unstructured notes, as seen in pipelines that preprocess clinical narratives for secondary analysis. For search enhancement, domain-specific lemmatization improves retrieval precision by standardizing query terms against biomedical corpora, with studies from the demonstrating substantial gains in concept matching through morphological . Domain-specific challenges arise from neologisms and abbreviations, such as variants of "" (e.g., "COVID" or ""), which evolve rapidly and evade standard dictionaries, complicating lemmatization in real-time clinical data. Hybrid methods combining with domain lexicons address these by leveraging UMLS or for context-aware normalization, boosting accuracy for ambiguous or novel terms in biomedical texts. Tools like MetaMap, developed by the in the 2000s, exemplify this through UMLS-based mapping that incorporates normalization steps akin to lemmatization, though evaluations on datasets like MIMIC-III highlight variability in performance for concept extraction (F-scores around 0.20-0.28). BioLemmatizer, in contrast, achieves approximately 97.5% accuracy on biomedical corpora like , underscoring the efficacy of lexicon-integrated approaches. Post-2020 advancements extend these techniques to AI-driven , where lemmatization preprocesses literature and patents to identify pharmacological candidates and repurpose existing drugs via . In precision medicine pipelines, it enables the integration of unstructured biomedical data into models for target prediction, enhancing across ontologies like UMLS.

Challenges and Advances

Limitations

One significant limitation of lemmatization arises from challenges in resolving lexical , particularly when part-of-speech () tagging errors occur, leading to incorrect assignments. For instance, words like "lead" can function as a (: lead) or (: lead), and misidentification of without sufficient contextual can result in errors estimated at 10-20% in highly ambiguous cases across neural models. This issue is exacerbated in morphologically rich languages, where extensive inflectional demands precise disambiguation, yet standard taggers achieve only 77-92% accuracy in lemmatization tasks due to unresolved homonymy. Lemmatization also exhibits strong language dependency, performing poorly in low-resource languages owing to insufficient training data and morphological resources. In , for example, state-of-the-art syllable-based models achieve only about 81% accuracy in lemma discovery, while earlier approaches drop to 68%, highlighting the scarcity of annotated corpora that limits model generalization. Error analyses from benchmarks like Universal Dependencies (version 2.10, released in 2022) further reveal variance by language family, with showing lower error rates (e.g., 2-4% for nouns and verbs in ) compared to like , where segmentation and lemmatization errors exceed 11% due to dialectal variations and integration. Computational costs represent another key constraint, especially for machine learning-based methods that rely on architectures like for context-aware lemmatization. Inference times for such models can be computationally intensive on standard CPU hardware, necessitating GPU resources for efficient processing and rendering them impractical for real-time applications on large-scale datasets. Dictionary-based approaches, while less resource-intensive per word, scale poorly for massive texts in the era, as exhaustive lookups become bottlenecks in distributed systems handling billions of tokens, often requiring custom optimizations like parallelization to maintain throughput. Edge cases further undermine lemmatization reliability, including the handling of compound words, , and , where standard algorithms struggle to preserve semantic integrity. In , compound nouns like "Eiscreme" () often result in incomplete or erroneous lemmas during tagging and decomposition, leading to fragmentation rather than proper base forms. Similarly, slang terms or code-switched phrases (e.g., mixing English and in bilingual texts) evade dictionary coverage, while over-normalization risks conflate distinct entities, such as lemmatizing "" as a common instead of a proper name when context is overlooked. These issues persist across methods, with morphological approaches offering only partial mitigation through rule extensions but failing in highly creative or non-standard usage.

Recent Developments

Post-2020 innovations in lemmatization have increasingly leveraged architectures to enhance multilingual capabilities and handle morphologically complex languages. Fine-tuned models such as ByT5, a byte-level variant of the , have demonstrated superior performance in lemmatizing ancient languages like and , achieving 80.55% accuracy on raw lemmas compared to 77.38% for the multilingual mT5 model, which supports over 100 languages through pre-training on diverse corpora. This advancement stems from ByT5's ability to bypass language-specific tokenization, effectively managing orthographic variations and diacritics without extensive preprocessing. Similarly, the GliLem hybrid system for integrates -based with rule-based analyzers, attaining 97.7% lemmatization accuracy on benchmark datasets, surpassing traditional HMM-based methods by approximately 10%. Hybrid approaches combining rule-based systems with large language models (LLMs) have emerged as a solution for zero-shot lemmatization in low-resource and historical settings. Adaptations of variants, such as GPT-4o, applied in few-shot scenarios for languages like , , , and , yield accuracies ranging from 65.6% to 94.2%, outperforming earlier RNN baselines in morphologically rich contexts by exploiting contextual understanding without annotated data. These methods enable lemma generation for under-resourced dialects by prompting LLMs to infer morphological rules, addressing gaps in traditional dictionary-based tools. Efficiency gains in lemmatization have been realized through techniques applied to transformers, reducing computational demands for applications. Distilled variants like DistilBERT, which compresses the original model to 40% of its size while retaining 97% of performance, facilitate faster inference in pipelines, including lemmatization, with up to 60% latency reduction on resource-constrained devices. Recent distillation frameworks for transformers preserve up to 98% of accuracy in morphological tasks while minimizing memory footprint, making them suitable for integration in environments. Emerging trends include unsupervised lemmatization via on unannotated corpora, often powered by LLMs to generate lemmas without domain-specific resources. For instance, zero-resource approaches using generative LLMs achieve competitive results by simulating morphological transformations from web-scale text, enabling adaptation to new languages or dialects with minimal supervision. This self-training paradigm, combined with embeddings, supports deployment in mobile via lightweight models optimized for on-device processing. A 2024 study on Estonian lemmatization showed transformer-enhanced systems outperforming rule-based methods by approximately 10% on benchmark datasets, particularly in contextual disambiguation. These shifts underscore the transition from rigid morphological rules to flexible, data-driven models. Recent machine learning integrations in lemmatization also raise ethical concerns, notably bias in normalizing diverse dialects, where models trained on standard corpora underperform on non-dominant variants, perpetuating linguistic inequities. Efforts to mitigate this include diverse dataset curation and fairness audits to ensure equitable lemma representation across dialects.

References

  1. [1]
    On the Role of Morphological Information for Contextual ...
    Lemmatization is a natural language processing (NLP) task that consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization ...
  2. [2]
    Lemmatization - an overview | ScienceDirect Topics
    Lemmatization is defined as the process of identifying words with a common morphological root and replacing them with the same representative token, ...
  3. [3]
    What Are Stemming and Lemmatization? - IBM
    In natural language processing (NLP), stemming and lemmatization are text preprocessing techniques that reduce the inflected forms of words across a text ...Missing: scholarly | Show results with:scholarly
  4. [4]
    Speech and Language Processing
    Aug 24, 2025 · An introduction to natural language processing, computational linguistics, and speech recognition with language models, 3rd edition.
  5. [5]
    Stemming and lemmatization - Stanford NLP Group
    Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional ...
  6. [6]
    What is Lemmatization? Definition from TechTarget
    Mar 5, 2025 · Lemmatization is the process of grouping together different inflected forms of the same word. It's used in computational linguistics, natural language ...
  7. [7]
    LEMMA | definition in the Cambridge English Dictionary
    a form of a word that appears as an entry in a dictionary and is used to represent all the other possible forms.
  8. [8]
    What is a Lemma? - ThoughtCo
    Nov 4, 2019 · Explore how in morphology and lexicology, a lemma is the form of a word that appears at the beginning of a dictionary or glossary entry.
  9. [9]
    [PDF] An algorithm for suffix stripping - Computer Science
    An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a ...
  10. [10]
    [PDF] A Study on Stemming vs Lemmatization - CEUR-WS
    Stemmers and lemmatizers differ in the way they are built and trained. Statistical stemmers are important com- ponents for text search over languages and can be.
  11. [11]
    BioLemmatizer: a lemmatization tool for morphological processing of ...
    In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature.Missing: seminal | Show results with:seminal
  12. [12]
    Stemming and Lemmatization: A Comparison of Retrieval ...
    Aug 6, 2025 · Stemming is a procedure to reduce all words with the same stem to a common form whereas lemmatization removes inflectional endings and returns ...
  13. [13]
    WordNet
    - **Definition**: WordNet® is a large lexical database of English, grouping nouns, verbs, adjectives, and adverbs into synsets (cognitive synonyms) linked by conceptual-semantic and lexical relations.
  14. [14]
    WordNet: a lexical database for English - ACM Digital Library
    Nov 1, 1995 · WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms.
  15. [15]
    nltk.stem.WordNetLemmatizer
    Documentation. NLTK Documentation. API Reference · Example Usage · Module Index · Wiki · FAQ · Open Issues · NLTK on GitHub. Installation. Installing NLTK ...
  16. [16]
    [PDF] Natural language processing: a historical review - ACL Anthology
    The first serious attempts were now made to exploit commercial dictionaries in machine-readable form, and this in turn led to the exploitation of text ...
  17. [17]
    Lemmatization with NLTK - GeeksforGeeks
    Oct 4, 2025 · Lemmatization is an important text pre-processing technique in Natural Language Processing (NLP) that reduces words to their base form known ...
  18. [18]
    [PDF] Finite-State Morphology: Xerox Tools and Techniques - source url
    This book will teach you how to use Xerox finite-state tools and techniques to build practical and efficient computer systems that perform morphological ...
  19. [19]
    [PDF] Twenty-Five Years of Finite-State Morphology - Stanford University
    When most of the computational morphologists working with the Xerox tools embraced the sequential model as the more practical approach in the mid 1990s, a two- ...
  20. [20]
    [PDF] Two-Level Model for Morphological Analysis - IJCAI
    This paper presents a new linguistic, computationally implemented model for mor- phological analysis and synthesis. It is general in the sense that the same ...
  21. [21]
    [PDF] Development of a rule-based lemmatization algorithm through Finite ...
    Oct 31, 2022 · This paper discusses the construction of a lemmatization algorithm for the Uzbek language. The main purpose of the work is to remove affixes of ...
  22. [22]
    Hunspell: About
    Hunspell is the spell checker of LibreOffice, OpenOffice.org, Mozilla Firefox & Thunderbird, Google Chrome, and it is also used by proprietary software ...Missing: lemmatizer | Show results with:lemmatizer
  23. [23]
    [PDF] Data-Driven Morphological Analysis for Uralic Languages
    Jan 8, 2019 · We show that our system achieves roughly 90% F1-score for most of the tested languages. Additionally, we compare our system to the Finnish data ...
  24. [24]
    [PDF] DEMorphy, German Language Morphological Analyzer - arXiv
    Abstract. DEMorphy is a morphological analyzer for German. It is built onto large, compactified lexicons from German Morphological Dictionary.Missing: accuracy | Show results with:accuracy
  25. [25]
    Building a large annotated corpus of English: the Penn Treebank
    The Penn Treebank is a large annotated corpus of over 4.5 million words of American English, with part-of-speech and skeletal syntactic structure annotations.
  26. [26]
    (PDF) The Penn Treebank: An overview - ResearchGate
    This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation.
  27. [27]
    A Simple Joint Model for Improved Contextual Neural Lemmatization
    A Simple Joint Model for Improved Contextual Neural Lemmatization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for ...
  28. [28]
    Cross-Lingual Lemmatization and Morphology Tagging with Two ...
    Our results show that fine-tuning multilingual BERT on the concatenation of all available treebanks allows the model to learn cross-lingual information that is ...
  29. [29]
    Universal Dependencies v2.5 Benchmarks for spaCy - Explosion AI
    Dec 14, 2021 · We present Universal Dependencies v2.5 benchmarks for spaCy v3.2 that show the competitive performance of spaCy in a direct comparison with ...
  30. [30]
    Neural edit-tree lemmatization for spaCy - Explosion AI
    Nov 24, 2021 · We are happy to introduce a new, experimental, machine learning-based lemmatizer that posts accuracies above 95% for many languages.Missing: CNN feature
  31. [31]
    UDPipe Models - ÚFAL
    ... 2016), Portorož, Slovenia, May 2016. 1.3. Model Description. The Universal Dependencies 2.5 models contain 94 models of 61 languages, each consisting of a ...
  32. [32]
    Exploring Large Language Models for Classical Philology
    While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to ...
  33. [33]
    Elasticsearch is awesome, but it won't fix your search problems | Tooso
    May 17, 2017 · Which stemming process should I use? And why stemming and not lemmatization? It is worth pointing out that this is not just a generic tech gap ( ...
  34. [34]
    Elasticsearch: The Basics and Inexact Matching I Instaclustr
    Jun 25, 2020 · They enable Elasticsearch to search for things that don't exactly match (inexact matching), and include these techniques (with some cool names): ...
  35. [35]
    [PDF] Resource-Size matters: Improving Neural Named Entity Recognition ...
    Jul 26, 2018 · Lemmatization performs comparatively better than lemmatization combined with POS tagging. This shows that dispersing the semantics of a ...
  36. [36]
    [PDF] arXiv:2007.061
    Jul 12, 2020 · achieving a lemma accuracy of 96.8%. The POS tagger by ... We compute both evaluation metrics (precision, recall and F1 score) as well.
  37. [37]
    [PDF] LSA-based Language Model Adaptation for Highly Inflected ...
    Lemmatization aims to reduce the lexical variety of the language and should act as an additional smooth- ing measure. In our approach, a very small ...
  38. [38]
    3 Processing Raw Text - NLTK
    The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas (or lexicon headwords). Note.
  39. [39]
    spaCy 101: Everything you need to know
    spaCy is a free, open-source Python library for advanced NLP, designed for production use to process and understand large volumes of text.
  40. [40]
    spaCy · Industrial-strength Natural Language Processing in Python
    Since its release in 2015, spaCy has become an industry standard with a huge ecosystem. ... lemmatization, morphological analysis, entity linking and more ...Install spaCy · spaCy 101 · Projects · Training Pipelines & Models
  41. [41]
    The Unified Medical Language System at 30 Years and How It Is ...
    Aug 27, 2021 · The UMLS brings together many broadly used vocabularies and standards in the biomedical field to facilitate interoperability among different ...
  42. [42]
    [PDF] Ontology-Based Clinical Information Extraction Using SNOMED CT
    SNOMED CT is the most comprehensive medical ontology with broad types of concepts and detailed relationships and it has been widely used for many clinical.<|separator|>
  43. [43]
    Text mining in medicine – Knowledge Discovery and Intelligent ...
    Among its principal functionalities the following can be highlight: named-entity detectors, analysis of dependencies, grammatical parsing, annotation, document ...
  44. [44]
    Text Mining of Electronic Health Records Can Accurately Identify ...
    Jan 12, 2021 · We present a text mining algorithm that can accurately identify and characterize patients with SLE using routinely collected data from the EHR.Patients And Methods · Text Mining · DiscussionMissing: lemmatization | Show results with:lemmatization
  45. [45]
    [PDF] Establishing a COVID-19 lemmatized word list for journalists and ...
    Jan 30, 2022 · ABSTRACT. The aim of this research is two-fold; first, to explore the most frequent COVID-19 inspired words in medical news reporting ...
  46. [46]
    Challenges in Clinical Natural Language Processing for Automated ...
    For example, two of the most widely used tools (MetaMap [23, 24] and MedLEE [25]) both emphasize a hybrid of natural language processing and lexical approaches ...
  47. [47]
    Effective mapping of biomedical text to the UMLS Metathesaurus - NIH
    This paper describes MetaMap, a program developed at the National Library of Medicine (NLM) to map biomedical text to the Metathesaurus.Missing: lemmatization | Show results with:lemmatization
  48. [48]
    Clinical concept recognition: Evaluation of existing systems on EHRs
    Aug 8, 2025 · The system performance was evaluated on two datasets: the 2010 i2b2 and the MIMIC-III. Additionally, we assessed the performance of these ...
  49. [49]
    Artificial intelligence in drug discovery and development - PMC
    In this review, we highlight the use of AI in diverse sectors of the pharmaceutical industry, including drug discovery and development, drug repurposing, ...
  50. [50]
    Artificial intelligence in early drug discovery enabling precision ...
    In this review, the authors discuss the current state of drug discovery in precision medicine and present our vision of how artificial intelligence will impact ...
  51. [51]
    [PDF] Neural Models for Lemmatization and POS-Tagging of Earlier and ...
    May 4, 2025 · For lemmatization, which is challenging in all of our test languages due to extensive ambiguity, we demonstrate accu- racies from 77% up to 92% ...
  52. [52]
    [PDF] Improving the Computational Morphological Analysis of a Swahili ...
    We compare the accuracy of four different approaches to morphological segmentation and lemmatization of Swahili: —. Morfessor (Creutz et al. 2005): an ...<|separator|>
  53. [53]
    Universal Dependencies
    But as of May 2022, only tablet I is release with partial morphological analysis and partial lemmatisation. (POS tagging and Dependency trees are complete).Short introduction to UD · Dependency Relations · UD English PUD · UD GuidelinesMissing: error | Show results with:error
  54. [54]
    [PDF] vs High-level Lemmatization for Historical Languages ... - CLiC-it 2025
    Sep 24, 2025 · PRON, with error rates of 2.37% and 1.87% respectively. All other POS categories show error rates below 1%. Er- rors involving ADJs and ...
  55. [55]
    [PDF] How to Train BERT with an Academic Budget - ACL Anthology
    Nov 7, 2021 · To simulate an academic computation budget, we limit the training time to 24 hours and the hardware to a single low-end deep learning server.2 ...
  56. [56]
    BERT inference cost/performance analysis CPU vs GPU
    Apr 18, 2021 · In this article, I'm looking at whether this is true or not, and what we could take away from this experiment in terms of cost optimization.
  57. [57]
    Reconstructing Complete Lemmas for Incomplete German ...
    These words pose a challenge in PoS tagging and lemmatization, which often leads to unknown or incomplete lemmas. We present an approach to reconstruct complete ...
  58. [58]
    Challenges with Lemmatization - BytePlus
    Explore the challenges of lemmatization in NLP, its limitations, and best practices for effective implementation.<|control11|><|separator|>
  59. [59]
    [PDF] Lemmatization of Cuneiform Languages Using the ByT5 Model
    May 4, 2025 · Experimental results demonstrate that ByT5 outperforms mT5 in this task, achieving an accuracy of 80.55% on raw lemmas and 82.59% on generalized.
  60. [60]
    None
    ### Summary of GliLem Model for Lemmatization
  61. [61]
    [PDF] lemmatization and POS-tagging with LLM annotators for historical
    Jun 18, 2025 · This paper investigates the ca- pacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mis- tral models, to ...
  62. [62]
    What are Small Language Models (SLMs) - GeeksforGeeks
    Jul 23, 2025 · DistilBERT. 40% smaller than BERT, 60% faster; Uses knowledge distillation; Good balance of speed and accuracy; Slight loss in performance. 2 ...<|separator|>
  63. [63]
    [PDF] Efficient Transformer Knowledge Distillation: A Performance Review
    Dec 6, 2023 · We find that distilled efficient attention transformers can preserve a significant amount of original model perfor- mance, preserving up to 98. ...
  64. [64]
    [PDF] On Lemma Generation Without Domain- or Language-Specific ...
    Lemmatization is one of the core NLP tasks widely used during data pre-processing in such areas as in- formation extraction, named entity recognition, and.
  65. [65]
    On Lemma Generation Without Domain- or Language-Specific ...
    Oct 8, 2025 · Our work demonstrates, for the first time, the potential to perform direct lemmatization directly without any training data by applying ...
  66. [66]
    Addressing Equity in Natural Language Processing of English Dialects
    Jun 12, 2023 · A suite of resources that aim to address equity challenges in NLP, specifically around the observed performance drops for different English dialects.
  67. [67]
    Five sources of bias in natural language processing - PMC
    We outline five sources where bias can occur in NLP systems: (1) the data, (2) the annotation process, (3) the input representations, (4) the models, and ...2.1. Bias From Data · 2.2. Bias From Annotations · 2.5. Bias From Research...