Fact-checked by Grok 2 weeks ago

Linguistic categories

Linguistic categories are the classes into which linguistic units, such as words, morphemes, and phrases, are grouped based on shared properties, enabling systematic description and cross-linguistic comparison of languages. These categories form the foundational framework for analyzing language structure across its phonological, morphological, syntactic, and semantic components. At the core of linguistic categories are lexical categories, also known as parts of speech, which include nouns (denoting entities like "cat"), verbs (expressing actions or states like "run"), adjectives (describing properties like "happy"), and adverbs (modifying other elements like "quickly"). These categories are distinguished by their syntactic distribution, morphological behavior, and semantic roles, rather than rigid boundaries, and they contrast with functional categories such as determiners or prepositions that serve grammatical rather than content-bearing functions. Grammatical categories, meanwhile, overlay lexical ones to encode relations like tense (past, present, future), (completed or ongoing action), number (singular, plural), (masculine, feminine, neuter), and case (nominative, accusative). Such categories often intersect, as in noun systems where and number combine to yield forms like "the cats" in English, highlighting their role in inflectional morphology. The delineation of linguistic categories draws on multiple criteria, including morphological ( patterns), syntactic (phrase-building rules), and semantic (meaning contributions), with ongoing debates in about their universality versus language-specific variation. For instance, while major lexical categories like and appear in nearly all languages, their exact definitions and additional categories (e.g., classifiers in some Asian languages) differ typologically. This framework underpins key subfields: phonological categories organize sounds (e.g., vowels, consonants), syntactic categories structure sentences (e.g., , object), and semantic categories handle meaning relations (e.g., thematic roles like or ). Understanding these categories is crucial for , studies, and computational modeling, as they reveal how humans encode and process complex communication systems.

Fundamentals

Definition and scope

Linguistic categories refer to abstract classes that group linguistic units, such as words, morphemes, or phrases, based on shared properties including syntactic distribution, morphological patterns, and semantic roles. These categories facilitate the systematic analysis of language structure by identifying patterns of behavior that recur across units within a class, such as nouns typically serving as subjects or objects in syntactic constructions and inflecting for number in morphological paradigms. For instance, parts of speech like and represent foundational lexical categories distinguished by these intertwined properties. The scope of linguistic categories encompasses multiple dimensions of analysis, including morphological (e.g., inflectional paradigms for or case), syntactic (e.g., argument structure and phrase-building rules), semantic (e.g., aspectual or interpretations), and phonological (e.g., prosodic or tonal features associated with specific classes). A prominent example is tense, a verbal category that encodes temporal relations in the anchoring layer of structure, influencing how events are situated relative to the . These categories operate across levels of , from individual morphemes to entire utterances, providing a framework for understanding how languages encode meaning and form. Linguistic categories can be distinguished as universal or language-specific, with the former representing abstract comparative concepts applicable across languages for typological analysis, and the latter comprising descriptive categories tailored to the particular grammatical systems of individual languages. This distinction underscores the tension between innate universals in human language capacity and the relativity of categorization shaped by cultural and structural factors, as highlighted in Benjamin Lee Whorf's work on how obligatory grammatical features in a language foster distinct habitual thought patterns. Whorf's ideas, part of the broader Sapir-Whorf hypothesis, emphasize that language-specific categories may influence cognition by delimiting conceptual boundaries, such as varying systems of spatial reference or evidentiality. By establishing these classificatory frameworks, linguistic categories enable cross-linguistic comparison through standardized comparative concepts, allowing researchers to identify similarities and divergences in how languages organize phenomena like or causation without presupposing identical inventories. This approach supports generalizations about language universals while respecting the diversity of descriptive categories, fostering advancements in and .

Historical development

The concept of linguistic categories originated in and grammars, where scholars sought to classify elements of language for pedagogical and analytical purposes. In the late 2nd century BCE, , an Alexandrian grammarian, outlined the foundational system of eight parts of speech in his treatise Tékhnē grammatikḗ (The Art of Grammar), including , , , article, , preposition, , and ; this framework, influenced by earlier Stoic and Aristotelian ideas, emphasized morphological and syntactic distinctions to aid in the interpretation of Homeric texts. grammarians like in the 6th century CE adapted and expanded this model in works such as Institutiones Grammaticae, preserving it through Latin scholarship and establishing parts of speech as an enduring classical framework for categorizing linguistic units. During the medieval and periods, linguistic categorization evolved amid scholastic traditions and renewed interest in classical texts, shifting toward more philosophical interpretations. In the 12th century, scholars like Peter Helias integrated Aristotelian logic into , refining categories to reflect semantic roles alongside form. The brought a rationalist turn, exemplified by the Grammaire générale et raisonnée (Port-Royal Grammar) of 1660 by and Claude , which posited that grammatical categories derive from universal mental structures inherent to human reason, reducing traditional parts of speech to three primary classes—, , and particle—based on their expression of thought modes like substance, mode, and modification. This approach influenced linguistics by prioritizing innate cognitive principles over empirical variation in languages. The 19th and early 20th centuries marked a transition to and formal theorizing, emphasizing systematic relations over historical . Ferdinand de Saussure's (1916), compiled posthumously from his lectures, revolutionized categorization by distinguishing langue (the abstract system) from parole (individual use) and introducing binary oppositions like synchronic/diachronic, treating linguistic categories as relational signs within a self-contained structure rather than isolated entities. Building on this, Noam Chomsky's (1957) advanced , proposing feature-based categories (e.g., ±noun, ±verb) within phrase structure rules and transformations to capture universal syntactic patterns, shifting focus from surface forms to underlying . Post-1960s developments highlighted typological and empirical approaches, countering universalist biases with cross-linguistic data. Joseph Greenberg's 1963 paper "Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements," presented at a and later published, identified 45 implicational universals based on 30 diverse languages, emphasizing probabilistic patterns in and to foster comparative typology over rigid hierarchies. By the , the rise of necessitated standardized inventories for machine processing; projects like the Penn Treebank (initiated in 1989) adapted classical categories into tagsets for part-of-speech annotation, driven by needs in natural language and development. This era bridged with practical tools, promoting reusable frameworks for multilingual analysis.

Types of Linguistic Categories

Grammatical categories

Grammatical categories refer to sets of syntactic features within a language's grammar that express meanings from the same conceptual domain and occur in paradigmatic contrast to one another, often manifesting as obligatory inflections on words. These categories encode abstract properties such as , number, and , which are typically marked morphologically on nouns, pronouns, and verbs to indicate their syntactic roles and relationships in a sentence. For instance, in many , nouns inflect for gender (masculine, feminine, neuter) and number (singular, plural), while verbs agree in person (first, second, third) with their subjects. These features are not merely semantic but serve structural functions, ensuring grammatical agreement and coherence across phrases. Key examples of grammatical categories include those related to verbal inflection, such as tense, , and . Tense distinguishes the time of an event relative to the moment of speaking, commonly divided into , present, and ; for example, like "walk" become "walked" in the to signal completion before the present. Aspect, in contrast, conveys the internal temporal structure of the event, with categories like perfective (viewing the action as bounded or completed, e.g., Spanish "hablé" for "I spoke") and imperfective (emphasizing ongoing or habitual action, e.g., "hablaba" for "I was speaking"). Mood indicates the speaker's attitude toward the proposition, such as indicative for factual statements (e.g., "She runs") or subjunctive for hypothetical or non-real scenarios (e.g., Latin "currat" for "she may run"). These categories intersect to form complex verbal paradigms, as seen in languages like , where a single root can yield dozens of forms combining tense, aspect, and mood. Cross-linguistic variation is prominent in grammatical categories, particularly in case systems that mark the grammatical function of noun phrases. In accusative-alignment languages, such as English or Latin, the subject of both transitive and intransitive verbs receives , while the object of transitives takes accusative; this patterns the single argument of intransitives (S) with the transitive subject (A), treating them as "agents" or "topics." Conversely, ergative-alignment languages like or mark the transitive subject (A) with and pattern it differently from the intransitive subject (S) and transitive object (P), which share absolutive case; here, S and P are aligned as "patients" or "absolutives." Such variations reflect diverse strategies for signaling syntactic roles, with split-ergative systems (e.g., in ) combining both patterns based on tense or , highlighting how grammatical categories adapt to a language's overall morphosyntactic architecture. Theoretical frameworks for understanding grammatical categories emphasize their oppositional structure. Roman Jakobson's 1932 markedness theory posits that categories often form binary oppositions where one member is unmarked (simpler, more frequent, default) and the other marked (more specified, complex); for example, in case systems, the nominative or absolutive may be unmarked relative to cases like genitive or ergative. This approach, initially applied to , explains asymmetries in category realization, such as the tendency for unmarked forms to appear in neutral contexts across languages. Jakobson's ideas have influenced typological studies, underscoring how captures universal tendencies in category encoding while accommodating variation.

Lexical categories

Lexical categories, also known as parts of speech, refer to the primary classes of words that serve as the syntactic building blocks of sentences, distinguished primarily by their distributional and morphological behaviors rather than semantic content alone. The core lexical categories in many languages include nouns, verbs, adjectives, and adverbs, each identified through syntactic tests such as their ability to occupy specific positions in phrases or combine with certain affixes. For instance, nouns typically head noun phrases, can co-occur with determiners like "the" or "a," and often inflect for number (e.g., "book" to "books"), while verbs head verb phrases, take subjects and objects, and mark tense or aspect (e.g., "walk" to "walked"). Adjectives modify nouns within noun phrases and may form comparatives (e.g., "big" to "bigger"), whereas adverbs modify verbs, adjectives, or other adverbs, often ending in suffixes like "-ly" in English (e.g., "quick" to "quickly"). These criteria rely on distributional tests, which examine how words behave in syntactic environments to assign category membership, as opposed to purely semantic definitions. Lexical categories are broadly divided into open and closed classes based on their and size. Open classes, comprising such as s, verbs, adjectives, and adverbs, are expandable through processes like borrowing or derivation, allowing new members like "" (noun) or "google" (verb) to enter the readily. In contrast, closed classes include function words like prepositions (e.g., "in," "on"), conjunctions, and pronouns, which perform grammatical roles but form finite sets with limited , as their meanings are highly abstract and tied to syntactic structure. This distinction highlights how open classes carry substantive semantic load, while closed classes support syntactic relations. While the core categories exhibit cross-linguistic consistency, their realization varies across languages, particularly in classifier systems where traditional distinctions may blur. In classifier languages like , nouns require classifiers (e.g., "yī běn shū" for "one ," with "běn" as the classifier) to quantify or specify, and the of adjectives is often debated, with many property-denoting words analyzed as stative verbs rather than a distinct class (e.g., "hóng" meaning "" functioning predicatively without a ). This variation challenges universal assumptions but underscores how lexical categories adapt to typological features like obligatory . A influential framework addressing such variability is Mark Baker's incorporation theory, which posits universal syntactic primitives for lexical categories to explain their presence across languages. Baker argues that verbs inherently license argument specifiers (e.g., subjects), nouns bear referential indices enabling anaphora and quantification, and adjectives lack both, allowing them to modify without projecting full phrases; these properties derive from parametric incorporation rules in , accounting for incorporations in polysynthetic languages like while maintaining categorial distinctions in analytic ones like . Grammatical features, such as tense on verbs or number on nouns, further inflect items within these categories to encode syntactic relations.

Semantic and pragmatic categories

Semantic categories in pertain to the organization of meaning at the level of words and , focusing on how conceptual roles and relations structure interpretation. Thematic roles, also known as semantic roles, represent one key framework for classifying the participants in events described by predicates. In , proposed by Charles Fillmore, deep structural cases such as (the initiator of an action), (the entity affected or moved), (the means used), and others like , , and experiencer, capture the semantic relations between verbs and their arguments, independent of surface syntactic structure. This approach highlights how meaning is encoded through these roles, as in the sentence "John broke the window," where John is the and the window the . Lexical semantic categories further organize vocabulary into relational structures, such as semantic fields and hyponymy. Semantic fields group words sharing a common conceptual domain, like color terms (red, blue, green) or kinship terms (, , ), where meanings are defined relative to each other within the field. Hyponymy establishes hierarchical inclusion, as in "dog" being a hyponym of "animal," with the superordinate term (animal) denoting a broader category encompassing the more specific one (). These categories facilitate understanding of lexical meaning through networks of inclusion and opposition, as detailed in structural semantics. Pragmatic categories address how context and speaker intent influence utterance interpretation beyond literal semantics. Central to this is speech act theory, developed by , which distinguishes three levels of acts: the (the literal utterance), the (the intended force, such as asserting, questioning, or promising), and the (the effect on the listener, like persuading or alarming). For example, saying "It's cold in here" can perform an of requesting someone to close the window, depending on context. This framework underscores that utterances are actions with conventional and contextual meanings. At the interface of semantics and lie aspectual categories, which classify verbs based on the temporal structure of events they denote. Zeno Vendler proposed a four-way : states (e.g., know, unchanging over time), activities (e.g., run, durative without endpoint), accomplishments (e.g., paint a picture, durative with inherent endpoint), and achievements (e.g., recognize, punctual with change). These distinctions relate to , where atelic verbs (activities and states) lack a natural boundary, contrasting with telic ones (accomplishments and achievements) that imply completion. Such categories often overlap with marking, as in how aspects highlight ongoing processes. From a perspective, semantic and pragmatic categories are not rigid but exhibit prototype effects, where membership is graded rather than binary. George Lakoff's argues that categories like "" center on prototypical exemplars (e.g., robin) with fuzzy boundaries, incorporating encyclopedic knowledge and experiential factors rather than strict definitions. This view, applied to both lexical items and pragmatic inferences, emphasizes in meaning construction, challenging classical Aristotelian .

Standardization Efforts

Part-of-speech tagsets

Part-of-speech (POS) tagsets provide standardized labels for annotating words in corpora based on their lexical categories, enabling consistent analysis of grammatical structure across texts. The , one of the first large-scale annotated corpora assembled in the early 1960s at , employed an initial tagset of 87 tags to distinguish detailed morphological and syntactic properties in samples totaling about one million words. This tagset included categories for verb tenses, pronoun types, and modifiers, reflecting the era's emphasis on comprehensive grammatical coverage. The Penn Treebank tagset, developed as part of the Penn Treebank project from 1989 to 1992 and detailed in 1993, streamlined the approach by reducing it to 45 tags—36 for core categories and 9 for and symbols—to minimize redundancy while supporting parsing models. This design eliminated recoverable distinctions, such as certain verb inflections, and prioritized tags aligned with syntactic positions in parse trees, making it a for English tasks. POS tagset design balances granularity with usability, often contrasting flat structures, which list discrete categories without embedded features, against hierarchical ones that layer attributes like number, gender, or tense for morphologically complex languages. The Treebank exemplifies a flat structure, assigning single tags like "JJ" for adjectives regardless of further properties, whereas hierarchical tagsets decompose labels into primary categories and modifiers to capture intricate inflections. resolution is a core principle, particularly for multifunctional words; for example, adverbs like "back" may tag as (adverb) in phrasal verbs (e.g., "back away") but (noun) when standalone, with guidelines permitting multiple tags or context-based selection to avoid over-specification. Language-specific adaptations tailor tagsets to unique grammatical traits, as seen in the CLAWS (Constituent Likelihood Automatic Word-tagging System) tagger for English, initiated in the early 1980s at and refined through versions like C5, which uses about 60 tags for probabilistic annotation of corpora such as the . In contrast, the Universal Dependencies (UD) project employs a universal POS tagset of 17 coarse tags—covering open classes like and , closed classes like DET and , and others like PUNCT—for cross-linguistic alignment, with fine-grained details handled via separate features rather than expanded tags. UD's design supports multilingual schemes by standardizing core categories while accommodating variations, such as distinguishing proper nouns (PROPN) from common nouns () across languages. Evaluation of POS tagsets in supervised learning focuses on accuracy, the ratio of correctly tagged tokens to total tokens against gold-standard annotations, serving as the primary metric for tagset effectiveness and tagger performance. On the Wall Street Journal section of the Penn Treebank, baseline most-frequent-tag approaches yield about 92% accuracy, while advanced supervised models reach 97%, highlighting the tagset's utility in capturing contextual disambiguation without excessive complexity.

Multilingual annotation schemes

Multilingual annotation schemes provide standardized frameworks for labeling syntactic and structures across diverse languages, enabling cross-linguistic comparisons and the development of multilingual parsers. These schemes address the variability in grammatical organization by defining universal categories while allowing language-specific adaptations. A foundational aspect involves extending part-of-speech tagsets to include relations, facilitating consistent treebank . The Universal Dependencies (UD) project, launched in 2014, exemplifies such a framework, offering cross-linguistically consistent annotation for parts of speech, morphological features, and syntactic dependencies in 179 languages (as of 2025). UD employs 17 universal POS tags, such as , , and , alongside dependency relations like nsubj (nominal subject) and (direct object), which capture head-dependent relationships in sentence trees. This design promotes interoperability, as seen in shared tasks like the CoNLL conferences, where UD treebanks support multilingual models. The project's guidelines evolve through community input, ensuring applicability to typologically diverse languages from Indo-European to Austronesian families. The Dependency Treebank (PDT), originally developed for in the and consolidated in versions like PDT 3.0, has significantly influenced multilingual extensions by providing a multi-layer model that integrates morphological, syntactic, and semantic levels. PDT's tectogrammatical approach, which abstracts away from surface to underlying dependencies, served as a basis for converting resources into UD format and inspired similar treebanks for languages like and through projects such as the Arabic Dependency Treebank. These extensions facilitate multilingual by harmonizing practices, allowing parsers trained on one language to transfer knowledge to others via shared dependency schemas. A key challenge in these schemes arises from typological differences, particularly in handling head-final versus head-initial languages, where the direction of dependencies (e.g., verb-final in versus verb-initial in Welsh) impacts head selection and arc projections. In UD, for instance, guidelines adjust for such variations by prioritizing as heads in coordinate structures, but inconsistencies persist in head-final languages like , requiring language-specific enhancements to maintain universality without losing linguistic fidelity. Efforts to harmonize European language resources in the , such as project, further advanced multilingual by developing inventories of standards for syntactic and semantic tagging, promoting among corpora like those in the EAGLES initiative's successors. Although specific projects like INTERSECT focused on intersecting layers for semantic in English texts, broader EU-funded work emphasized consistent schemes for across Romance and .

Interlinear glossing conventions

Interlinear glossing conventions provide a standardized for representing the morphological structure of languages in linguistic descriptions, particularly useful for under-documented or morphologically complex languages. These conventions involve aligning the original text with a morpheme-by-morpheme gloss and a free , enabling precise of grammatical categories such as tense, , and case. The basic structure of an consists of three lines: the first presents the original word or phrase, the second breaks it down into morphemes with corresponding in uppercase abbreviations, and the third offers a free translation. Morphemes are typically separated by hyphens in both the original and gloss lines, while clitics are marked with equals signs; for instance, in a hypothetical example from a , the form ŋa-ŋu-m=lu might be glossed as 1SG.-1SG.OBJ-3PL=CL with the translation 'I saw them'. This alignment facilitates morpheme-by-morpheme correspondence and highlights grammatical features like or (e.g., PST). The Glossing Rules, developed by the Department of at the Institute for and the University of , establish these standards, first published in 2006 and last updated in 2015. They specify conventions such as left-aligned word-by-word glossing, uppercase abbreviations for grammatical categories (e.g., 1SG for first person singular, for past tense), and handling of non-one-to-one mappings with periods or other symbols. The rules include an appendix of recommended abbreviations to promote consistency across publications. Complex morphology is addressed through specific notations: portmanteaus, which fuse multiple categories into a single form, are glossed using the ">" symbol to indicate hierarchical relations, such as 2DU>3SG for a second-person dual acting on a third-person singular. Zero morphemes, representing covert grammatical elements, are marked with "Ø" or square brackets, as in puer-Ø glossed boy-NOM to denote an unexpressed nominative marker. Inherent categories not overtly marked may appear in parentheses, like (PL) for an unmarked plural. These conventions evolved from early 20th-century practices, notably in George A. Grierson's (1894–1928), which introduced systematic interlinear word-for-word and sub-word glossing for over 700 linguistic varieties, emphasizing literal translations aligned with segmented transcriptions. This approach laid groundwork for modern standards, refined by Christian Lehmann's guidelines in 1982 and further standardized in the , which reflect common usage with minimal innovations. SIL International's current publication guidelines adopt the Leipzig conventions, using for glosses and hyphens for breaks to ensure readability in fieldwork and descriptive .

Linguistic ontologies and registries

Linguistic ontologies and registries provide formal frameworks for defining, standardizing, and interconnecting categories used in linguistic descriptions, facilitating interoperability across datasets, tools, and research domains. These systems encode linguistic concepts as structured knowledge representations, often using ontology languages like OWL, to enable semantic querying, reuse, and integration in both traditional linguistics and computational applications. By formalizing relations between categories such as tense, part of speech, or semantic roles, they address challenges in data sharing and annotation consistency. The General Ontology for Linguistic Description (GOLD), initiated in 2001 and continuously updated, serves as a comprehensive for descriptive , formalizing basic categories and relations in human language to capture linguists' . It employs a profile-based approach, where linguistic terms are defined as OWL classes and properties, allowing for modular extensions tailored to specific research needs, such as phonological or syntactic descriptions. For instance, GOLD includes classes like 'Tense', which represents temporal relations in verb inflection, enabling precise modeling of grammatical features across languages. Developed by Scott Farrar and D. Terence Langendoen, the ontology originated from efforts to create a semantic web-compatible structure for linguistic metadata, with its foundational formalization outlined in their 2003 publication. The ISOcat Data Category Registry, aligned with the ISO 12620:2009 standard for and language resources, functioned as a collaborative platform from the late until its discontinuation in , hosting standardized data categories for linguistic and annotations. It emphasized terminological precision, defining categories through attributes like labels, definitions, and domains, to support consistent data modeling in language resources without imposing a full ontological . Categories could be submitted for within thematic groups, ensuring broad applicability in areas like and building. Following ISOcat's shutdown due to unmet goals and funding issues, related efforts shifted toward enhanced relation-handling mechanisms. Complementing ISOcat, the RELcat Relation Registry, prototyped in 2012, extends by enabling the and typing of arbitrary relationships between ISOcat entries or external , using an RDF quad store for flexible ontological linkages. This addresses ISOcat's limitations in representing hierarchies or equivalences, allowing users to define personalized views, such as subclass relations or mappings to other registries. Post-2015, RELcat's framework influenced subsequent semantic registries, including integrations in CLARIN's , which incorporates relational typing for improved in linguistic . The Ontologies of Linguistic Annotation (OLiA), developed from the early 2010s, offer a modular suite of /DL ontologies focused on , linking specific models for over 100 languages to a shared for compatibility. This architecture supports mappings between annotation schemes, such as part-of-speech tagsets, by formalizing categories like morphological features or syntactic dependencies as interconnected concepts. OLiA's design prioritizes and corpus interoperability, enabling tools to resolve ambiguities in multilingual annotations through explicit alignments. As detailed in Chiarcos and Sukhareva's 2015 overview, it covers phenomena including inflectional and structures, with extensions for and semantics. In terms of coverage, provides a broad, foundational suited for general , encompassing and with approximately 500 classes for diverse phenomena. Conversely, ISOcat and RELcat emphasize registry-based , with ISOcat cataloging around 2,000 categories focused on descriptors, enhanced post-2015 by RELcat's relational capabilities for dynamic linkages. OLiA, while narrower in scope to practices, excels in practical , linking to both and ISOcat concepts to bridge descriptive and computational uses, though it avoids exhaustive coverage. These differences highlight 's role in conceptual breadth versus the registry-oriented precision of ISOcat/RELcat and OLiA's annotation-centric . OLiA models, for example, briefly integrate with POS tagsets by mapping labels to reference categories.

Applications

In linguistic annotation and typology

Linguistic categories play a crucial role in by enabling the systematic tagging and of structural features across s, particularly in building parallel corpora for typological databases. In the World Atlas of Language Structures (WALS), categories such as word order types (e.g., SOV, SVO) and morphological alignments are used to annotate data from over 2,600 s, facilitating cross-linguistic parallels that reveal patterns in grammatical variation. This approach, as outlined by Dryer and Haspelmath, relies on standardized categories to ensure consistency in documenting features like case marking and tense systems, allowing researchers to construct searchable corpora for identifying universals and rarities in structures. In typological research, these categories support feature-based comparisons that uncover underlying principles of language diversity and . For instance, Comrie's of tense, , and () systems employs categories like "absolute tense" versus "relative tense" to classify how languages encode temporal relations, drawing on data from hundreds of languages to propose hierarchies of . This method has influenced subsequent typologies by providing a framework for quantifying feature distributions, such as the prevalence of marking, which aids in testing hypotheses about implicational universals in linguistic . Case studies in annotating endangered languages highlight the practical application of categories in preservation efforts. The DoBeS (Dokumentation Bedrohter Sprachen) project, active in the 2000s, utilized categories for morphological and syntactic in corpora of languages like Tsakhur and Udi, enabling detailed glossing that captures unique features such as polypersonal agreement before they are lost. Similarly, the PARADISEC archive applies category-based standards to document Australian and Pacific languages, ensuring annotations include semantic roles and pragmatic markers that support typological comparisons while adhering to ethical guidelines for indigenous . These initiatives demonstrate how categories enhance the of , allowing typologists to integrate into broader analyses. Briefly, interlinear glossing conventions reference these categories to provide morphological detail in annotations, aligning with typological needs for precise feature representation.

In computational linguistics and NLP

In and (NLP), linguistic categories form the backbone of many models for tasks such as part-of-speech (POS) tagging, dependency parsing, and (SRL). These categories enable the annotation of text data, which is then used to train models that predict grammatical and semantic structures. Early approaches relied on statistical methods that leveraged predefined category inventories to achieve high accuracy in processing unrestricted text. POS tagging, a core component of pipelines, assigns lexical categories to words based on context, often using hidden Markov models (HMMs). Seminal work by in 1988 introduced an HMM-based tagger trained on the , achieving around 96% accuracy by estimating transition and emission probabilities from tagged data. This approach influenced subsequent systems, including the rule-based Brill tagger from 1992, which automatically learns transformation rules from category-annotated corpora like the Penn Treebank, outperforming earlier methods with 95-97% accuracy on English text while requiring fewer parameters. HMMs and rule-based taggers using such inventories remain foundational for preprocessing in pipelines, enabling downstream tasks like . Dependency parsing employs syntactic categories from schemes like Universal Dependencies (UD) to model head-dependent relations in sentences, facilitating applications in . UD provides a consistent set of 17 universal POS tags and dependency labels across languages, used in multilingual treebanks for training parsers. Systems integrating UD, such as those from , support cross-lingual for , where structures help align source and target sentences, improving fluency in tools like early prototypes. For instance, CoNLL shared tasks since 2017 have advanced UD-based , achieving UAS scores above 90% on average for high-resource languages, directly aiding quality. Semantic role labeling (SRL) utilizes thematic categories to identify predicate-argument structures, enhancing understanding of event semantics in text. PropBank, introduced in the mid-2000s, annotates verbs with numbered argument roles (e.g., Arg0 for , Arg1 for ) atop the Penn Treebank, enabling supervised SRL models to predict roles with F1 scores around 80-85%. These categories support tasks like by clarifying who does what to whom. In the 2020s, transformer-based models have advanced category-aware processing by fine-tuning on annotated datasets like CoNLL formats. For example, the toolkit (2021) uses bidirectional transformers pre-trained on multilingual corpora and fine-tuned on UD treebanks for POS tagging and dependency parsing, attaining 97%+ accuracy on English data comparable to CoNLL-2003 benchmarks and enabling zero-shot transfer to low-resource languages. Similarly, Trankit (2021) employs lightweight transformers for end-to-end pipelines on CoNLL-U files, achieving average POS accuracies around 96% on UD benchmarks for high-resource languages while supporting over 90 languages. These models integrate categories for joint tasks, outperforming HMMs by capturing long-range dependencies. Ontologies like UD briefly aid in such fine-tuning by standardizing category mappings across datasets.

Challenges and future directions

One major challenge in linguistic category systems is the pervasive embedded in many part-of-speech (POS) tagsets, which often underrepresent the morphological complexity of non-Indo-European languages such as polysynthetic ones. Standard tagsets like Universal Dependencies (UD) were primarily developed based on Indo-European structures, leading to difficulties in accurately categorizing agglutinative or polysynthetic forms where words incorporate multiple s that blur traditional POS boundaries, as seen in languages like and Adyghe. This bias results in low-resource polysynthetic languages receiving inadequate annotation support, with high morpheme ambiguity and limited datasets hindering effective morphological analysis. Another persistent issue is the inherent in fuzzy linguistic categories, where boundaries between semantic, syntactic, or pragmatic labels are not discrete but gradient, complicating consistent and . For instance, prosodic or morphosyntactic features may exhibit overlapping interpretations across contexts, leading to challenges in defining clear delimiters for categories like versus generality in lexical items. This fuzziness exacerbates errors in cross-linguistic comparisons and automated processing, as human annotators and models alike struggle with the malleable nature of these boundaries. The deprecation of ISOcat in 2014 has further intensified problems among linguistic metadata schemas, leaving a fragmented of vocabularies without centralized relational mapping, which impedes across projects. While ISOcat's static persists, its lack of dynamic relations contributed to redundant categories and stalled harmonization efforts. RELcat, introduced as an RDF-based relation registry in 2012, offers a partial solution by enabling flexible crosswalks between ISOcat data categories and external standards like SKOS, though it has not fully resolved the post-deprecation silos. Looking ahead, future directions emphasize AI-driven dynamic categorization to adapt tagsets in real-time, leveraging techniques like clustering to uncover latent patterns in corpora without rigid preconceptions, thus addressing static limitations. Integration with data, particularly for languages, promises enhanced inclusivity; models like SignAlignLM demonstrate how text-based and video inputs can be fused to process glosses and gestures, extending categories beyond spoken modalities. Inclusivity efforts are advancing through initiatives like the 2020s Masakhane project, which expands resources for low-resource African languages—including creoles—via community-driven datasets, fostering broader ontological representation in underrepresented linguistic diversity.

References

  1. [1]
    [PDF] Linguistic concepts and categories in language description and ...
    The paper discusses linguistic concepts and categories in language description and comparison, focusing on the relationship between comparative and descriptive ...
  2. [2]
    Linguistic Form - an overview | ScienceDirect Topics
    A typical model describes grammatical knowledge as being compartmentalized into phonological, syntactic and semantic components. ... linguistic categories should ...
  3. [3]
    Lexical Categories - Annual Reviews
    Aug 15, 2016 · Lexical categories, like noun, verb, and adjective, are units of words, not structured combinations, and express semantic content.
  4. [4]
    [PDF] 1 Lexical categories - Sites@Rutgers
    Introduction. The term 'lexical category' is generally used to describe the categories of noun, verb, adjective, and possibly certain others (e.g. ...<|control11|><|separator|>
  5. [5]
    Grammatical Categories and Relations: Universality vs. Language ...
    Nov 5, 2010 · Grammatical categories and relations correspond to classes whose members display at least partially overlapping properties, for example, they ...<|separator|>
  6. [6]
    None
    Below is a merged summary of "The Universal Structure of Linguistic Categories" by Martina Wiltschko (2014), consolidating all information from the provided segments into a single, comprehensive response. To retain maximum detail, I will use a combination of narrative text and a table in CSV format for key details, ensuring all languages, examples, and concepts are included. The response avoids redundancy while preserving all unique information.
  7. [7]
    [PDF] Comparative concepts and descriptive categories in crosslinguistic ...
    Linguistic Discovery 6.40-63. Haspelmath, Martin. 2010. Framework-free grammatical theory. The Oxford handbook oj grammatical analysis, ed. by Bernd Heine and ...
  8. [8]
    [PDF] LINGUISTIC RELATIVITY
    work (Whorf 1956b). Although one might begin with only two levels, a lower, universal one and a higher language-specific one (e.g. Levinson 1997,. Wierzbicka ...<|separator|>
  9. [9]
    [PDF] Grammar: A Historical Survey - IOSR Journal
    The system of eight parts of speech first appeared in the work of the great Alexandrian grammarian,. Dionysius Thrax( Late 2nd century B.C. ), who is reputed ...Missing: Grammatike | Show results with:Grammatike
  10. [10]
  11. [11]
    Port Royal Logic - Stanford Encyclopedia of Philosophy
    Jul 22, 2014 · The Logic is a companion to General and Rational Grammar: The Port-Royal Grammar, written primarily by Arnauld and “edited” by Claude Lancelot, ...
  12. [12]
    Port Royal Grammar - Personal Websites - University at Buffalo
    The central argument of the Port Royal Grammar is that grammatical rules are mental in origin, and inborn, and universal. Barnard, Howard (1913). The little ...Missing: categories | Show results with:categories
  13. [13]
    [PDF] Post- structuralism and inflectional morphology in Saussure's Cours
    Even though Saussure's ideas about linguistics were fundamental in the development of French structuralism, his influence did not come through. French ...
  14. [14]
    [PDF] Noam Chomsky Syntactic Structures - Tal Linzen
    Linguists seek to de- scribe the mental systems that Japanese or Cornish people have, their language "organs." These systems are represented somehow in human.
  15. [15]
    [PDF] Universals of language - Internet Archive
    ... UNIVERSALS of LANGUAGE. EDITED BY JOSEPH H. GREENBERG. PROFESSOR OF ANTHROPOLOGY. STANFORD UNIVERSITY. REPORT OF A CONFERENCE HELD AT. DOBBS FERRY, NEW YORK.
  16. [16]
    Computational Linguistics - Stanford Encyclopedia of Philosophy
    Feb 6, 2014 · A major shift in nearly all aspects of natural language processing began in the late 1980s and was virtually complete by the end of 1995: this ...
  17. [17]
    [PDF] Introduction: The Handbook of Linguistic Annotation
    Like the Brown Corpus, corpora developed in the 70s and 80s were typically annotated for part-of-speech, but the lack of rea- sonably accurate automatic methods ...
  18. [18]
    What is a Grammatical Category - Glossary of Linguistic Terms |
    Definition: A grammatical category is a set of syntactic features that: express meanings from the same conceptual domain. occur in contrast to each other.
  19. [19]
    [PDF] On the nature of grammatical categories - UNM Linguistics
    In addition, however, we tend to expect a grammatical category to have at least two other properties, one structural and the other semantici we expect the ...
  20. [20]
    7.2: Grammatical Categories and NPs - Social Sci LibreTexts
    Apr 10, 2021 · As we'll see, though, grammatical categories can also be defined by grammatical morphemes that are separate words. Languages differ quite ...
  21. [21]
    Grammatical categories (Chapter 12) - The Cambridge Handbook of ...
    The term 'grammatical' refers to the syntactic and morphological properties of a language. Morphology and syntax are related to phonology, on the one hand, and ...
  22. [22]
    Grammatical categories - Unisa
    Jul 30, 2017 · The various kinds of grammatical categories include the following: number, definiteness, tense and aspect, case, person, gender and mood.Number · In/definiteness · Tense And Aspect<|control11|><|separator|>
  23. [23]
    ta1: intro to tense, aspect, mood, voice - LAITS
    Mood is a grammatical category distinguishing verb tenses. There are four moods in French: indicative, subjunctive, conditional, and imperative. All of these ...
  24. [24]
    Tense, Aspect, Mood - Learning Proper English Grammar Concepts
    Tense, Aspect and Mood. Tenses interact with the grammatical concept of aspect. Aspect defines how the flow of time is viewed in the sentence.
  25. [25]
  26. [26]
    Chapter Alignment of Case Marking of Full Noun Phrases
    In studying the alignment of case marking, we ask the question which of S, A, and P are coded identically and which are coded differently.
  27. [27]
    [PDF] The strategy of case-marking - Rutgers Optimality Archive
    We claim that cross-linguistic variation in case marking patterns can be analysed in terms of differences in the relative strengths of the two basic case ...<|separator|>
  28. [28]
    4 - Grammatical categories: typological markedness, economy and ...
    Jun 5, 2012 · The notion of marked and unmarked values of a category was first developed for phonological systems by Trubetzkoy (1931; 1939/1969) and first ...
  29. [29]
    [PDF] Gender markedness: the anatomy of a counter-example
    In a now-famous discussion, Roman Jakobson (1932/1984, pp. 2–3) observes that a morphological markedness asymmetry in masculine-feminine pairs is paralleled by ...
  30. [30]
    Ling 247, S20, Notes on lexical categories
    Lexical categories · a. Open-class categories: noun, verb, adjective, adverb · b. Closed-class categories: determiner, pronoun, auxiliary verb, preposition, ...
  31. [31]
    [PDF] Revisiting Adjective Classification in Chinese: Insights from Entropy ...
    classes of adjectives in Mandarin Chinese: simple adjectives and derived adjectives. Derived adjectives encompass both reduplicated adjectives and modifier ...
  32. [32]
  33. [33]
    [PDF] Fillmore, Charles J. The Case for Case In
    Fillmore, Charles J. The Case for Case In: E. Bach and R.T. Harms (eds) (1968) Universals in. Linguistic Theory. London: Holt, Rinehart and Winston, pp. 1-25 ...Missing: text | Show results with:text
  34. [34]
    Lexical Semantics (Chapter 16) - The Cambridge Handbook of ...
    Lexical semantics is the study of word meanings. The topic is not an easy one to research. Unlike pronunciations, which are public, word meanings cannot be ...Missing: authoritative | Show results with:authoritative
  35. [35]
    Semantics: Volume 1 - John Lyons - Google Books
    Jun 2, 1977 · Volume 1 provides a general and comprehensive introduction to semantics, synthesizing work on meaning and communication from many disciplines.
  36. [36]
    How to Do Things with Words - Harvard University Press
    Apr 15, 1975 · These talks became the classic How to Do Things with Words. For this second edition, the editors have returned to Austin's original lecture notes.Missing: speech theory
  37. [37]
    Women, Fire, and Dangerous Things
    In Women, Fire, and Dangerous Things, George M. Lakoff takes on the classical theory of categorization, which argues that the classes into which our minds ...
  38. [38]
    [PDF] Building a Large Annotated Corpus of English: The Penn Treebank
    Since these taggers are based on the Penn Treebank tagset, the 4% error rate introduced as an artefact of mapping from the PARTS tagset to ours is eliminated, ...<|separator|>
  39. [39]
    [PDF] Part-of-Speech Tagging - Stanford University
    Indeed, the Treebank tag POS is used only for 's, which must be segmented in tokenization. Page 6. 6. CHAPTER 8 • PART-OF-SPEECH TAGGING. 8.3 ...
  40. [40]
    CLAWS part-of-speech tagger - UCREL - Lancaster University
    Our POS tagging software for English text, CLAWS (the Constituent Likelihood ... tagging System), has been continuously developed since the early 1980s.
  41. [41]
    Universal POS tags
    Universal POS tags. These tags mark the core part-of-speech categories. To distinguish additional lexical and grammatical properties of words, ...AdjectiveADP
  42. [42]
    Short Introduction - Universal Dependencies
    Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages.
  43. [43]
    [PDF] Universal Dependencies v1: A Multilingual Treebank Collection
    In this paper, we describe v1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages. Keywords: ...
  44. [44]
    Universal Dependencies | Computational Linguistics | MIT Press
    Jul 13, 2021 · Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 ...
  45. [45]
    Prague Dependency Treebank | ÚFAL
    The Prague Dependency Treebank (PDT) contains a large amount of Czech texts with complex and interlinked morphological, syntactic and complex semantic ...
  46. [46]
  47. [47]
    [PDF] Head-Initial & Head-Final Structures in Dependency Grammar
    The two schemes agree insofar as coordinate structures are head-initial, the initial conjunct being head over the following conjuncts. They disagree, however, ...
  48. [48]
    Coordinate Structures in Universal Dependencies for Head-final ...
    Head-Initial and Head-Final Coordinate Structures in Two Annotation Schemes of Dependency Grammar · Annotation Issues in Universal Dependencies for Korean and ...Missing: Challenges | Show results with:Challenges<|separator|>
  49. [49]
    [PDF] The Leipzig Glossing Rules:
    The Leipzig Glossing Rules have been developed jointly by the Department of. Linguistics of the Max Planck Institute for Evolutionary Anthropology (Bernard.
  50. [50]
    [PDF] Glossing in the Linguistic Survey of India: some insights into early 20
    Lehmann, who identifies Finck (1909) as one of the first publications to make use of interlinear morphemic glossing and translation, points out that the latter ...
  51. [51]
    [PDF] ILS Publications Style Sheet - SIL Global
    Linguistic abbreviations for technical terms use full caps, except when the abbreviation is used in interlinear text glossing (see sections 8.3.3 and 9.3). •.
  52. [52]
    GOLD
    GOLD is an ontology for descriptive linguistics, formalizing basic categories and relations in human language, intended to capture a linguist's knowledge.
  53. [53]
    General Ontology for Linguistic Description
    GOLD is an ontology for descriptive linguistics, defining linguistic terms using OWL, and providing a semantic framework for linguistic knowledge.
  54. [54]
    ISO 12620:2009 - Terminology and other language and content ...
    ISO 12620:2009 provides guidelines concerning constraints related to the implementation of a Data Category Registry (DCR) applicable to all types of language ...Missing: ISOcat | Show results with:ISOcat
  55. [55]
    [PDF] Annotation Interoperability for the Post-ISOCat Era - ACL Anthology
    May 16, 2020 · ISOCat failed in general to deliver on its promises and was eventually discontinued in 2014 (Schuurman et al., 2015). 2. Linguistic ...
  56. [56]
    Data Category Registry - CLARIN Standards Information System
    ... ISO 12620 standard, which is called ISOcat (“Data Category Registry for ISO TC 37”). The ISOcat describes the data model and procedures for DCR. It is ...
  57. [57]
    [PDF] ISO 12620 Data Category Registry An introduction - CLARIN-NL
    Mar 25, 2010 · Data categories can be submitted to the standardization process, in which case they are assigned to a Thematic. Domain Group which judges it.
  58. [58]
    [PDF] Cross-linguistic Data Formats, advancing data sharing and reuse in ...
    Jan 8, 2018 · ISOcat [63]. ISOcat. Discontinued because 'the original mandate to “standardize” data categories within the ISO framework was never fulfilled'.
  59. [59]
    RELcat: a Relation Registry for ISOcat data categories
    RELcat is a first prototype of a Relation Registry, which allows storing arbitrary relationships. These relationships can reflect the personal view of one ...Missing: deprecated | Show results with:deprecated
  60. [60]
    [PDF] RELcat: a Relation Registry for ISOcat data categories - LREC
    The predecessor of ISOcat SYNTAX, which was based on a draft of the ... model in ISO 12620:2009, but its addition to ISOcat has been sanctioned by ...Missing: successor | Show results with:successor
  61. [61]
    [PDF] CLARIN Concept Registry: The New Semantic Registry
    RELcat: a Relation Registry for ISOcat data categories. Proceedings of the Eight. International Conference on Language Resources and Evaluation (LREC 2012) ...
  62. [62]
    ISOcat and RELcat, two cooperating semantic registries - Royal ...
    Dive into the research topics of 'ISOcat and RELcat, two cooperating semantic registries'. Together they form a unique fingerprint. ISOcat Keyphrases ...
  63. [63]
    Ontologies of Linguistic Annotation (OLiA) - GitHub Pages
    OLiA provides Annotation Models for linguistic annotations and NLP tools for more than 85 languages together with their linking to a common Reference Model.
  64. [64]
    [PDF] OLiA – Ontologies of Linguistic Annotation - Semantic Web Journal
    The OLiA ontologies cover different grammatical phenomena, including inflectional morphology, word classes, phrase and edge labels of different syntax an-.
  65. [65]
    OLiA – Ontologies of Linguistic Annotation
    Aug 1, 2015 · The OLiA ontologies cover different grammatical phenomena, including inflectional morphology, word classes, phrase and edge labels of different ...
  66. [66]
    GOLD - FAIRsharing
    Aug 21, 2018 · GOLD is an ontology for encoding linguistic data. It gives a formalized account of the most basic categories and relations used in the ...
  67. [67]
    Ontologies of Linguistic Annotation: Survey and perspectives
    The OLiA ontologies represent a repository of annotation terminology for various linguistic phenomena on a great band-width of languages. This paper summarizes ...
  68. [68]
    [PDF] A Simple Rule-Based Part of Speech Tagger - ACL Anthology
    In this paper, we present a sim- ple rule-based part of speech tagger which au- tomatically acquires its rules and tags with ac- curacy comparable to stochastic ...
  69. [69]
    Universal Dependencies
    Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across ...Dependency Relations · Tools for working with UD · UD Guidelines · English UD
  70. [70]
    Universal Dependencies v1: A Multilingual Treebank Collection
    Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based ...<|control11|><|separator|>
  71. [71]
    [PDF] CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to ...
    In 2017, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a real- world setting without any gold-standard.
  72. [72]
    [PDF] The Proposition Bank: An Annotated Corpus of Semantic Roles
    The Proposition Bank project takes a practical approach to semantic representation, adding a layer of predicate-argument information, or semantic role ...
  73. [73]
    [PDF] Trankit: A Light-Weight Transformer-based Toolkit for Multilingual ...
    We introduce Trankit, a light-weight. Transformer-based Toolkit for multilingual. Natural Language Processing (NLP). It provides a trainable pipeline for ...
  74. [74]
    A decade of language processing research: Which place for ...
    4) puts it, “even after the Eurocentric bias has started to lose its grip on the choice of languages to be studied, there remains a bias that can be summed up ...
  75. [75]
    [PDF] Towards Unsupervised Morphological Analysis of Polysynthetic ...
    Nov 23, 2022 · Polysynthetic languages present a challenge for morphological analysis due to the complexity of their words and the lack of high-quality an-.Missing: Eurocentrism | Show results with:Eurocentrism
  76. [76]
    [PDF] Expanding Universal Dependencies for Polysynthetic Languages
    May 15, 2021 · This paper describes the development of the first Universal Dependencies (UD, Nivre et al.,. 2016, 2020) treebank for St. Lawrence Island.
  77. [77]
    Editorial: Fuzzy boundaries: Ambiguity in speech production and ...
    Early research may have attributed much of this ambiguity to equipment error, less than ideal recording conditions, population under-sampling, or other sources ...
  78. [78]
    Fuzziness - vagueness - generality - ambiguity - ScienceDirect.com
    In this paper, I attempt to distinguish four linguistic concepts: fuzziness, vagueness, generality and ambiguity.
  79. [79]
    [PDF] Exploring the Future of Corpus Linguistics: Innovations in AI and ...
    Oct 3, 2025 · This paper explores how AI is transforming corpus linguistics, its methodological shifts, and its social and linguistic implications.
  80. [80]
    [PDF] SignAlignLM: Integrating Multimodal Sign Language Processing into ...
    Jul 27, 2025 · We introduce the first text-based and multi- modal LLMs capable of sign language pro- cessing called SignAlignLM, and propose new prompting and ...
  81. [81]
    Masakhane
    Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans.Masakhane Fundraising | DLI... · Open Positions · Decolonise Science · MMT-AfricaMissing: creoles ontologies