Fact-checked by Grok 2 weeks ago

Interlinear gloss

An interlinear gloss, also known as interlinear glossed text (IGT), is a standardized format in for presenting the morphological and grammatical structure of sentences in a source (often under-documented or non-Indo-European) through a multi-line aligned word-for-word or morpheme-for-morpheme. It typically consists of three aligned lines: the first presents the original text in the source ; the second provides a gloss line that translates lexical morphemes with equivalent words in a target (usually English) and labels grammatical morphemes with uppercase abbreviations (e.g., "PL" for ); and the third offers a free, idiomatic into the target . This method enables precise analysis of linguistic forms without requiring fluency in the source , making it essential for , comparative studies, and . The primary purpose of interlinear glossing is to bridge the gap between the surface form of a text and its underlying grammatical structure, allowing readers to map elements of the free translation back to specific morphemes in the original. By breaking down words into morphemes—using hyphens for segmentation and conventions like equals signs for clitics or periods for multi-morpheme glosses—it reveals semantic, phonological, and syntactic details that idiomatic translations obscure. Developed as a formal practice in the mid-20th century, with precursors in 19th-century descriptive , interlinear glossing has become a cornerstone of fieldwork and publication in , , and preservation, particularly for low-resource languages where full grammars may not exist. Key conventions, such as those outlined in the Leipzig Glossing Rules (a collaborative standard from institutions like the Max Planck Institute), ensure consistency across publications: grammatical categories use standardized abbreviations (e.g., "1SG" for first-person singular), non-overt elements are marked with "" or brackets, and infixes or reduplications receive special notations like angle brackets or tildes. These rules promote biunique mappings between source morphemes and glosses, prioritizing precision while allowing flexibility for language-specific phenomena. In , interlinear glosses also serve as training data for tasks, including automatic glossing tools for under-resourced languages.

Fundamentals

Definition and Purpose

An interlinear gloss is a textual representation in where a source is aligned line-by-line with morpheme-by-morpheme translations and grammatical annotations directly below or beside each word or , providing a detailed breakdown of linguistic structure. This format typically includes three or more parallel lines: the original text, a gloss line offering word-for-word or morpheme-for-morpheme equivalents (often in the analyst's language, such as English), and a free translation of the entire . The es themselves are brief summaries of the meaning or grammatical properties of morphemes, designed for clarity in interlinear displays without implying a full morphological parse. The primary purpose of interlinear glosses in is to facilitate a precise grammatical , particularly for morphologically complex or under-documented languages, by revealing the internal composition of words and their syntactic relationships that free translations obscure. They serve as a tool for presenting linguistic examples in research publications, offering structural insights beyond idiomatic renderings to support theoretical and descriptive work. In this way, interlinear glosses enable linguists to encode morphosyntactic, functional, and part-of-speech information systematically, aiding in the documentation and preservation of endangered languages. Applications of interlinear glosses span field linguistics, where they assist in eliciting and analyzing structures directly from native speakers; typological studies, for cross-linguistic comparisons of grammatical patterns; and , providing literal, morpheme-aligned renderings essential for fidelity in source material. They are particularly valuable in documenting lesser-known languages, such as those in databases like TypeCraft, which annotate phrases across diverse tongues to enhance multilingual research. Key benefits include the revelation of morphological and syntactic patterns invisible in standard translations, promoting in linguistic descriptions to improve data reusability and comparability across studies. By aligning annotations vertically, interlinear glosses accelerate comprehension of linguistic components, fostering quicker analysis and broader accessibility for teaching and comparative purposes. This format thus supports sustainable efforts, especially for low-resource languages, by enabling efficient data handling and sharing in scholarly contexts.

Basic Components

An interlinear gloss typically consists of three core lines that facilitate the of linguistic . The first line presents the source text in the original , with morpheme segmentation using hyphens for bound morphemes and signs for clitics (often in orthographic or ). The second line offers a word-for-word or morpheme-by-morpheme gloss, translating each segment into the target while indicating grammatical categories. The third line delivers a free, idiomatic of the entire , capturing its natural meaning. An unsegmented version of the source text may optionally precede the segmented source line. Morpheme glossing isolates the smallest meaningful units—such as roots, affixes, and clitics—for precise annotation. Roots are glossed with their lexical equivalents in the target language, while affixes and other grammatical elements receive abbreviated labels that denote features like tense, number, or case. For instance, a form like "abur-u-n" might be segmented and glossed as "they-OBL-GEN," where "abur" translates to "they," "u" indicates the oblique case, and "n" marks the genitive. This approach ensures that portmanteau morphemes (single forms expressing multiple categories) are glossed with multiple labels separated by periods, such as "GEN.PL" for a genitive plural ending. The alignment principle governs the vertical correspondence across lines, maintaining a one-to-one or one-to-many mapping between elements in the source text, segmentation, and gloss. Each or word in the upper lines aligns horizontally with its counterpart below, typically left-justified to preserve sequential order and readability. This structure, as in the example below from , allows linguists to trace morphological and syntactic relationships directly:
Mereka di [Jakarta](/page/Jakarta) sekarang.
they   in [Jakarta](/page/Jakarta) now
'They are in [Jakarta](/page/Jakarta) now.'
Such alignment supports comparative analysis without disrupting the flow of the original text. Abbreviations play a crucial role in compactly encoding grammatical information in the gloss line, using standardized uppercase tags for categories like parts of speech (e.g., N for , V for ), morphological features (e.g., PL for , FUT for ), and relational markers (e.g., COM for comitative, INS for ). These tags draw from a conventional to promote consistency across publications, though authors may adapt rare or language-specific terms. For example, in a Russian sentence, "poexa-l-i" is glossed as "go-PST-PL," where PST denotes and PL . This system balances brevity with informativeness, enabling quick identification of structural patterns.

Historical Development

Origins in Philology and Biblical Studies

The practice of interlinear glossing originated in ancient Greco-Roman , where scholars added marginal and interlinear notes to papyrus rolls and codices of classical texts to elucidate difficult words, , and syntax. These glosses, often in the same as the primary text (intralingual), facilitated and interpretation of works like Homer's Iliad and Odyssey, preserving archaic vocabulary and poetic forms through word-aligned explanations. In the medieval period, this tradition continued in Carolingian manuscripts of Virgil's Aeneid, where interlinear glosses provided grammatical aids and historical context, blending ancient scholia with contemporary commentary to maintain fidelity to the original Latin structure. In , interlinear glossing gained prominence in the early through printed polyglots designed for theological and literal interpretation of scripture. The (1514–1517), commissioned by Cardinal and printed at , featured the with Hebrew and texts alongside and Latin versions, including interlinear Latin translations above the and to enable word-for-word comparison. This layout supported scholars in resolving textual variants and uncovering original meanings, emphasizing precise alignment to avoid interpretive distortions in doctrinal debates. A pivotal contribution came from Desiderius of Rotterdam, whose (1516), published by Johann Froben in , presented the in parallel and revised Latin columns with extensive annotations promoting philological accuracy. Although not strictly interlinear, Erasmus's edition advanced word-for-word fidelity by prioritizing the Greek original over the , influencing subsequent translations and exegetical methods. These early efforts underscored a methodological commitment to preserving the source texts' syntax and semantics, laying foundational practices for scholarly analysis that extended beyond religious contexts into broader linguistic applications.

Evolution in Modern Linguistics

The practice of interlinear glossing emerged as a key tool in descriptive linguistics during the 1950s and 1960s, particularly among structuralists associated with missionary linguistics efforts. Organizations like the Summer Institute of Linguistics (SIL), founded in 1934 by William Cameron Townsend with Kenneth Pike as an early leader and later director, promoted its use for detailed phonological and morphological analysis of under-documented languages, facilitating fieldwork in regions such as Mexico and New Guinea. This approach allowed linguists to break down complex structures morpheme by morpheme, aiding in the documentation of tonal systems and tagmemic units central to Pike's theoretical framework. By the late 1960s, full adoption of interlinear morpheme glossing (IMG) had taken hold, addressing the need for precise representations in typological studies of unfamiliar languages. The rise of in the 1970s, spearheaded by Noam Chomsky's emphasis on abstract rule systems, initially shifted focus away from surface-level morphological breakdowns toward underlying . However, interlinear glosses endured as an essential method for presenting empirical data in academic publications, including influential journals like , where they provided verifiable examples from diverse languages to support theoretical claims. This persistence underscored glossing's role in bridging formal theory with descriptive evidence, ensuring that even generative analyses incorporated glossed texts to illustrate phenomena like case marking or . Standardization efforts accelerated in the and , driven by SIL International's development of tools for fieldworkers, such as the Shoebox interlinearizer in 1988, which systematized gloss production for non-Indo-European languages. These initiatives were complemented by broader guidelines, culminating in the Glossing Rules introduced in 2004 by the Institute for , which established conventions for abbreviations, alignment, and handling of irregularities to promote consistency across publications. Updated periodically, with the latest revision in 2015, these rules reflected evolving practices in cross-linguistic comparison. In contemporary , interlinear glossing is ubiquitous in and , particularly for grammatical descriptions of lesser-known languages. Databases like the Online Database of Interlinear Text () illustrate its prevalence, containing over 150,000 glossed examples from nearly 1,500 languages as of 2016, extracted from scholarly sources to support global linguistic analysis. This widespread adoption, exceeding that in earlier decades, enables reusable data for computational and projects in the .

Formatting and Conventions

Layout and Alignment

Interlinear glosses are structured to maintain a strict horizontal alignment between the source text and its morpheme-by-morpheme , ensuring a correspondence that facilitates precise morphological analysis. This alignment is achieved by segmenting words into morphemes using hyphens in both the source and lines, with spaces separating words to mirror their boundaries exactly. For instance, in an example from , "Mereka di sekarang" aligns directly with "they in now," preserving word-level correspondence before delving into morphemes. Vertically, interlinear glosses are typically presented in a trilinear : the first line contains the source text (often in orthographic or ), the second provides the gloss with linguistic labels, and the third offers a free translation of the entire . An optional fourth line may include a if the orthographic form differs significantly, enhancing readability for non-specialists. This stacking prioritizes left-alignment to avoid visual misalignment during or digital rendering. Irregularities in morpheme segmentation are handled through standardized conventions to maintain alignment without distortion. Portmanteaux, where a single form encodes multiple grammatical features, are glossed with a period (or sometimes a colon) separating the elements, such as in Turkish "çık-mak" rendered as "come.out-INF." Zero morphemes, representing covert elements like unmarked cases, are indicated by "Ø" in the source line or glossed in parentheses, as in Latin "puer-Ø" for "boy-NOM.SG." These markers ensure the lines remain balanced in length and structure. Typography in interlinear glosses emphasizes distinction between lines for clarity, with the source text often in italics, the gloss in a smaller font using small capitals for labels (e.g., NOM for nominative), and the in standard . In print formats, hyphens and alignment are managed manually or via justified spacing, while digital presentations frequently employ tables with invisible borders to automate precise positioning, reducing errors in complex examples.

Punctuation, Symbols, and Abbreviations

In interlinear glosses, standardized abbreviations are employed to denote grammatical categories and morphological features, ensuring clarity and comparability across linguistic analyses. These abbreviations typically appear in uppercase letters and are aligned morpheme-by-morpheme with the source language form. Common examples include part-of-speech tags such as V for verb and N for noun, person-number markers like 1SG for first person singular and 3PL for third person plural, and morphological markers such as CAUS for causative and RED for reduplication. A more extensive inventory of recommended abbreviations is provided in the appendix to the Leipzig Glossing Rules, which includes labels like ACC for accusative, DAT for dative, FUT for future, NEG for negation, and PASS for passive, among over 80 others designed to reflect widespread usage in typological and descriptive linguistics. Punctuation in glosses follows specific conventions to indicate structural relationships between elements. Hyphens (-) are used to mark boundaries between segmentable morphemes in both the object form and the corresponding , such as in ba-la '3SG-go' glossed as 3SG-go. Equals signs (=) denote boundaries, distinguishing clitics from tightly bound affixes, as in ba=la where =la is a . Periods (.) separate multiple gloss elements that correspond to a single unsegmentable form in the source , allowing for one-to-many mappings like ba.la glossed as 3SG.go. Additional optional punctuation includes underscores (_) for unsegmentable elements without a single gloss equivalent, semicolons (;) for elements with multiple properties, colons (:) to avoid spurious segmentation, and backslashes () for morphophonological alternations. Symbols enhance the precision of glosses by highlighting particular morphological phenomena. Grammatical labels are conventionally rendered in uppercase to distinguish them from the lowercase free translation line. Angled brackets (< >) enclose and their glosses, as in bili glossed as buy 'buy', where the infix is positioned relative to its host. Square brackets [ ] mark non-overt elements, round parentheses ( ) indicate inherent categories, and tildes (~) connect reduplicated portions to the base, such as ba~la glossed as go~RED. For non-standard abbreviations or terms outside the recommended , footnotes are used to provide explanations, maintaining accessibility for readers unfamiliar with specialized notations. While the Glossing Rules, originating from linguistic traditions, serve as the dominant standard, variations exist across institutions. The Summer of Linguistics (SIL) International, focused on field linguistics and , recommends adherence to these rules for grammatical abbreviations in its publications , promoting uniformity with global practices. Recent proposals, such as the 2023 Generalized Glossing Guidelines, build on conventions by introducing more explicit notations for non-concatenative —like curly braces { } for processes such as infixation or tonal changes—to improve machine readability and representational accuracy, though adoption remains emerging.

Examples

Introductory Example

To illustrate the basic structure of an interlinear gloss, consider a sentence, chosen for its familiarity to demonstrate how glossing applies universally across languages, even those with minimal like English. The "The cats sleep" can be glossed morpheme-by-morpheme as follows, following standard conventions where words in analytic languages are typically treated as units unless bound morphemes are present:
The     cats     sleep
DET     cat.N.PL sleep.V.PRS
'The cats sleep.'
Here, the first line presents the original sentence with words segmented (in this case, no hyphens are needed as English relies on separate words rather than affixes). The second line provides the gloss: "DET" indicates the definite article as a ; "cat.N.PL" glosses "cats" with the stem "cat," category "N," and "PL"; "sleep.V.PRS" glosses the verb with its stem, category "V," and "PRS." These abbreviations follow widely adopted standards for grammatical categories to ensure clarity and consistency. The third line offers a , capturing the natural meaning without literal constraints. This step-by-step breakdown highlights key choices: segmentation aligns directly with gloss elements for correspondence, and glosses prioritize semantic and grammatical over exhaustive detail. In analytic languages like English, glossing avoids over-segmentation—such as artificially breaking function words into subparts—to prevent misrepresentation of the language's structure, focusing instead on core categories like number and tense where relevant.

Advanced Example in Typologically Diverse Languages

To illustrate the application of interlinear glossing in typologically diverse languages, consider examples from polysynthetic and isolating structures, which highlight the varying degrees of morphological complexity across language families. These cases demonstrate how glosses adapt to reveal intricate that differ markedly from those in more familiar Indo-European patterns. In polysynthetic languages like , an Eskimo-Aleut language spoken in regions of , a single verb form can encode an entire through extensive affixation, incorporating multiple arguments, modalities, and aspectual nuances. A representative example is the word tusaatsiarunnanngittualuujunga, which translates to "I don’t hear very well." The interlinear gloss breaks it down as follows:
tusaatsiarunnanngittualuujunga
tusaa-           tsiag-    junnag-  nngit-  tualuu-  junga
hear-            well      be.able- NEG-    much-    1SG.PRES
This segmentation reveals over six morphemes in , showcasing the language's agglutinative tendencies where affixes stack sequentially to build meaning, including the root for "hear," an modifier "well," a modal "be able," , an "much," and subject agreement in the first-person singular . Such glossing underscores Inuktitut's polysynthetic profile, where s frequently incorporate nominal elements or adverbials to compactly express predicate-argument structures that would require full clauses in analytic languages. In contrast, isolating languages like , an Austroasiatic language of , exhibit minimal morphological segmentation, relying instead on , particles, and classifiers to convey . Vietnamese words are predominantly monosyllabic and uninflected, with tones marking lexical distinctions and classifiers specifying noun categories in numeral phrases. An advanced example appears in a context: Thì thấy một ông già và hai đứa bé nên nghĩ là ba cha con, glossed as:
thì     thấy   một   ông   già   và   hai   đứa   bé    nên   nghĩ   là    ba   [cha](/page/CHA)   con
PTL     see    one   man   old   and   two   CLF.CH child so    think be    three father son
This yields the translation: "I saw an and two boys, so I thought they were father and sons." Here, glossing involves little to no affixation breakdown, instead annotating particles like thì (topic linker), classifiers such as ông (for males) and đứa (for children), and the . Tonal diacritics (e.g., on thấy rising ) are often noted in the but not separately glossed unless phonologically contrastive. Interlinear glosses in these examples illuminate typological contrasts, such as agglutinative patterns in —where morphemes attach discretely with one-to-one form-function mappings—versus fusional patterns in languages like Latin, where affixes blend multiple categories (e.g., legunt glossed as read.3PL.PRES to capture tense, , and number in a single portmanteau form). Portmanteau morphemes, common in fusional systems, are glossed with periods to separate fused features (e.g., insularum as island.GEN.PL), while suppletive forms—irregular stem alternations like English go/went—are annotated by their grammatical function without implying segmental unity (e.g., backslash for stem change in Väter-n as father.PL-DAT.PL). These conventions ensure glosses capture language-specific irregularities without over-segmentation. Through such glossing, typological insights emerge: Inuktitut's structure reveals polysynthetic grammar via noun-verb incorporation, where objects or adverbials fuse into the verb stem to denote holistic events (e.g., incorporating a spatial or noun to specify manner without separate words), contrasting Vietnamese's isolating reliance on syntactic positioning and classifiers for semantic nuance. This approach not only aids cross-linguistic comparison but also highlights how encodes , such as compact event integration in versus explicit relational marking in isolating ones.

Resources and Tools

Databases and Corpora

Databases and corpora of interlinear glossed texts (IGT) serve as essential repositories for linguistic research, compiling annotated examples from scholarly publications, field documentation, and archival projects to support cross-linguistic analysis and . These resources typically include searchable interfaces that allow queries by gloss tags, such as morphological categories (e.g., "NOM" for nominative), language family, or geographic region, enabling researchers to explore typological patterns without manual data extraction. Coverage often emphasizes endangered languages, with many entries featuring multimedia alignments like audio or video linked to glosses. One prominent database is the Online Database of Interlinear Text (), developed by researchers at the and launched in the late 2000s. aggregates 158,007 IGT examples (as of ) extracted automatically from PDF versions of linguistic papers, covering 1,496 languages primarily from published sources. Users can access it via a web interface for searching glosses, alignments, and translations, making it valuable for rapid prototyping in and applications. The (TLA) at the Institute for , which incorporates the DoBeS (Documentation of Endangered Languages) program initiated in , hosts over 350 collections spanning more than 250 languages worldwide. These include richly annotated IGT with interlinear glosses, free translations, and metadata on grammatical features, often derived from fieldwork on endangered varieties; access is open through the TLA portal with tools for querying by linguistic tags or depositing new data. For field-based corpora, the Endangered Languages Archive (ELAR) at preserves multimedia documentation for over 770 endangered languages across more than 90 countries as of 2024. ELAR's collections frequently feature FLEx-generated IGT linked to SIL International's lexical tools, focusing on under-documented languages with searchable glosses for morphological and syntactic analysis. Similarly, the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) maintains IGT-integrated texts from 794 collections (as of April 2025), emphasizing Pacific and Australian Indigenous languages with alignments to audio recordings. A recent addition is the GlossLM corpus, released in 2024, which compiles over 450,000 IGT examples from diverse documentation projects across 1,800 languages. Normalized for consistent gloss labeling, it supports quantitative studies by providing balanced representation of typological features like case marking or verb agreement. These repositories collectively enable quantitative typology; for instance, analyses of gloss frequencies in have revealed patterns in ergative-absolutive alignment across 500+ language samples, informing publications on morphological universals.

Software for Creation and Analysis

Software tools for creating and analyzing interlinear glosses facilitate the transition from manual linguistic documentation to digital workflows, enabling precise segmentation, alignment, and export for further research. These applications support linguists in handling complex morphological data across diverse languages, often incorporating standards like the Leipzig Glossing Rules for consistency. FieldWorks Language Explorer (FLEx), developed by , serves as a primary tool for segmentation and interlinear gloss creation. It allows users to parse texts into interlinear formats by linking lexical entries to morphological analyses, supporting building alongside glossing. FLEx handles multiple writing systems via and exports glossed data in XML and TEI formats compatible with archival standards. For typesetting interlinear glosses in publications, the LaTeX package Covington provides customizable macros to align word-by-word translations and grammatical labels. It enables the production of multi-line glosses with automatic alignment, accommodating diacritics and abbreviations while adhering to typographic conventions in linguistic papers. Covington integrates with broader LaTeX linguistics packages for enhanced formatting, such as handling multiple accents on letters. ELAN (EUDICO Linguistic Annotator), from the Institute for , excels in analyzing interlinear glosses through its annotation alignment features, particularly for time-based media like audio recordings. In interlinearization mode, it parses and glosses texts across tiers, supporting right-to-left scripts and for typologically diverse languages. facilitates querying and of glossed annotations, with XML for into linguistic . Recent open-source developments include tools like Leipzig.js, a JavaScript utility for embedding interlinear glosses in web-based documents, promoting sharing. These options extend FLEx and ELAN workflows by allowing browser-based gloss creation and analysis, often with support for database imports from corpora like those in the Language Archive.

Computational Methods

Automatic Glossing Techniques

Automatic glossing techniques aim to computationally generate interlinear glosses from raw linguistic data, such as transcribed text or audio transcriptions, to accelerate language documentation for under-resourced languages. These methods integrate linguistic rules and statistical models to parse morphology and produce aligned annotations, reducing manual effort in fieldwork. Early approaches focused on rule-based systems, while recent advancements leverage machine learning for greater flexibility in handling diverse typologies. Rule-based approaches employ finite-state transducers (FSTs) to model morphological , particularly effective for agglutinative languages with complex suffixation. Tools like XFST ( Finite State Tool) compile morphological rules and lexicons into transducers that analyze word forms and generate glosses by mapping surface realizations to underlying morphemes. For instance, in Plains Cree, an agglutinative Algonquian language, Snoek et al. (2014) developed an FST-based model for noun morphology, automating gloss assignment for stems and affixes without requiring extensive . This technique excels in languages with predictable inflectional patterns but struggles with irregularities or free . Machine learning methods have advanced automatic glossing, especially for low-resource settings. Statistical models like Conditional Random Fields (CRFs) combine source text features (e.g., morphemes, POS tags) with translation alignments to predict glosses, achieving morpheme accuracies of 71-85% on languages like Abui, Chintang, and Matsigenka using around 1,000 interlinear glossed text (IGT) examples for training. Neural approaches, such as fine-tuned transformer models (e.g., ByT5 pretrained on multilingual IGT corpora spanning 1,800 languages), further improve performance through cross-lingual transfer, reaching up to 82% morpheme accuracy on unsegmented data for languages like Arapaho. Implementations on platforms like Hugging Face in the 2020s enable accessible fine-tuning for glossing tasks. The SIGMORPHON 2023 shared task demonstrated transformer-based systems outperforming baselines by 17-24 percentage points in word-level accuracy across six low-resource languages, using inputs like transcriptions and translations. The typical workflow begins with input as transcribed text, optionally paired with English translations, followed by preprocessing (e.g., tokenization or segmentation in open setups). Models then predict boundaries and labels, outputting aligned interlinear formats; post-processing heuristics resolve conflicts, such as prioritizing translation-informed predictions. Challenges include ambiguity in segmentation, out-of-vocabulary items, and homophonous forms that confound in polysynthetic languages. Applications of these techniques support large-scale in AI-linguistics projects, such as over 100 hours of recordings for . The SIGMORPHON 2023 task and subsequent works (2023–2025) have integrated neural glossing into revitalization efforts for endangered languages, enabling rapid annotation of corpora while allowing manual refinement with tools like those in software suites; as of 2025, developments include benchmarks like LingGym for evaluating large language models on IGT tasks and practical tools for glossing specific low-resource languages such as .

Extraction of Morphological Structures

Extraction of morphological structures from interlinear glosses involves computational techniques to derive underlying grammatical rules and patterns, facilitating of linguistic theories and enabling automated inference of morphological paradigms from annotated data. These methods treat gloss lines as structured input, where boundaries and tags provide explicit cues for and clustering, contrasting with raw text by leveraging pre-annotated features to uncover inflectional classes, allomorphy, and dependency relations. Parsing algorithms adapted for gloss lines build morpheme-level dependency trees by treating gloss tags as nodes in a graph, similar to how Universal Dependencies (UD) frameworks handle syntactic relations but extended to subword units. For instance, tools like UDPipe, a trainable pipeline for tokenization, tagging, , and dependency parsing, can be customized to process glossed morphemes, generating trees that represent intra-word dependencies such as affix-stem relations. Recent adaptations integrate morphological segmentation from glosses into UD trees, using statistical measures like ∆P scores to assign features to morphs, achieving consistent cross-lingual alignment for languages like and . This approach enables the extraction of hierarchical structures, where gloss tags inform edge labels (e.g., :morph for morphological dependencies), supporting theory testing by visualizing paradigm-internal relations. Discovery methods employ unsupervised clustering on gloss tags to infer inflectional paradigms, grouping forms by shared features like stem glosses or affix patterns without prior grammatical knowledge. In the IGT2P framework, words are clustered by lemma glosses and feature sets extracted from interlinear texts, using transformer-based reinflection models to complete partial paradigms; this identifies inflectional classes with accuracies ranging from 21% to 64% across low-resource languages like Tsez and , improving by up to 21 points with data cleaning. Algorithms from the , such as non-parametric models for paradigm discovery, cluster inflected forms into classes by modeling probability distributions over affix sequences, though adapted here to gloss tags for enhanced semantic alignment. These techniques reveal latent patterns, such as gender-linked inflection in agglutinative systems, by iteratively refining clusters based on tag co-occurrence. Case studies applying these methods to large corpora like , a database of over 158,000 interlinear glossed texts from hundreds of languages, demonstrate their utility in uncovering allomorphy rules, such as context-sensitive alternations in verbal prefixes in . Such applications confirm theoretical predictions, like the CARP template in Proto-Bantu, by mapping gloss patterns to phonological rules across dialects. Recent advances include transformer-based models for morphological and tagging in low-resource settings, such as multi-task autoencoding achieving up to 73.66% accuracy in tasks across diverse languages. Adaptations of these architectures to UD frameworks support morpheme-level parsing and feature assignment, with evaluations emphasizing precision in relation recovery and robustness to annotation noise in corpora like .

References

  1. [1]
    Dept. of Linguistics | Resources | Glossing Rules
    Interlinear morpheme-by-morpheme glosses give information about the meanings and grammatical properties of individual words and parts of words.Missing: scholarly | Show results with:scholarly
  2. [2]
    [PDF] Automatic Interlinear Glossing for Under-Resourced Languages ...
    Dec 8, 2020 · Interlinear glossed text (IGT) is the name for a format commonly used by linguists in presenting linguistic data. In addition to providing the ...Missing: definition | Show results with:definition
  3. [3]
    [PDF] Interlinear morphemic glossing - Christian Lehmann
    Aug 3, 2015 · Its primary aim is to make the reader understand the grammatical structure of the L1 text by identifying aspects of the free translation with.
  4. [4]
    None
    ### Summary of Interlinear Glossing from Leipzig Glossing Rules
  5. [5]
    [PDF] Interlinear Glossing and its Role in Theoretical and Descriptive ...
    Mar 31, 2009 · The use of glosses in the representation of primary data became a standard for linguistic publications as late as in the 1980s (Lehmann, 2004) ...<|control11|><|separator|>
  6. [6]
    What is a Gloss - Glossary of Linguistic Terms | - SIL Global
    Definition: A gloss is a summary of the meaning of a morpheme or word, suitable for use in interlinear text displays. Morpheme. This page is an extract from ...
  7. [7]
    [PDF] The Generic Style Rules for Linguistics
    Dec 2, 2014 · When an example is from a language other than the language of the main text, it is provided with an interlinear gloss (with word-by-word ...
  8. [8]
    (PDF) Interlinear morphemic glossing - ResearchGate
    Precise guidelines for interlinear morphological glosses of examples and edited texts used in linguistic publications are formulated.
  9. [9]
    Book as Bibliotheca: The Emergence of the Commented Edition
    For the history of the new layout, the Carolingian reception of Virgil and the circulation of glossed Virgil manuscripts appear to have been important. Gibson ...
  10. [10]
    [PDF] Creating the past in the Carolingian Book of Virgil
    The glosses, compiled from Servian and non-Servian sources, comprise ancient and medieval materials. They include information not found in the extant ...
  11. [11]
    The Complutensian Polyglot Bible
    Both the Greek and the Aramaic are accompanied by an interlinear Latin translation. Ximénez wrote that the Bible's layout, with the Vulgate between the ...
  12. [12]
    Complutensian Polyglot - THE STORY OF THE BIBLE
    There are, in essence, two interlinears in the Complutensian Polyglot. Above each Greek word in the inside column (left in Genesis 1) is a translation in Latin ...
  13. [13]
    Erasmus' New Testament edition of 1516 - Leiden Special ...
    Feb 28, 2016 · On March 1, 1516, Erasmus' Novum Instrumentum came from the presses of Johann Froben at Basle. It contained a new Latin version of the New Testament.Missing: interlinear | Show results with:interlinear
  14. [14]
    Introduction: Biblical Philology in the Sixteenth Century
    This Introduction shows that biblical scholarship reached an advanced level of sophistication in the course of the sixteenth century, stimulated by the rise of ...
  15. [15]
    Pike, Kenneth Lee (1912-2000) | History of Missiology
    Pike's earliest work was on the sound structures of languages and included books on phonetics, the linguistic analysis of sound systems, tone languages, and ...Missing: interlinear gloss
  16. [16]
    ODIN :: The Online Database of INterlinear glossed text
    Mar 14, 2016 · ODIN stands for the Online Database of Interlinear Text. It is a collection of interlinear glossed text (IGT) instances extracted from linguistic documents on ...
  17. [17]
    [PDF] A Web-framework for ODIN Annotation - ACL Anthology
    Aug 7, 2016 · The current release of the ODIN (On- line Database of Interlinear Text) database contains over 150,000 linguistic examples,.
  18. [18]
    None
    ### Summary of Brill’s Conventions for Interlinear Glosses (Based on Leipzig Glossing Rules)
  19. [19]
    [PDF] Generalized Glossing Guidelines: An Explicit, Human- and Machine ...
    Jul 14, 2023 · Conventions for associating gloss tokens with morpheme tokens (see Table 2) are based on the. Leipzig glossing conventions with significant ex-.<|control11|><|separator|>
  20. [20]
    [PDF] Towards the First Named Entity Recognition of Inuktitut for an ...
    Jul 14, 2023 · Tusaatsiarunnanngittualuujunga, that means I don't hear very well. That sentence word could be segmented as fol- lows: The root Tusaa- (to ...<|control11|><|separator|>
  21. [21]
    [PDF] NOUN INCORPORATION AND CASE IN HERITAGE INUKTITUT*
    In this paper, we present our study on the knowledge and processing of noun incorporation (NI) in heritage speakers of Inuktitut, the language of the Inuit in ...
  22. [22]
    [PDF] The Particle Thì in Vietnamese A Thesis Submitted to The Faculty of ...
    Aug 9, 2021 · 14 All of the examples by Cao are in Vietnamese only, so I provided the interlinear gloss and free translation for all of his examples as ...
  23. [23]
    [PDF] Noun Incorporation1 - Alana Johns - University of Toronto
    Abstract – This chapter provides an overview of noun incorporation in broad terms, examining nominals found either within or strictly adjacent to predicates ...
  24. [24]
    The Language Archive | Max Planck Institute
    Currently, TLA contains more than 350 collections, covering over 250 different languages that are spoken around the world. This includes: Languages from around ...
  25. [25]
    [PDF] Enriching Interlinear Text using Automatically Constructed Annotators
    In- terlinear Glossed Text (IGT) is a resource which is available for over 1,000 lan- guages as part of the Online Database of. INterlinear text (ODIN) (Lewis ...
  26. [26]
    DOBES | Documentation of Endangered Languages
    The DOBES Archive contains language documentation data from a great variety of languages from around the world that are in danger of becoming extinct.
  27. [27]
    About | Endangered Languages Archive - Preservica
    ELAR currently holds materials for over 770 endangered languages recorded in over 90 different countries. Visit our Blog, Vimeo, and Facebook or subscribe to ...
  28. [28]
    Endangered Language Archive (ELAR) - DELAMAN
    The archive holds audio and video recordings from over 500 languages from all over the world. The materials ELAR holds are digital and access is free of charge.
  29. [29]
    GlossLM: A Massively Multilingual Corpus and Pretrained Model for ...
    Mar 11, 2024 · GlossLM is a large corpus of interlinear glossed text (IGT) with a pretrained multilingual model for IGT generation, enabling crosslingual ...Missing: major databases
  30. [30]
    [PDF] Modelling and Annotating Interlinear Glossed Text from 280 ...
    Dec 12, 2020 · This paper reports on the harvesting, analysis, and annotation of 20k documents from 4 dif- ferent endangered language archives in 280 ...Missing: prevalence | Show results with:prevalence
  31. [31]
    [PDF] Review of Fieldworks Language Explorer (FLEx) - ScholarSpace
    Fieldworks Language Explorer (FLEx) 3.0 is software for organizing and analyzing linguistic data and is produced for free download by SIL International (SIL)1.Missing: Linguist's | Show results with:Linguist's
  32. [32]
  33. [33]
    [PDF] The covington Package - Macros for Linguistics
    This package, initially a collection of Michael A. Covington's private macros, pro- vides numerous customizable LATEX macros that are helpful for linguists, ...
  34. [34]
    3.5. Interlinearization mode
    Interlinearization mode is a text oriented mode designed for parsing and glossing annotations to one or more lines of interlinearized text.
  35. [35]
    [PDF] User Guide for ELAN Linguistic Annotator
    Dec 5, 2018 · ELAN (EUDICO Linguistic Annotator) is an annotation tool that allows you to create, edit, visualize and search annotations for video and ...
  36. [36]
    bdchauvette/leipzig.js: Interlinear glossing for the browser - GitHub
    Leipzig.js is a small JavaScript utility that makes it easy to add interlinear glosses to webpages.
  37. [37]
    [PDF] Automating Gloss Generation in Interlinear Glossed Text
    I combine linguistic knowledge and statistical machine learning to develop a system for automatically annotating low-resource language data.
  38. [38]
    [PDF] Findings of the SIGMORPHON 2023 Shared Task on Interlinear ...
    Jul 14, 2023 · The SIGMORPHON 2023 task explored interlinear glossing on six languages, with the winning team achieving improvements in both closed and open ...
  39. [39]
    [PDF] Modeling the Noun Morphology of Plains Cree
    This paper presents aspects of a com- putational model of the morphology of. Plains Cree based on the technology of finite state transducers (FST). The paper.Missing: et automatic interlinear glossing
  40. [40]
    [PDF] Automatic Interlinear Glossing for Otomi language - ACL Anthology
    Jun 11, 2021 · In Snoek et al. (2014), they use a rule-based approach (Finite State. Transducer) to obtain glosses for Plains Cree, an.Missing: XFST | Show results with:XFST
  41. [41]
  42. [42]
    [PDF] IGT2P: From Interlinear Glossed Texts to Paradigms - ACL Anthology
    Nov 16, 2020 · IGT2P generates entire morphological paradigms from IGT input. We show that existing morphological reinflection models can solve the task with ...
  43. [43]
    [PDF] UDPipe: Trainable Pipeline for Processing CoNLL-U Files ...
    We presented UDPipe, a simple, unified tool for tokeniza- tion, morphological analysis, POS tagging, lemmatization and dependency parsing. It is distributed ...
  44. [44]
    [PDF] Universal Feature-based Morphological Trees - ACL Anthology
    May 25, 2024 · Abstract. The paper proposes a novel data representation inspired by Universal Dependencies (UD) syntactic trees, which are.<|control11|><|separator|>
  45. [45]
    A Non-Parametric Model for the Discovery of Inflectional Paradigms ...
    The main goal of this thesis is to build probability models over inflectional paradigms, and therefore to sort the large vocabulary of a morphologically rich ...
  46. [46]
    [PDF] Aspect and Evidentiality in Four Bantu Languages Crane, Thera Marie
    Mar 7, 2024 · This chapter describes four Bantu languages – Fwe, Nyamwezi, Nzadi, and Ikoma – in which evidential distinctions appear to have developed ...
  47. [47]
    [PDF] Exploring Unsupervised Tasks for Morphological Inflection
    Nov 12, 2024 · Our work is limited to the character-level task of morphological inflection. Thus, findings may not hold for other similar tasks such as. G2P ...
  48. [48]
    [PDF] Adapting transformer models to morphological tagging of two highly ...
    For morphological tagging, transformer models show promising potential, but the best approach to use these models is unclear. For both languages ...