Fact-checked by Grok 2 weeks ago

Text corpus

A text corpus, in and , is a large, principled collection of naturally occurring examples—typically written texts or transcriptions of —stored electronically in a machine-readable format for systematic analysis. These collections are designed to represent authentic use across various genres, registers, and contexts, serving as empirical for studying linguistic patterns rather than relying on or contrived examples. Key characteristics include their size (often millions of words), representativeness through principled sampling, and reliance on computational tools for quantitative and qualitative examination. The development of text corpora traces back to 19th-century lexicographic efforts to gather language samples, but the field of formalized in the 1960s with the advent of computers, marked by the creation of the in 1961—a 1-million-word sample of mid-20th-century . Today, corpora range from synchronic (capturing language at a single point in time, like the with over 1 billion words) to diachronic (tracking changes over time, such as the Corpus of Historical American English). Common types also encompass general corpora for broad language representation (e.g., the ), specialized corpora focused on domains like academic speech (e.g., the Michigan Corpus of Academic Spoken English), parallel corpora aligning texts in multiple languages for translation research, learner corpora documenting non-native usage, and multimodal corpora integrating text with audio or video. Text corpora underpin diverse applications, from identifying word frequencies, collocations, and grammatical structures in linguistic research to training models in tasks like and . In education, they inform design and materials development, such as the Academic Word List derived from corpus data to highlight high-frequency vocabulary in scholarly texts. By enabling replicable, data-driven insights into language variation across dialects, time periods, and social contexts, corpora have transformed empirical approaches to language study and computational applications.

Fundamentals

Definition

A text corpus is a large, structured collection of machine-readable texts assembled for linguistic or computational , often designed to represent a specific , , , or historical period. These collections enable empirical investigation into patterns, usage frequencies, and structural features by providing a finite, digitized sample of authentic data. Key attributes of a text include its substantial size, typically encompassing millions to billions of words to ensure statistical reliability; representativeness, achieved through that mirrors real-world use across varied contexts; and machine-readability, facilitated by digital formats that allow automated processing and querying. This structured nature distinguishes a from mere accumulations of texts, emphasizing purposeful curation for analytical objectives rather than arbitrary gathering. The term "" derives from the Latin , meaning "," and entered linguistic usage in the to denote a cohesive body of texts suitable for systematic study. In this context, it underscores the corpus as an organized whole, akin to a physical entity, rather than disparate fragments.

Types

Text corpora are categorized based on their design principles, size, scale, and intended purpose, reflecting diverse approaches to capturing linguistic phenomena. Major classifications include balanced corpora, corpora, and corpora, each tailored to specific research needs in . Balanced corpora aim for even representation across genres, registers, and subcorpora to provide a snapshot of language use at a given time, ensuring proportional coverage of text types such as , , and . This design promotes representativeness and comparability, allowing researchers to draw inferences about general language patterns without bias toward dominant genres; however, their static nature limits their ability to capture diachronic changes in language evolution. Monitor corpora, in contrast, are dynamically updated with new texts to track ongoing shifts in usage, prioritizing breadth and recency over strict balance. They facilitate the study of contemporary trends, such as neologisms or syntactic innovations, but require continuous maintenance and may introduce inconsistencies due to varying addition rates. corpora consist of aligned texts in multiple s, where source and target versions correspond sentence-by-sentence, enabling cross-linguistic comparisons and translation analysis. Their strength lies in supporting and , though alignment errors and translationese effects can skew natural representation. Other notable types include corpora, which are large, general-purpose collections serving as benchmarks for compilation, description, and normative studies. These corpora emphasize comprehensiveness and reliability, often drawing from diverse sources, but their size can complicate targeted analyses without sub-sampling. Specialized or -specific corpora focus on particular fields, such as medical or legal texts, to investigate , , and patterns within constrained contexts. This targeted approach yields high for experts but reduces generalizability to broader use. Comparable corpora feature similar texts across languages without direct translation alignment, built under equivalent sampling criteria to enable indirect cross-linguistic insights. They offer flexibility for studying cultural variations in but demand meticulous design to ensure true comparability, avoiding unintended biases in text selection. Emerging types expand traditional boundaries, such as web-as-corpus approaches that harvest texts for vast, real-time data pools, treating the as a dynamic linguistic resource. This method captures informal and evolving language at scale but grapples with , issues, and representativeness challenges from uneven web coverage. corpora integrate textual elements with non-textual data like audio or video transcripts, focusing on synchronized textual components to analyze in context-rich environments. While enhancing understanding of communicative interplay, they necessitate advanced for textual , increasing preparation complexity.

Development

Construction Methods

Constructing a text corpus begins with sourcing raw textual materials from diverse origins to ensure a foundation suitable for linguistic analysis. Common methods include manual collection through keyboarding, where texts from books or journals are typed directly into digital formats, particularly for handwritten or degraded sources; digitization via scanning and (OCR) software to convert physical documents into machine-readable text; and automated crawling of digital archives, such as , which provides over 75,000 e-books (as of November 2025) for free download and integration into corpora. Once sourced, materials undergo sampling to achieve representativeness and balance, reflecting the target language variety without . Random sampling selects texts probabilistically from a larger to capture natural variability, though it risks underrepresenting rare linguistic features; divides the source into subgroups (strata) based on criteria like genre, demographics, or time period—such as the 15 categories in the —before proportionally selecting from each to ensure comprehensive coverage. Corpus size is determined by objectives, with specialized corpora often targeting 1 million words for focused studies and general-purpose ones aiming for 10-100 million words to enable robust statistical , as seen in the (BNC) at 100 million words. Ethical and legal considerations are integral to construction, prioritizing compliance with and regulations. Copyright clearance requires permissions from rights holders for proprietary texts, favoring or licensed materials to avoid infringement on reproduction and distribution rights; for instance, the CLARIN guidelines recommend license agreements for copyrighted works while permitting use of expired copyrights or orphan works after diligent searches. in texts must be anonymized—through techniques like or removal of identifiers—to adhere to regulations such as the EU's (GDPR), which mandates protection of in for purposes. Standardized tools and formats facilitate interoperability and efficient assembly. Texts are often encoded in XML using the (TEI) guidelines, which provide a structured for markup of linguistic features, , and hierarchies to ensure consistency across corpora. Software like supports initial assembly by allowing uploads in vertical or XML formats, enabling the creation of corpora up to billions of words with built-in annotation tools; similarly, AntConc aids in compiling and preliminary processing of text files into searchable collections, though it is primarily geared toward analysis. These practices ensure corpora are machine-readable, scalable, and reusable in linguistic research.

Annotation and Preparation

After initial construction, text corpora undergo preprocessing to clean and standardize the , ensuring it is suitable for linguistic and computational . Tokenization involves segmenting the text into smaller units such as words, , or subwords, which is essential for subsequent tasks like and tagging; common methods include rule-based splitting on whitespace and , though challenges arise with ambiguities in contractions or hyphenated terms. follows to reduce variability, encompassing techniques like converting text to lowercase, removing diacritics, and applying to map inflected forms to their base or dictionary form, thereby facilitating consistent across the corpus. Noise removal addresses extraneous elements such as tags, special characters, or irrelevant , often using regular expressions or filters to strip formatting while preserving semantic content. Annotation enhances the corpus by adding interpretive layers, enabling deeper analysis of linguistic structures. Part-of-speech (POS) tagging assigns grammatical categories (e.g., noun, verb) to each token, typically using tagsets like the 36-tag scheme in the Penn Treebank, which supports automated training of models while allowing manual refinement for accuracy. Syntactic parsing structures sentences into hierarchical trees representing phrase and dependency relations, as exemplified by the Penn Treebank's bracketing scheme that encodes constituency and functional labels for over 4.5 million words of English text. Semantic labeling goes further by marking elements like named entities, coreference, or predicate-argument structures, often building on syntactic annotations to capture meaning; standards such as those in the Penn Treebank facilitate interoperability across tools. Annotation can be manual, involving human experts for high precision in complex cases, or automatic, leveraging machine learning models trained on existing corpora for scalability, with hybrid approaches combining both to balance cost and quality. Quality control is integral to maintain reliability, involving systematic error detection through automated validation scripts that flag inconsistencies like mismatched tags or incomplete parses. Inter-annotator agreement metrics, such as , quantify reliability by measuring agreement between multiple annotators beyond chance, with values above 0.8 indicating strong consistency in tasks like POS tagging; this statistic is particularly useful in for validating annotation schemes. Versioning systems track iterative changes, allowing reversion to prior states and documentation of modifications to ensure reproducibility. Annotating diverse corpora presents challenges, particularly with multilingual texts where varying scripts, morphologies, and standards across languages complicate uniform markup, often requiring language-specific guidelines or parallel alignment strategies. Dialectal variations introduce inconsistencies in and , necessitating region-aware tagsets to avoid bias toward standard forms. Historical corpora exacerbate issues with archaic spellings and orthographic shifts, which can degrade automatic tagging accuracy unless addressed through normalization tools like VARD that standardize variants probabilistically based on context.

Applications

Linguistics

In descriptive linguistics, text corpora provide for analyzing word frequency distributions, allowing researchers to quantify how often specific lexical items appear across contexts and thereby uncover patterns of usage that inform phonological, morphological, and syntactic descriptions. This approach shifts focus from to data-driven , as corpora reveal variations in word occurrences that might otherwise go unnoticed in smaller samples. studies, a of this analysis, use corpora to identify recurrent word pairings, such as "strong tea" over "powerful tea," by examining frequencies within defined spans. comparisons further leverage corpora to contrast linguistic features, like in academic versus conversational texts, highlighting register-specific distributions. Corpus-based approaches extend to by supplying authentic examples that test and refine grammatical rules. In corpus-based grammar, researchers draw on attested instances from corpora to validate or challenge prescriptive rules, such as the variability in dative alternation (e.g., "give the book to her" versus "give her the book"), providing quantitative support for probabilistic rather than absolute constraints. Sociolinguistic investigations utilize corpora to examine variations influenced by social factors, including regional dialects (e.g., differences in shifts between British and ) and gender-based patterns, such as higher frequencies of hedges like "you know" in female speech across sampled dialogues. employs diachronic corpora, which compile texts spanning centuries, to trace language evolution; for example, the Helsinki Corpus documents shifts in syntax from synthetic to analytic structures between 730 AD and 1700 AD. Key methodologies in for these applications include concordancing, which retrieves all instances of a keyword in context (e.g., lines showing surrounding words within a 50-word span) to facilitate qualitative examination of usage patterns. Keyword extraction identifies terms unusually frequent in a target compared to a reference, aiding in without manual sifting. Statistical measures like quantify collocation strength by calculating the logarithmic ratio of observed to expected co-occurrence probabilities, defined as I(X;Y) = \sum_{x,y} p(x,y) \log_2 \frac{p(x,y)}{p(x)p(y)}, where high values (e.g., above 3) signal significant associations such as "." Text corpora have profoundly impacted by furnishing evidence for entries, including authentic collocations and usage examples that reflect real-world frequency. For sense disambiguation, corpora help distinguish polysemous meanings through contextual distributions, as in resolving "" as a versus a river edge based on surrounding terms. detection relies on monitoring novel forms in corpora, such as tracking the emergence of "" via rising frequencies in contemporary texts, enabling timely inclusion in lexical resources.

Computational Linguistics and NLP

Text corpora play a pivotal role in and (NLP) as primary datasets for training models that underpin modern language technologies. These corpora supply the extensive textual data required for developing representations that capture linguistic patterns, semantics, and syntax through supervised, , and . A landmark example is the BERT model, which was pre-trained on the —a collection of 800 million words from unpublished books—and , encompassing 2.5 billion words, to learn bidirectional contextual embeddings that revolutionized downstream NLP tasks. In targeted NLP applications, specialized corpora enable task-specific model training and fine-tuning. Annotated corpora are fundamental for named entity recognition (NER), where datasets like CoNLL-2003 provide sentence-level annotations for entities such as persons, locations, organizations, and miscellaneous items in Reuters newswire texts, allowing models to achieve entity extraction accuracies exceeding 90% F1-score on benchmarks. Parallel corpora support machine translation by offering aligned sentence pairs across languages; for instance, Europarl supplies over 2 million sentence pairs from European Parliament proceedings in 21 languages, facilitating statistical and neural translation models that learn cross-lingual mappings. Sentiment analysis leverages labeled corpora like the Stanford Sentiment Treebank (SST), which includes 11,855 sentences from movie reviews with fine-grained polarity annotations from very negative to very positive, enabling models to classify emotional tones with nuanced granularity. Speech recognition benefits from multimodal corpora aligning text transcripts with audio, such as LibriSpeech, comprising 1,000 hours of 16 kHz English audiobook readings from public domain sources, which supports end-to-end training of acoustic models to transcribe speech with word error rates below 5% on clean test sets. Corpus-based evaluation in NLP employs standardized metrics to quantify model performance against annotated gold standards, ensuring reliable comparisons across systems. Precision measures the proportion of correct positive predictions among all positive predictions, recall assesses the fraction of actual positives correctly identified, and the F1-score harmonizes them as their , particularly vital for imbalanced datasets common in NLP tasks like NER. These metrics are applied in benchmarks such as the GLUE suite, where corpus splits from diverse sources test general language understanding, with top models achieving aggregate F1-scores around 90% through ensemble techniques. Cross-validation methods, including stratified k-fold partitioning of the , further validate model robustness by simulating varied data distributions and mitigating . Recent advancements emphasize massive-scale corpora for , exemplified by —a web archive exceeding 3 petabytes annually—whose filtered subsets train large language models like on hundreds of billions of tokens, enhancing zero-shot and few-shot capabilities in generative tasks. Yet, this scale amplifies biases inherent in web-sourced data, as models trained on such corpora reproduce and intensify societal prejudices, such as gender stereotypes, more than those on curated datasets. Addressing these issues involves preprocessing for bias detection and augmentation strategies to promote equitable representations.

Examples

General-Purpose Corpora

General-purpose corpora are large-scale collections of text designed to represent a broad spectrum of use across various genres, registers, and contexts, serving as versatile resources for linguistic research, , and . These corpora emphasize balance and diversity to capture the overall structure and variation of a language variety, often including both written and spoken components, and are typically annotated for part-of-speech or syntactic features to facilitate . Unlike domain-specific collections, they aim for comprehensive coverage to support cross-disciplinary studies, such as or frequency-based modeling. The , compiled in 1961 by W. Nelson Francis and Henry Kučera at , was the first million-word balanced of , marking a foundational milestone in . It consists of 500 samples totaling one million words, drawn from 15 genres including press, fiction, and learned texts, with sampling based on materials from the Brown University Library and Providence Athenaeum published in 1961. This systematic selection ensured proportional representation of different text categories, approximately 52% informative prose and 48% imaginative prose, enabling early computational analyses of word frequencies and collocations. The corpus's influence is evident in its role as a model for subsequent balanced corpora, inspiring projects like the Lancaster-Oslo/ Corpus and establishing standards for representativeness in empirical language studies. The (BNC), developed in the 1990s by a led by , comprises 100 million words of modern , with 90% from written sources and 10% from spoken transcripts, captured primarily from the late to 1993. It includes over 4,000 texts from diverse domains such as books, newspapers, and conversations, selected through to reflect sociolinguistic variation including region, age, and gender. The BNC features XML markup for structural and linguistic , including , which supports advanced querying and . As a publicly available resource under license from the BNC , it has facilitated extensive research in areas like and , serving as a benchmark for studies. The (), created by linguist Mark Davies and hosted at , is a monitor exceeding 1 billion words of from 1990 onward, updated annually until 2019 to track over time. It maintains genre balance across five categories—spoken, , magazines, newspapers, and journals—with equal to avoid skew toward any single , drawing from television , books, and periodicals. Fully searchable online via a interface, allows searches, frequency lists, and comparisons with other corpora like the BNC, making it a key tool for real-time linguistic monitoring and applications in language teaching. Its scale and accessibility have made it one of the most widely used resources, with millions of queries processed annually. The (ICE) is a collaborative family of 1-million-word corpora documenting varieties of English worldwide, initiated in 1990 by Sidney Greenbaum at to study in countries where English holds official status. Each national component, such as ICE-GB for or ICE-India, contains 500,000 words of written and 500,000 words of spoken data from the 1990s, sampled from sources like broadcasts, fiction, and to ensure comparability across varieties. ICE corpora include syntactic annotation using a common scheme, enabling cross-varietal analyses of grammatical structures and pragmatic features. Coordinated internationally with contributions from over 20 teams, the project has produced over 20 complete corpora publicly available for research, as of 2025, advancing comparative studies of English diversification.

Specialized Corpora

Specialized corpora are designed for targeted research in specific domains, languages, or applications, often featuring domain-specific annotations to support in-depth analysis such as entity recognition, alignment, or syntactic parsing. These resources contrast with general-purpose corpora by emphasizing thematic depth and expert curation, enabling precise investigations into specialized linguistic phenomena. In the medical domain, corpora derived from clinical literature facilitate tasks like terminology extraction and entity annotation for clinical decision support. For instance, a corpus of 263 randomized controlled trial (RCT) abstracts from the British Medical Journal (BMJ) has been annotated for PICO elements (Population, Intervention, Comparison, Outcome), aiding in the identification of key clinical concepts and supporting schema-based information extraction. Another prominent example is the corpus of 5,000 abstracts from medical articles on clinical RCTs, richly annotated for patients, interventions, and outcomes, which enables advanced natural language processing for evidence-based medicine research. These annotations typically include named entities such as medical terms and symptoms, with inter-annotator agreement exceeding 80% in entity-level tasks, highlighting their utility for training models in clinical terminology extraction. Legal corpora focus on structured texts like and judgments, often with parallel alignments to address multilingual and . The MultiJur corpus, comprising international conventions and treaties in multiple languages, is aligned at the level to support comparative legal and in legal contexts. Similarly, the JRC-Acquis contains over 1 billion words across 22 official languages, drawn primarily from legal documents, with sentence-level alignments that enable cross-lingual studies of legal and . These resources emphasize parallel structure to capture nuances in legal system-bound terms, such as court names, facilitating objective analysis of strategies in law. Multilingual specialized corpora extend this precision to cross-lingual syntax and . The Europarl corpus, extracted from proceedings, includes approximately 60 million words per language across 21 languages, with sentence alignments that support and multilingual policy studies. Complementing this, Universal Dependencies (UD) treebanks provide consistent syntactic annotations for 319 treebanks in 179 languages, as of May 2025, focusing on relations to enable cross-lingual parsing and complexity research. UD's standardized tags and morphological features achieve high consistency across languages, with parsing accuracies often above 90% in monolingual settings and transferable to low-resource languages via cross-lingual methods. Emerging specialized corpora from social media, such as datasets, target sentiment and opinion mining while addressing ethical challenges in sourcing. The Foundations Corpus consists of 35,108 tweets annotated for moral sentiments, derived from seven politically oriented accounts, supporting analyses of on ethical topics. Ethical considerations in these corpora include anonymization to protect user , with terms, and avoidance of real-time scraping without consent, as emphasized in guidelines for using . Such practices ensure responsible use, mitigating risks like re-identification while enabling sentiment models with F1-scores exceeding 70% on annotated subsets. For recent developments, the Dataset (2020–2023) provides over 200 million tweets annotated for sentiments, aiding pandemic response analysis.

References

  1. [1]
    [PDF] AN INTRODUCTION TO CORPUS LINGUISTICS
    Corpus linguistics studies language in use through corpora, which are large, principled collections of naturally occurring language examples stored ...
  2. [2]
    Week 1 What are Corpus Linguistics and Text Analysis?
    A corpus is a machine readable and electronically stored collection of natural language texts representing writing or speech chosen to be characteristic of a ...
  3. [3]
    Corpora and Text/Data Mining For Digital Humanities Projects
    Jan 6, 2025 · In text mining, a “corpus” (plural: corpora) refers to a large and structured set of texts that are used for linguistic analysis, to study ...Missing: definition | Show results with:definition
  4. [4]
    Basic workflow for text analysis | Computing for Information Science
    A text corpus is a large and structured set of texts. It typically stores the text as a raw character string with meta data and details stored with the text.
  5. [5]
    Corpus types: monolingual, parallel, multilingual… | Sketch Engine
    A text corpus is a very large collection of text (often many billion words) produced by real users of the language and used to analyse how words, phrases and ...Monolingual Corpus · Parallel Corpus... · Comparable Corpus<|control11|><|separator|>
  6. [6]
    Corpus Linguistics | Research Starters - EBSCO
    Corpus linguistics uses corpora, or empirical collections of written and/or spoken text, to discern naturally occurring patterns and features of language use.Introduction to Corpus... · Corpus Linguistics in Context · Corpus Linguistics...
  7. [7]
    [PDF] Unit 2 Representativeness, balance and sampling
    A corpus is considered representative if what we find on the basis of the corpus also holds for the language or language variety it is supposed to represent.
  8. [8]
    Corpus Representativeness (Chapter 3) - Designing and Evaluating ...
    We define corpus representativeness as the extent to which a corpus includes the full range of text types and linguistic distributions in a domain. In line ...
  9. [9]
    Definition of a corpus
    Any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text).
  10. [10]
    Definition and Examples of Corpus Linguistics - ThoughtCo
    May 12, 2025 · Although the methods used in corpus linguistics were first adopted in the early 1960s, the term itself didn't appear until the 1980s.
  11. [11]
    [PDF] 1 What is corpus linguistics? - Assets - Cambridge University Press
    Corpus-based versus corpus-driven linguistics; r. Data collection regime; r ... In contrast to monitor corpora, balanced corpora, also known as sam- ple ...
  12. [12]
    [PDF] Corpus linguistics: some key terms
    Balance. A property of a corpus (or, more precisely, a sampling frame). A corpus is said to be balanced if the relative sizes of each of its subsections have ...
  13. [13]
    [PDF] Development of Comparable Specialized Corpora of National ...
    By our definition, corpora containing components of varieties of the same language (e.g. the International Corpus of English […]) are not comparable corpora ...
  14. [14]
    [PDF] Introduction to the Special Issue on the Web as Corpus
    The web is a large, free source of language data, used for research, and is a place to obtain a corpus meeting specifications.
  15. [15]
    [PDF] Web as Corpus - Adam Kilgarriff
    The web, with its vast amount of language data, is a linguists' playground and a source for language research, used as a corpus.
  16. [16]
    [PDF] The future of multimodal corpora - SciELO
    Design: Multimodal corpora tend to include synchronised video, audio and textual records designed and constructed primarily to meet a specific research need ...
  17. [17]
    [PDF] Building a written corpus What are the basics?
    Tribble's use of 'exemplar texts' to exemplify genres, while keeping the overall size of the corpus down to manageable levels, continued this trend and he noted ...
  18. [18]
    [PDF] Building and Cleaning Corpora for Linguistic Analysis: A Practical ...
    We present accessible instructions for corpus building, text cleaning, and linguistic analysis based on our coursework and research experience. The guide ...
  19. [19]
    [1812.08092] A standardized Project Gutenberg corpus for statistical ...
    Dec 19, 2018 · Abstract:The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 ...
  20. [20]
    [PDF] GUIDELINES FOR BUILDING LANGUAGE CORPORA UNDER ...
    The rules of copyright law that are most relevant for written corpora affect the right of reproduction (§ 16), the right of distribution (§ 17), the right of ...
  21. [21]
    Corpus building with XML and TEI: Introduction to TEI
    Jul 24, 2008 · The Text Encoding Initiative (TEI) is a standard for the representation of textual material in digital form through the means of text encoding.
  22. [22]
    Sketch Engine: Create and search a text corpus
    Sketch Engine is the ultimate corpus tool to create and search text corpora in 100+ languages. Try a 30-day free trial.Word sketch · Price List · Quick Start Guide · What can Sketch Engine do?
  23. [23]
    [PDF] Words and Tokens - Stanford University
    May 23, 2025 · Tokenization, the first stage of natural language processing, is the process of seg- tokenization menting the running input text into tokens.
  24. [24]
    [PDF] Building a Large Annotated Corpus of English: The Penn Treebank
    It reflects not only true mistakes in PARTS performance, but also the many and important differences in the usage of Penn Treebank POS tags and the usage of ...
  25. [25]
    (PDF) The Penn Treebank: An overview - ResearchGate
    This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation.
  26. [26]
    Inter-Coder Agreement for Computational Linguistics
    This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement ...<|separator|>
  27. [27]
    [PDF] VARD 2: A tool for dealing with spelling variation in historical corpora
    Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and ...
  28. [28]
    (PDF) Corpus Linguistics: Analyzing Language through Large-Scale ...
    Aug 10, 2025 · It discusses how corpus linguistics enables the examination of language variation across genres, registers, time periods, and social contexts.
  29. [29]
    The Essential Contributions of Corpora in Language Research
    This paper probes the nature of corpus linguistics as a methodology to language study by elaborating on core tenets of corpus-based approach.
  30. [30]
    [PDF] Corpora and collocations - Lexically.net
    Collocations are words that tend to occur near each other, like 'cow and milk', and are characteristic, frequently recurrent word combinations.
  31. [31]
    [PDF] 43. Corpora and grammar - Stefan Th. Gries
    The most recent attempt at a comprehensive framework for the corpus-based investiga- tion of lexico-grammar is collostructional analysis, a set of methods for ...
  32. [32]
    (PDF) Corpora from a sociolinguistic perspective - ResearchGate
    Aug 8, 2025 · In this paper, I consider the use of corpora in sociolinguistic research and, more broadly, the relationships between corpus linguistics and ...
  33. [33]
    [PDF] DIACHRONIC CORPORA AND LANGUAGE EVOLUTION OVER TIME
    Linguists use diachronic corpora to trace language change and model how languages evolve under various influences. By analyzing language change in historical ...
  34. [34]
  35. [35]
    Chapter 6 Keyword Analysis | Corpus Linguistics - GitHub Pages
    Keywords in corpus linguistics are defined statistically using different measures of keyness. Keyness can be computed for words occurring in a target corpus.
  36. [36]
    [PDF] Normalized (Pointwise) Mutual Information in Collocation Extraction
    Mutual information can be used to perform collocation extraction by considering the MI of the indicator variables of the two parts of the potential collocation.
  37. [37]
    [PDF] 13 The Impact of Corpora on Dictionaries - FutureLearn
    Mar 12, 2009 · This chapter discusses how corpus-linguistic techniques have revolutionized dictionary creation since the 1980s. While arguing that corpora ...
  38. [38]
  39. [39]
  40. [40]
    [PDF] Introduction to the CoNLL-2003 Shared Task - ACL Anthology
    The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, ...
  41. [41]
    Europarl: A Parallel Corpus for Statistical Machine Translation
    We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web.Missing: WMT | Show results with:WMT
  42. [42]
    [PDF] LibriSpeech: An ASR Corpus Based on Public Domain Audio Books
    This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The Lib-.Missing: corpora | Show results with:corpora
  43. [43]
    The Brown Corpus (BROWN) - CoRD
    The Brown Corpus was the first computer-readable general corpus of texts prepared for linguistic research on modern English.Missing: 1967 | Show results with:1967
  44. [44]
    [bnc] About the British National Corpus
    The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources.
  45. [45]
    English-Corpora: COCA
    [Davies] 1.1 billion word corpus of American English, 1990-2010. Compare to the BNC and ANC. Large, balanced, up-to-date, and freely-available online.
  46. [46]
    The International Corpus of English - University College London
    The project soon became known as the International Corpus of English (ICE), and was coordinated by Greenbaum until 1996. From 1996 to 2001, ICE was coordinated ...
  47. [47]
    An annotated corpus of clinical trial publications supporting schema ...
    May 23, 2022 · The corpus created by Summerscales et al. [10] contains 263 RCT abstracts from the British Medical Journal (BMJ). The PICO elements are ...
  48. [48]
    [PDF] A Corpus with Multi-Level Annotations of Patients, Interventions and ...
    We present a corpus of 5,000 richly anno- tated abstracts of medical articles describ- ing clinical randomized controlled trials.
  49. [49]
    Natural language processing to extract symptoms of severe mental ...
    From the corpus of discharge summaries, it was possible to extract symptomatology in 87% of patients with SMI and 60% of patients with non-SMI diagnosis.Materials And Methods · Results · Discussion
  50. [50]
    MultiJur: Multilingual Parallel Corpus of Legal Texts - META-SHARE
    Aug 8, 2012 · The corpus contains international conventions and treaties arranged as a parallel corpus aligned on paragraph level.
  51. [51]
    The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ ...
    Abstract. We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature.Missing: Juris | Show results with:Juris
  52. [52]
    (PDF) Using Parallel Corpora to Study the Translation of Legal ...
    Hence, parallel corpora allow researchers to systematically and objectively study the solutions given to pre-identified translation problems, like legal system- ...
  53. [53]
    [PDF] Europarl: a parallel corpus for statistical machine translation
    Al- together, the corpus comprises of about 30 million words for each of the 11 official languages of the. European Union: Danish (da), German (de), Greek. (el) ...
  54. [54]
    Universal Dependencies
    Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across ...Universal features · Universal POS tags · Download UD treebanks · UD Guidelines
  55. [55]
    Universal Dependencies | Computational Linguistics | MIT Press
    Jul 13, 2021 · Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 ...Introduction · Basic Tenets of UD · Analyzing Linguistic... · Design Principles of UD
  56. [56]
    A Collection of 35k Tweets Annotated for Moral Sentiment
    Feb 19, 2020 · To address this issue, we introduce the Moral Foundations Twitter Corpus, a collection of 35,108 tweets that have been curated from seven ...Missing: corpora | Show results with:corpora<|control11|><|separator|>
  57. [57]
    Ethical and Methodological Considerations of Twitter Data for Public ...
    Nov 29, 2022 · This review describes the current state of public health research using Twitter data in terms of methods and research questions, geographic focus, and ethical ...
  58. [58]
    Towards an Ethical Framework for Publishing Twitter Data in Social ...
    This article presents an analysis of Twitter users' perceptions of research conducted in three settings (university, government and commercial)