Fact-checked by Grok 2 weeks ago

Text corpus

A text corpus, in linguistics and natural language processing, is a large, principled collection of naturally occurring language examples—typically written texts or transcriptions of spoken language—stored electronically in a machine-readable format for systematic analysis.^[1] These collections are designed to represent authentic language use across various genres, registers, and contexts, serving as empirical data for studying linguistic patterns rather than relying on intuition or contrived examples.^[1] Key characteristics include their size (often millions of words), representativeness through principled sampling, and reliance on computational tools for quantitative and qualitative examination.^[2] The development of text corpora traces back to 19th-century lexicographic efforts to gather language samples, but the field of corpus linguistics formalized in the 1960s with the advent of computers, marked by the creation of the Brown Corpus in 1961—a 1-million-word sample of mid-20th-century American English.^[1] Today, corpora range from synchronic (capturing language at a single point in time, like the Corpus of Contemporary American English with over 1 billion words) to diachronic (tracking changes over time, such as the Corpus of Historical American English).^[1] Common types also encompass general corpora for broad language representation (e.g., the British National Corpus), specialized corpora focused on domains like academic speech (e.g., the Michigan Corpus of Academic Spoken English), parallel corpora aligning texts in multiple languages for translation research,^[3] learner corpora documenting non-native usage, and multimodal corpora integrating text with audio or video.^[4]^[1] Text corpora underpin diverse applications, from identifying word frequencies, collocations, and grammatical structures in linguistic research to training machine learning models in natural language processing tasks like sentiment analysis and machine translation.^[2] In education, they inform syllabus design and materials development, such as the Academic Word List derived from corpus data to highlight high-frequency vocabulary in scholarly texts.^[1] By enabling replicable, data-driven insights into language variation across dialects, time periods, and social contexts, corpora have transformed empirical approaches to language study and computational applications.^[1]

Fundamentals

Definition

A text corpus is a large, structured collection of machine-readable texts assembled for linguistic or computational analysis, often designed to represent a specific language, genre, dialect, or historical period.^[5]^[6] These collections enable empirical investigation into language patterns, usage frequencies, and structural features by providing a finite, digitized sample of authentic language data.^[7]^[8] Key attributes of a text corpus include its substantial size, typically encompassing millions to billions of words to ensure statistical reliability; representativeness, achieved through systematic sampling that mirrors real-world language use across varied contexts; and machine-readability, facilitated by digital formats that allow automated processing and querying.^[6]^[8]^[9] This structured nature distinguishes a corpus from mere accumulations of texts, emphasizing purposeful curation for analytical objectives rather than arbitrary gathering.^[10] The term "corpus" derives from the Latin corpus, meaning "body," and entered linguistic usage in the 20th century to denote a cohesive body of texts suitable for systematic study.^[10]^[11] In this context, it underscores the corpus as an organized whole, akin to a physical entity, rather than disparate fragments.

Types

Text corpora are categorized based on their design principles, size, scale, and intended purpose, reflecting diverse approaches to capturing linguistic phenomena. Major classifications include balanced corpora, monitor corpora, and parallel corpora, each tailored to specific research needs in linguistics.^[12] Balanced corpora aim for even representation across genres, registers, and subcorpora to provide a snapshot of language use at a given time, ensuring proportional coverage of text types such as news, fiction, and academic writing.^[8] This design promotes representativeness and comparability, allowing researchers to draw inferences about general language patterns without bias toward dominant genres; however, their static nature limits their ability to capture diachronic changes in language evolution.^[8]^[12] Monitor corpora, in contrast, are dynamically updated with new texts to track ongoing shifts in language usage, prioritizing breadth and recency over strict balance.^[6] They facilitate the study of contemporary trends, such as neologisms or syntactic innovations, but require continuous maintenance and may introduce inconsistencies due to varying addition rates.^[6]^[12] Parallel corpora consist of aligned texts in multiple languages, where source and target versions correspond sentence-by-sentence, enabling cross-linguistic comparisons and translation analysis.^[6] Their strength lies in supporting machine translation and contrastive linguistics, though alignment errors and translationese effects can skew natural language representation.^[6] Other notable types include reference corpora, which are large, general-purpose collections serving as benchmarks for dictionary compilation, grammar description, and normative studies.^[13] These corpora emphasize comprehensiveness and reliability, often drawing from diverse sources, but their size can complicate targeted analyses without sub-sampling.^[13] Specialized or domain-specific corpora focus on particular fields, such as medical or legal texts, to investigate jargon, terminology, and discourse patterns within constrained contexts.^[1] This targeted approach yields high precision for domain experts but reduces generalizability to broader language use.^[1] Comparable corpora feature similar texts across languages without direct translation alignment, built under equivalent sampling criteria to enable indirect cross-linguistic insights.^[6] They offer flexibility for studying cultural variations in genre but demand meticulous design to ensure true comparability, avoiding unintended biases in text selection.^[6]^[14] Emerging types expand traditional boundaries, such as web-as-corpus approaches that harvest internet texts for vast, real-time data pools, treating the web as a dynamic linguistic resource.^[15] This method captures informal and evolving language at scale but grapples with noise, copyright issues, and representativeness challenges from uneven web coverage.^[15]^[16] Multimodal corpora integrate textual elements with non-textual data like audio or video transcripts, focusing on synchronized textual components to analyze discourse in context-rich environments.^[17] While enhancing understanding of communicative interplay, they necessitate advanced annotation for textual alignment, increasing preparation complexity.^[17]

Development

Construction Methods

Constructing a text corpus begins with sourcing raw textual materials from diverse origins to ensure a foundation suitable for linguistic analysis. Common methods include manual collection through keyboarding, where texts from books or journals are typed directly into digital formats, particularly for handwritten or degraded sources; digitization via scanning and optical character recognition (OCR) software to convert physical documents into machine-readable text; and automated crawling of digital archives, such as Project Gutenberg, which provides over 75,000 public domain e-books (as of November 2025) for free download and integration into corpora.^[18]^[19]^[20] Once sourced, materials undergo sampling to achieve representativeness and balance, reflecting the target language variety without bias. Random sampling selects texts probabilistically from a larger population to capture natural variability, though it risks underrepresenting rare linguistic features; stratified sampling divides the source into subgroups (strata) based on criteria like genre, demographics, or time period—such as the 15 categories in the Brown Corpus—before proportionally selecting from each to ensure comprehensive coverage. Corpus size is determined by research objectives, with specialized corpora often targeting 1 million words for focused studies and general-purpose ones aiming for 10-100 million words to enable robust statistical analysis, as seen in the British National Corpus (BNC) at 100 million words.^[8]^[8]^[1] Ethical and legal considerations are integral to construction, prioritizing compliance with intellectual property and privacy regulations. Copyright clearance requires permissions from rights holders for proprietary texts, favoring public domain or licensed materials to avoid infringement on reproduction and distribution rights; for instance, the CLARIN guidelines recommend license agreements for copyrighted works while permitting use of expired copyrights or orphan works after diligent searches. Personal data in texts must be anonymized—through techniques like pseudonymization or removal of identifiers—to adhere to regulations such as the EU's General Data Protection Regulation (GDPR), which mandates protection of privacy in data processing for research purposes.^[21]^[21]^[21] Standardized tools and formats facilitate interoperability and efficient assembly. Texts are often encoded in XML using the Text Encoding Initiative (TEI) guidelines, which provide a structured schema for markup of linguistic features, metadata, and hierarchies to ensure consistency across corpora. Software like Sketch Engine supports initial assembly by allowing uploads in vertical or XML formats, enabling the creation of corpora up to billions of words with built-in annotation tools; similarly, AntConc aids in compiling and preliminary processing of text files into searchable collections, though it is primarily geared toward analysis. These practices ensure corpora are machine-readable, scalable, and reusable in linguistic research.^[22]^[23]^[19]

Annotation and Preparation

After initial construction, text corpora undergo preprocessing to clean and standardize the data, ensuring it is suitable for linguistic analysis and computational processing. Tokenization involves segmenting the raw text into smaller units such as words, sentences, or subwords, which is essential for subsequent tasks like parsing and tagging; common methods include rule-based splitting on whitespace and punctuation, though challenges arise with ambiguities in contractions or hyphenated terms.^[24] Normalization follows to reduce variability, encompassing techniques like converting text to lowercase, removing diacritics, and applying lemmatization to map inflected forms to their base or dictionary form, thereby facilitating consistent pattern recognition across the corpus.^[24] Noise removal addresses extraneous elements such as HTML tags, special characters, or irrelevant metadata, often using regular expressions or filters to strip formatting while preserving semantic content.^[24] Annotation enhances the corpus by adding interpretive layers, enabling deeper analysis of linguistic structures. Part-of-speech (POS) tagging assigns grammatical categories (e.g., noun, verb) to each token, typically using tagsets like the 36-tag scheme in the Penn Treebank, which supports automated training of models while allowing manual refinement for accuracy.^[25] Syntactic parsing structures sentences into hierarchical trees representing phrase and dependency relations, as exemplified by the Penn Treebank's bracketing scheme that encodes constituency and functional labels for over 4.5 million words of English text.^[25] Semantic labeling goes further by marking elements like named entities, coreference, or predicate-argument structures, often building on syntactic annotations to capture meaning; standards such as those in the Penn Treebank facilitate interoperability across tools.^[26] Annotation can be manual, involving human experts for high precision in complex cases, or automatic, leveraging machine learning models trained on existing corpora for scalability, with hybrid approaches combining both to balance cost and quality.^[25] Quality control is integral to maintain reliability, involving systematic error detection through automated validation scripts that flag inconsistencies like mismatched tags or incomplete parses. Inter-annotator agreement metrics, such as Cohen's kappa, quantify reliability by measuring agreement between multiple annotators beyond chance, with values above 0.8 indicating strong consistency in tasks like POS tagging; this statistic is particularly useful in corpus linguistics for validating annotation schemes.^[27] Versioning systems track iterative changes, allowing reversion to prior states and documentation of modifications to ensure reproducibility.^[27] Annotating diverse corpora presents challenges, particularly with multilingual texts where varying scripts, morphologies, and annotation standards across languages complicate uniform markup, often requiring language-specific guidelines or parallel alignment strategies. Dialectal variations introduce inconsistencies in vocabulary and grammar, necessitating region-aware tagsets to avoid bias toward standard forms. Historical corpora exacerbate issues with archaic spellings and orthographic shifts, which can degrade automatic tagging accuracy unless addressed through normalization tools like VARD that standardize variants probabilistically based on context.^[28]

Applications

Linguistics

In descriptive linguistics, text corpora provide empirical evidence for analyzing word frequency distributions, allowing researchers to quantify how often specific lexical items appear across contexts and thereby uncover patterns of usage that inform phonological, morphological, and syntactic descriptions.^[29] This approach shifts focus from intuition to data-driven observation, as corpora reveal variations in word occurrences that might otherwise go unnoticed in smaller samples.^[30] Collocation studies, a cornerstone of this analysis, use corpora to identify recurrent word pairings, such as "strong tea" over "powerful tea," by examining co-occurrence frequencies within defined spans.^[31] Genre comparisons further leverage corpora to contrast linguistic features, like lexical density in academic versus conversational texts, highlighting register-specific distributions.^[29] Corpus-based approaches extend to theoretical linguistics by supplying authentic examples that test and refine grammatical rules. In corpus-based grammar, researchers draw on attested instances from corpora to validate or challenge prescriptive rules, such as the variability in dative alternation (e.g., "give the book to her" versus "give her the book"), providing quantitative support for probabilistic rather than absolute constraints.^[32] Sociolinguistic investigations utilize corpora to examine variations influenced by social factors, including regional dialects (e.g., differences in vowel shifts between British and American English) and gender-based patterns, such as higher frequencies of hedges like "you know" in female speech across sampled dialogues.^[33] Historical linguistics employs diachronic corpora, which compile texts spanning centuries, to trace language evolution; for example, the Helsinki Corpus documents shifts in Old English syntax from synthetic to analytic structures between 730 AD and 1700 AD.^[34] Key methodologies in corpus linguistics for these applications include concordancing, which retrieves all instances of a keyword in context (e.g., lines showing surrounding words within a 50-word span) to facilitate qualitative examination of usage patterns.^[35] Keyword extraction identifies terms unusually frequent in a target corpus compared to a reference, aiding in thematic analysis without manual sifting.^[36] Statistical measures like mutual information quantify collocation strength by calculating the logarithmic ratio of observed to expected co-occurrence probabilities, defined as I(X;Y) = \sum_{x,y} p(x,y) \log_2 \frac{p(x,y)}{p(x)p(y)}, where high values (e.g., above 3) signal significant associations such as "kick the bucket."^[37] Text corpora have profoundly impacted lexicography by furnishing evidence for dictionary entries, including authentic collocations and usage examples that reflect real-world frequency.^[38] For sense disambiguation, corpora help distinguish polysemous meanings through contextual distributions, as in resolving "bank" as a financial institution versus a river edge based on surrounding terms.^[39] Neologism detection relies on monitoring novel forms in corpora, such as tracking the emergence of "selfie" via rising frequencies in contemporary texts, enabling timely inclusion in lexical resources.^[40]

Computational Linguistics and NLP

Text corpora play a pivotal role in computational linguistics and natural language processing (NLP) as primary datasets for training machine learning models that underpin modern language technologies. These corpora supply the extensive textual data required for developing representations that capture linguistic patterns, semantics, and syntax through supervised, unsupervised, and self-supervised learning. A landmark example is the BERT model, which was pre-trained on the BookCorpus—a collection of 800 million words from unpublished books—and English Wikipedia, encompassing 2.5 billion words, to learn bidirectional contextual embeddings that revolutionized downstream NLP tasks. In targeted NLP applications, specialized corpora enable task-specific model training and fine-tuning. Annotated corpora are fundamental for named entity recognition (NER), where datasets like CoNLL-2003 provide sentence-level annotations for entities such as persons, locations, organizations, and miscellaneous items in Reuters newswire texts, allowing models to achieve entity extraction accuracies exceeding 90% F1-score on benchmarks. Parallel corpora support machine translation by offering aligned sentence pairs across languages; for instance, Europarl supplies over 2 million sentence pairs from European Parliament proceedings in 21 languages, facilitating statistical and neural translation models that learn cross-lingual mappings. Sentiment analysis leverages labeled corpora like the Stanford Sentiment Treebank (SST), which includes 11,855 sentences from movie reviews with fine-grained polarity annotations from very negative to very positive, enabling models to classify emotional tones with nuanced granularity. Speech recognition benefits from multimodal corpora aligning text transcripts with audio, such as LibriSpeech, comprising 1,000 hours of 16 kHz English audiobook readings from public domain sources, which supports end-to-end training of acoustic models to transcribe speech with word error rates below 5% on clean test sets.^[41]^[42]^[43] Corpus-based evaluation in NLP employs standardized metrics to quantify model performance against annotated gold standards, ensuring reliable comparisons across systems. Precision measures the proportion of correct positive predictions among all positive predictions, recall assesses the fraction of actual positives correctly identified, and the F1-score harmonizes them as their harmonic mean, particularly vital for imbalanced datasets common in NLP tasks like NER. These metrics are applied in benchmarks such as the GLUE suite, where corpus splits from diverse sources test general language understanding, with top models achieving aggregate F1-scores around 90% through ensemble techniques. Cross-validation methods, including stratified k-fold partitioning of the corpus, further validate model robustness by simulating varied data distributions and mitigating overfitting. Recent advancements emphasize massive-scale corpora for deep learning, exemplified by Common Crawl—a web archive exceeding 3 petabytes annually—whose filtered subsets train large language models like GPT-3 on hundreds of billions of tokens, enhancing zero-shot and few-shot capabilities in generative tasks. Yet, this scale amplifies biases inherent in web-sourced data, as models trained on such corpora reproduce and intensify societal prejudices, such as gender stereotypes, more than those on curated datasets. Addressing these issues involves preprocessing for bias detection and augmentation strategies to promote equitable representations.

Examples

General-Purpose Corpora

General-purpose corpora are large-scale collections of text designed to represent a broad spectrum of language use across various genres, registers, and contexts, serving as versatile resources for linguistic research, natural language processing, and language education. These corpora emphasize balance and diversity to capture the overall structure and variation of a language variety, often including both written and spoken components, and are typically annotated for part-of-speech or syntactic features to facilitate quantitative analysis. Unlike domain-specific collections, they aim for comprehensive coverage to support cross-disciplinary studies, such as comparative linguistics or frequency-based modeling. The Brown Corpus, compiled in 1961 by W. Nelson Francis and Henry Kučera at Brown University, was the first million-word balanced corpus of American English, marking a foundational milestone in corpus linguistics. It consists of 500 samples totaling one million words, drawn from 15 genres including press, fiction, and learned texts, with sampling based on materials from the Brown University Library and Providence Athenaeum published in 1961. This systematic selection ensured proportional representation of different text categories, approximately 52% informative prose and 48% imaginative prose, enabling early computational analyses of word frequencies and collocations. The corpus's influence is evident in its role as a model for subsequent balanced corpora, inspiring projects like the Lancaster-Oslo/Bergen Corpus and establishing standards for representativeness in empirical language studies.^[44] The British National Corpus (BNC), developed in the 1990s by a consortium led by Oxford University Press, comprises 100 million words of modern British English, with 90% from written sources and 10% from spoken transcripts, captured primarily from the late 1980s to 1993. It includes over 4,000 texts from diverse domains such as books, newspapers, and conversations, selected through stratified sampling to reflect sociolinguistic variation including region, age, and gender. The BNC features XML markup for structural and linguistic annotation, including part-of-speech tagging, which supports advanced querying and parsing. As a publicly available resource under license from the BNC Consortium, it has facilitated extensive research in areas like lexicography and discourse analysis, serving as a benchmark for British English studies.^[45] The Corpus of Contemporary American English (COCA), created by linguist Mark Davies and hosted at Brigham Young University, is a monitor corpus exceeding 1 billion words of American English from 1990 onward, updated annually until 2019 to track language change over time. It maintains genre balance across five categories—spoken, fiction, popular magazines, newspapers, and academic journals—with equal representation to avoid skew toward any single register, drawing from television subtitles, books, and periodicals. Fully searchable online via a free interface, COCA allows collocation searches, frequency lists, and comparisons with other corpora like the BNC, making it a key tool for real-time linguistic monitoring and applications in language teaching. Its scale and accessibility have made it one of the most widely used resources, with millions of queries processed annually.^[46] The International Corpus of English (ICE) is a collaborative family of 1-million-word corpora documenting varieties of English worldwide, initiated in 1990 by Sidney Greenbaum at University College London to study World Englishes in countries where English holds official status. Each national component, such as ICE-GB for British English or ICE-India, contains 500,000 words of written and 500,000 words of spoken data from the 1990s, sampled from sources like broadcasts, fiction, and academic writing to ensure comparability across varieties. ICE corpora include syntactic annotation using a common scheme, enabling cross-varietal analyses of grammatical structures and pragmatic features. Coordinated internationally with contributions from over 20 teams, the project has produced over 20 complete corpora publicly available for research, as of 2025, advancing comparative studies of English diversification.^[47]

Specialized Corpora

Specialized corpora are designed for targeted research in specific domains, languages, or applications, often featuring domain-specific annotations to support in-depth analysis such as entity recognition, alignment, or syntactic parsing. These resources contrast with general-purpose corpora by emphasizing thematic depth and expert curation, enabling precise investigations into specialized linguistic phenomena. In the medical domain, corpora derived from clinical literature facilitate tasks like terminology extraction and entity annotation for clinical decision support. For instance, a corpus of 263 randomized controlled trial (RCT) abstracts from the British Medical Journal (BMJ) has been annotated for PICO elements (Population, Intervention, Comparison, Outcome), aiding in the identification of key clinical concepts and supporting schema-based information extraction.^[48] Another prominent example is the corpus of 5,000 abstracts from medical articles on clinical RCTs, richly annotated for patients, interventions, and outcomes, which enables advanced natural language processing for evidence-based medicine research.^[49] These annotations typically include named entities such as medical terms and symptoms, with inter-annotator agreement exceeding 80% in entity-level tasks, highlighting their utility for training models in clinical terminology extraction.^[50] Legal corpora focus on structured texts like legislation and judgments, often with parallel alignments to address multilingual legal translation and harmonization. The MultiJur corpus, comprising international conventions and treaties in multiple languages, is aligned at the paragraph level to support comparative legal linguistics and machine translation in legal contexts.^[51] Similarly, the JRC-Acquis corpus contains over 1 billion words across 22 official EU languages, drawn primarily from legal EU documents, with sentence-level alignments that enable cross-lingual studies of legal terminology and policy alignment.^[52] These resources emphasize parallel structure to capture nuances in legal system-bound terms, such as court names, facilitating objective analysis of translation strategies in EU law.^[53] Multilingual specialized corpora extend this precision to cross-lingual syntax and discourse analysis. The Europarl corpus, extracted from European Parliament proceedings, includes approximately 60 million words per language across 21 EU languages, with sentence alignments that support statistical machine translation and multilingual policy studies.^[54] Complementing this, Universal Dependencies (UD) treebanks provide consistent syntactic annotations for 319 treebanks in 179 languages, as of May 2025, focusing on dependency relations to enable cross-lingual parsing and complexity research.^[55] UD's standardized POS tags and morphological features achieve high consistency across languages, with parsing accuracies often above 90% in monolingual settings and transferable to low-resource languages via cross-lingual methods.^[56] Emerging specialized corpora from social media, such as Twitter datasets, target sentiment and opinion mining while addressing ethical challenges in data sourcing. The Moral Foundations Twitter Corpus consists of 35,108 tweets annotated for moral sentiments, derived from seven politically oriented accounts, supporting analyses of public discourse on ethical topics.^[57] Ethical considerations in these corpora include anonymization to protect user privacy, compliance with platform terms, and avoidance of real-time scraping without consent, as emphasized in guidelines for public health research using Twitter data.^[58] Such practices ensure responsible use, mitigating risks like re-identification while enabling sentiment models with F1-scores exceeding 70% on annotated subsets.^[59] For recent developments, the COVID-19 Twitter Dataset (2020–2023) provides over 200 million tweets annotated for public health sentiments, aiding pandemic response analysis.^[60]

References

[1]
[PDF] AN INTRODUCTION TO CORPUS LINGUISTICS
Corpus linguistics studies language in use through corpora, which are large, principled collections of naturally occurring language examples stored ...
[2]
Week 1 What are Corpus Linguistics and Text Analysis?
A corpus is a machine readable and electronically stored collection of natural language texts representing writing or speech chosen to be characteristic of a ...
[3]
Corpora and Text/Data Mining For Digital Humanities Projects
Jan 6, 2025 · In text mining, a “corpus” (plural: corpora) refers to a large and structured set of texts that are used for linguistic analysis, to study ...Missing: definition | Show results with:definition
[4]
Basic workflow for text analysis | Computing for Information Science
A text corpus is a large and structured set of texts. It typically stores the text as a raw character string with meta data and details stored with the text.
[5]
Corpus types: monolingual, parallel, multilingual… | Sketch Engine
A text corpus is a very large collection of text (often many billion words) produced by real users of the language and used to analyse how words, phrases and ...Monolingual Corpus · Parallel Corpus... · Comparable Corpus<|control11|><|separator|>
[6]
Corpus Linguistics | Research Starters - EBSCO
Corpus linguistics uses corpora, or empirical collections of written and/or spoken text, to discern naturally occurring patterns and features of language use.Introduction to Corpus... · Corpus Linguistics in Context · Corpus Linguistics...
[7]
[PDF] Unit 2 Representativeness, balance and sampling
A corpus is considered representative if what we find on the basis of the corpus also holds for the language or language variety it is supposed to represent.
[8]
Corpus Representativeness (Chapter 3) - Designing and Evaluating ...
We define corpus representativeness as the extent to which a corpus includes the full range of text types and linguistic distributions in a domain. In line ...
[9]
Definition of a corpus
Any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text).
[10]
Definition and Examples of Corpus Linguistics - ThoughtCo
May 12, 2025 · Although the methods used in corpus linguistics were first adopted in the early 1960s, the term itself didn't appear until the 1980s.
[11]
[PDF] 1 What is corpus linguistics? - Assets - Cambridge University Press
Corpus-based versus corpus-driven linguistics; r. Data collection regime; r ... In contrast to monitor corpora, balanced corpora, also known as sam- ple ...
[12]
[PDF] Corpus linguistics: some key terms
Balance. A property of a corpus (or, more precisely, a sampling frame). A corpus is said to be balanced if the relative sizes of each of its subsections have ...
[13]
[PDF] Development of Comparable Specialized Corpora of National ...
By our definition, corpora containing components of varieties of the same language (e.g. the International Corpus of English […]) are not comparable corpora ...
[14]
[PDF] Introduction to the Special Issue on the Web as Corpus
The web is a large, free source of language data, used for research, and is a place to obtain a corpus meeting specifications.
[15]
[PDF] Web as Corpus - Adam Kilgarriff
The web, with its vast amount of language data, is a linguists' playground and a source for language research, used as a corpus.
[16]
[PDF] The future of multimodal corpora - SciELO
Design: Multimodal corpora tend to include synchronised video, audio and textual records designed and constructed primarily to meet a specific research need ...
[17]
[PDF] Building a written corpus What are the basics?
Tribble's use of 'exemplar texts' to exemplify genres, while keeping the overall size of the corpus down to manageable levels, continued this trend and he noted ...
[18]
[PDF] Building and Cleaning Corpora for Linguistic Analysis: A Practical ...
We present accessible instructions for corpus building, text cleaning, and linguistic analysis based on our coursework and research experience. The guide ...
[19]
[1812.08092] A standardized Project Gutenberg corpus for statistical ...
Dec 19, 2018 · Abstract:The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 ...
[20]
[PDF] GUIDELINES FOR BUILDING LANGUAGE CORPORA UNDER ...
The rules of copyright law that are most relevant for written corpora affect the right of reproduction (§ 16), the right of distribution (§ 17), the right of ...
[21]
Corpus building with XML and TEI: Introduction to TEI
Jul 24, 2008 · The Text Encoding Initiative (TEI) is a standard for the representation of textual material in digital form through the means of text encoding.
[22]
Sketch Engine: Create and search a text corpus
Sketch Engine is the ultimate corpus tool to create and search text corpora in 100+ languages. Try a 30-day free trial.Word sketch · Price List · Quick Start Guide · What can Sketch Engine do?
[23]
[PDF] Words and Tokens - Stanford University
May 23, 2025 · Tokenization, the first stage of natural language processing, is the process of seg- tokenization menting the running input text into tokens.
[24]
[PDF] Building a Large Annotated Corpus of English: The Penn Treebank
It reflects not only true mistakes in PARTS performance, but also the many and important differences in the usage of Penn Treebank POS tags and the usage of ...
[25]
(PDF) The Penn Treebank: An overview - ResearchGate
This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation.
[26]
Inter-Coder Agreement for Computational Linguistics
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement ...<|separator|>
[27]
[PDF] VARD 2: A tool for dealing with spelling variation in historical corpora
Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and ...
[28]
(PDF) Corpus Linguistics: Analyzing Language through Large-Scale ...
Aug 10, 2025 · It discusses how corpus linguistics enables the examination of language variation across genres, registers, time periods, and social contexts.
[29]
The Essential Contributions of Corpora in Language Research
This paper probes the nature of corpus linguistics as a methodology to language study by elaborating on core tenets of corpus-based approach.
[30]
[PDF] Corpora and collocations - Lexically.net
Collocations are words that tend to occur near each other, like 'cow and milk', and are characteristic, frequently recurrent word combinations.
[31]
[PDF] 43. Corpora and grammar - Stefan Th. Gries
The most recent attempt at a comprehensive framework for the corpus-based investiga- tion of lexico-grammar is collostructional analysis, a set of methods for ...
[32]
(PDF) Corpora from a sociolinguistic perspective - ResearchGate
Aug 8, 2025 · In this paper, I consider the use of corpora in sociolinguistic research and, more broadly, the relationships between corpus linguistics and ...
[33]
[PDF] DIACHRONIC CORPORA AND LANGUAGE EVOLUTION OVER TIME
Linguists use diachronic corpora to trace language change and model how languages evolve under various influences. By analyzing language change in historical ...
[34]
Concordancing tools - Corpus Linguistics: Method, theory and practice
perhaps a word, part of a ...
[35]
Chapter 6 Keyword Analysis | Corpus Linguistics - GitHub Pages
Keywords in corpus linguistics are defined statistically using different measures of keyness. Keyness can be computed for words occurring in a target corpus.
[36]
[PDF] Normalized (Pointwise) Mutual Information in Collocation Extraction
Mutual information can be used to perform collocation extraction by considering the MI of the indicator variables of the two parts of the potential collocation.
[37]
[PDF] 13 The Impact of Corpora on Dictionaries - FutureLearn
Mar 12, 2009 · This chapter discusses how corpus-linguistic techniques have revolutionized dictionary creation since the 1980s. While arguing that corpora ...
[38]
https://ugc.futurelearn.com/uploads/files/3b/78/3b7858d8-293c-4ded-9de5-65a5bb7a5cd6/FL_Wk6_The_Impact_of_Corpora_on_Dictionaries_hanks.pdf
[39]
https://www.degruyterbrill.com/document/doi/10.1515/9783110231335.2.155/html
[40]
[PDF] Introduction to the CoNLL-2003 Shared Task - ACL Anthology
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, ...
[41]
Europarl: A Parallel Corpus for Statistical Machine Translation
We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web.Missing: WMT | Show results with:WMT
[42]
[PDF] LibriSpeech: An ASR Corpus Based on Public Domain Audio Books
This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The Lib-.Missing: corpora | Show results with:corpora
[43]
The Brown Corpus (BROWN) - CoRD
The Brown Corpus was the first computer-readable general corpus of texts prepared for linguistic research on modern English.Missing: 1967 | Show results with:1967
[44]
[bnc] About the British National Corpus
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources.
[45]
English-Corpora: COCA
[Davies] 1.1 billion word corpus of American English, 1990-2010. Compare to the BNC and ANC. Large, balanced, up-to-date, and freely-available online.
[46]
The International Corpus of English - University College London
The project soon became known as the International Corpus of English (ICE), and was coordinated by Greenbaum until 1996. From 1996 to 2001, ICE was coordinated ...
[47]
An annotated corpus of clinical trial publications supporting schema ...
May 23, 2022 · The corpus created by Summerscales et al. [10] contains 263 RCT abstracts from the British Medical Journal (BMJ). The PICO elements are ...
[48]
[PDF] A Corpus with Multi-Level Annotations of Patients, Interventions and ...
We present a corpus of 5,000 richly anno- tated abstracts of medical articles describ- ing clinical randomized controlled trials.
[49]
Natural language processing to extract symptoms of severe mental ...
From the corpus of discharge summaries, it was possible to extract symptomatology in 87% of patients with SMI and 60% of patients with non-SMI diagnosis.Materials And Methods · Results · Discussion
[50]
MultiJur: Multilingual Parallel Corpus of Legal Texts - META-SHARE
Aug 8, 2012 · The corpus contains international conventions and treaties arranged as a parallel corpus aligned on paragraph level.
[51]
The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ ...
Abstract. We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature.Missing: Juris | Show results with:Juris
[52]
(PDF) Using Parallel Corpora to Study the Translation of Legal ...
Hence, parallel corpora allow researchers to systematically and objectively study the solutions given to pre-identified translation problems, like legal system- ...
[53]
[PDF] Europarl: a parallel corpus for statistical machine translation
Al- together, the corpus comprises of about 30 million words for each of the 11 official languages of the. European Union: Danish (da), German (de), Greek. (el) ...
[54]
Universal Dependencies
Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across ...Universal features · Universal POS tags · Download UD treebanks · UD Guidelines
[55]
Universal Dependencies | Computational Linguistics | MIT Press
Jul 13, 2021 · Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 ...Introduction · Basic Tenets of UD · Analyzing Linguistic... · Design Principles of UD
[56]
A Collection of 35k Tweets Annotated for Moral Sentiment
Feb 19, 2020 · To address this issue, we introduce the Moral Foundations Twitter Corpus, a collection of 35,108 tweets that have been curated from seven ...Missing: corpora | Show results with:corpora<|control11|><|separator|>
[57]
Ethical and Methodological Considerations of Twitter Data for Public ...
Nov 29, 2022 · This review describes the current state of public health research using Twitter data in terms of methods and research questions, geographic focus, and ethical ...
[58]
Towards an Ethical Framework for Publishing Twitter Data in Social ...
This article presents an analysis of Twitter users' perceptions of research conducted in three settings (university, government and commercial)