Fact-checked by Grok 2 weeks ago
References
-
[1]
[PDF] AN INTRODUCTION TO CORPUS LINGUISTICSCorpus linguistics studies language in use through corpora, which are large, principled collections of naturally occurring language examples stored ...
-
[2]
Week 1 What are Corpus Linguistics and Text Analysis?A corpus is a machine readable and electronically stored collection of natural language texts representing writing or speech chosen to be characteristic of a ...
-
[3]
Corpora and Text/Data Mining For Digital Humanities ProjectsJan 6, 2025 · In text mining, a “corpus” (plural: corpora) refers to a large and structured set of texts that are used for linguistic analysis, to study ...Missing: definition | Show results with:definition
-
[4]
Basic workflow for text analysis | Computing for Information ScienceA text corpus is a large and structured set of texts. It typically stores the text as a raw character string with meta data and details stored with the text.
-
[5]
Corpus types: monolingual, parallel, multilingual… | Sketch EngineA text corpus is a very large collection of text (often many billion words) produced by real users of the language and used to analyse how words, phrases and ...Monolingual Corpus · Parallel Corpus... · Comparable Corpus<|control11|><|separator|>
-
[6]
Corpus Linguistics | Research Starters - EBSCOCorpus linguistics uses corpora, or empirical collections of written and/or spoken text, to discern naturally occurring patterns and features of language use.Introduction to Corpus... · Corpus Linguistics in Context · Corpus Linguistics...
-
[7]
[PDF] Unit 2 Representativeness, balance and samplingA corpus is considered representative if what we find on the basis of the corpus also holds for the language or language variety it is supposed to represent.
-
[8]
Corpus Representativeness (Chapter 3) - Designing and Evaluating ...We define corpus representativeness as the extent to which a corpus includes the full range of text types and linguistic distributions in a domain. In line ...
-
[9]
Definition of a corpusAny collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text).
-
[10]
Definition and Examples of Corpus Linguistics - ThoughtCoMay 12, 2025 · Although the methods used in corpus linguistics were first adopted in the early 1960s, the term itself didn't appear until the 1980s.
-
[11]
[PDF] 1 What is corpus linguistics? - Assets - Cambridge University PressCorpus-based versus corpus-driven linguistics; r. Data collection regime; r ... In contrast to monitor corpora, balanced corpora, also known as sam- ple ...
-
[12]
[PDF] Corpus linguistics: some key termsBalance. A property of a corpus (or, more precisely, a sampling frame). A corpus is said to be balanced if the relative sizes of each of its subsections have ...
-
[13]
[PDF] Development of Comparable Specialized Corpora of National ...By our definition, corpora containing components of varieties of the same language (e.g. the International Corpus of English […]) are not comparable corpora ...
-
[14]
[PDF] Introduction to the Special Issue on the Web as CorpusThe web is a large, free source of language data, used for research, and is a place to obtain a corpus meeting specifications.
-
[15]
[PDF] Web as Corpus - Adam KilgarriffThe web, with its vast amount of language data, is a linguists' playground and a source for language research, used as a corpus.
-
[16]
[PDF] The future of multimodal corpora - SciELODesign: Multimodal corpora tend to include synchronised video, audio and textual records designed and constructed primarily to meet a specific research need ...
-
[17]
[PDF] Building a written corpus What are the basics?Tribble's use of 'exemplar texts' to exemplify genres, while keeping the overall size of the corpus down to manageable levels, continued this trend and he noted ...
-
[18]
[PDF] Building and Cleaning Corpora for Linguistic Analysis: A Practical ...We present accessible instructions for corpus building, text cleaning, and linguistic analysis based on our coursework and research experience. The guide ...
-
[19]
[1812.08092] A standardized Project Gutenberg corpus for statistical ...Dec 19, 2018 · Abstract:The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 ...
-
[20]
[PDF] GUIDELINES FOR BUILDING LANGUAGE CORPORA UNDER ...The rules of copyright law that are most relevant for written corpora affect the right of reproduction (§ 16), the right of distribution (§ 17), the right of ...
-
[21]
Corpus building with XML and TEI: Introduction to TEIJul 24, 2008 · The Text Encoding Initiative (TEI) is a standard for the representation of textual material in digital form through the means of text encoding.
-
[22]
Sketch Engine: Create and search a text corpusSketch Engine is the ultimate corpus tool to create and search text corpora in 100+ languages. Try a 30-day free trial.Word sketch · Price List · Quick Start Guide · What can Sketch Engine do?
-
[23]
[PDF] Words and Tokens - Stanford UniversityMay 23, 2025 · Tokenization, the first stage of natural language processing, is the process of seg- tokenization menting the running input text into tokens.
-
[24]
[PDF] Building a Large Annotated Corpus of English: The Penn TreebankIt reflects not only true mistakes in PARTS performance, but also the many and important differences in the usage of Penn Treebank POS tags and the usage of ...
-
[25]
(PDF) The Penn Treebank: An overview - ResearchGateThis paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation.
-
[26]
Inter-Coder Agreement for Computational LinguisticsThis article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement ...<|separator|>
-
[27]
[PDF] VARD 2: A tool for dealing with spelling variation in historical corporaCorpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and ...
-
[28]
(PDF) Corpus Linguistics: Analyzing Language through Large-Scale ...Aug 10, 2025 · It discusses how corpus linguistics enables the examination of language variation across genres, registers, time periods, and social contexts.
-
[29]
The Essential Contributions of Corpora in Language ResearchThis paper probes the nature of corpus linguistics as a methodology to language study by elaborating on core tenets of corpus-based approach.
-
[30]
[PDF] Corpora and collocations - Lexically.netCollocations are words that tend to occur near each other, like 'cow and milk', and are characteristic, frequently recurrent word combinations.
-
[31]
[PDF] 43. Corpora and grammar - Stefan Th. GriesThe most recent attempt at a comprehensive framework for the corpus-based investiga- tion of lexico-grammar is collostructional analysis, a set of methods for ...
-
[32]
(PDF) Corpora from a sociolinguistic perspective - ResearchGateAug 8, 2025 · In this paper, I consider the use of corpora in sociolinguistic research and, more broadly, the relationships between corpus linguistics and ...
-
[33]
[PDF] DIACHRONIC CORPORA AND LANGUAGE EVOLUTION OVER TIMELinguists use diachronic corpora to trace language change and model how languages evolve under various influences. By analyzing language change in historical ...
-
[34]
Concordancing tools - Corpus Linguistics: Method, theory and practiceperhaps a word, part of a ...
-
[35]
Chapter 6 Keyword Analysis | Corpus Linguistics - GitHub PagesKeywords in corpus linguistics are defined statistically using different measures of keyness. Keyness can be computed for words occurring in a target corpus.
-
[36]
[PDF] Normalized (Pointwise) Mutual Information in Collocation ExtractionMutual information can be used to perform collocation extraction by considering the MI of the indicator variables of the two parts of the potential collocation.
-
[37]
[PDF] 13 The Impact of Corpora on Dictionaries - FutureLearnMar 12, 2009 · This chapter discusses how corpus-linguistic techniques have revolutionized dictionary creation since the 1980s. While arguing that corpora ...
- [38]
- [39]
-
[40]
[PDF] Introduction to the CoNLL-2003 Shared Task - ACL AnthologyThe shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, ...
-
[41]
Europarl: A Parallel Corpus for Statistical Machine TranslationWe collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web.Missing: WMT | Show results with:WMT
-
[42]
[PDF] LibriSpeech: An ASR Corpus Based on Public Domain Audio BooksThis paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The Lib-.Missing: corpora | Show results with:corpora
-
[43]
The Brown Corpus (BROWN) - CoRDThe Brown Corpus was the first computer-readable general corpus of texts prepared for linguistic research on modern English.Missing: 1967 | Show results with:1967
-
[44]
[bnc] About the British National CorpusThe British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources.
-
[45]
English-Corpora: COCA[Davies] 1.1 billion word corpus of American English, 1990-2010. Compare to the BNC and ANC. Large, balanced, up-to-date, and freely-available online.
-
[46]
The International Corpus of English - University College LondonThe project soon became known as the International Corpus of English (ICE), and was coordinated by Greenbaum until 1996. From 1996 to 2001, ICE was coordinated ...
-
[47]
An annotated corpus of clinical trial publications supporting schema ...May 23, 2022 · The corpus created by Summerscales et al. [10] contains 263 RCT abstracts from the British Medical Journal (BMJ). The PICO elements are ...
-
[48]
[PDF] A Corpus with Multi-Level Annotations of Patients, Interventions and ...We present a corpus of 5,000 richly anno- tated abstracts of medical articles describ- ing clinical randomized controlled trials.
-
[49]
Natural language processing to extract symptoms of severe mental ...From the corpus of discharge summaries, it was possible to extract symptomatology in 87% of patients with SMI and 60% of patients with non-SMI diagnosis.Materials And Methods · Results · Discussion
-
[50]
MultiJur: Multilingual Parallel Corpus of Legal Texts - META-SHAREAug 8, 2012 · The corpus contains international conventions and treaties arranged as a parallel corpus aligned on paragraph level.
-
[51]
The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ ...Abstract. We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature.Missing: Juris | Show results with:Juris
-
[52]
(PDF) Using Parallel Corpora to Study the Translation of Legal ...Hence, parallel corpora allow researchers to systematically and objectively study the solutions given to pre-identified translation problems, like legal system- ...
-
[53]
[PDF] Europarl: a parallel corpus for statistical machine translationAl- together, the corpus comprises of about 30 million words for each of the 11 official languages of the. European Union: Danish (da), German (de), Greek. (el) ...
-
[54]
Universal DependenciesUniversal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across ...Universal features · Universal POS tags · Download UD treebanks · UD Guidelines
-
[55]
Universal Dependencies | Computational Linguistics | MIT PressJul 13, 2021 · Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 ...Introduction · Basic Tenets of UD · Analyzing Linguistic... · Design Principles of UD
-
[56]
A Collection of 35k Tweets Annotated for Moral SentimentFeb 19, 2020 · To address this issue, we introduce the Moral Foundations Twitter Corpus, a collection of 35,108 tweets that have been curated from seven ...Missing: corpora | Show results with:corpora<|control11|><|separator|>
-
[57]
Ethical and Methodological Considerations of Twitter Data for Public ...Nov 29, 2022 · This review describes the current state of public health research using Twitter data in terms of methods and research questions, geographic focus, and ethical ...
-
[58]
Towards an Ethical Framework for Publishing Twitter Data in Social ...This article presents an analysis of Twitter users' perceptions of research conducted in three settings (university, government and commercial)