Fact-checked by Grok 2 weeks ago

Word list

A word list is a curated collection of lexical items from a language, typically organized alphabetically, by frequency of occurrence, or thematically, and compiled for specific analytical, educational, or practical purposes such as vocabulary instruction, linguistic comparison, or software applications.^[1] In linguistics, word lists have long been instrumental for tasks like historical-comparative analysis and fieldwork; for instance, the Swadesh list, developed by Morris Swadesh in the mid-20th century, comprises 100 to 207 basic vocabulary items intended to remain stable across languages for estimating divergence times through glottochronology.^[2] Similarly, in language education, word lists prioritize high-utility terms to optimize learning efficiency, as seen in the General Service List (GSL), a 1953 compilation by Michael West of approximately 2,000 English word families representing the most frequent general vocabulary needed for everyday comprehension.^[3] Complementing this, Averil Coxhead's Academic Word List (AWL), published in 2000, identifies 570 word families prevalent in university-level texts across disciplines, excluding those in the GSL, to support advanced academic reading and writing.^[4] Beyond education and linguistics, word lists play a key role in computational contexts, where they form the basis for tools like spell-check dictionaries and natural language processing algorithms; standard word list files, such as those bundled with Unix-like operating systems (e.g., /usr/share/dict/words), contain thousands of entries in multiple languages to enable functions ranging from text validation to machine translation.^[5] These applications underscore the versatility of word lists, which enhance targeted language mastery and technological efficiency by focusing on essential or contextually relevant terms rather than exhaustive dictionaries.^[6]

Overview

Definition and Scope

A word list is a curated collection of lexical items from a language, often derived from linguistic data and ranked by frequency of occurrence in a corpus, serving as a tool for analyzing vocabulary distribution and usage patterns in corpus linguistics. These lists typically present words in descending order of frequency, highlighting the most common lexical items first, and are essential for identifying core vocabulary that accounts for the majority of text in natural language. For instance, the most frequent words often include function words such as articles, prepositions, and pronouns, which dominate everyday discourse.^[7] Word lists vary in their unit of counting, with key distinctions between headword lists, lemma-based lists, and word family lists. A headword represents the base form of a word, such as "run," without grouping variants. Lemma-based lists expand this to include inflected forms sharing the same base, like "run," "runs," "running," and "ran," treating them as a single entry to reflect morphological relationships. In contrast, word family lists encompass not only inflections but also derived forms, such as "runner," "running," and "unrunnable," capturing broader semantic and derivational connections within the vocabulary.^[8]^[9] The scope of word lists is generally limited to common nouns, verbs, adjectives, and other content words in natural language, excluding proper nouns—such as names of people, places, or brands—unless they hold contextual relevance in specialized corpora. This focus ensures the lists prioritize generalizable vocabulary over unique identifiers. Basic word lists, often comprising the top 1,000 most frequent items, cover essential everyday terms sufficient for rudimentary communication, while comprehensive lists extending to 10,000 words incorporate advanced vocabulary for broader proficiency, such as in academic or professional settings. Systematic frequency-based word lists emerged in the early 20th century with large-scale manual counts.^[10]^[11]

Historical Evolution

The development of word lists began in the early 20th century with manual efforts to identify high-frequency vocabulary for educational purposes. In 1921, Edward Thorndike published The Teacher's Word Book, a list of 10,000 words derived from analyses of children's reading materials, including school texts and juvenile literature, to aid in curriculum design and literacy instruction.^[12] This was expanded in 1932 with A Teacher's Word Book of the Twenty Thousand Words Found Most Frequently and Widely in General Reading for Children and Young People, which incorporated additional sources to rank words by frequency in youth-oriented content.^[13] By 1944, Thorndike collaborated with Irving Lorge on The Teacher's Word Book of 30,000 Words, updating the earlier lists by integrating data from over 4.5 million words across diverse adult materials such as newspapers, magazines, and literature, thereby broadening applicability beyond child-focused education.^[14] Post-World War II advancements emphasized practical lists for language teaching, particularly in English as a foreign language (EFL) and other tongues. Michael West's General Service List (GSL), released in 1953, compiled 2,000 word families selected for their utility in EFL contexts, drawing from graded readers and general texts to prioritize coverage of everyday communication.^[15] Concurrently, in France during the 1950s, the Français Fondamental project produced basic vocabulary lists ranging from 1,500 to 3,200 words, organized around 16 centers of interest like family and work, to standardize teaching for immigrants and non-native speakers through corpus-based frequency analysis of spoken and written French.^[16] The digital era marked a shift toward corpus linguistics in the late 20th century, enabling larger-scale and more precise frequency counts. The Brown Corpus—a 1-million-word collection of 1961 American English texts—was created in 1961 and made digitally available in 1964, facilitating the rise of computational methods for word list construction and influencing subsequent projects with balanced, genre-diverse data. This culminated in the 2013 New General Service List (NGSL) by Charles Browne, Brent Culligan, and Joseph Phillips, which updated West's GSL using a 273-million-word corpus of contemporary English, refining the core vocabulary to 2,801 lemmas for better EFL relevance.^[17] A notable innovation occurred in 2009 with the introduction of SUBTLEX by Marc Brysbaert and Boris New, a frequency measure derived from 51 million words in American English movie and TV subtitles, offering superior representation of spoken language patterns over traditional written corpora.^[18] This subtitle-based approach has since expanded, exemplified by the 2024 adaptation of SUBTLEX-CY for Welsh, which analyzes a 32-million-word corpus of television subtitles to provide psycholinguistically validated frequencies for this low-resource Celtic language, underscoring the method's versatility in supporting underrepresented tongues.^[19]

Methodology

Key Factors in Construction

The construction of word lists hinges on ensuring representativeness, which requires balancing a diverse range of genres such as fiction, news, and academic texts to prevent skews toward specific linguistic features or registers. This diversity mirrors the target language's natural variation, allowing the list to capture a broad spectrum of usage patterns without overemphasizing one sub-domain. Corpus size plays a critical role in reliability, with a minimum of 1 million words often deemed sufficient for stable frequency estimates of high-frequency vocabulary, though larger corpora (16-30 million words) enhance precision for norms. Smaller corpora risk instability in rankings, particularly for mid- and low-frequency items. Decisions on word family inclusion address morphological relatedness, treating derivatives like "run," "running," and "ran" as a single unit based on affixation levels that account for productivity and transparency. Bauer and Nation's framework outlines seven progressive levels, starting from the headword and extending to complex derivations, enabling compact lists that reflect learner needs while avoiding over-inflation of unique forms. This approach prioritizes semantic and derivational connections, but requires careful calibration to exclude transparent compounds that may dilute family coherence. Normalization techniques mitigate sublanguage biases, where specialized texts like technical documents disproportionately elevate jargon frequencies.^[20] Stratified sampling and weighting adjust for these imbalances by proportionally representing genres, ensuring the list approximates general language use rather than niche varieties.^[20] Such methods preserve overall frequency integrity while countering distortions from uneven source distributions.^[20] Key challenges include handling polysemy, where a single form's multiple senses complicate frequency attribution, often requiring sense-disambiguated corpora to allocate counts accurately. Idioms pose similar issues, as their multi-word nature and non-compositional meanings evade standard tokenization, potentially underrepresenting phrasal units in lemma-based lists.^[21] Neologisms, such as "COVID-19," further challenge static lists built from pre-2020 corpora, necessitating periodic updates to incorporate emergent terms without retrospective bias.^[22] Dispersion metrics like Juilland's D quantify evenness of word distribution across texts, with values approaching 1 indicating broad coverage and thus greater reliability for generalizability. This measure, normalized by corpus structure, helps filter words concentrated in few documents, enhancing the list's robustness beyond raw frequency.

Corpus Sources

Traditional written corpora have formed the foundation for early word list construction, providing balanced samples of edited prose across various genres. The Brown Corpus, compiled in 1961, consists of approximately 1 million words drawn from 500 samples of American English texts published that year, including fiction, news, and scientific writing, making it the first major computer-readable corpus for linguistic research.^[23] Similarly, the British National Corpus (BNC), developed in the 1990s, encompasses 100 million words of contemporary British English, with 90% from written sources like books and newspapers and 10% from spoken transcripts, offering a synchronic snapshot of language use.^[24] These corpora, while pioneering in representing formal written language, have notable limitations, such as the absence of internet slang, social media expressions, and evolving colloquialisms that emerged after their compilation periods.^[25] To address gaps in capturing everyday spoken language, subtitle and spoken corpora have gained prominence since 2009, prioritizing natural dialogue over polished text. The SUBTLEX family, for instance, derives frequencies from film and television subtitles; SUBTLEX-US, based on American English, includes 51 million words from over 8,000 movies and series, providing measures like words per million and contextual diversity (percentage of films featuring a word).^[26] This approach offers advantages in reflecting colloquial frequency, as subtitle-based norms better predict lexical decision times and reading behaviors compared to traditional written corpora like the Brown or BNC, which underrepresent informal speech patterns.^[27] Modern digital corpora have expanded scale and diversity by incorporating web-based and historical data, enabling broader frequency analyses. The Corpus of Contemporary American English (COCA), spanning 1990 to 2019, contains over 1 billion words across genres such as spoken transcripts, fiction, magazines, newspapers, academic texts, and web content including blogs, thereby capturing evolving usage in digital contexts.^[28] Complementing this, the Google Books Ngram corpus draws from trillions of words in scanned books across languages, covering the period from 1800 to 2019 (with extensions to 2022 in recent updates), allowing diachronic tracking of word frequencies while excluding low-quality scans for reliability.^[29] Post-2010, there has been a notable shift toward multimodal corpora that integrate text with audio transcripts, video, and other modalities to enhance relevance for second language (L2) learners by simulating real-world input.^[30] These resources, such as those combining spoken audio with aligned textual representations, better support vocabulary acquisition in naturalistic settings compared to text-only sources.^[31] Dedicated corpora for AI-generated text remain in early development.^[32]

Lexical Unit Definitions

In the construction of word lists, a fundamental distinction exists between lemmas and word forms as lexical units. A lemma represents the base or citation form of a word, encompassing its inflected variants that share the same core meaning, such as "be" including "am," "is," "are," and "been." This approach groups related forms to reflect semantic unity and is commonly used in frequency-based vocabulary lists to avoid inflating counts with morphological variations. In contrast, word forms refer to the surface-level realizations of words as they appear in texts, treating each inflection or spelling variant separately for precise token analysis, such as counting "runs" and "running" independently. This differentiation affects how vocabulary size is estimated and prioritized in lists, with lemmas promoting efficiency in pedagogical applications while word forms provide granular data on actual usage patterns.^[33] Word families extend the lemma concept by incorporating hierarchically related derivatives and compounds, allowing for a more comprehensive representation of vocabulary knowledge. According to Bauer and Nation's framework, which outlines seven progressive levels, inclusion begins at Level 1, treating each inflected form as separate, and progresses through Level 2 (inflections with the same base), Levels 3-6 (various derivational affixes based on frequency, regularity, and productivity), to Level 7 (classical roots and affixes). This scale balances inclusivity with learnability, though practical word lists often limit to Level 6 to focus on more transparent forms, integrating less predictable derivatives only if they occur frequently in corpora. For instance, the word family for "decide" at higher levels might include "decision," "indecisive," and "undecided," reflecting shared morphological and semantic roots. Such hierarchical structuring is widely adopted in corpus-derived lists to estimate coverage and guide instruction.^[34] Multi-word units, such as collocations and lexical bundles, are treated as single lexical entries in pedagogical word lists to account for their formulaic nature and frequent co-occurrence beyond chance. Phrases like "point of view" or "in order to" are included holistically rather than as isolated words, recognizing their role as conventionalized units that learners acquire as wholes for fluency. These units are identified through corpus analysis focusing on mutual information and range, with lists like the Academic Collocation List compiling thousands of such sequences tailored to specific registers. By delineating multi-word units distinctly, word lists enhance coverage of idiomatic expressions, which constitute a significant portion of natural language use.^[35] The token-type distinction underpins the delineation of lexical units by differentiating occurrences from unique forms, essential for assessing diversity in word lists. Tokens represent every instance of a word in a corpus, including repetitions, while types denote distinct forms, such as unique lemmas or word families. This leads to the type-token ratio (TTR), a measure of lexical variation calculated as
TTR = \frac{types}{tokens}
where higher values indicate greater diversity. In word list construction, TTR helps evaluate corpus representativeness, guiding decisions on unit granularity to ensure lists reflect both frequency and richness without redundancy.^[36] Challenges in defining lexical units arise with proper nouns and inflections, particularly in diverse language structures. Proper nouns like "London" are often excluded from core frequency lists or segregated into separate categories to focus on general vocabulary, unless analyses specifically track capitalized forms for domain-specific coverage, as seen in the BNC/COCA lists where they comprise nearly half of unlisted types. In agglutinative languages such as Turkish or Finnish, extensive inflectional suffixes create long, context-dependent forms, complicating lemmatization and risking fragmentation of units; for example, a single root might yield dozens of surface variants, necessitating advanced morphological parsing to group them accurately without under- or over-counting types. These issues highlight the need for language-specific rules in unit delineation to maintain list utility.^[37]^[38]

Frequency Calculation Methods

Frequency calculation in word lists begins with raw frequency, which simply counts the occurrences of a lexical unit within a corpus. For instance, if a word appears N times in a corpus of total size S, its raw frequency is N. This measure provides an unnormalized tally but is sensitive to corpus size variations, limiting direct comparisons across datasets.^[39] To address this, relative frequency normalizes counts against the total number of words, often expressed per million words for comparability. The formula is f = \frac{\text{count}}{\text{total words}} \times 1,000,000, yielding a standardized rate that facilitates analysis across diverse corpora. This approach is standard in corpus linguistics for scaling frequencies proportionally.^[40] Zipf's law offers a predictive model for word frequency distributions, stating that the frequency f of a word is approximately inversely proportional to its rank r in the frequency list, given by f \approx \frac{c}{r}, where c is a constant. Validation typically involves plotting frequency against rank on a log-log scale, where a linear relationship confirms adherence to the law, as observed in many natural language corpora. This principle, first formalized in 1935, underpins much of modern frequency analysis by highlighting the skewed nature of word usage.^[41] Advanced metrics extend beyond basic counts to account for contextual and distributional properties. Mutual information quantifies the association strength in collocations by measuring how much the co-occurrence probability of two words deviates from their independent probabilities, favoring rare but strongly linked pairs over high-frequency but weakly associated ones. For dispersion, adjustments mitigate biases from uneven distribution across texts; a common transformation is the logarithmic adjustment \log(f + 1), which compresses the frequency skew and stabilizes variance for low-frequency items. The SUBTLEX database exemplifies this by employing \log_{10}(f + 1) to derive Zipf-scaled frequencies, better handling the long-tail distribution in subtitle corpora.^[42]^[43] Computational tools streamline these calculations. AntConc generates raw and relative frequency lists from loaded corpora, supporting keyword extraction and basic statistical overviews. Sketch Engine provides advanced querying for frequency lists, including part-of-speech filtering and collocation metrics like mutual information. Post-2020 developments integrate neural embeddings, such as BERT models, to compute semantic frequencies that weight word occurrences by contextual similarity, enhancing traditional counts with distributional semantics for more nuanced rankings in word lists.^[44]^[45]^[46]

Applications and Effects

Pedagogical Integration

Word lists play a central role in curriculum prioritization for language instruction, enabling educators to focus on high-frequency vocabulary that maximizes text coverage with minimal effort. According to Nation's principle, knowledge of the top 2,000 to 3,000 word families typically provides 80-90% coverage of everyday written and spoken texts, allowing learners to achieve functional comprehension early in their studies.^[47] This approach ensures that instructional time is allocated efficiently, prioritizing words that appear most often in authentic materials rather than rare or specialized terms. Vocabulary acquisition is often structured in tiers using word lists tailored to learner proficiency. For beginners, high-frequency lists emphasize the most common 1,000-2,000 words to build a foundational lexicon essential for basic communication.^[48] At advanced levels, specialized lists such as the Academic Word List (AWL), developed by Coxhead in 2000, target 570 word families prevalent in scholarly texts, enhancing learners' ability to engage with academic discourse.^[4] Teaching methods incorporating word lists frequently employ spaced repetition systems to reinforce retention, scheduling reviews at increasing intervals based on learner performance to optimize long-term memory formation.^[49] These lists are also integrated with frameworks like the Common European Framework of Reference for Languages (CEFR), where approximately 500 words align with A1-level basic user proficiency, guiding syllabus design and assessment.^[50] In digital applications, word lists inform frequency-based progression, as seen in platforms like Duolingo, which sequences lessons to introduce high-utility vocabulary first for rapid skill-building.^[51] Effectiveness is evaluated through coverage tests, which measure how well a learner's vocabulary spans sample texts, confirming alignment with instructional goals.^[47] However, post-2020 developments in adaptive learning AI, such as personalized systems that dynamically adjust word exposure using updated frequency data, remain underexplored in pedagogical literature despite their potential to enhance customization.^[52]

Psycholinguistic Impacts

High-frequency words are recognized more rapidly during lexical access compared to low-frequency words, as demonstrated in priming studies where repeated exposure to frequent items accelerates subsequent identification. This effect arises from the organization of the mental lexicon, where frequent words occupy more accessible positions, reducing search time in models of lexical retrieval. Eye-tracking research further supports this, showing that gaze durations on high-frequency words are shorter by approximately 20-50 milliseconds during natural reading, reflecting faster orthographic and phonological processing.^[53] Word frequency also influences memory retrieval, with high-frequency words exhibiting fewer tip-of-the-tongue (TOT) states, where a known word temporarily evades recall. Studies indicate that TOT incidents are significantly rarer for words in the top 1,000 most frequent, as their stronger semantic-phonological connections facilitate easier access from long-term memory.^[54] In contrast, low-frequency words, comprising the bulk of the lexicon beyond basic vocabulary lists, are more prone to TOT due to weaker representational strength, impacting fluent language production in everyday discourse.^[54] The Zipfian distribution underlying word frequency lists promotes incremental language acquisition by prioritizing high-frequency items that learners encounter repeatedly in input. This skewed pattern allows initial mastery of a small set of common words, enabling contextual scaffolding for rarer terms. Low-frequency words demand more exposures to achieve comparable retention, as their sparse occurrence hinders consolidation in working memory.^[55] Such dynamics underscore how frequency-based lists align with natural learning trajectories, reducing cognitive load during early stages. An interaction between word frequency and phonological neighborhood density modulates processing efficiency, where high-frequency words in dense neighborhoods (surrounded by many phonologically similar competitors) exhibit slower recognition times. This inhibitory effect, observed in spoken word production tasks, arises from increased competition among activated lexical candidates, delaying selection despite the word's inherent accessibility.^[56] Recent neuroimaging evidence from fMRI studies confirms this at the neural level, revealing reduced activation in Broca's area (left inferior frontal gyrus) for high-frequency words during reading, indicative of more efficient articulatory planning and semantic integration with fewer neural resources.^[57]

Language-Specific Examples

English-Language Lists

One of the earliest and most influential English-language word lists is the General Service List (GSL), compiled by Michael West in 1953, which includes 2,000 headwords selected for their high frequency in everyday English texts and covers approximately 80% of words in general written materials.^[58] This list was derived from a corpus of about 2.5 million words, primarily from British and American sources, emphasizing words useful for non-native learners.^[59] However, the GSL has faced criticisms for relying on a dated corpus that predates significant linguistic shifts, such as technological advancements, leading to underrepresentation of modern vocabulary.^[59]^[60] To address these limitations, the New General Service List (NGSL), developed by Charles Browne, Brent Culligan, and Joseph Phillips in 2013, updates the GSL with 2,801 lemmas drawn from the 273-million-word Cambridge English Corpus (CEC), excluding overlap with the Academic Word List to focus on general high-frequency vocabulary.^[59] The NGSL achieves over 90% coverage of common texts, providing a more current foundation for language learning by incorporating internet-era terms like "email."^[61] For specialized contexts, the Academic Word List (AWL), created by Averil Coxhead in 2000, identifies 570 word families prevalent in university-level texts across disciplines, excluding general high-frequency words to target academic-specific vocabulary essential for higher education. This list was built from a 3.5-million-word corpus of written academic English, highlighting terms like "analyze" and "concept" that appear frequently in scholarly discourse but rarely in everyday language.^[4] In the domain of readability for early education, Edward Fry's 1967 list compiles the top 300 high-frequency words suitable for grades 1-9, aiding in the assessment of text accessibility for young readers.^[62] The NGSL serves as a modern counterpart, extending coverage to contemporary terms absent from earlier lists like Fry's. Recent global events, such as the COVID-19 pandemic, underscore the need for post-2020 refreshes to English word lists, as terms like "vaccine" saw dramatic frequency increases in discourse, potentially altering coverage priorities in learner corpora.

European-Language Lists

Frequency-based word lists for European languages other than English have been developed to address the unique morphological and syntactic features of these tongues, such as rich inflectional systems and grammatical gender, which complicate lexical frequency estimation compared to analytic languages like English. These lists often draw from diverse corpora, including written texts, spoken dialogues, and subtitles, to capture both formal and colloquial usage. Traditional corpora, such as national reference collections, provide the foundational data for many of these efforts. In French, the Français Fondamental project, initiated in 1948 and completed by 1964 under the direction of Paul Rivenc, produced a core vocabulary list ranging from 800 to 3,000 words, along with basic grammatical structures, aimed at teaching French as a foreign language to illiterate adults in colonial contexts and beyond. This list emphasized high-utility terms derived from everyday speech and simple texts, influencing subsequent pedagogical materials. A more comprehensive modern resource is Lexique3, a lexical database covering approximately 140,000 unique word forms (lemmas) with frequency measures updated in 2016 using a corpus of film and television subtitles totaling over 50 million words, enabling precise psycholinguistic analyses of word recognition and processing.^[63]^[64] For Spanish, Mark Davies's A Frequency Dictionary of Spanish (2006) compiles the 5,000 most frequent words from a 20-million-word corpus spanning contemporary written and spoken sources, including newspapers, literature, and conversations, providing part-of-speech information and example sentences to support language learners. Complementing this, the SUBTLEX-ESP database (2011) offers word frequencies derived from subtitles of 1,627 Spanish films and television programs, encompassing 41.5 million words, which better approximates informal, spoken language exposure than traditional written corpora.^[65]^[66] German frequency lists, such as those integrated into the Duden dictionary series, rely on the DeReKo (German Reference Corpus), a vast collection of over 61.4 billion words from texts dating from the 1990s onward, including newspapers, books, and web content, as of January 2025, to rank lemmas and word forms by occurrence. Analysis of this corpus shows that the top 4,000 words account for about 95% of tokens in typical German texts, highlighting the efficiency of focusing on high-frequency items for comprehension and instruction.^[67] Cross-linguistic extensions of subtitle-based frequency measures, like the SUBTLEX family, have been adapted for European languages; for instance, SUBTLEX-FR (2012) provides French word frequencies from film subtitles, facilitating comparative studies across Romance and Germanic tongues. However, compiling these lists in gendered languages presents challenges, particularly with nouns, where separate masculine and feminine forms (e.g., in French le chat vs. la chatte) must be aggregated at the lemma level or ranked individually, potentially skewing rankings due to morphological variation and agreement rules that inflate frequencies of inflected variants.^[26]^[68] A notable recent advancement is the 2022 EU-funded Romance-Croatian Parallel Corpus, which includes aligned texts in five Romance languages (French, Italian, Portuguese, Romanian, and Spanish) alongside Croatian, totaling millions of words, to update frequency profiles and support machine translation while addressing gaps in outdated monolingual lists for these high-resource languages.^[69]

Asian-Language Lists

Word lists for Asian languages often address unique linguistic features such as logographic scripts in Chinese and Japanese, tonal systems in Mandarin and Korean, and syllabic structures in Hangul. These lists prioritize frequency data from large corpora to account for compound words, character combinations, and context-dependent usage, differing from alphabetic languages by emphasizing character-level statistics alongside word forms. For instance, in logographic systems, frequency calculations may distinguish between individual characters and multi-character words to better reflect reading and comprehension patterns. In Chinese, the Hanyu Shuiping Kaoshi (HSK) syllabus, developed in the 1980s by the National Hanban/Confucius Institute Headquarters, structures vocabulary across six levels with a total of 8,840 words and characters, enabling learners to progress from basic greetings to advanced discourse. This list draws from contemporary written and spoken corpora, incorporating both simplified characters and common compounds, with coverage increasing cumulatively: level 1 requires 150 items, while level 6 encompasses all prior levels plus 2,500 additional entries for professional and academic contexts. Japanese word lists grapple with the complexity of kanji compounds, where frequency is derived from mixed-script corpora including hiragana, katakana, and kanji. This approach highlights how compound formation affects word boundaries, with top entries covering approximately 90% of typical texts through prioritized kanji-kanji pairings. For Korean, the National Institute of the Korean Language's frequency list, released in the 2000s, identifies the top 5,000 words from the Sejong Corpus—a approximately 11-million-word collection of balanced written and spoken data—handling Hangul's syllabic nature by treating morphemes and particles as integral to word units. This list facilitates learner progression by including honorifics and agglutinative forms, with the initial 1,000 items alone accounting for over 70% of common occurrences, adapted for tonal variations in pronunciation. A distinctive aspect of Chinese word lists is the distinction between character and word frequency, as logographic writing allows characters to function independently or in compounds. According to the Ministry of Education's 1986 guidelines, the 3,500 most common characters cover 99% of usage in modern texts, enabling efficient literacy without exhaustive memorization of all 50,000+ characters in existence.^[70] This character-centric metric contrasts with word-based lists in alphabetic languages, influencing pedagogical tools to prioritize high-coverage hanzi like 的 (de, possessive particle) and 是 (shì, to be). Despite these advancements, post-2020 developments in Asian word lists remain limited, with few incorporating social media corpora like Weibo for Mandarin to capture evolving slang and neologisms such as 打工人 (dǎ gōng rén, "wage slave"). Similarly, integration of emojis into vocabulary frameworks is nascent, overlooking their role as visual lexemes in digital communication across tonal and logographic contexts, such as emoji-modified compounds on platforms like Weibo or KakaoTalk.

Emerging and Low-Resource Languages

Word lists for emerging and low-resource languages address critical gaps in linguistic documentation, particularly for indigenous, African, and endangered varieties where traditional corpora are scarce. These efforts often rely on targeted collections from oral traditions, limited texts, and modern digital sources to prioritize high-frequency vocabulary essential for revitalization and basic communication. Such lists not only support language preservation but also enable computational applications in understudied tongues. In indigenous languages, organizations like SIL International have developed extensive corpora and word lists to document and analyze vocabulary from diverse communities worldwide. For instance, SIL's resources include elicitation-based word lists and texts for languages spoken in regions like Australia and the Americas, facilitating frequency analysis where full corpora are unavailable.^[71] A notable example is work on Navajo (Diné), where benchmark references from the 1980s, such as Young and Morgan's grammatical analyses, underpin vocabulary compilations of around 1,000 core terms derived from spoken and educational materials.^[72] African languages have seen advancements through corpus-driven frequency lists, filling voids in data for Bantu and other families. The Helsinki Corpus of Swahili 2.0, compiled in the 2000s and expanded to 25 million words, yields top-1,000 word lists based on annotated texts from newspapers, books, and interviews, highlighting everyday usage patterns.^[73] For Zulu (isiZulu), computational extraction methods applied to parallel corpora in studies from the early 2020s enable semi-automatic term and frequency identification, drawing from web-mined and aligned texts to rank common lexical items.^[74] Recent expansions target Celtic and North Germanic low-resource languages using subtitle and gigaword corpora for more naturalistic frequency data. The SUBTLEX-CY database for Welsh, released in 2023 from a 32-million-word corpus of television subtitles, provides detailed word frequencies that outperform earlier written-based lists in predicting lexical processing.^[75] Similarly, the Icelandic Gigaword Corpus, developed in the 2010s with versions reaching 1.3 billion words by 2022, supports customizable frequency lists from parliamentary speeches, news, and literature, aiding in the analysis of a language with limited external influences.^[76] Crowdsourced platforms have emerged as vital tools for endangered languages, enabling community-driven vocabulary building. Apps like Memrise host user-generated courses for Hawaiian ('Ōlelo Hawaiʻi), including lists of 2,000–3,000 high-frequency terms compiled from preserved documents and revitalization projects around 2022, which emphasize practical words for daily use and cultural preservation.^[77] Despite these initiatives, significant gaps persist in AI-assisted word lists for over 7,000 low-resource languages, where training data shortages limit model development and exacerbate digital divides.^[78] Post-2020 calls from initiatives like the Lacuna Fund urge the creation of open-access global corpora to democratize resources, emphasizing collaborative data curation for equitable NLP advancements in underrepresented tongues; as of 2025, the fund continues to support new dataset releases for African and indigenous languages.^[79]

References

[1]
https://dictionary.cambridge.org/us/dictionary/english/word-list
[2]
Selecting vocabulary: General service list of English words - UEfAP
Michael West (West, 1953) published his well-known General Service List of English words. This was a list of the 2000 most useful word families of English.<|control11|><|separator|>
[3]
The Academic Word List | Te Kura Tātari Reo / School of Linguistics ...
The Academic Word List is a useful English resource for lecturers and students. Averil Coxhead from the School of Linguistics and Applied Language StudiesAWL Most Frequent Words in... · Academic Word List headwords · Useful links
[4]
WORDLIST File Extension - FileInfo.com
Jun 4, 2015 · Word list files can be used for storing collections of words in different languages. Several standard word list files are included with the ...
[5]
Using word lists - EAP Foundation
Nov 28, 2022 · Word lists enable students and teachers to decide which words in a text deserve particular attention, such as academic words, as well as providing them with a ...
[6]
Corpus Linguistics - Frequency Lists and Keywords: Making Wordlists
Wordlist (F): frequency list, where words are listed with the most frequent coming first, descending to the least frequent.
[7]
[PDF] Lemma v. family as grouping unit for pedagogical word lists
Aug 24, 2022 · A lemma is a base word plus inflections only, while a family includes inflected and derived forms of a base word.
[8]
[PDF] General Word Lists: Overview and Evaluation - ERIC
A lemma with the headword develop includes the inflectional verb forms develops, developed and developing, whereas a word family with the same headword would ...
[9]
How Many Words Do We Know? Practical Estimates of Vocabulary ...
Jul 28, 2016 · In general, lemmas exclude proper nouns (names of people, places, …). Lemmatization also involves correcting spelling errors and standardizing ...
[10]
[PDF] How Large a Vocabulary is Needed For Reading and Listening?
Nov 17, 2017 · There are no words in the novel from the ninth 1,000 onwards and none outside the lists (excluding proper nouns). There are only 20 words.
[11]
A Seventeenth Century Frequency Word List - jstor
' The Great Didactic of John Amos Comenius. Translated into English and edited with biographical, historical and critical introductions by M. W. Keat- inge.Missing: earliest 17th
[12]
The teacher's word book - Internet Archive
Dec 14, 2006 · The teacher's word book. by: Thorndike, Edward L. (Edward Lee), 1874-1949. Publication date: 1921. Topics: English language -- Glossaries ...Missing: list 1932
[13]
A Teacher's Word Book of the Twenty Thousand Words Found Most ...
A teacher's word book of the twenty thousand words found most frequently and widely in general reading for children and young people.
[14]
[PDF] The teacher's word book of 30,000 words - Internet Archive
The original Thorndike count failed to record abbreviations adequately, and it is likely the recorders in the other counts skipped many. This book is not final ...
[15]
A General Service List of English Words - Google Books
Compiled by, Michael West ; Edition, 14, reprint, revised ; Publisher, Longman, 1953 ; Original from, the University of California ; Digitized, Mar 25, 2009.
[16]
Word Lists by Frequency | Encyclopedia MDPI
Nov 29, 2022 · A review has been made by New & Pallier. An attempt was made in the 1950s–60s with the Français fondamental. It includes the F.F.1 list with ...
[17]
Brown Corpus - Wikipedia
Although the Brown Corpus pioneered the field of corpus linguistics, by now ... The corpus originally (1961) contained 1,014,312 words sampled from 15 text ...History · Sample distribution · Part-of-speech tags used
[18]
New General Service List Project
The New General Service List Project is a collection of high-frequency English vocabulary resources, based on corpus linguistics research.New General Service List · New Academic Word List · Tools · NGSL-Spoken
[19]
Moving beyond Kučera and Francis: A critical evaluation of current ...
A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English.Missing: SUBTLEX | Show results with:SUBTLEX
[20]
SUBTLEX-CY: A new word frequency database for Welsh
We present SUBTLEX-CY, a new word frequency database created from a 32-million-word corpus of Welsh television subtitles.Missing: adaptation | Show results with:adaptation
[21]
(PDF) Sublanguage Corpus Analysis Toolkit: A tool for assessing ...
Nov 15, 2014 · PDF | Sublanguages are varieties of language that form "subsets" of the general language, typically exhibiting particular types of lexical, ...
[22]
https://ceur-ws.org/Vol-3972/paper7.pdf
[23]
[PDF] Benchmarking Automatic Tools for Neologisms Extraction - CEUR-WS
Jun 18, 2025 · One of the primary difficulties lies in defining the criteria for what constitutes a neologism across different domains. As new terms may emerge ...Missing: idioms lists
[24]
BROWN Corpus search online | Sketch Engine
The original corpus was published in 1963–1964 by W. Nelson Francis and Henry Kučera (Department of Linguistics, Brown University Providence, Rhode Island, USA) ...
[25]
The British National Corpus
The British National Corpus is a collection of 100 million words of contemporary British English text held in computer-readable form.
[26]
From opportunistic to systematic use of the Web as corpus: Do ...
Hardly any relevant examples can be found in traditional small corpora. There are no attestations in corpora of edited written American English such as Brown ...<|separator|>
[27]
SUBTLEXus — Department of Experimental Psychology
Brysbaert & New compiled a new frequency measure on the basis of American subtitles (51 million words in total).
[28]
Subtitle-Based Word Frequencies as the Best Estimate of Reading ...
Previous evidence has shown that word frequencies calculated from corpora based on film and television subtitles can readily account for reading performance, ...Missing: colloquial | Show results with:colloquial
[29]
English-Corpora: COCA
[Davies] 1.1 billion word corpus of American English, 1990-2010. Compare to the BNC and ANC. Large, balanced, up-to-date, and freely-available online.
[30]
Google Ngram Viewer
When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British ...University of · Book _INF a hotel · Fitzgerald,Dupont · Tackle _VERB , tackle _NOUN
[31]
(PDF) The future of multimodal corpora - ResearchGate
Aug 8, 2025 · This paper takes stock of the current state-of-the-art in multimodal corpus linguistics, and proposes some projections of future developments in this field.Missing: post- | Show results with:post-
[32]
Dual Coding or Cognitive Load? Exploring the Effect of Multimodal ...
Mar 10, 2022 · In the era of eLearning 4.0, many researchers have suggested that multimodal input helps to enhance second language (L2) vocabulary learning.<|separator|>
[33]
A question of alignment – AI, GenAI and applied linguistics
Jul 24, 2025 · This problem is exacerbated by AI generated text starting to populate the web pages from which data for training models are drawn, meaning ...<|control11|><|separator|>
[34]
[PDF] Exploring the Future of Corpus Linguistics: Innovations in AI and ...
Oct 3, 2025 · For corpus linguistics, AI-generated text can be valuable in creating synthetic corpora for specific domains or languages where data is scarce.
[35]
THE LEMMA DILEMMA | Studies in Second Language Acquisition
Dec 17, 2021 · The most common of the classifications related to word form are the word type, lemma, flemma, and word family. Word types consist of each unique ...<|separator|>
[36]
[PDF] Word Families - Compleat Lexical Tutor
The levels only deal with affixation. After Level 2 which contains inflectional suffixes, it would be possible to add a level dealing with transparent compound.Missing: et 2013
[37]
[PDF] Multiword Sequences and Language Learning Pedagogy - ERIC
Dec 26, 2018 · Lists of Multiword Units. One way to maximize the learning of various types of multiword units is to make use of the growing number of lists ...
[38]
The Type-Token Ratio and Vocabulary Performance - Sage Journals
The Type-Token Ratio (TTR), a measure of lexical diversity, was correlated with four measures of vocabulary performance: the Peabody Picture Vocabulary ...
[39]
[PDF] The BNC/COCA word family lists
Sep 17, 2012 · Note in Table 5 that almost half of the different words are proper nouns. Four percent are foreign words, and 6% are low frequency members of ...
[40]
Preparing Non-English Texts for Computational Analysis
Aug 28, 2020 · Lemmatization isn't enough for agglutinative languages such as Turkish, where very long words can be constructed by stringing together ...
[41]
[PDF] Useful statistics for corpus linguistics - Stefan Th. Gries
Gries. Second, relative frequencies can be used to compare different corpora with each other just by computing the relative frequency ratio, the quotient of the.
[42]
Normalizing frequencies - ENGLISH LINGUISTICS
To determine the number of occurrences of awesome per million words, we need to divide the raw frequency by the total number of words in the corpus section and ...
[43]
Understanding Zipf's law of word frequencies through sample-space ...
Jul 6, 2015 · We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along sentence formation, on average.
[44]
SUBTLEX-UK: A new and improved word frequency database for ...
The Zipf scale is a logarithmic scale, like the decibel scale of sound intensity, and roughly goes from 1 (very-low-frequency words) to 6 (very-high-frequency ...
[45]
[PDF] Dispersions and adjusted frequencies in corpora - Stefan Th. Gries
The most frequent statistics in corpus linguistics are frequencies of occurrence and frequencies of co-occurrence of two or more linguistic variables. However,.
[46]
AntConc - Laurence Anthony's Website
Word Frequency Lists. These 'wordlist' corpora can be loaded into AntConc (4.0 and later) via the Corpus Manager and used as reference corpora to create ...Of /software/antconc/releases · Software · Resume · Publications and PresentationsMissing: Sketch Engine
[47]
Wordlists - word frequency lists - Sketch Engine
The word list tool uses a text corpus to generate frequency lists of words, lemmas, nouns, verbs and other parts of speech. Regular expressions can be used ...
[48]
Calculating Semantic Frequency of GSL Words Using a BERT ...
Apr 26, 2025 · There has always been a pressing need to provide semantic information for words in high-frequency word lists, but technical limitations have ...
[49]
[PDF] How Large a Vocabulary Is Needed For Reading and Listening?
Tokens, types, and families at each of the 14 BNC word-family levels in the LOB corpus. Word list (1,000). Token (%). Types (%). Families. 1. 78,944 (77.86).Missing: scale | Show results with:scale
[50]
How Large a Vocabulary Is Needed for Reading and Listening?
Aug 6, 2025 · Nation (2006) estimates this as an 8,000 to 9,000 word family vocabulary for comprehension of written text and a 6,000 to 7,000 word family ...
[51]
Spaced repetition and the classroom: part 1 | Adaptive Learning in ELT
Oct 13, 2014 · Multiple exposure to a vocabulary item through spaced repetition is likely to help the process of that item ending up in the long-term memory.
[52]
How Many Words Do You Need to Know? - LENGO
Vocabulary: 500-1,000 words · At this level, learners can understand and use familiar everyday expressions and basic phrases that satisfy concrete needs.<|separator|>
[53]
[PDF] The Duolingo Method for App-based Teaching and Learning
Jan 11, 2023 · We illustrate the details of the method by highlighting examples from the Duolingo Language app,. Literacy app, and Math app throughout this ...
[54]
https://www.sciencedirect.com/science/article/pii/0749596X9190026G
[55]
[PDF] Length, frequency, and predictability effects of words on eye ...
According to Rayner et al. (2001) these eye-position effects are dissociated from lexical processing, that is they are not influenced by word frequency or.
[56]
On the tip of the tongue: What causes word finding failures in young ...
TOTs occur when the connections between lexical and phonological nodes become weakened due to infrequent use, nonrecent use, and aging.
[57]
Zipfian frequency distributions facilitate word segmentation in context
Word frequencies in natural language follow a highly skewed Zipfian distribution, but the consequences of this distribution for language acquisition are ...
[58]
[PDF] The spread of the phonological neighborhood influences spoken ...
Several studies in English have demonstrated that neighborhood density influences various aspects of spoken language processing, including lexical acquisi- tion ...
[59]
Word frequency and reading demands modulate brain activation in ...
Oct 11, 2023 · The results indicate that word frequency influenced the activation of the pars orbitalis and pars triangularis of the inferior frontal gyrus, ...
[60]
GSL (General Service List) - EAP Foundation
Feb 11, 2022 · The General Service List (GSL) is a list of 2284 headwords, chosen by frequency for being of greatest general use to learners of English.Frequency · NGSL · GSL alphabetical · GSL highlighter
[61]
[PDF] A New General Service List: The Better Mousetrap We've Been ...
NGSL was then published in several journals including the July issue of the. Language Teacher (Browne, 2013). Following many of the same steps that West and.
[62]
[PDF] An Examination of the New General Service List - ERIC
The New General Service List (NGSL; Browne, Culligan, & Phillips, 2013b) was published in 2013 as a modern replacement for West's (1953) original General.<|control11|><|separator|>
[63]
New General Service List 1.01 Glossary - LinguaEruditio
Sentencedict.com a letter, package, or email. 453, main ... Sentencedict.com something that you do because it is your duty or because you feel you have to.
[64]
Fry Word List - 1,000 High Frequency Words - K12 Reader
Fry Word List - 1,000 High Frequency Words. The Fry word list or "instant words" are widely accepted to contain the most used words in reading and writing.Fry Word Complete List · The 1st Hundred · The 3rd Hundred · The 2nd HundredMissing: readability 1967
[65]
[PDF] ED196274.pdf - ERIC
The girls were each given a list of 100 French words and asked to write down beside each one the first French word that it made them think of, The words were a ...Missing: details | Show results with:details
[66]
Documentation – Lexique
Lexique Manual. Lexique 3 Manual. How to cite Lexique. New, B., Pallier, C., Brysbaert, M., Ferrand, L. (2004) Lexique 2: A New French Lexical Database.Missing: details | Show results with:details
[67]
A Frequency Dictionary of Spanish | Core Vocabulary for Learners
May 17, 2006 · A Frequency Dictionary of Spanish ; Edition 1st Edition ; First Published 2005 ; eBook Published 17 May 2006 ; Pub. Location London ; Imprint ...Missing: details | Show results with:details
[68]
SUBTLEX-ESP: spanish word frequencies based on film subtitles
Aug 7, 2025 · In this study, we present a subtitle-based word frequency list for Spanish, one of the most widely spoken languages.
[69]
DeReWo – Corpus-Based Lemma and Word Form Lists
In this subproject we are developing methods to create frequency-based ranking lists of lemmata and word forms on the basis of random virtual corpora.Missing: vocabulary | Show results with:vocabulary
[70]
http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO
[71]
Romance-Croatian Parallel Corpus - RomCro - ELRC-SHARE
Dec 8, 2022 · Multi-linguality type: Parallel (Parallel Corpus of Romance Languages and Croatian Language unites five Romance languages (French, Italian, ...
[72]
https://www.sas.rochester.edu/lin/joycemarymcdonough/htouym-june2015.pdf
[73]
Word lists and texts | SIL Global
Language elicitation and texts from Maningrida through Milingimbi to Bamyili and Galiwinku. Tapes deposited with the Australian Institute for Aboriginal and ...Missing: corpora | Show results with:corpora
[74]
[PDF] How to Use Young and Morgan's 1987 The Navajo Language
1.1 The Young and Morgan grammars Young & Morgan 1980 and 1987 are the final two volumes and represent the benchmark reference grammars for Navajo and indeed ...Missing: frequency 1000
[75]
Helsinki Corpus of Swahili 2.0 (HCS 2.0) - Kielipankki
Oct 2, 2025 · The corpus contains about 25 million words of written text, and it is available in two formats. The annotated version contains morphological and ...
[76]
[PDF] MasakhaNER: Named Entity Recognition for African Languages
Submission batch: 5/2021; Revision batch: 7/2021; Published 10/2021. c 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license ...
[77]
SUBTLEX-CY: A new word frequency database for Welsh - PMC - NIH
We present SUBTLEX-CY, a new word frequency database created from a 32-million-word corpus of Welsh television subtitles.
[78]
The Icelandic Gigaword Corpus
Word frequency information for version 2022 is available on a special website. It allows creating frequency lists based on various criteria. N-gram frequency ...Missing: 2010s | Show results with:2010s
[79]
Developing Data-Driven Hawaiian Language Vocabulary Lists ...
Developing Data-Driven Hawaiian Language Vocabulary Lists using Preserved Documents ... Research on vocabulary size shows that a 2000 to 3000 word vocabulary ...
[80]
How language gaps constrain generative AI development | Brookings
Oct 24, 2023 · Generative AI tools trained on internet data may widen the gap between those who speak a few data-rich languages and those who do not.
[81]
Language Resources - Lacuna Fund
“Corpus Building for Low Resource Languages in the DARPA LORELEI Program.” In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource ...