Fact-checked by Grok 2 weeks ago

Stop word

A stop word is a high-frequency word in natural language, such as articles, prepositions, and conjunctions like "the," "and," or "to," that carries minimal semantic value and is typically removed during text preprocessing in natural language processing (NLP) tasks to reduce noise and improve computational efficiency.^[1]^[2] These words, often referred to as function words, connect other elements in sentences but contribute little to the overall meaning or predictive power in applications like information retrieval, text classification, and sentiment analysis.^[1]^[3] Stop words are identified through statistical methods, including term frequency-inverse document frequency (TF-IDF) and entropy measures, which highlight their ubiquity across documents while underscoring their low informational content.^[1]^[2] The concept of stop words dates back to early NLP and information retrieval systems in the mid-20th century, with foundational lists derived from general corpora like the Brown Corpus and later adapted for domain-specific use, such as in patent analysis or software engineering documentation.^[1] Standard stop word lists, such as those provided by the Natural Language Toolkit (NLTK) library (containing around 179 English words) or the United States Patent and Trademark Office (USPTO) (99 words), serve as starting points but often require customization to account for technical or specialized contexts where certain terms may function differently.^[1] Removal techniques vary, including fixed dictionary-based filtering or dynamic approaches using probabilistic models like Poisson distributions, ensuring that only non-informative elements are eliminated without losing contextual nuances.^[2]^[3] In practice, excluding stop words enhances model performance by focusing on content-bearing terms, though over-removal in domain-specific texts can inadvertently discard meaningful jargon.^[2]

Fundamentals

Definition

Stop words are the most common words in a language, such as articles, prepositions, and pronouns, that carry little semantic value and are typically removed during text preprocessing to reduce noise and improve the focus on meaningful content.^[4] The term "stop words" was first coined by Hans Peter Luhn in 1960 in the context of indexing technical literature for search engines.^[5] In text processing, the primary purpose of removing stop words is to enhance efficiency and accuracy by concentrating on content words—primarily nouns, verbs, and adjectives—that convey the core meaning and context of the text, thereby increasing the signal-to-noise ratio in tasks like indexing and analysis.^[4] This elimination reduces computational overhead and minimizes the dilution of semantic signals from high-frequency, low-information terms. Basic examples of stop words include "the," which functions as a definite article providing no unique informational content; "is," a form of the verb "to be" that primarily indicates existence or state without adding substantive meaning; "at," a preposition denoting location or time but offering little discriminative value; "which," a relative pronoun introducing clauses with minimal semantic weight; and "on," another preposition indicating position or relation that frequently appears without contributing to topical specificity.^[4] Stop words are distinguished from content words based on several key criteria, as outlined in the following comparison:

Criterion	Stop Words	Content Words
Frequency	High (approximately 40-60% of text tokens in English)	Low to moderate (less frequent but more variable)
Informativeness	Low (little to no contribution to semantic meaning or topic discrimination)	High (carry specific meaning, context, or entities essential for understanding)
Part of Speech	Primarily function words (articles, prepositions, pronouns, conjunctions, auxiliary verbs)	Primarily content words (nouns, main verbs, adjectives, adverbs)
Role in Analysis	Filtered out to reduce noise and improve efficiency	Retained to capture core semantics and enable effective retrieval or classification

Characteristics

Stop words are primarily identified through their high frequency of occurrence in natural language corpora, where they dominate the distribution of word usage. In English texts, function words comprising stop words often account for approximately 40-60% of all tokens in a typical document, as evidenced by analyses of composite corpora that highlight the prevalence of structural elements.^[6] This phenomenon is explained by Zipf's law, which posits that the frequency f of a word is inversely proportional to its rank r in the frequency list, approximately f \approx \frac{1}{r}, leading to a small set of high-frequency words carrying low semantic information but high structural importance.^[7] From a linguistic perspective, stop words consist mainly of function words, including articles (e.g., "the", "a"), auxiliary verbs (e.g., "is", "have"), conjunctions (e.g., "and", "but"), and prepositions (e.g., "in", "of"), in contrast to open-class lexical words like nouns, verbs, adjectives, and adverbs that convey core content. These function words belong to closed-class categories, which maintain a finite and stable inventory with limited morphological variation and minimal impact on sentence meaning when omitted, preserving the overall informational value of the text.^[8] The characteristics of stop words exhibit considerable variability depending on linguistic and contextual factors. Stop word sets are inherently language-dependent, as grammatical structures differ across languages; for instance, particles essential in agglutinative languages like Japanese may function as stop words, while equivalents in English do not. In certain applications, such as information retrieval, stop word definitions may broaden to encompass non-alphabetic elements like numbers (e.g., digits or numerals) or punctuation marks, which provide little topical relevance and are filtered to enhance processing efficiency.^[9]^[10] Quantitatively, the identification of stop words frequently leverages the term frequency-inverse document frequency (TF-IDF) metric, which diminishes the weight of ubiquitous terms. The TF-IDF value for a term t in document d is calculated as:

\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log \left( \frac{N}{\text{DF}(t)} \right)

Here, \text{TF}(t, d) represents the frequency of t within d, N is the total number of documents in the collection, and \text{DF}(t) is the number of documents containing t. Stop words yield low TF-IDF scores owing to their high document frequency, rendering their IDF component near zero and underscoring their lack of discriminatory power across texts.

Historical Development

Early Concepts

Linguistic studies in the 19th and early 20th centuries distinguished between function words—such as articles, prepositions, and conjunctions that serve grammatical structures—and content words like nouns and verbs that convey core semantic information. This differentiation provided a foundation for later ideas in language processing. Information theory provided a quantitative foundation for undervaluing high-frequency words, building on linguistic insights. In his seminal 1948 paper "A Mathematical Theory of Communication," Claude Shannon defined entropy as a measure of uncertainty or information content, revealing that predictable, high-frequency elements in language—like common function words—carry low informational value because their occurrence reduces surprise in message decoding. Shannon's models of English text entropy highlighted redundancy in natural language, where such words comprise a large portion of text but add minimal distinguishing power, implying their potential omission in efficient representation without significant loss of meaning.^[11] Pre-digital practices in library science and documentation exemplified filtering of common words. In the 1950s, early automatic indexing for abstracting services excluded prevalent but low-value terms to improve efficiency in bibliographic control, aligning with emerging standards for scientific literature.^[12] A key milestone in formalizing these ideas occurred in 1957 with Hans Peter Luhn's paper "A Statistical Approach to Mechanized Encoding and Searching of Literary Information," which introduced selective removal of insignificant words from abstracts to generate concise keyword sets for information dissemination. Luhn's method statistically weighted word significance based on frequency and context, advocating exclusion of prevalent but low-value terms to improve encoding for search systems, marking an early bridge from manual to automated techniques.^[13]

Evolution in Computing

The integration of stop words into computing emerged in the 1960s through the SMART information retrieval system, developed by Gerard Salton at Cornell University. In SMART, hardcoded stop lists—predefined collections of high-frequency function words such as articles, prepositions, and conjunctions—were systematically removed during the indexing process to eliminate non-informative terms and enhance retrieval efficiency. This preprocessing step addressed limitations in early automatic indexing by focusing computational resources on content-bearing words, as demonstrated in Salton's experiments with document collections that showed improved performance metrics when stop words were excluded. The system used a stop list of common English words to reduce vocabulary size significantly.^[14]^[15] In the 1970s and 1980s, stop word removal solidified as a standard preprocessing technique in information retrieval, particularly within vector space models that represented documents and queries as weighted term vectors. Salton's 1975 vector space model explicitly incorporated stop word deletion, excluding common English function words from the term vocabulary to reduce dimensionality and mitigate noise in similarity computations, as applied to collections like 425 Time magazine articles yielding 7,569 unique terms post-removal. Stop word removal was typically followed by stemming to further normalize terms, preserving system efficiency in IR pipelines. These advancements were validated through extensive experiments, establishing stop word handling as a foundational element in IR system design.^[16] The 1990s marked the transition to web-scale applications, where stop words were incorporated into early search engines like AltaVista, launched in 1995, to tackle scalability challenges posed by exploding corpora of web pages. By excluding stop words during indexing, AltaVista and similar systems achieved substantial reductions in index sizes—typically around 30-40% in IR systems—enabling faster query processing and lower storage demands for tens of millions of documents. This practice, drawn from traditional IR methods, proved critical for handling the web's unstructured growth while maintaining retrieval speed.^[17] From the 2000s to the present, stop word strategies evolved toward dynamic handling in big data contexts, influenced by Google's algorithms following its 1998 launch, which employed minimal fixed stop lists to balance efficiency and relevance. As corpora scaled to billions of pages, adaptations emerged for context-aware removal, where terms traditionally treated as stop words were retained or weighted based on query semantics, such as in phrase detection or natural language queries, to avoid losing intent signals. This shift, exemplified in patented methods for query-specific stopword detection, enhanced precision in distributed systems without rigid exclusions. As of 2025, modern NLP models like transformers continue this trend by often retaining stop words for contextual understanding.^[17]^[18]

Applications

Information Retrieval

In information retrieval systems, stop words are removed during indexing to optimize storage and query performance, particularly in the construction of inverted indexes. An inverted index maps terms to the documents containing them, and excluding stop words prevents these high-frequency, low-value terms from generating extensive posting lists. This process skips common words like "the," "and," or "of" while parsing documents, resulting in a more compact structure that facilitates faster lookups. According to Manning et al., removing a standard list of 150 stop words reduces the number of nonpositional postings by 25-30% relative to case-folded text, though the impact on compressed index size is limited due to efficient compression of frequent terms.^[19] Such optimizations can decrease storage requirements and accelerate query resolution, especially for large-scale corpora where stop words might otherwise account for 25-30% of all tokens. Stop word removal enhances retrieval relevance by mitigating the dominance of non-informative terms in ranking algorithms, thereby improving precision. In term-weighting schemes like TF-IDF or BM25, stop words have high document frequency but low inverse document frequency (IDF), which can dilute scores for meaningful terms if not filtered. For instance, processing the query "the cat sat on the mat" as "cat sat mat" shifts focus to content words, yielding more precise matches without substantial recall loss, as relevant documents typically contain similar stop words proportionally. This approach prevents noise in vector space models, where unfiltered stop words could inflate similarity scores erroneously. Two primary techniques govern stop word handling: static and dynamic stopping. Static stopping employs a predefined, fixed list of words excluded universally during indexing and querying, offering simplicity and consistency across applications. Dynamic stopping, conversely, evaluates words based on corpus or query-specific frequency thresholds, removing those deemed non-discriminative (e.g., appearing in over 90% of documents) to adapt to domain variations.^[20] The latter, often informed by IDF scores, allows flexibility but increases computational overhead. Assessments of stop word removal rely on standard metrics like precision (relevant retrieved documents over total retrieved) and recall (relevant retrieved over total relevant), benchmarked in evaluations such as TREC. Evaluations in TREC ad-hoc tasks have shown that stop word filtering provides efficiency gains with minimal impact on retrieval effectiveness metrics like precision and recall.^[21]^[22] Over time, stop word handling has evolved with search engine advancements, from basic static removal in early web search systems to more sophisticated integration with semantic and machine learning techniques in modern engines.

Natural Language Processing

In natural language processing (NLP), stop word removal serves as an initial step in the text preprocessing pipeline, particularly during tokenization for tasks such as sentiment analysis. This process filters out high-frequency function words like "the," "is," and "and," which carry minimal semantic value, thereby reducing the feature space in representations like bag-of-words models. By concentrating on content words, it enhances the focus on meaningful terms that drive task-specific insights, such as polarity detection in reviews.^[23]^[24] Stop word removal significantly lowers the dimensionality of vectorized text data, which is crucial for machine learning integration in NLP. For instance, in libraries like scikit-learn, excluding stop words during TF-IDF or CountVectorizer transformations decreases the vocabulary size, mitigating the curse of dimensionality and improving model training efficiency by reducing computational overhead. This streamlining is especially beneficial for large corpora, where it minimizes noise and accelerates convergence in algorithms like support vector machines or naive Bayes classifiers.^[23]^[25] In topic modeling with Latent Dirichlet Allocation (LDA), stop word removal shifts emphasis toward thematic content words, enhancing topic coherence by eliminating pervasive noise from function words. Empirical analysis shows that applying a standard stop word list, such as the 524-word MALLET set, improves normalized pointwise mutual information (NPMI) scores and downstream classification accuracy, with post-removal models achieving comparable log-likelihoods to baselines while yielding clearer topic distributions (e.g., 0.0931 NPMI for 10 topics on New York Times data). However, the timing of removal—pre- or post-tokenization—has negligible impact on these metrics.^[26] For machine translation, stop word removal requires context-aware strategies to preserve function words essential for syntactic structure and fluency, as indiscriminate elimination can degrade output quality. In statistical machine translation systems, relaxing frequent target words reduces perplexity (e.g., by 15% for the top 20 words) but does not boost BLEU scores, underscoring the need for predictive reinsertion models that account for contextual dependencies. Recent context-aware approaches filter stop words selectively during document-level processing to maintain coherence without sacrificing grammatical integrity.^[27]^[28] Popular NLP libraries provide built-in support for stop word removal, facilitating integration into pipelines. In NLTK, the process involves loading the English stop words corpus and filtering tokens:

python
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in text.split() if word.lower() not in stop_words]
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in text.split() if word.lower() not in stop_words]

This approach is efficient for basic preprocessing in sentiment analysis or classification tasks. Similarly, spaCy offers linguistic-aware removal via its token attributes:

python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
filtered_tokens = [token.text for token in doc if not token.is_stop]
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
filtered_tokens = [token.text for token in doc if not token.is_stop]

spaCy's implementation leverages pre-trained models for more nuanced filtering, suitable for complex tasks like topic modeling. Empirical studies from the 2010s demonstrate that stop word removal can yield accuracy gains of 5-15% in text classification tasks, depending on the domain and baseline. Empirical studies have shown that stop word removal can improve accuracy in text classification tasks, with gains varying by domain and model. These gains are attributed to noise reduction, though benefits diminish in transformer-based models where contextual embeddings inherently downweight function words. In transformer-based models like BERT, the benefits of explicit stop word removal diminish as contextual embeddings naturally downweight function words.^[29]^[30]^[31]

Stop Word Lists

English Stop Words

English stop words comprise the most frequently occurring function words in English, including articles, prepositions, pronouns, conjunctions, and auxiliary verbs, which carry minimal semantic content and are routinely filtered out in text analysis to emphasize meaningful terms. These lists are compiled from empirical frequency data in large corpora, such as the Brown Corpus—a 1-million-word collection of mid-20th-century American English prose developed by Kucera and Francis—which identifies the top approximately 150 words as accounting for over 50% of all tokens while representing just 0.5% of the unique vocabulary, excluding proper nouns and content-specific terms.^[32] The inclusion criteria prioritize words with high type-token ratios that do not discriminate document topics effectively, drawing from established resources like the Natural Language Toolkit (NLTK) corpus (179 words), the Snowball stemmer (174 words), and the SMART information retrieval system (571 words).^[33] While general stop word lists focus on everyday function words, academic variants adapt for formal and scholarly contexts by incorporating transitional or abbreviative terms that appear more often in research texts, such as "etc.", "however", and "therefore", to better handle discourse markers without altering core semantics.^[34] Similar principles guide stop word curation in multilingual settings, though English lists remain the most extensively documented due to the availability of corpora like Brown.^[35] The following table presents a curated selection of approximately 120 common English stop words, drawn from the intersection of NLTK, Snowball, and SMART lists, grouped by grammatical category. Frequencies are approximated per million words from the Brown Corpus (Kucera and Francis, 1967), using ranks for the top occurrences to illustrate prevalence; lower ranks indicate higher frequency (e.g., rank 1: ~70,000 occurrences).^[36]

Category	Examples (Selected Words)	Brown Corpus Frequency Rank (Representative)
Articles	a, an, the	the (1), a (4), an (not in top 150)
Prepositions	in, on, at, to, for, of, with, by, from, up, about, into, over, after, under, through, during	of (2), in (6), to (5), for (12), with (17), on (14), by (32), from (25)
Pronouns	I, you, he, she, it, we, they, me, him, her, us, them, my, your, his, its, our, their, this, that	I (20), you (8), he (11), it (10), we (44), they (19), me (not in top 150), him (63), her (48), us (not in top 150), this (23), that (9)
Auxiliary/Modals	be, is, are, was, were, been, have, has, had, do, does, did, will, would, shall, should, may, might, can, could	be (22), is (7), are (15), was (13), were (57), have (24), has (68), had (31), do (52), did (105), will (46), would (60), can (54), could (79), may (117)
Conjunctions	and, but, or, yet, so, for, nor, because, if, while, although	and (3), but (27), or (29), so (53), for (12), if (not in top 150), while (not in top 150)
Adverbs	not, no, very, just, only, now, then, too, also, even, as, well	not (35), no (81), very (127), only (129), now (77), well (102), as (16), too (not in top 150)
Determiners	all, both, each, every, some, any, few, many, much, several, these, those	all (56), some (not in top 150), any (not in top 150), many (142), these (86), those (not in top 150)
Other Function Words	at, on, what, which, who, when, where, why, how, there, here, again, against, below, between	at (50), what (59), which (not in top 150), when (not in top 150), where (not in top 150), there (130), here (125), again (not in top 150), between (not in top 150)

Multilingual Stop Words

Stop words vary significantly across languages due to differences in grammatical structure, with common function words like articles, prepositions, and conjunctions serving as primary candidates for removal in natural language processing tasks.^[37] In French, typical stop words include definite and indefinite articles such as le, la, les, un, une, as well as prepositions like de, à, en, sur, par, dans, and conjunctions like et, ou, mais, car, donc.^[37] Spanish stop words similarly feature articles (el, la, los, las, un, una, unos, unas), prepositions (de, a, en, por, con, para, sobre), and conjunctions (y, o, pero, sino).^[37] For Chinese, a language without articles or inflections, stop words often consist of particles and auxiliary verbs such as 的 (de, possessive), 是 (shì, to be), 在 (zài, at/in), 不 (bù, not), 和 (hé, and), 有 (yǒu, have), 我 (wǒ, I), 这 (zhè, this), 你 (nǐ, you), 他 (tā, he), 了 (le, particle), 为 (wéi, for), 以 (yǐ, with), 对 (duì, to), 上 (shàng, on), 下 (xià, under), 从 (cóng, from), 到 (dào, to), 由 (yóu, by), 都 (dōu, all), 也 (yě, also), 很 (hěn, very), 就 (jiù, just), 只 (zhǐ, only), and 还 (hái, still).^[37] Turkish, an agglutinative language, includes conjunctions like ve (and), veya (or), ile (with), ama (but), fakat (however), prepositions and postpositions such as de (at), da (also), için (for), gibi (like), and pronouns like bu (this), şu (that), o (he/she/it), bir (a/one), ben (I), sen (you), biz (we), siz (you plural), onlar (they), along with question particles mi and mu.^[37] Multilingual stop word lists are often created using frequency analysis on large parallel corpora, such as Europarl, which contains proceedings from the European Parliament in 21 languages and enables cross-lingual identification of high-frequency function words through term-document frequency (TDF) metrics and geometric cutoff strategies.^[38] This approach identifies stop words by their prevalence in documents, achieving high accuracy even with small corpora subsets, and highlights how agglutinative languages like Turkish require expanded lists due to the proliferation of bound morphemes that function similarly to independent words in analytic languages.^[38] Developing stop words for morphologically rich languages presents challenges, as inflected forms must integrate with stemming or lemmatization to capture variants of function words, such as null copulas or derivational suffixes in Turkish that blur the line between content and function elements. Non-Latin scripts in languages like Chinese and Turkish further complicate processing, necessitating script-specific tokenization to avoid misidentifying particles or affixes as meaningful terms. Key resources for multilingual stop words include the stopwords-iso collection, which aggregates lists for over 40 languages following ISO 639-1 standards, and the tidystopwords R package, which generates customizable lists from Universal Dependencies treebanks by filtering high-frequency lemmas tagged as adpositions, auxiliaries, pronouns, and other function categories across languages.^[37] These tools provide 20-30 or more words per language, emphasizing universal POS tags for consistency, such as determiners (le, el, 的), conjunctions (et, y, 和), and prepositions (de, en, 在).

Category	French Examples	Spanish Examples	Chinese Examples	Turkish Examples	German Examples
Determiners	le, la, les, un, une, des	el, la, los, las, un, una	的, 这, 那	bu, şu, o, bir	der, die, das, ein, eine
Conjunctions	et, ou, mais, car, donc	y, o, pero, sino, ni	和, 或, 但	ve, veya, ama, fakat	und, oder, aber, denn, sondern
Prepositions	de, à, en, sur, par, dans	de, a, en, por, con, para	在, 上, 下, 从, 到	de, da, için, gibi, ile	in, auf, von, zu, bei, mit
Pronouns/Auxiliaries	je, tu, il, être, avoir	yo, tú, él, ser, estar	我, 你, 他, 是, 有	ben, sen, o, olmak, etmek	ich, du, er, sein, haben

This table illustrates category overlaps, where function words like conjunctions and prepositions recur universally despite orthographic differences.^[37]

Challenges and Considerations

Selection Criteria

Selection of stop words typically begins with frequency-based thresholds, where terms appearing in a high proportion of documents—often greater than 50% (e.g., max_df > 0.5 in vectorization pipelines)—are flagged as candidates due to their low discriminative value across the corpus. This approach leverages document frequency (DF) to identify function words that dominate text without contributing to topical content, as high DF correlates with generic usage per Zipf's law observations in large corpora. Statistical tests like the chi-squared test further refine this by measuring a term's independence from document classes or categories; terms with low chi-squared scores (indicating insignificant association, p > 0.05) are deemed non-informative and suitable for removal, enhancing feature relevance in downstream tasks.^[39]^[40]^[41] Heuristic methods contrast manual curation, where linguists compile lists based on grammatical function (e.g., articles, prepositions) and empirical validation against sample texts, with automated techniques employing scores like mutual information (MI). In automated approaches, MI quantifies dependency between a term and contextual elements (e.g., MI(term, class) = Σ P(term, class) log[P(term, class)/(P(term)P(class))]); low MI values signal stop word status, as these terms provide minimal entropy for classification or retrieval. Context-aware considerations, such as query-dependent stopping, adjust removal dynamically: for instance, words like "to" are retained in multi-word queries (e.g., "go to school") to preserve phrase integrity, preventing semantic loss in information retrieval systems.^[42]^[43]^[44] Tools for generating stop word lists, such as those in libraries like Gensim, follow a structured pipeline: first, perform corpus analysis via tokenization and frequency computation; second, rank terms by DF or MI scores; third, apply thresholds (e.g., top 1-5% by frequency) to compile the list, often integrating with topic modeling workflows for validation. Best practices emphasize balancing over-removal, which risks eliminating contextually vital terms like negations ("not"), against under-removal, which preserves noise and inflates vector spaces; studies from 2015-2020 advocate iterative evaluation using metrics like normalized pointwise mutual information (NPMI) for topic coherence, recommending post-processing removal over rigid pre-filtering to mitigate these trade-offs while maintaining model interpretability.^[26]

Domain-Specific Adaptations

In technical domains like medicine, stop word lists are extended to include frequent but semantically low-value terms such as "patient" and "study," which recur extensively in clinical narratives and research abstracts without altering core diagnostic or therapeutic meaning.^[45] This adaptation reduces noise in preprocessing for tasks like entity recognition in electronic health records. In the legal domain, adaptations emphasize retaining functional prepositions while designating domain fillers—such as "accordance," "corresponding," "respectively," "thereby," "thereof," and "thereto"—as pseudo-stop words, as these appear ubiquitously in contracts and patents but provide minimal case-specific insight.^[1] Genre-specific modifications further tailor stop words to textual styles; for social media analysis, lists incorporate informal elements like hashtags (#), retweet indicators (RT), user mentions (@username), and slang abbreviations (e.g., "lol" for "laugh out loud"), which dominate tweets but dilute sentiment or topic signals, contrasting with formal texts where such terms are absent or treated as content.^[45]^[46] Case studies illustrate these adaptations' precision: in BioNLP, gene names (e.g., "PRNP" or synonyms like "not") are explicitly excluded from stop lists to prevent false negatives in protein interaction extraction, as standard removal could obscure biological entities amid high-frequency technical jargon.^[47] In finance NLP, currency symbols (e.g., $) and repetitive qualifiers like "market" are stopped to streamline trend detection, while retaining pivotal terms like "stock" ensures relevance in sentiment models for price prediction.^[48]^[45] Evaluations via A/B testing on domain corpora reveal substantial gains; for instance, domain-specific variants outperformed generics in 17 of 19 metrics across information extraction tasks in software engineering documentation.^[2] Custom lists also improve classifier accuracy in medical corpora like PubMed abstracts.^[49] Emerging trends shift toward AI-driven adaptive lists, where BERT embeddings enable context-aware filtering—dynamically assessing word salience (e.g., via cross-contextual polysemy modeling) rather than fixed removal—to better handle ambiguous terms in evolving domains like social media or biomedicine.^[50] More recent studies (as of 2024) indicate that in transformer-based models like BERT and GPT, traditional stop word removal may be less necessary or even counterproductive, as these models use contextual embeddings to better handle function words without explicit filtering.^[51]

References

[1]
Stopwords in technical language processing - PMC - NIH
A standard component of such tasks is the removal of stopwords, which are uninformative components of the data.
[2]
[PDF] Stop Words for Processing Software Engineering Documents - arXiv
Abstract—Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. How- ever, the definition of uninformative ...
[3]
[PDF] Stop Word Removal of English Text Documents Based on Finite ...
Stopwords are also known as function words. Stop word removal techniques are required in many NLP activities like Information Retrieval systems wherein the ...Missing: definition | Show results with:definition
[4]
https://doi.org/10.1371/journal.pone.0254937
[5]
English Words of Very High Frequency - jstor
vocabulary of English, it appears that more than half the words in a typical composite corpus will be structural. The last lines of Table 4 call to our.
[6]
Zipf's Law - GeeksforGeeks
Jul 23, 2025 · Zip's law describes the relationship between the frequency of words in language corpus and their rank in a frequency sorted list.
[7]
More on word classes - Ling 131, Topic 2 (session A)
Open Class (Lexical) Words and Closed Class (Grammatical) Words. We can make a basic distinction between open class (lexical) and closed class words: ...
[8]
Chapter 3 Stop words
The concept of stop words has a long history with Hans Peter Luhn credited with coining the term in 1960 (Luhn 1960). Examples of these words in English are ...
[9]
https://smltar.com/stopwords
[10]
Structural Linguistics - an overview | ScienceDirect Topics
Saussure argued that words' meanings arise not from an intrinsic quality that links them to the phenomena they represent, but from the systematic distinctions ...
[11]
[PDF] A Mathematical Theory of Communication
In the case of a discrete source of information we were able to determine a definite rate of generating information, namely the entropy of the underlying ...
[12]
The history of stoplists: lists of words not indexed
Feb 27, 2024 · Stoplists were thought to have been developed in the 1950s in conjunction with automatic indexing ... (2020) '75 stop words that are common in SEO ...
[13]
https://ieeexplore.ieee.org/document/5392697
[14]
The SMART Retrieval System—Experiments in Automatic Document ...
The SMART Retrieval System—Experiments in Automatic Document ProcessingJanuary 1971. Author: Author Picture G. Salton.Missing: stop | Show results with:stop
[15]
[PDF] The SMART and SIRE Experimental Retrieval Systems - SIGIR
2 A stop list, comprising a few hundred high-frequency function words, such as “and,” “of,” “or,” and “but,” is used to eliminate such words from consideration ...
[16]
https://dl.acm.org/doi/10.1145/361219.361220
[17]
An algorithm for suffix stripping - Emerald Publishing
Mar 1, 1980 · An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better.
[18]
None
Below is a merged summary of the use of stop words in web search engines, focusing on the 1990s and early engines like AltaVista and Google, as well as their role in scalability with large corpora. To retain all information in a dense and organized manner, I’ve used a table in CSV format to consolidate details from the provided segments, followed by a narrative summary that ties everything together. The table captures key points such as definitions, historical context, evolution, scalability benefits, trade-offs, and relevant URLs, while the narrative provides a cohesive overview.
[19]
Locating meaningful stopwords or stop-phrases in keyword-based ...
Aug 4, 2008 · A stopword detection component detects stopwords (also stop-phrases) in search queries input to keyword-based information retrieval systems.
[20]
[PDF] 5 Index compression - Introduction to Information Retrieval
“∆%” indicates the reduction in size from the pre- vious line, except that “30 stop words” and “150 stop words” both use “case folding” as their reference line.
[21]
Stop Word List - an overview | ScienceDirect Topics
In search engine optimization and information retrieval systems, stop-word removal is a common step before executing queries or building models, as stop ...
[22]
[PDF] IIT at TREC-8: Improving Baseline Precision
Our automatic runs used relevance feedback with a high-precision first pass to select terms and then a high-recall final pass. For manual runs, we used ...
[23]
[PDF] Overview of the Fourth Text REtrieval Conference (TREC-4) - UMBC
An important element of TREC is to provide a common evaluation forum. Standard recall/precision and recall/fallout figures have been calculated for each TREC.
[24]
Bing Search Operators - SEOSLY - Olga Zarr
Mar 3, 2025 · All stop words and punctuation marks are by default ignored by Bing. If you want Bing to take notice of them, put them in quotation marks ...
[25]
7.2. Feature extraction — scikit-learn 1.7.2 documentation
Using stop words#. Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text, and which may ...
[26]
Text Preprocessing in NLP with Python Codes - Analytics Vidhya
Apr 4, 2025 · Stop Word Removal. We remove commonly used stopwords from the text because they do not add value to the analysis and carry little or no meaning.
[27]
Optimizing TF-IDF Vectorization by Eliminating Stop Words
By filtering out stop words, we significantly reduce the dimensionality of our data. This is a key step in enhancing computational efficiency as it lessens the ...
[28]
[PDF] An Empirical Evaluation of Stop Word Removal in Statistical ...
Inspired by the concept of stop word removal in Information Retrieval, in this work we study the feasibility of stop word removal in Statistical Machine ...
[29]
[PDF] Revisiting Context Choices for Context-aware Machine Translation
May 20, 2024 · They find that actual utilisation of document-level context is rarely interpretable, but filtering out stop-words and most frequent words from ...
[30]
Effect of stop word removal on the performance of naïve Bayesian ...
Jun 1, 2014 · In this paper, an experimental study was conducted on three techniques for Arabic text classification.
[31]
[PDF] Text Classification based on the Latent Topics of Important ...
Aug 9, 2013 · In this paper, we propose a method to raise the accuracy of text classification based on latent topics, reconsidering the techniques ...
[32]
BROWN Corpus search online | Sketch Engine
The Brown corpus is the first text corpus of American English, consisting of 1 million words of edited English prose from 1961.Missing: 100 percentage
[33]
A stop word list - Snowball
| An English stop word list. Comments begin with vertical bar. Each stop | word is at the start of a line. | Many of the forms below are quite rare (e.g. ...
[34]
None
### Summary of Stop Words from https://raw.githubusercontent.com/igorbrigadir/stopwords/master/en/smart.txt
[35]
[PDF] stopwords.pdf
Oct 28, 2021 · The main stopword lists are taken from the Snowball stemmer project in different languages (see https://snowballstem.org/projects.html). The ...
[36]
[PDF] The 150 Most Frequent Words of English - PronunciationCoach
The 150 Most Frequent Words of English. 1 the. 26 are. 51 out. 76 like. 101 go. 126 US. 2 of. 27 but. 52 do. 77 now. 102 well. 127 very. 3 and. 28 from. 53 so.
[37]
stopwords-iso/stopwords-iso: All languages stopwords collection
The most comprehensive collection of stopwords for multiple languages. The collection follows the ISO 639-1 language code.Missing: dependent | Show results with:dependent
[38]
Automatic Multilingual Stopwords Identification from Very Small ...
This paper focuses on stopwords, ie, terms in a text which do not contribute in conveying its topic or content.
[39]
CountVectorizer — scikit-learn 1.7.2 documentation
If None, no stop words will be used. In this case, setting max_df to a higher value, such as in the range (0.7, 1.0), can automatically detect and ...
[40]
[PDF] Automatically Building a Stopword List for an Information Retrieval ...
The less information a word has, the more likely it is going to be a stopword. We evaluate our new term-based random sampling approach using various TREC.<|separator|>
[41]
Chi-Square Test for Feature Selection - Mathematical Explanation
Jul 28, 2025 · The chi-square test is a statistical method that can be used for feature selection by measuring the association between categorical variables.
[42]
(PDF) Evaluating Mutual Information and Chi-Square Metrics in Text ...
PDF | The aim of this work was to compare the behavior of mutual information and Chi-square as metrics in the evaluation of the relevance of the terms.
[43]
Automatic Stop Word Generation for Mining Software Artifact Using ...
Sep 9, 2019 · LEE et al.: AUTOMATIC STOP WORD GENERATION FOR MINING SOFTWARE ARTIFACT USING TOPIC MODEL WITH POINTWISE MUTUAL INFORMATION. 1763 list ...
[44]
Dropping common terms: stop words - Stanford NLP Group
These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times ...
[45]
[PDF] Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
In Latent Dirichlet allocation (LDA) (Blei et al., 2003), a common preprocessing step is the removal of stopwords, or common, contentless words in a corpus. The ...
[46]
Impact of Domain-Specific Stop-Word Lists on ECommerce Website ...
One way to accelerate search time is to reduce the index is by removing common words like "the," "and," or "with." These words, called "stop words," offer ...
[47]
Tips for Constructing Custom Stop Word Lists - Kavita Ganesan, PhD
It is actually fairly easy to construct your own domain specific stop word list. Here are a few ways of doing it assuming you have a large corpus of text.
[48]
Just words summary - AWS
Stopwords list imrovements · abbreviation - rt, lol, im, st, u.s, p.m. a.m, mr, dr, ll, ur, omg, co · interjection - oh, ha, haha, ha, la · letters - a-z ...Clean Text Data · Pre-Processing Corpus · Stopwords List Imrovements
[49]
Concept recognition for extracting protein interaction relations from ...
Sep 1, 2008 · To further enhance system performance, especially with regard to false-positive gene mention identification, we assembled stop word lists ...
[50]
Explainable assessment of financial experts' credibility by classifying ...
Dec 1, 2024 · Stop words and special characters (e.g., currency symbols, percentage symbols, etc.) ... stock prices from Yahoo®finance api. In this research, the ...Explainable Assessment Of... · 4. Evaluation And Discussion · 4.4. Classification Module
[51]
Stop Words for Processing Software Engineering Documents - arXiv
Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary ...
[52]
Accelerating Text Mining Using Domain-Specific Stop Word Lists
Nov 29, 2020 · Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving ...
[53]
Adaptive cross-contextual word embedding for word polysemy with ...
Apr 22, 2021 · This paper proposes a novel adaptive cross-contextual word embedding (ACWE) method for capturing the word polysemy in different contexts based on topic ...