Fact-checked by Grok 2 weeks ago

Sketch Engine

Sketch Engine is a web-based corpus analysis tool designed for exploring language patterns through large-scale text corpora, enabling users to query and visualize authentic language usage across multiple languages.^[1] Developed collaboratively by linguists Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and David Tugwell starting in the early 2000s, it builds on innovations like word sketches—automatic summaries of a word's grammatical and collocational behavior—first introduced in the Macmillan English Dictionary in 2002.^[2] The tool was launched in 2004 as an extension of the Manatee corpus query system, initially aimed at supporting lexicography by automating corpus-based insights for dictionary compilation.^[2] Over the subsequent two decades, Sketch Engine has evolved into a comprehensive platform supporting over 100 languages and more than 800 pre-built corpora totaling around 1 trillion words, with individual corpora reaching up to 80 billion words each.^[1] Key features include word sketches, concordances, distributional thesauruses, term extraction, and diachronic trend analysis, allowing users to identify typical collocations, rare usages, neologisms, and multilingual parallels.^[3] It accommodates diverse writing systems such as Latin, Cyrillic, and Chinese, and facilitates custom corpus building from web sources or uploaded files.^[1] Widely adopted in academia, publishing, and language policy, Sketch Engine serves linguists, lexicographers, translators, educators, and national institutes for tasks ranging from dictionary development to language teaching and historical text analysis.^[1] Major users include Oxford University Press, Cambridge University Press, and institutions like the Czech and Dutch language academies, with ongoing updates including enhancements to diachronic analysis tools in late 2023 and recent 2025 additions such as the ParlaTalk collection of parliamentary corpora from 22 EU states.^[4]^[5]

Introduction and History

Overview

Sketch Engine is a web-based corpus manager and text analysis software developed by Lexical Computing for querying and analyzing large collections of authentic texts across over 100 languages and more than 30 writing systems.^[1]^[6] It serves as a comprehensive platform for linguistic exploration, enabling users to uncover patterns in language use through data-driven methods.^[1] The primary purposes of Sketch Engine include facilitating complex queries into text corpora for professionals such as lexicographers, translators, linguists, researchers, teachers, and language learners, allowing them to study real-world language patterns, collocations, and contextual usages.^[1] It supports applications in fields like lexicography, translation, education, and computational linguistics by providing empirical evidence from vast datasets.^[6] Originating from the Manatee and Bonito corpus tools, Sketch Engine has become integral to dictionary creation and language resource development.^[7] Key to its utility are over 800 pre-built corpora encompassing a total of 1 trillion words, offering scalable resources from small specialized sets to massive general collections.^[1] Available as a commercial subscription service with robust support, it also includes a free open-source version called NoSketch Engine, which allows self-hosting but requires users to provide their own corpora.^[8]^[9] In basic operation, users access or upload corpora to the platform, execute searches such as concordances to retrieve contextual examples, and produce visualizations that highlight grammatical, collocational, and distributional patterns in language.^[1] This workflow empowers evidence-based analysis without necessitating advanced programming skills for most tasks.^[10]

Development History

Sketch Engine was developed in 2003 and launched in 2004 by Adam Kilgarriff, Pavel Rychlý, Pavel Smrz, and David Tugwell through their company, Lexical Computing, as a commercial corpus analysis tool primarily aimed at lexicographers and linguists.^[11]^[12] The platform built upon earlier open-source components, including Manatee, a C++-based corpus indexer created by Rychlý during his time at Masaryk University, and Bonito, a web-based interface for corpus querying.^[13] An open-source variant, NoSketch Engine, was released alongside the commercial version to support academic and research use, providing core functionality without proprietary corpora or advanced features.^[8] Key early milestones included the integration of word sketches—automatic, corpus-derived summaries of a word's grammatical and collocational behavior—in 2004, which became a hallmark feature for efficient lexical analysis across languages.^[2] By 2014, Sketch Engine expanded accessibility with the launch of SKELL, a simplified web interface derived from the main platform, initially supporting English for language learners and later extending to other languages like Russian, Czech, German, Italian, and French.^[14] In 2020, the company discontinued support for the legacy Bonito-based interface to streamline development toward a modern, unified user experience.^[15] Post-2016 developments focused on performance and scalability, with the Manatee indexer undergoing a partial rewrite in the Go programming language starting that year to handle larger corpora more efficiently, culminating in significant speed improvements by the late 2010s.^[16] Following Kilgarriff's passing in 2015, the team emphasized multilingual capabilities, adding enhancements for bilingual lexicography and integrating with European Union projects, such as the EUR-Lex parallel corpus covering all official EU languages for legal and translational analysis.^[17] By 2024, new features like the Timeline tool enabled diachronic analysis of word usage trends over time, while ongoing expansions added dozens of corpora annually, reaching over 800 preloaded options as of 2025 and incorporating AI-assisted functionalities for automated term extraction and word sense disambiguation. In 2025, updates included new corpora such as the ParlaTalk parliamentary collections from 22 EU states and enhancements to concordance visualization.^[18]^[19]^[20]^[21]^[22]

Core Features

Search and Analysis Tools

Sketch Engine provides a suite of search and analysis tools designed to enable linguists, lexicographers, and researchers to explore linguistic patterns within large text corpora efficiently. At its core is the concordance search, which retrieves instances of words, phrases, or patterns in their surrounding contexts, typically displayed in keyword-in-context (KWIC) or full-sentence views. This tool supports extensive customization, including sorting results by corpus order, random selection, or relevance metrics such as Good Dictionary Examples, which prioritize illustrative usages based on linguistic criteria. Users can group concordances by frequency, attributes like part-of-speech tags, or metadata, and apply filters to retain or exclude lines matching specific conditions, facilitating targeted analysis of up to 1,000 lines for download in preloaded corpora.^[23] For deeper distributional analysis, Sketch Engine offers collocation tools that identify co-occurring words and phrases, revealing syntactic and semantic relationships through statistical measures. These include lists of frequent collocations within defined spans (e.g., left or right of the node word), sortable by metrics like t-score or logDice for reliability. Advanced querying is powered by the Corpus Query Language (CQL), a flexible syntax for specifying complex patterns, such as grammatical structures, optional elements, or alignments with tags like lemmas and part-of-speech. For instance, CQL allows searches like [lemma="run" & tag="V.*"] to capture verb forms in context, enabling precise extraction of multi-word units or rare phenomena across corpora.^[24]^[25] The platform's thesaurus and similarity functions leverage distributional semantics to automatically generate relations between words based on their co-occurrence patterns in the corpus. The distributional thesaurus computes similarity scores based on word sketch data to cluster synonyms, hyponyms, or contextually related terms, providing an automated alternative to manual thesauri. This tool supports exploratory queries, such as finding words similar to "bank" in financial versus river contexts, and is available for every word in supported corpora, drawing on principles established in early implementations like those from 2007.^[26]^[27] Diachronic analysis tools in Sketch Engine track frequency changes over time in timestamped corpora (available in 18 languages as of September 2025), aiding the study of language evolution.^[28] The Trends feature generates graphs of word usage across periods, highlighting neologisms or shifts in meaning. Introduced in 2024, the Timeline function enhances this by producing interactive visualizations for any search result, displaying normalized frequencies with options to compare multiple terms or filter by subcorpora, thus revealing granular trends like the rise of "AI" in recent decades.^[18]^[29] For multilingual research, Sketch Engine supports parallel corpus facilities, where aligned texts in multiple languages allow querying in one language to retrieve corresponding segments in others. The parallel concordance displays results side-by-side, supporting translation equivalence studies through alignment at sentence or paragraph levels, often built from bilingual or multilingual datasets using tools like Excel imports for 1:1 or M:N mappings. This enables cross-linguistic pattern analysis, such as identifying idiomatic translations, without requiring manual alignment for basic setups.^[30]^[31]

Word Sketches and Extraction

Word sketches in Sketch Engine are algorithm-generated, one-page summaries that capture a word's grammatical and collocational behavior by organizing typical collocations into predefined categories based on syntactic relations.^[2] These summaries highlight patterns such as verbs with direct objects (e.g., for the verb "give," collocations like "advice," "information," or "money" as objects), nouns with modifiers (e.g., for "university," adjectives like "leading," "top," or "prestigious"), or subjects of verbs, providing a concise linguistic profile derived from corpus analysis.^[25] The generation process relies on a sketch grammar—a set of rules written in the Corpus Query Language (CQL)—that scans the corpus for patterns around the target word, scoring collocations by frequency and significance to filter the most relevant examples.^[32] Keywords and terminology extraction tools in Sketch Engine identify significant single-word keywords and multi-word terms characteristic of a specific corpus or domain by comparing their frequencies against a reference corpus.^[33] These tools employ statistical measures such as log-likelihood or chi-squared tests to detect deviations from expected distributions, highlighting terms that are over-represented in the target text (e.g., extracting domain-specific vocabulary like "machine learning" or "neural network" from AI-related documents).^[33] Specialized features like the Keywords & Terms tool and OneClick Terms automate this for user-uploaded texts, producing ranked lists of terms suitable for terminology management in specialized fields.^[34] Customization of these features is achieved through adjustable sketch grammars, which can be tailored for different languages by adapting CQL rules to specific part-of-speech tagsets and syntactic structures, or for domains by modifying relation definitions to capture relevant patterns (e.g., adding industry-specific collocation categories).^[32] Sketch Engine provides pre-built grammars for word sketches in 34 languages, including English, German, Czech, and Chinese, with extensions available for additional languages through user-defined rules.^[35] This approach draws from Adam Kilgarriff's emphasis on distributional properties, where a word's meaning and usage are inferred from its co-occurrences in varied grammatical contexts across large corpora.^[2]

SKELL Service

The SKELL (Sketch Engine for Language Learning) service was launched in 2014 as a free, public web-based tool providing simplified access to corpus data for non-experts, particularly language learners and educators, without requiring user login or registration.^[14] Developed by Lexical Computing, it offers a user-friendly interface to explore authentic language usage through example sentences and basic analytical views, drawing from subsets of larger corpora maintained by Sketch Engine.^[36] Key features include simplified concordances, which display up to 40 contextual example sentences for a queried word or phrase; word sketches, which highlight common collocations and grammatical patterns in a tabular format; and the "Good Dictionaries" view, an algorithm-driven thesaurus showing synonyms and related terms.^[36] Unlike the full Sketch Engine, SKELL omits advanced query languages like CQL and support for custom corpora, focusing instead on straightforward searches to promote intuitive language discovery.^[14] As of 2025, SKELL supports six languages: English, Russian (via ruSKELL), Czech, German, Italian, and Estonian, with each interface tailored to provide relevant corpus examples in the target language.^[36] The service uses sampled subsets of multi-billion-word corpora to ensure quick response times, though results include watermarks indicating the SKELL version for attribution.^[36] Designed primarily for teachers and students, SKELL aims to bridge corpus linguistics with practical language learning by offering real-world usage examples that enhance vocabulary acquisition, collocation awareness, and writing skills, as evidenced in educational studies from the 2020s.^[37] Limitations include restricted result volumes to prevent overload and the absence of export options or detailed metadata, encouraging users to upgrade to the commercial Sketch Engine for deeper analysis.^[36] In the 2020s, improvements to mobile responsiveness have made it more accessible on handheld devices, supporting integrations in classroom activities and online EFL programs.^[38]

Corpora and Data Management

Available Text Corpora

Sketch Engine provides access to over 800 preloaded text corpora spanning more than 100 languages, with sizes ranging from approximately 1,000 words to 86.8 billion words, enabling diverse linguistic analyses from small specialized datasets to massive general-purpose collections.^[1]^[20] The corpora draw from varied sources, including web-crawled content, legal documents, translated subtitles, and domain-specific texts such as environmental or academic materials, offering comprehensive coverage for research in lexicography, translation, and language teaching.^[20]^[39] Central to this collection is the TenTen family of corpora, which comprises web-derived texts for over 50 languages, each exceeding 10 billion words and processed with advanced cleaning, deduplication, part-of-speech tagging, and lemmatization to ensure high-quality linguistic data.^[40] Notable examples include the British National Corpus (BNC), a 100-million-word balanced sample of late 20th-century British English encompassing both written and spoken varieties, and the EUR-Lex parallel corpus, a multilingual repository of EU legal and public documents in 24 official languages, with a total size exceeding several billion words across all languages (e.g., English version: 630 million words), segmented by paragraph for alignment studies.^[41]^[17] Additional domain-specific corpora feature the OpenSubtitles collection, which aggregates translated movie subtitles across 58 languages into 60 parallel sub-corpora for multimodal translation analysis, and the EcoLexicon English Corpus, a 23.1-million-word set of contemporary environmental texts supporting terminology work in sustainability topics.^[42]^[43] Multilingual capabilities extend to over 100 languages, including low-resource ones like Yiddish or indigenous Australian languages, often bolstered by targeted web crawls to fill representation gaps in under-documented varieties.^[35]^[20] Access is structured in tiers: open corpora, such as subsets of the BNC or EcoLexicon, are freely searchable without an account via the NoSketch Engine interface; trial users and subscribers gain expanded access to full datasets, with ongoing updates ensuring relevance.^[8]^[44] Post-2020 expansions have addressed coverage gaps through additions like the ukTenTen22 (7.6 billion words of Ukrainian web texts), arTenTen24 (6.6 billion words of Arabic), and 2024 releases including idTenTen24 (7.1 billion words of Indonesian), fiTenTen24 (4.4 billion words of Finnish). As of July 2025, the ParlaTalk corpora of parliamentary debates have been expanded to 2.8 billion words in 20 languages.^[20]^[45]^[46]

Corpus Building and Customization

Corpus Architect serves as the core tool within Sketch Engine for enabling users to construct and tailor personalized text corpora without requiring specialized technical expertise.^[47] This web-based interface facilitates corpus creation either by uploading user-provided documents or by automatically crawling and harvesting content from the web using seed keywords or specified URLs via the integrated WebBootCaT technology.^[48] It supports a range of input formats, including plain text (.txt), HTML (.htm, .html), TEI XML (.tei, .xml), Microsoft Word (.doc, .docx), PDF (.pdf, with OCR for scanned documents), and zipped archives for batch processing.^[49] The corpus building process begins with users naming the corpus, selecting the primary language, and optionally adding a description before proceeding to input data.^[49] Uploaded texts undergo preprocessing to clean and structure the content, such as removing boilerplate or non-linguistic elements from web pages and converting complex formats to a vertical text representation suitable for indexing.^[50] Deduplication is applied to eliminate exact or near-duplicates, ensuring the corpus maintains high-quality, non-redundant data.^[51] Following preprocessing, the tool automatically performs part-of-speech tagging and lemmatization for more than 30 languages, assigning positional attributes like lemmas and tags to each token to support subsequent linguistic queries.^[48] Once prepared, the corpus is compiled and indexed, generating searchable structures including word sketches and thesauri where applicable.^[52] This indexing step creates a fully functional corpus that integrates directly with Sketch Engine's query interface, allowing users to analyze it using the same tools as pre-built collections.^[52] Small-scale corpora are available at no additional cost within standard subscriptions, while larger builds scale with institutional licensing for handling extensive datasets.^[1] Customization enhances user control over the corpus structure and utility. Users can define subcorpora to isolate specific subsets, such as by genre or time period, through configuration files that specify structural tags like documents (), paragraphs (), or sentences ().^[53] Metadata attributes, including details like author, publication date, or domain, can be added to enrich structural elements and enable filtered searches.^[53] For multilingual applications, parallel alignment is supported via formats like TMX or XLIFF, allowing sentence-level correspondences to be established for translation studies.^[49] In the 2020s, updates to corpus building have introduced streamlined handling of large-scale datasets through optimized processing pipelines and built-in automated quality assessments, such as integrity verification during compilation, to facilitate reliable use in research projects.^[52]
Technical Architecture

Manatee
Manatee serves as the core backend database and indexing system for Sketch Engine, managing the storage and efficient retrieval of large-scale text corpora. Originally developed in C++ by Pavel Rychlý, it was designed specifically for corpus linguistics applications, enabling the handling of corpora containing billions of words through optimized data structures such as inverted indexes for rapid query processing.^[54]^[55] Some components, including the corpus indexing tool mklcm, were rewritten in the Go programming language starting in 2016 to enhance performance and maintainability.^[16] Key functions of Manatee include processing tokenized text into a vertical format where each token is annotated with attributes such as part-of-speech (POS) tags and lemmas, facilitating advanced linguistic analysis. During indexing, it builds positional inverted indexes that map attribute values to their occurrences in the corpus, supporting fast searches via the Corpus Query Language (CQL). Lemmatization and POS tagging are integrated as corpus attributes, allowing queries to target base forms or grammatical categories without reprocessing raw text.^[56]^[55] In terms of performance, Manatee is engineered to manage terabyte-scale corpora, with features like asynchronous query evaluation that display initial results before full computation, making it suitable for interactive use. Indexing supports parallel processing to accelerate the building of large corpora, as introduced in version 2.152, reducing preparation time for multi-billion token datasets.^[57]^[16] The core of Manatee is available open-source as part of NoSketch Engine, an initiative that combines it with the Bonito interface for free corpus management, allowing customization for specific languages through extensible attribute handling and query optimizations. This open-source variant supports deployment in diverse environments while maintaining compatibility with Sketch Engine's proprietary extensions. Manatee interacts with the Bonito frontend to deliver query results, but its primary role remains backend data handling.^[58]^[59]
Bonito
Bonito serves as the web-based graphical user interface (GUI) for Sketch Engine, enabling users to input queries and interact with corpus data through an intuitive platform. Developed as the client component in a client-server architecture, it facilitates the display of search results such as keyword-in-context (KWIC) concordances, collocation graphs, frequency distributions, and word sketches, all rendered dynamically via web technologies.^[56]^[60] Implemented in Python since version 2, Bonito leverages an object-oriented structure for maintainability and extensibility, utilizing tools like the Cheetah Templating Engine to generate responsive HTML outputs. Key features include support for multilingual user interfaces, with localization added for languages such as Polish, Slovak, Spanish, French, and Arabic in updates from 2021 to 2023, allowing seamless language selection based on browser settings or user profiles. Additionally, it provides API access for programmatic interactions, enabling developers to retrieve results in JSON or XML formats, with enhancements like keyword extraction and customizable views introduced in versions 3.42 and 3.92. Post-2020 updates emphasized responsive design, incorporating mobile and touch compatibility, particularly for related services like SKELL, ensuring accessibility across devices by 2025.^[61]^[60] Bonito integrates closely with the Manatee corpus management system by communicating queries to the server for processing and retrieving data for visualization, while handling frontend tasks independently. It manages user sessions through standard web protocols, supporting features like subcorpus saving and query history to maintain continuity during interactions. Security is enforced via role-based access controls, configurable for user groups and shared corpora, including HTTPS for secured connections and permission checks to prevent unauthorized data access.^[56]^[60] The interface evolved significantly with the release of Bonito 2 in 2004, transitioning from an earlier Tcl/Tk-based standalone application to a fully web-based CGI-driven system, which replaced the legacy interface entirely by January 2020 to streamline maintenance and user experience. Subsequent versions, such as 3.70 in 2021 introducing trends visualization and 3.101 in 2023 enabling multiword sketches for queries of three or more terms, have continued to refine its capabilities for advanced linguistic analysis.^[15]^[54]^[60]
Corpus Architect
Corpus Architect is a Python-based utility integrated into Sketch Engine, designed to facilitate the creation and maintenance of custom corpora from raw text files or web sources without requiring advanced technical expertise.^[47] It serves as a dedicated tool for corpus preparation, enabling users to process diverse data inputs into structured, queryable formats compatible with Sketch Engine's ecosystem.^[62] By incorporating web crawling capabilities via the BootCaT module, it allows automated collection of domain-specific texts using seed keywords and search engines, streamlining the assembly of corpora for linguistic analysis.^[63] The tool handles essential processes such as text cleaning to remove noise and inconsistencies, followed by annotation for linguistic features including part-of-speech tagging, lemmatization, named entity recognition (NER), and sentiment analysis.^[63] Deduplication is a core step, employing algorithms to eliminate exact or near-duplicate content, ensuring corpus quality and reducing redundancy during compilation.^[62] Once processed, Corpus Architect generates indexes in the Manatee format, which supports efficient storage and retrieval for subsequent querying.^[62] It also automates metadata extraction and the compilation of derived structures like word sketches and thesauri, enhancing the corpus's utility for lexicographic and research purposes.^[63] Advanced features include batch processing for handling large-scale data volumes and scripting interfaces for custom automation, allowing users to tailor workflows via Python scripts.^[62] The utility supports vertical file formats, where each token appears on a separate line with associated attributes, facilitating precise alignment and analysis in multilingual or parallel corpora.^[63] What distinguishes Corpus Architect within Sketch Engine is its seamless integration with the Bonito interface, enabling immediate querying and visualization of newly built corpora without additional setup.^[62]
Applications

In Lexicography and Publishing
Sketch Engine has been widely adopted by major publishers in lexicography since the early 2000s, enabling evidence-based dictionary production through corpus analysis. Oxford University Press (OUP), Macmillan, Cambridge University Press, and Collins—four of the UK's five largest dictionary publishers—have integrated it into their workflows for creating and updating monolingual and bilingual dictionaries.^[64] Macmillan was the first to use word sketches in 1999, while OUP adopted the full system shortly thereafter for thesaurus development and beyond.^[65] In dictionary compilation, Sketch Engine's word sketches provide concise summaries of a word's collocations, grammatical patterns, and usage, serving as draft entries for definitions and example selection.^[65] Lexicographers at these publishers employ term extraction tools to identify neologisms and multi-word units from large corpora, facilitating the detection of emerging language trends for inclusion in resources like learner's dictionaries.^[19] For instance, Macmillan's online dictionaries leverage these features to label core vocabulary (e.g., 7,500 "red words" for high-frequency terms) and generate corpus-attested examples, shifting from print to digital formats by 2012.^[64] The tool's impact lies in promoting data-driven lexicography, replacing intuition-based methods with statistical evidence from billions of words, which has streamlined production and improved accuracy.^[64] Reports from the 2014 Research Excellence Framework highlight efficiency gains, such as generating detailed word profiles in seconds, allowing lexicographers to focus on curation rather than manual data gathering; this has supported the explosive growth of online dictionaries since 2009.^[64] A key case study involves OUP's use of the Oxford English Corpus—nearly 2.1 billion words analyzed via Sketch Engine—for updating the Oxford English Dictionary (OED), including revisions to entries based on real-world usage across English variants.^[66] Similarly, multilingual projects, such as bilingual dictionaries, benefit from Sketch Engine's alignment tools for cross-language collocations.^[65] In the 2020s, Sketch Engine has evolved to incorporate hybrid AI-human workflows, enhancing lexicographic processes with automated features like word sense induction using language models to group collocations by meaning.^[19] This integration allows publishers to combine machine-generated insights with expert verification, as seen in recent updates to term extraction for more languages, supporting faster detection of specialized vocabulary in global resources.^[19]
In Research and Education
Sketch Engine has been extensively applied in academic linguistic research, particularly for diachronic analysis in sociolinguistics, where its Trends and Timeline tools enable researchers to track changes in word usage and frequency over time. For instance, the Timeline feature generates visualizations of language evolution, allowing studies on neologisms, semantic shifts, and sociolinguistic variations in large-scale corpora spanning decades or centuries.^[67]^[29] In translation studies, parallel corpora such as the OPUS collection facilitate comparative analysis across languages, helping scholars identify translation equivalents, idiomatic expressions, and alignment patterns in aligned sentence pairs.^[68]^[69] Additionally, researchers in domain-specific fields build custom corpora to analyze specialized texts; historians, for example, upload historical documents to create tailored corpora for examining linguistic features in archival materials like Early English Books Online.^[70]^[71] In education, Sketch Engine supports language teaching through its SKELL interface, a simplified version designed for classrooms that provides authentic examples of word usage without requiring advanced technical knowledge. Teachers in English as a Second Language (ESL) programs integrate SKELL to illustrate collocations, grammar patterns, and contextual examples, fostering corpus-based pedagogy that emphasizes real-language exposure over rote memorization.^[72]^[73] The platform also aids in analyzing learner corpora, where educators upload student writing to identify common errors, vocabulary gaps, and progress in language acquisition.^[74] The tool's community includes numerous universities and research institutions worldwide, such as Lancaster University and the University of Groningen, which provide institutional access for linguistic analysis and text mining.^[75]^[76] Examples of its impact include sociolinguistic studies using Timeline to monitor sentiment shifts in economic terminology during crises, revealing patterns in public discourse.^[77] In ESL education, corpus-based approaches with Sketch Engine have been adopted in programs to enhance vocabulary teaching, as demonstrated in classroom activities exploring word sketches for nuanced usage.^[78] Recent expansions from 2024 to 2025 have extended its utility to interdisciplinary areas like computational social science, integrating corpus tools with AI for analyzing social media trends and multilingual data in applied linguistics research; as of November 2025, updates include the English Trends corpus exceeding 86 billion words for enhanced diachronic studies and timestamped corpora in 18 languages for time-specific multilingual analysis.^[79]^[80]^[81]^[28]

References

[1]
Sketch Engine: Create and search a text corpus
Sketch Engine is the ultimate tool to explore how language works. Its algorithms analyze authentic texts of billions of words (text corpora)What can Sketch Engine do? · Word sketch · Price List · Quick Start Guide
[2]
[PDF] The Sketch Engine
Now, we have developed the Sketch Engine, a corpus tool which takes as input a corpus of any language and a corresponding grammar patterns and which generates ...
[3]
How to use a corpus to get information about words - Sketch Engine
Sketch Engine is an online text analysis tool that works with large samples of language, called text corpora, to identify what is typical and frequent in a ...
[4]
Sketch Engine news: enhanced tools, new corpora, and Lexicom ...
Sketch Engine news: enhanced tools, new corpora, and Lexicom 2024 in Spain ... These regular updates enable you to use Trends, the #diachronic analysis ...Missing: 2025 | Show results with:2025
[5]
New 15-billion-word English corpus | Sketch Engine
Check our new 15-billion-word English corpus (enTenTen) comprised of texts from the Web until the end of 2015. We used our newest advanced cleaning method ...Expand Your Linguistic... · New Corpora, Tips And... · Lexicom 2025, New Corpora...
[6]
About us - Lexical Computing
Lexical Computing is a supplier of word databases, lexicons, n-gram databases and other language data and a developer of the Sketch Engine corpus software.
[7]
[PDF] Ten Years On 1. Introduction - The Sketch Engine
in Pakistan (as an official language but not the mother tongue of many people) is a ... dedicated corpus linguistic tools, Google may be the best tool to use. For ...
[8]
NoSketch Engine and Sketch Engine
NoSketch Engine is an open source version of Sketch Engine with certain functionality limitations. NoSketch Engine does not contain any corpora.Missing: GPL | Show results with:GPL
[9]
Open-source Natural Language Processing tools - Lexical Computing
NoSketch Engine is an open-source corpus query system based on Sketch Engine. NoSketch Engine does not feature any of the automated corpus building tools ...Missing: GPL | Show results with:GPL
[10]
Choose the right corpus | Sketch Engine
Sketch Engine provides you hundreds of corpora in various sizes from tiny (less than million words) to really huge (10+ billion words).
[11]
Sketch Engine team
Adam Kilgarriff founded Lexical Computing, the company behind Sketch Engine, in 2003 and remained a central figure of Sketch Engine until November 2014.
[12]
The Sketch Engine: ten years on | Lexicography
Jul 10, 2014 · The Sketch Engine is a leading corpus tool ... corpus websites and corpus tools as available for lexicography and corpus linguistics.
[13]
(PDF) The Sketch Engine - ResearchGate
Now, we have developed the Sketch Engine, a corpus tool which takes as input a corpus of any language and a corresponding grammar patterns and which generates ...
[14]
[PDF] SkELL: Web Interface for English Language Learning - Sketch Engine
SkELL is a web interface for English language learning, derived from Sketch Engine, offering concordance, word sketches, and a thesaurus.Missing: launch | Show results with:launch
[15]
Old interface closes down - Sketch Engine
Sketch Engine decided not to maintain two interfaces. For this reason, the old interface closes down and will not be available any more after 20 January 2020.Old Interface · New Interface · Faqs<|control11|><|separator|>
[16]
Sketch Engine changelog - Manatee
This software consists of three main components, which enable searching and building text corpora. Bonito – a graphical user interface to corpora maintained, ...Missing: origins | Show results with:origins
[17]
EUR-Lex parallel Corpus | Sketch Engine
The EUR-Lex parallel corpus is a collection of multilingual corpora in all the official languages of the European Union.A General Purpose... · Important Copyright Notice · Size In Tokens
[18]
Discover the new Timeline and other features. - Sketch Engine
Jul 1, 2024 · Track how wordusage and frequency change over time with Sketch Engine's Timeline Function! Discover trends, uncover new words, and delve into detailed changes.Missing: milestones history 2014 Manatee 2018
[19]
Automated word sense identification, multi-word term extraction for ...
Sketch Engine supports monolingual and bilingual term extraction. Read more about linguistic tools for term extraction on our blog. Sketch Engine free trial.Missing: assisted 2025<|control11|><|separator|>
[20]
List of corpora | Sketch Engine
This is a list of corpora preloaded in Sketch Engine and available to Sketch Engine users. In addition to these corpora, Sketch Engine holds other corpora.
[21]
Concordance - most powerful corpus search | Sketch Engine
The concordance is the most powerful tool with a variety of search options. It can find words, phrases, tags, documents, text types or corpus structures.
[22]
CQL – Corpus Query Language - Sketch Engine
The Corpus Query Language is a special code or query language used in Sketch Engine to search for complex grammatical or lexical patterns.
[23]
Word sketch - collocations and word combinations - Sketch Engine
The word sketch shows the most typical collocations and word combinations of each word in the language identified in a text corpus.
[24]
distributional thesaurus - Sketch Engine
Nov 13, 2024 · It draws on the theory of distributional semantics. The automatically produced thesaurus is available for each word in the corpus.Missing: functions | Show results with:functions
[25]
[PDF] An efficient algorithm for building a distributional thesaurus (and ...
The Sketch Engine now allows the user to prepare keyword lists for any subcorpus, either in relation to the full corpus or in relation to another subcorpus,.
[26]
Timeline – language use over time | Sketch Engine
The timeline function displays the changing frequency of a word or phrase over time. It provides a detailed graph with information about word frequency ...Missing: 2024 | Show results with:2024
[27]
Parallel concordance - searching translations - Sketch Engine
The parallel concordance searches for words, phrases, tags, documents, text types or corpus structures in one language and displays the results together.
[28]
Build parallel and multilingual corpora - Sketch Engine
This method only supports 2 languages. If your parallel corpus has more languages, an external tool or a manual procedure should be used for the alignment.Vertical File M:N · For Best Results · How To Build A Parallel...Missing: querying | Show results with:querying
[29]
Writing a Sketch Grammar
Word Sketch Grammar is a series of rules written in the CQL query language that search for collocations in a text corpus and categorize them according to their ...
[30]
Keywords and term extraction - Sketch Engine
Keyword and term extraction identifies typical words, single and multi-word units, and results in keywords (single words) and terms (multi-words).
[31]
The best term extraction - Sketch Engine
Term extraction or terminology extraction is an automatic method of analysing text in order to identify phrases which fulfil the criteria for terms.
[32]
Supported languages - Sketch Engine
This page lists all supported languages for which there are publicly available corpora. Languages with user corpora only are not included.Languages In Sketch Engine · Preloaded Corpora Features · User Corpora FeaturesMissing: enhancements 2018
[33]
SKELL – corpus tool for language learners - Sketch Engine
A simple tool for students and teachers of English to easily check whether or how a particular phrase or a word is used by real speakers of English.English corpus · Skell · ruSKELL for Russian
[34]
A Critical Review of SkELL (Sketch Engine for Language Learning
May 26, 2025 · Sketch Engine's simplified language learning interface offers learners authentic usage of words and phrases by tapping into its mother corpus ...
[35]
(PDF) Technology Integration and SkELL: A Novelty in English ...
Aug 10, 2025 · On purpose, the present study introduces the implementation of SkELL (Sketch Engine for Language Learning) in English Foreign Language (EFL) ...Missing: 2020s | Show results with:2020s
[36]
Parallel corpora | Sketch Engine
EUR-Lex ... This is an enormous corpus of various documents. The documents cover various topics. Although it is formal language on the legal side, it covers ...Texts Produced By The Eu · Eur-Lex · Non-Eu Languages
[37]
TenTen Corpora | Sketch Engine
TenTen corpora are web-based text corpora, with 10+ billion words per language, built using specialized technology for linguistic content.
[38]
British National Corpus (BNC) search | Sketch Engine
The British National Corpus (BNC) is a 100-million-word collection of samples of the written and spoken language of British English from the latter part of the ...
[39]
OpenSubtitles parallel corpora - Sketch Engine
The OpenSubtitles parallel corpora are a collection of 60 corpora in 58 languages made up of translated movie subtitles in the OpenSubtitles database.
[40]
EcoLexicon corpus search - Sketch Engine
Search the EcoLexicon corpus, an English corpus of contemporary environmental texts prepared by the LexiCon Research Group at the University of Granada.
[41]
https://www.sketchengine.eu/british-national-corpus-bnc/
[42]
Happy New Year with a bunch of new corpora! - Sketch Engine
In December 2024 we introduced new corpora for Lithuanian, Finnish and Swedish language.Missing: 2025 | Show results with:2025
[43]
corpus architect - Sketch Engine
Nov 12, 2024 · an intuitive tool inside Sketch Engine for creating corpora from documents or the Web which does not require any expert knowledge.
[44]
Create a corpus from the web - Sketch Engine
Create a multi-million-word corpus from the web within minutes. Fully automatic corpus building, lemmatization and tagging in 30+ languages.Missing: social media
[45]
Create a corpus by uploading files - Sketch Engine
To create a corpus by uploading files, name it, select language, drag/drop files, and select 'I have my own texts'. Multiple files can be uploaded as a zip.
[46]
Preparing a Text Corpus for Sketch Engine: Overview
Steps to prepare a text corpus for Sketch Engine · Prepare the corpus configuration file · Compile (index) the corpus · Verify corpus consistency, integrity and ...Missing: Architect features customization<|separator|>
[47]
Build a corpus from the web | Sketch Engine
May 13, 2019 · Sketch Engine uses a deduplication procedure which is able to detect perfect duplicates as well as texts which were slightly adapted, shortened or extended.Missing: Architect | Show results with:Architect
[48]
Compiling corpus on local installation - Sketch Engine
You are ready to compile the corpus in Sketch Engine. This can either be done from the Corpus Architect interface, or from the command line.
[49]
Fine-tune your corpus | Sketch Engine
### Summary of Fine-Tuning and Customization in Sketch Engine
[50]
[PDF] Manatee/Bonito -- A Modular Corpus Manager
1. Rychlý, P.: Corpus managers and their effective implementation. Ph.D. thesis, Faculty of Informatics, Masaryk University (2000).
[51]
[PDF] Optimization of Regular Expression Evaluation within the Manatee ...
Manatee is a state-of-the-art corpus management system providing facilities for efficient indexing (compiling) and searching billion-word-sized corpora. [6].
[52]
[PDF] Manatee, Bonito and Word Sketches for Czech
Manatee serves as a base for the Sketch Engine [4]. As it was defined in [3], Word Sketches is a short corpus-based summary of a word's grammatical and ...Missing: origins | Show results with:origins
[53]
[PDF] Accelerating Corpus Search Using Multiple Cores - Sketch Engine
The Manatee system (Rychlý, 2000) is a corpus manager, designed to be able to deal with ex- tremely large corpora, optimized for fast query evaluation. It ...
[54]
NoSketchEngine | langui.ch /'læŋgwɪtʃ/ /'læŋgwɪdʒ/
Welcome to NoSketch Engine, an open-source project combining Manatee and Bonito into a powerful and free corpus management system.Missing: 2003 | Show results with:2003
[55]
acdh-oeaw/noske-ubi9: Building NoSkE for and with UBI9 - GitHub
NoSketch Engine is an open-source project combining Manatee and Bonito and Crystal into a powerful and free corpus management and search system.
[56]
The Sketch Engine Changelog - Bonito
Manatee – a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures, see the changelog of Manatee ...Missing: origins | Show results with:origins
[57]
[PDF] Corpus Query System Bonito – Recent Development - Sketch Engine
At Masaryk University in Brno, a corpus manager Manatee/Bonito [1] is being developed, that is able to perform wide variety of tasks including e.g. fast.Missing: origins | Show results with:origins
[58]
[PDF] Proceedings of the 3rd Workshop on Building and Using ... - LREC
May 22, 2010 · Corpus Architect. A recent release made available through the Sketch Engine website, Corpus Architect is a system incorporating BootCaT, a ...
[59]
(PDF) Sketch Engine for Bilingual Lexicography - ResearchGate
Sketch Engine is a leading corpus query and corpus management tool that has been used for many large dictionary projects. The paper summarizes its features ...<|control11|><|separator|>
[60]
Using Computational Lexicography for Dictionary Production with ...
The Sketch Engine has been adopted by four of the UK's five major dictionary publishers, national language institutes in nine European countries and over 100 ...
[61]
(PDF) The Sketch Engine: Ten Years On - ResearchGate
The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software.
[62]
Oxford English Corpus search - Sketch Engine
The last version of this corpus contains nearly 2.1 billion words (almost 2.5 billion tokens). For more information visit Oxford Dictionaries's website. The ...
[63]
Trends – diachronic analysis | Sketch Engine
Timelines are available via the Concordance or Wordlist tools. They are computed the same as the graphs in Trends, however, they can be generated for any word ...Missing: 2024 | Show results with:2024
[64]
OPUS parallel corpora | Sketch Engine
Search the OPUS parallel corpora, the multilingual corpora in 40 languages. Make concordance or generate n-gram, word lists, collocations and more...
[65]
7 - Leveraging Large Corpora for Translation Using Sketch Engine
Jun 10, 2019 · ... Sketch Engine or by building a new parallel corpus from their TM using Corpus Architect, the corpus-building component of Sketch Engine ...<|control11|><|separator|>
[66]
Historians | Sketch Engine
) to create a corpus from files or use our tool for building corpora from the web, e.g. downloading specific websites containing historical texts or books.Missing: development | Show results with:development
[67]
(PDF) The Sketch Engine as infrastructure for historical corpora
Abstract A part of the case for corpus building is always that the corpus will have many users and uses. For that, it must be easy to use.
[68]
Teachers - Sketch Engine
SKELL is a simple user-friendly interface to Sketch Engine for students and teachers of English. No need to worry about settings, just type a word and see how ...Using Sketch Engine As A... · Features To Use · Finding Examples
[69]
[PDF] A Critical Review of SkELL (Sketch Engine for Language Learning)
Embracing. Topal's (2022) framework, this media review critically evaluates SkELL by addressing its strengths and weaknesses as a language learning resource.
[70]
[PDF] Corpora and Language Learning with the Sketch Engine and SKELL
The Sketch Engine5 (Kilgarriff et al 2004) is a leading corpus tool which has been in use for lexicography and language research since 2004. It has two ...<|separator|>
[71]
Sketch Engine and other tools for language analysis – CASS
All Lancaster University staff and students have now access to Sketch Engine, an online tool for the analysis of linguistic data.
[72]
New: Sketch Engine, tool for language research | Library
Mar 11, 2025 · Sketch Engine is a tool for language research, which can also be used for text analysis or text mining.
[73]
Tracking diachronic sentiment change of economic terms in times of ...
Our analysis shows that there were three clearly defined epochs during the timeline of the study: pre-crisis in 2007, the outburst of the crisis of 2008–2012, ...<|separator|>
[74]
Using the Sketch Engine Corpus Query Tool for Language Teaching
Editor's Note: The web has brought a myriad of tools to our students' fingertips, and Keith Barrs has shared an engaging tool for investigating how language is ...
[75]
Bibliography of Sketch Engine
To cite Sketch Engine in academic publications, use the following papers. If you refer to Sketch Engine in general, choose from the papers in General ...Missing: rewritten | Show results with:rewritten
[76]
Integrating critical corpus and AI literacies in applied linguistics
These workshops focused on the use of the corpus analysis software, Sketch Engine and the Generative AI tool, ChatGPT for vocabulary and grammar learning.3. Results · 3.3. Focus Group · 4. Discussion And Conclusion