ChEMBL
ChEMBL is a manually curated, open-access database of bioactive small molecules with drug-like properties, aggregating chemical structures, bioactivity measurements, and associated genomic and proteomic target data to facilitate drug discovery and chemical biology research.[1][2] Developed and maintained by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), it originated from the StARlite system of Inpharmatica Ltd. and was publicly launched in 2009 with funding from the Wellcome Trust.[2][3] The database's core strength lies in its high-quality curation process, which involves manual extraction of bioactivity data—such as binding affinities (e.g., IC50, Ki), functional potencies, and ADMET (absorption, distribution, metabolism, excretion, toxicity) properties—from peer-reviewed medicinal chemistry literature, patents via SureChEMBL, and direct submissions from research consortia like EUbOPEN and BindingDB.[2][3] Data are sourced from approximately 230 scientific journals and public repositories, ensuring compliance with FAIR (Findable, Accessible, Interoperable, Reusable) principles, and are standardized using ontologies like the Experimental Factor Ontology (EFO) for diseases and phenotypes.[2][3] As of the ChEMBL 36 release in October 2025, it encompasses approximately 2.8 million distinct compounds, 17,803 targets (primarily proteins but including cell lines and organisms), and millions of bioactivity data points across more than 830,000 functional and 520,000 binding assays.[4][2] ChEMBL plays a pivotal role in cheminformatics and computational drug discovery by enabling applications such as quantitative structure-activity relationship (QSAR) modeling, virtual screening, machine learning-based target prediction, and toxicity assessment.[2][3] It integrates with other resources like PubChem, UniProt, and the Open Targets platform, and offers user-friendly access through a web interface, RESTful APIs, downloadable SQL dumps, and RDF formats for semantic querying.[1][2] Recent enhancements in ChEMBL 36 include expanded drug and clinical candidate data from FDA and EMA approvals (e.g., incorporating biotherapeutics and vaccines), tripled patent-derived assays from BindingDB, and new classifications for pesticides and natural products, reflecting its ongoing evolution to support AI-driven research and neglected tropical disease initiatives.[4][5] Over the past 15 years, ChEMBL has been cited in nearly 1,000 PubMed articles, underscoring its influence in advancing therapeutic development.[2]Overview
Definition and Purpose
ChEMBL is a manually curated, open-access database of bioactive molecules with drug-like properties. It integrates chemical structures, bioactivity data—such as binding affinities and functional outcomes—and genomic information associated with molecular targets.[1][6] The primary purpose of ChEMBL is to facilitate the translation of genomic data into effective new medicines by supporting chemical biology and drug discovery efforts. It aids in target validation, compound prioritization, and the elucidation of molecular interactions between small molecules and biological targets.[6][7] As of the ChEMBL 36 release in 2025, the database encompasses over 2.8 million distinct compounds, more than 17,800 targets, and millions of bioactivity measurements, underscoring its scale as a key chemogenomic resource that bridges chemistry and biology.[8][1]History and Development
ChEMBL originated as the StARlite database, developed by the biotechnology company Inpharmatica Ltd. in the early 2000s to capture structure-activity relationship data from medicinal chemistry literature.[2] Inpharmatica was acquired by Galapagos NV in 2006, which continued development of the resource as a proprietary chemogenomics platform.[2] In July 2008, Galapagos transferred the database to the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) under a £4.7 million Strategic Award from the Wellcome Trust, enabling its transition to a publicly accessible resource.[9] The database was rebranded as ChEMBL and launched publicly by EMBL-EBI in October 2009, initially comprising over 500,000 compounds with a focus on curated bioactivity data extracted from peer-reviewed literature.[9][2] This marked a pivotal shift from a commercial tool to an open-access model, broadening its utility for academic and industrial drug discovery efforts.[2] EMBL-EBI has since assumed ongoing maintenance, supported by core funding from EMBL member states, the Wellcome Trust, and European Union projects such as the Innovative Medicines Initiative (IMI) and Framework 7 programs.[10][11] Key expansions followed the launch, including the integration of absorption, distribution, metabolism, excretion, and toxicity (ADMET) data in 2011 to enhance its applicability in early-stage drug profiling.[12] In 2023, updates to ChEMBL incorporated broader data types, such as detailed profiles for clinical candidate drugs, reflecting its evolution into a multifaceted platform for drug discovery.[7] The resource marked its 15th anniversary in October 2024, underscoring its growth from a literature-focused repository to a comprehensive, FAIR-compliant database aiding global cheminformatics research.[2] ChEMBL's development has proceeded through regular releases, with version 17 in September 2013 containing over 12 million bioactivity measurements from more than 1 million assays.[13] By version 35, released in December 2024, the database encompassed 17,500 approved drugs alongside extensive clinical candidate information, demonstrating sustained expansion in scale and scope.[14] The latest release, ChEMBL 36 in October 2025, further expanded drug and clinical candidate data from FDA and EMA approvals, tripled patent-derived assays from BindingDB, and introduced new classifications for pesticides and natural products.[8]Data Content and Curation
Sources of Data
ChEMBL primarily obtains its data through manual extraction from peer-reviewed medicinal chemistry literature, focusing on seven core journals: Journal of Medicinal Chemistry, Bioorganic & Medicinal Chemistry Letters, European Journal of Medicinal Chemistry, Bioorganic & Medicinal Chemistry, Journal of Natural Products, ACS Medicinal Chemistry Letters, and MedChemComm.[7] Additional sources include deposited datasets from high-throughput screening efforts, such as those from PubChem BioAssay and BindingDB, as well as public repositories like the GSK, Novartis, and St. Jude malaria screening datasets, the Sanger Institute's Genomics of Drug Sensitivity in Cancer, and the MMV Malaria Box.[15] Patent data, including contributions from BindingDB patents and SureChEMBL, further supplements these origins, alongside clinical candidate information from regulatory sources like the FDA Orange Book and EMA approvals.[1][15] As of the ChEMBL 36 release in October 2025, enhancements include tripled patent-derived assays from BindingDB (to approximately 13,847 assays), expanded data on biotherapeutics and vaccines from FDA and EMA approvals (up to November 2024), and new classifications for pesticides and natural products.[4] The database encompasses a range of data types centered on bioactive molecules, including chemical structures of small molecules and peptides, alongside approved drugs, clinical candidates, and experimental compounds.[1] Bioactivity measurements form a core component, covering binding affinities (e.g., IC50, Ki, Kd), functional assays, and ADMET (absorption, distribution, metabolism, excretion, toxicity) endpoints.[12] Target annotations link these to proteins and genes, sourced from databases like UniProt and Ensembl, while metadata includes assay descriptions, organism contexts (e.g., human, rodent models), and phenotypic screening results.[1][16] ChEMBL's content spans diverse therapeutic areas, with particular emphasis on neglected diseases through dedicated datasets like those for malaria and cancer sensitivity.[15] Since its inception in 2009, the database has grown in data diversity, initially prioritizing binding data but expanding post-2011 to incorporate broader functional, ADMET, and phenotypic screening information from literature and deposited sets, reflecting evolving drug discovery needs.[7][12]Curation Process
The curation process in ChEMBL begins with manual extraction of scientific facts from peer-reviewed journal articles, where curators identify and record key bioactivity data such as structure-activity relationships (SAR), assay results (e.g., IC50 values), target mappings, and associated metadata including experimental conditions and organism details.[17][18] Recent enhancements, as in ChEMBL 36 (October 2025), incorporate natural language processing (NLP) tools like LeadMine and spaCy for semi-automated extraction of bioactivity and phenotype data, while preserving core manual oversight.[4] This workflow involves drawing chemical structures as molfiles or SMILES notations and annotating protein targets using UniProt accession numbers to ensure traceability.[18] Automated steps follow to standardize activity data, converting diverse units (e.g., from 133 different concentration formats) to a common scale like nanomolar (nM) and calculating derived values such as pChEMBL for dose-response curves, which represent negative logarithms of activity measurements.[18] These processes aim to maximize data comparability while preserving original reported values in dedicated fields for transparency.[18] Chemical structure standardization is a core component, employing an open-source pipeline integrated with the RDKit cheminformatics toolkit to process incoming structures systematically.[19] The pipeline consists of three modules: a Checker that validates structures against rules (assigning penalty scores from 2 to 7 for issues like invalid valences or stereo mismatches, with scores of 7 preventing loading), a Standardizer that applies FDA and IUPAC guidelines (e.g., normalizing charges, removing explicit hydrogens except in specific cases, and excluding organometallics), and a GetParent module that strips salts and solvents using predefined lists of 162 salts and 9 common solvents to generate canonical parent compounds.[19] Structures are converted to canonical SMILES, handling isomers by aggregating data under parent forms and flagging duplicates or errors for manual review; this has standardized over 2 million compounds across releases, with ongoing additions from literature and deposited datasets.[19][7] Target-assay relationships are assigned confidence scores on a 0-9 scale during curation, reflecting the evidence level and specificity of the mapping (e.g., score 9 for a directly identified single protein target via binding assays, score 4 for multiple homologous proteins in a family, and score 0 for uncurated entries).[16] Scores are determined manually based on assay descriptions, prioritizing direct interactions (e.g., binding affinity) over inferred ones (e.g., phenotypic screens), with ambiguities labeled as "protein family" or "complex" to avoid over-assignment.[16][18] Quality control integrates automated flagging of inconsistencies (e.g., out-of-range values or transcription errors like 1000-fold discrepancies in Ki measurements) with manual validation, ensuring less than 0.1% missing data and annotating potential issues in fields like DATA_VALIDITY_COMMENT.[18] External integrations, such as filtered PubChem BioAssay data introduced since 2011, undergo similar standardization and are cross-validated against ChEMBL's ontology mappings (e.g., using BioAssay Ontology for assay types and QUDT for units) to resolve literature ambiguities like unclear targets or duplicate reports.[18] Periodic releases incorporate these updates, with recent enhancements including semi-automated checks for pharmacokinetic/pharmacodynamic data and chemical probe annotations.[7] The process addresses challenges from inconsistent literature reporting—such as varying assay formats or incomplete structural depictions—through rigorous validation rules and community deposition guidelines that promote FAIR principles (Findable, Accessible, Interoperable, Reusable).[7][20] For instance, new datasets like EUbOPEN chemical probes or SARS-CoV-2 screening results are curated to maintain interoperability, with documentation and training resources aiding reproducibility.[7] This dual manual-automated approach ensures ChEMBL's reliability for downstream applications in drug discovery.[17]Access and Interfaces
Web Interface and APIs
The ChEMBL web interface provides an interactive platform for querying and exploring its database of bioactive molecules and bioactivity data. Users can perform searches by compound name, structure (including substructure and similarity searches), target (such as protein families or specific genes), assay type, documents, cell lines, or tissues, utilizing flexible text matching and secure HTTPS protocol.[21] Browsing options include dedicated sections for approved drugs and clinical candidates, allowing users to filter by development phase, molecule type, or first approval year.[22] Visualization tools enhance data exploration, featuring interactive bubble charts that summarize entity quantities (e.g., approximately 2.8 million compounds and 17,803 targets), hierarchical trees for target classifications like kinases or proteases, and bar charts for drug distributions by indication or phase.[22][4] These tools support clicking to drill down into related activities, structures, and plots of potency data (e.g., pChEMBL values), facilitating quick assessment of compound-target interactions without downloading data.[23] ChEMBL offers a RESTful API for programmatic access, enabling real-time data retrieval without authentication, under a Creative Commons license. Key endpoints include/compound/search for keyword or structure-based compound queries, /target for protein or gene targets (e.g., filtering by name containing "kinase"), and /activity for bioactivity records (millions of entries).[24] The API supports pagination via limit and offset parameters (default limit: 20), and filtering options such as pchembl_value__gte=5 for potencies above 5 (indicating micromolar activity).[24] Results are returned in JSON format, wrapped in metadata envelopes for total counts and navigation.[25]
Web services are extended by ChEMBL Beaker, a suite of cheminformatic utilities for advanced queries. It enables similarity searches generating SVG maps from SMILES or SDF inputs, substructure matching via SMARTS patterns to highlight fragments or compute maximum common substructures, and calculations of physicochemical properties like molecular weight, logP, and hydrogen bond donors using RDKit.[26]
For usage, the API allows retrieving bioactivity data for targets like kinase inhibitors; for example, querying /activity?target_chembl_id=CHEMBL2111439 (for a specific kinase) returns filtered results with standard relations, types, and pChEMBL values.[24] Integration in workflows is streamlined via the official Python client chembl_webresource_client, which supports Django-like filtering (e.g., new_client.activity.filter(target_chembl_id='CHEMBL2111439', pchembl_value__gte=6.0)) and local caching for efficiency.[27]
Although no strict rate limits are enforced, best practices include using pagination for large datasets, enabling client-side caching to minimize requests, specifying only fields to reduce payload size, and implementing timeouts (default: 10 seconds) to handle high-volume queries effectively.[27]