Fact-checked by Grok 2 weeks ago

Chemical database

A chemical database is an organized, typically electronic, collection of information about chemical substances, enabling efficient storage, retrieval, and analysis of data such as molecular structures, physical and chemical properties, biological activities, and safety profiles. These databases are essential resources in fields like cheminformatics, , and , where they facilitate , hypothesis testing, and the critical evaluation of chemical information. Chemical databases are broadly categorized into primary and secondary types, with primary databases archiving raw, experimentally derived data—such as deposition records from researchers—and secondary databases offering curated, value-added compilations from multiple primary sources, often including standardized annotations and cross-references. They encompass both factographic databases, which store structured records like chemical identifiers (e.g., Registry Numbers) and property tables, and bibliographic databases that index literature for chemical references. Common contents include hazard classifications, emergency response guidelines, structural similarity data for , and crystal structures for applications. The importance of chemical databases lies in their role in accelerating scientific discovery, particularly in , toxicity assessment, and regulatory compliance, by providing accessible, searchable repositories that complement incomplete individual sources and support advanced queries like . For instance, they enable researchers to retrieve bioactive molecules via structural algorithms, analyze molecular diversity, and integrate data across disciplines to inform risk assessments and innovation. Free databases dominate modern usage due to their open-access nature, while commercial ones offer enhanced curation for specialized needs. Notable examples include PubChem, a comprehensive public repository from the National Center for Biotechnology Information containing over 322 million deposited substances and 119 million unique chemical structures (as of September 2024), sourced from scientific literature, patents, and experimental depositions. ChEMBL focuses on bioactivity data curated from peer-reviewed publications, aiding medicinal chemistry research with details on compound-target interactions. Other key resources are ChemSpider, which aggregates over 130 million structures (as of 2025) from crowdsourced and publisher data for broad chemical searches, and the Crystallography Open Database (COD), offering more than 529,000 open-access crystal structures (as of November 2025) for structural chemistry. Commercial options like SciFinder provide extensive chemical literature and substance records, exceeding 59 million references (as of 2025), to support industrial R&D.

Overview

Definition and Scope

A chemical database is an organized collection of data encompassing chemical structures, properties, reactions, spectra, and related information, designed for efficient storage, retrieval, and analysis to support applications in , , and . These databases primarily cover small-molecule compounds, polymers, and biomolecules, setting them apart from general scientific databases through their emphasis on chemical-specific attributes such as atomic , , and molecular topology. The primary purposes of chemical databases include facilitating through and lead optimization, enabling materials design by providing property predictions for novel compounds, ensuring via standardized reporting on hazardous substances, and supporting predictive modeling for and reactivity assessments. For instance, in pharmaceuticals, these databases allow researchers to perform of millions of virtual compounds to identify potential therapeutic candidates. Key concepts in chemical databases distinguish between centralized systems, where data is stored and managed in a single location for unified access and control, and distributed architectures, which spread information across multiple nodes to enhance and in large-scale environments. Storage approaches often involve relational databases for structured chemical data like tabular properties and identifiers, contrasted with non-relational formats for handling complex, unstructured elements such as spectral images or reaction pathways. Chemical databases emerged in the with early punched-card systems for indexing compounds, evolving to modern scales exemplified by , which as of 2025 contains over 119 million unique compounds and 322 million substances.

Historical Development

The development of chemical databases began with manual systems in the , where chemists relied on and printed to organize compound information. Pioneered by figures like in the for biological classification, index card systems were adapted for chemistry, with Leopold Gmelin's 1817 using cards to catalog inorganic compounds. Friedrich Beilstein's Handbuch der Organischen Chemie, first published in 1881, served as a major precursor by systematically compiling verified data on organic compounds from literature, spanning millions of entries over subsequent editions. The transition to computerized systems occurred in the , driven by advances in computing power and the need to handle growing chemical literature. The () launched the CAS Registry System in 1965, marking the first electronic chemical registry that assigned unique identifiers to substances and enabled automated indexing of over 100 million compounds by the 2010s. Concurrently, the Cambridge Structural Database (CSD) was established in 1965 to curate small-molecule crystal structures from , initially with a few hundred entries and expanding significantly in the 1980s as crystallographic techniques improved resolution and throughput. In the 1970s and , structure-searchable databases emerged, facilitated by innovations in software and hardware. Molecular Design Limited (MDL) introduced the MACCS system in 1977, an early software for storing and searching chemical structures using connection tables, which became widely adopted in pharmaceutical research for proprietary compound management. This period also saw the rise of spectral databases, spurred by advancements in NMR spectroscopy that generated vast datasets requiring digital storage. Regulatory pressures, such as the U.S. Toxic Substances Control Act of 1976, further drove database development for compliance tracking. The 1990s and 2000s ushered in the era, making databases web-accessible and integrating bioinformatics. The International Union of Pure and Applied Chemistry (IUPAC) established standards like JCAMP-DX in for exchanging and spectral data, promoting interoperability. , launched by the in 2004, provided free access to millions of compounds and bioactivities, catalyzing initiatives. The European Union's REACH regulation in 2007 mandated extensive chemical data submission, boosting public databases for safety assessments. From the 2010s to 2025, , , and transformed chemical databases for and . expanded to nearly 2 million unique compounds by 2020 and further to over 2.8 million distinct compounds as of 2025 through curation of bioactivity data, supporting . The accelerated antiviral compound databases, with releasing an open dataset of potential inhibitors in 2020 to aid global research efforts. Post-2015, a shift to cloud-based platforms enabled handling of massive datasets, as seen in enhanced versions of and , driven by regulatory needs and technologies like high-throughput NMR.

Types of Chemical Databases

Chemical Structure Databases

Chemical structure databases primarily store and organize representations of molecular topologies, capturing elements such as atomic connectivity, bond types, , and tautomeric forms to enable unique identification of chemical entities. These databases represent molecules as graphs where atoms serve as nodes and bonds as edges, facilitating the systematic cataloging of both simple and complex structures like polymers or organometallics. The foundational effort in this domain traces back to the (CAS), which began manual indexing of chemical literature in and introduced the first computerized structure registry in to handle the growing volume of disclosed substances. This shift to digital formats in the 1960s marked the transition from paper-based abstracts to machine-readable structure databases, enabling efficient storage and retrieval. Prominent examples include the , which as of 2025 contains over 290 million unique substances derived from scientific literature, patents, and other sources, assigning each a distinct for unambiguous identification. , maintained by the , holds approximately 119 million compounds and 322 million substances, aggregating data from over 1,000 sources including government depositions and academic contributions. , operated by the Royal Society of Chemistry, provides access to more than 130 million structures sourced from hundreds of suppliers and publications, emphasizing free public access. The standardization of notations like SMILES in the late played a pivotal role in enhancing management by allowing consistent structure representation across databases and patent filings, reducing ambiguity in chemical claims. Curation in these databases involves a combination of and automated processes to ensure accuracy and consistency. Automated validation employs rules such as checks to verify orders and configurations against chemical principles, flagging anomalies like invalid hybridization. review addresses nuanced cases, including the of salts, isotopes, and mixtures into standardized parent structures with associated components. For instance, isotopic variants are often stored separately but linked to core structures, while mixtures are decomposed where possible to avoid redundancy. Unique features of chemical structure databases include support for both 2D depictions, which emphasize connectivity and , and 3D conformers, which model spatial arrangements for applications like simulations—, for example, provides computed 3D structures for millions of entries. Integration with external resources enhances utility; Registry incorporates patent data to track novelty, while links structures to vendor catalogs for commercial sourcing. Scale continues to expand rapidly, with adding millions of compounds annually through ongoing depositions from diverse contributors. A key challenge in maintaining these databases is duplicate detection, addressed through algorithms that generate a unique string representation—such as canonical SMILES—for each regardless of input format or depiction order. These algorithms normalize graphs by selecting a standard traversal path and atom ordering, enabling efficient comparison and merging of redundant entries across large-scale integrations. Failure to implement robust can lead to inflated counts and retrieval errors, underscoring its importance in curation pipelines.

Property and Spectral Databases

Property and spectral databases focus on compiling experimental and computed data for the physical, chemical, and spectral characteristics of chemical compounds, enabling researchers to access quantitative information beyond structural representations. These databases typically include thermophysical properties such as boiling points, melting points, and , as well as safety-related data like profiles and flammability ratings. Spectral data encompasses (IR), ultraviolet-visible (UV-Vis), (NMR), and records, which are crucial for compound identification and analysis. A seminal example is the (DDB), initiated in 1973 at the University of to store vapor-liquid equilibrium and other thermophysical data from literature sources, now encompassing over 100,000 pure components and mixtures with associated properties. Curation in these databases involves rigorously linking property values to chemical structures using standardized identifiers like SMILES or InChI to ensure and . Quality control measures include documenting uncertainty ranges, experimental conditions (e.g., , , or ), and source references to mitigate errors from heterogeneous data origins. For instance, the NIST Chemistry WebBook, launched in 1996, provides critically evaluated thermochemical, thermophysical, and spectroscopic data for over 7,000 organic and inorganic compounds, distinguishing between experimental measurements and computational estimates while including like . , an expert-curated resource combining Beilstein, Gmelin, and patent literature, offers property data such as , , and for millions of substances, with values tied to original experimental reports and units standardized for consistency. Unique to these databases is the emphasis on quantitative precision, where properties are stored with explicit units (e.g., °C for , mg/L for ) and contextual to support predictive modeling and validation. Post-2010, there has been significant growth in incorporating quantum-derived properties via (DFT) calculations, addressing gaps in experimental data for novel or unstable compounds; for example, the Materials Project's MPcules extension (2023) integrates DFT-computed molecular properties like energies and geometries for over 170,000 , enhancing accessibility for applications. By 2025, databases like have incorporated AI-predicted properties alongside experimental ones, using models trained on vast datasets to estimate attributes such as and bioactivity for understudied molecules. Despite these advances, challenges persist, including data sparsity for rare or proprietary compounds, which limits comprehensive coverage and model training. Standardization of property ontologies remains an ongoing issue, as varying and protocols across sources can introduce inconsistencies, necessitating harmonized frameworks for . Efforts like those in the BIGCHEM project highlight the need for scalable curation to handle while preserving accuracy in sparse regimes.

Reaction and Synthesis Databases

Reaction and synthesis databases specialize in storing and retrieving information on chemical transformations, encompassing reactants, products, reaction conditions, yields, catalysts, and details. These databases enable chemists to explore synthetic pathways by providing structured reaction schemas that map atomic changes and conditions. Prominent examples include , which integrates data from Beilstein, Gmelin, and patent sources to offer millions of experimentally validated reactions with associated yields and stereochemical outcomes; SciFinder, powered by the Reactions database containing over 150 million reactions and synthetic preparations; and extracts from USPTO patents, which provide reaction data from chemical inventions often including novel catalysts and conditions. The curation of these databases involves extracting reaction data from scientific literature and patents using natural language processing (NLP) techniques to identify and parse reaction descriptions. For instance, large language models have been applied to extract high-quality reaction data from patent documents, automating the identification of reactants, products, and conditions that would otherwise require manual annotation. Standardization follows extraction, focusing on reaction centers—the atoms directly involved in bond changes—and atom mapping, which assigns consistent identifiers to atoms across reactants and products to track transformations accurately. This process ensures interoperability and enables precise querying, as seen in protocols that curate structures, transformations, and conditions in four steps for database integration. Unique features of these databases include tools for retrosynthesis planning, where algorithms predict precursor molecules by reversing reaction arrows, and multi-step route optimization, which evaluates sequences of for efficiency and feasibility. Integration with calculations enhances prediction reliability by generating quantum chemical data to fill gaps in experimental datasets, assessing energetics and . The origins of such databases trace to the with CASREACT, which began indexing from journals (1840 onward, comprehensive post-1975) and patents (from 1982). By 2025, advancements feature AI-driven prediction, exemplified by IBM RXN for Chemistry (launched in 2018), which uses transformer models for synthesis planning and has evolved to incorporate generative for broader reaction mapping. Challenges in these databases include handling incomplete data from patents, where reaction details like exact yields or may be omitted or ambiguously described, leading to noise in training datasets for predictive models. Scalability issues arise with libraries, which generate vast numbers of potential reactions, straining and query performance without advanced indexing. These hurdles underscore the need for robust and to improve data completeness and efficiency.

Biological and Literature Databases

Biological and literature databases in the context of chemical informatics integrate molecular structures with experimental bioactivity data, biological targets, and annotations from scientific publications, facilitating and research. These resources typically include quantitative measures such as values for inhibitory concentrations and binding affinities like or Kd, which quantify interactions between small molecules and biomolecules. are often proteins, enzymes, or signaling pathways, with data linked to genomic identifiers for contextualization. Literature citations provide traceability to original studies, enabling validation and further exploration. Prominent examples include , a manually curated open-access database that aggregates bioactive molecules with drug-like properties, encompassing chemical, bioactivity, and genomic data extracted primarily from literature. As of the ChEMBL 36 release in 2025, it contains 24,267,312 bioactivity measurements across 2,878,135 distinct compounds and 17,803 targets, including updates from campaigns and patent sources. BindingDB complements this by focusing on measured binding affinities, reporting 3.2 million data points for 1.4 million compounds against 11,400 targets as of late 2025, with emphasis on protein-ligand interactions from journals and patents. , linked to for literature access, extends coverage to broader chemical abstracts and bioassays, holding 295 million bioactivities for 119 million compounds in its 2025 update, integrating data from diverse sources like NIH screenings. Data curation in these databases involves manual and semi-automated annotation of results from (HTS) experiments, where large compound libraries are tested against biological targets to identify hits. Standardization employs ontologies such as ChEBI (Chemical Entities of Biological Interest) to ensure consistent entity representation, linking chemical structures to biological roles and avoiding nomenclature ambiguities. For instance, aligns targets with identifiers and uses ChEBI for compound ontology, enhancing across resources. This process draws from peer-reviewed journals, patents, and public depositories, with quality controls to filter unreliable assays. Unique features distinguish these databases, such as structure-activity relationship () tables in , which organize bioactivity data by molecular series to reveal trends in potency and selectivity. Cross-referencing with genomic data, including pathway mappings via Reactome or , supports systems-level analyses. Open-access models, exemplified by 's (Findable, Accessible, Interoperable, Reusable) principles, promote data sharing and reuse in academia and industry. The growth of these databases accelerated following the Project's completion in 2003, which provided a reference sequence enabling target validation and spurred integration of chemical and genomic datasets for . Notably, incorporated extensive COVID-19-related datasets between 2020 and 2025, including 37,209 activities from SARS-CoV-2 screening assays and 9,646 from IMI-CARE antiviral studies, aiding rapid therapeutic development. Despite advancements, challenges persist, including a toward drug-like molecules due to curation priorities in literature, which underrepresents non-drug scaffolds and limits applicability to broader chemical spaces. Additionally, privacy concerns arise with proprietary data, where selective public release can obscure full datasets, complicating comprehensive analyses while adhering to restrictions.

Data Representation

Structure Encoding Formats

Chemical structures in databases are digitally represented using standardized encoding formats that capture atomic , , and optionally spatial coordinates to ensure accurate storage, retrieval, and exchange of molecular information. These formats enable across software tools and databases by providing compact, machine-readable descriptions of molecules. Common formats include line notations for connectivity and file-based representations for geometric data, each balancing simplicity, uniqueness, and completeness in different ways. One widely adopted format is the Simplified Molecular Input Line Entry System (SMILES), a string-based notation that encodes molecular structures using ASCII characters to represent atoms and bonds. For example, acetic acid is denoted as CC(=O)O, where 'C' represents carbon atoms, '=' a , and parentheses branches. SMILES was invented by David Weininger in as a lightweight method for chemical information processing, allowing linear descriptions of complex topologies without requiring graphical input. While versatile for small molecules, standard SMILES can generate multiple strings for the same structure due to different traversal paths, necessitating to produce a unique representation for duplicate avoidance in databases. Canonical SMILES algorithms reorder atoms and bonds according to predefined rules, such as prioritizing heavy atoms and minimizing numerical identifiers, to generate a standardized string. The (InChI), developed by the International Union of Pure and Applied Chemistry (IUPAC) starting in 2000 in collaboration with the National Institute of Standards and Technology (NIST), addresses limitations in earlier notations by providing a layered, hierarchical string that ensures uniqueness and completeness. InChI separates information into layers for , atoms, isotopes, , and other features, prefixed with "InChI=" and optionally including a fixed "/f" layer for tautomers. This design makes InChI lossless for most organic structures, capturing all structural details without ambiguity, and it has been extended to handle polymers, organometallics, and as of recent updates. A related InChIKey is a hashed 27-character fixed-length identifier derived from the full InChI, facilitating efficient database indexing. For representations including spatial information, the MDL Molfile () format stores a single molecule's 2D or coordinates in a text-based connection table, specifying atom types, bond orders, and positions via fixed-width columns. Developed by MDL Information Systems (now ), MOL files include sections for atom counts, coordinates, and bonds, enabling visualization and geometric analysis. The Structure-Data File () extends this by concatenating multiple MOL records, separated by "$$$$" delimiters, to store batches of structures with optional property data fields, making it ideal for large database exchanges. For biomacromolecules like proteins, the (PDB) format is standard, encoding atomic coordinates from experimental determinations such as , with records for chains, residues, and conformational details to represent folded structures. Handling conformations in these formats involves specifying Cartesian coordinates, but databases often store multiple conformers or use energy-minimized models to account for flexibility. At a fundamental level, chemical structures can be modeled as undirected graphs, where atoms are vertices and bonds are edges, facilitating computational analysis through representations. The A of a is a square matrix where each entry A_{ij} is 1 if atoms i and j are connected by a bond, and 0 otherwise (with the diagonal typically zero for simple graphs). This binary matrix encodes connectivity losslessly and serves as a basis for deriving molecular descriptors, such as topological indices. For (C₆H₆), a cyclic structure with alternating double bonds, the for the six carbon atoms (ignoring hydrogens for the core graph) is: A = \begin{pmatrix} 0 & 1 & 0 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 \\ 1 & 0 & 0 & 0 & 1 & 0 \end{pmatrix} This reflects the , where each carbon connects to two neighbors. Graph-based encodings like this are particularly useful in database algorithms for substructure searching, though they require extensions for and charges. Despite their advantages, structure encoding formats face challenges in balancing completeness and practicality. Lossy encodings, such as basic SMILES without stereo specification, may omit conformational or isotopic details, leading to incomplete representations, while lossless formats like InChI preserve all information but can produce longer strings that are harder to parse manually. Variations in software support further complicate ; for instance, different toolkits may interpret ambiguous SMILES branches differently, requiring validation against standards to prevent errors in database registration. Ongoing efforts, including IUPAC updates, aim to standardize handling of complex cases like polymers to mitigate these issues.

Molecular Descriptors and Identifiers

Molecular descriptors are numerical or categorical features derived from a molecule's structure, enabling efficient indexing, searching, and analysis within chemical databases. These descriptors transform complex structural information into quantifiable attributes that facilitate quantitative structure-activity relationship (QSAR) modeling and database operations. Topological descriptors capture the connectivity of atoms in a molecule's representation, ignoring spatial arrangements. A prominent example is the , which measures molecular branching and size by summing the shortest path distances between all pairs of atoms. The Wiener index W is calculated as W = \frac{1}{2} \sum_{i \neq j} d_{ij}, where d_{ij} is the shortest path distance between atoms i and j in the . For linear alkanes like n-pentane (C5H12), the Wiener index is 20, reflecting minimal branching, while for branched isomers like 2,2-dimethylpropane, it decreases to 16 due to increased compactness. Geometrical descriptors account for the three-dimensional arrangement of atoms, providing insights into molecular shape and volume. The van der Waals volume (V_{vdw}) quantifies the space occupied by a within its van der Waals surface, approximating the in intermolecular interactions and correlating with properties like . Electronic descriptors, derived from quantum mechanical calculations, describe charge distribution and reactivity; for instance, the HOMO-LUMO gap represents the energy difference between the highest occupied and lowest unoccupied molecular orbitals, influencing properties and . Identifiers serve as unique labels or compact representations for molecules in databases, supporting rapid retrieval and deduplication. Registry Numbers have provided unique identifiers for chemical substances since 1965, assigning a sequential numeric code to each distinct compound regardless of nomenclature variations. The InChIKey, a 27-character hashed version of the IUPAC (InChI), enables quick database lookups by generating a fixed-length string from the full InChI using SHA-256 hashing, optimized for web-based searches. Molecular fingerprints act as binary identifiers encoding substructural features into bit vectors for substructure detection. Extended Connectivity Fingerprints (ECFP) generate circular topological bit vectors that iteratively expand atom neighborhoods, capturing extended connectivity up to a specified radius (e.g., ECFP4 for radius 2), ideal for identifying substructures in large databases. Daylight fingerprints, introduced in the , pioneered path-based and topological substructure encoding, forming the basis for many modern fingerprint methods. Basic descriptors like molecular weight (MW) and are computed from atomic properties to assess size and hydrophobicity. MW is the sum of atomic masses: MW = \sum m_a, where m_a is the mass of each atom a, providing a fundamental measure of molecular scale. , estimating , relies on atomic contribution methods, summing hydrophobicity increments for each atom type and correction factors for bonds or groups. The RDKit cheminformatics toolkit, developed from 2000 to 2006 at Rational Discovery and open-sourced in 2006, standardized the computation of these descriptors, including topological indices and fingerprints, across diverse chemical databases. In 2025, AI-enhanced descriptors leverage models like MolAI to generate predictive features from raw structures, improving applications in property prediction beyond traditional calculations.

Database Operations

Search and Query Techniques

Search and query techniques in chemical databases enable the retrieval of specific compounds or patterns from vast collections of molecular data. Exact matching, often performed using unique identifiers such as the , allows for precise lookups of individual substances in databases like REGISTRY, which contains over 290 million curated chemical entries. This method ensures unambiguous identification, as demonstrated by tools like the NIST Chemistry WebBook, where entering a CAS number retrieves exact structural and property data for the corresponding compound. Substructure search extends this capability by identifying molecules that contain a specified fragment embedded within their structure, a essential for exploring chemical families or analogs. Pioneered in the 1970s and 1980s through systems like MACCS (Molecular ACCess System), which was evaluated for performance alongside other early implementations such as DARC and S4, substructure searching revolutionized database querying by enabling pattern-based retrieval rather than full-structure matches. The core algorithm for exact substructure matching is a variant of the Ullmann algorithm, introduced in 1976, which uses backtracking and a compatibility matrix to map query nodes to target graph atoms while refining invalid mappings through neighbor checks to prune the search space efficiently. For instance, in , a query for a ring fragment retrieves thousands of aromatic compounds containing that motif, supporting and synthetic planning. Queries often incorporate flexible elements, such as variable , to account for unspecified bond types (e.g., single, double, or aromatic) in the target , broadening the search without requiring exact bond specification. This is achieved by defining bond variables in query languages like SMARTS, where a generic bond symbol matches any type, facilitating searches for motifs like reactive groups across diverse structures. Modern implementations, such as those in Oracle's Chemical Data Cartridge from the , integrated these techniques into relational databases, allowing SQL-based substructure queries on enterprise-scale chemical repositories. To handle the computational demands of large datasets, indexing strategies like inverted files accelerate searches by precomputing mappings from structural fragments to lists, enabling rapid filtering before full . In chemical contexts, bitmap-based inverted indexes on molecular fingerprints, as used in systems like , speed up substructure queries by quickly eliminating non-matches based on bit patterns. Query optimization further enhances efficiency in distributed environments, where techniques such as data fragmentation partition the database across nodes to parallelize searches and reduce for billion-scale collections. Despite these advances, substructure search faces inherent challenges due to its NP-hard nature, stemming from the , which requires verifying if a query graph is embeddable in a larger target—a computationally intractable task for complex patterns without heuristics. To address imperfect matches, fuzzy searching techniques allow retrieval of near-matches by tolerating minor variations, such as in bond orders or atom substitutions. Similarity methods serve as an extension for probabilistic rankings beyond exact patterns.

Similarity and Matching Methods

Similarity and matching methods in chemical databases enable the identification of compounds with analogous structures or properties, facilitating tasks such as lead optimization and analog searching. These approaches typically rely on quantitative metrics to compare molecular representations, ranging from fingerprints to three-dimensional overlays, allowing researchers to quantify degrees of resemblance beyond exact matches. Fingerprint-based methods, in particular, dominate due to their computational efficiency and ability to handle large datasets. One of the most prevalent techniques involves molecular fingerprints, which encode structural features into bit vectors, followed by similarity scoring using coefficients like the Tanimoto index. The Tanimoto coefficient, introduced by Tanimoto in , measures the overlap between two bit sets A and B as T_c = \frac{|A \cap B|}{|A \cup B|}, where values range from 0 (no similarity) to 1 (identical). This metric gained widespread adoption in cheminformatics during the with the rise of structural databases, often applied to extended-connectivity fingerprints (ECFPs) that capture substructural patterns up to a specified . For instance, ECFP4 fingerprints, which consider connectivity up to four bonds, are commonly paired with T_c thresholds of 0.85 to define "similar" compounds in . An alternative, the coefficient, addresses cases where bit densities vary, defined as D_c = \frac{2|A \cap B|}{|A| + |B|}, and performs comparably to Tanimoto for sparse fingerprints in chemical datasets. Molecular descriptors, such as topological indices, serve as inputs to generate these fingerprints for similarity computations. For more nuanced structural analogies, (GED) quantifies the minimum operations (e.g., node insertions, deletions, or substitutions) needed to transform one into another, capturing edits like bond changes or atom replacements. GED is particularly useful in ligand-based , where it identifies bioisosteric replacements by modeling molecular graphs with attributed nodes and edges, though its NP-hard nature limits scalability without approximations. In practice, GED variants with learned edit costs have shown efficacy in predicting bioactivity similarities across diverse scaffolds. Three-dimensional similarity extends methods by aligning conformations based on shape and feature overlays, crucial for bioactivity prediction. The ROCS (Rapid Overlay of Chemical Structures) software exemplifies this, using Gaussian functions to compute volumetric overlap scores between query and database molecules, often incorporating pharmacophoric "color" forces for hydrogen bonding or matching. ROCS enables rapid screening of millions of compounds, with shape Tanimoto scores emphasizing steric fit over exact atom mapping. matching complements this by focusing on abstract feature patterns—such as donor-acceptor distances—essential for bioactivity, allowing database searches for compounds sharing key interaction motifs without full structural identity. Recent advancements leverage , particularly graph neural networks (GNNs), to generate embeddings that capture both local and global molecular features for similarity assessment. As of 2025, GNN models like Kolmogorov–Arnold variants produce low-dimensional representations from molecular graphs, enabling cosine or metrics for similarity, outperforming traditional fingerprints in property prediction tasks. These embeddings facilitate scalable comparisons in large databases, integrating quantum-informed features for enhanced accuracy. In , these methods power by ranking database compounds against known actives, often enriching hits by 10-100 fold over random selection using Tanimoto or shape-based filters. For diversity analysis, clustering algorithms apply Tanimoto distances to partition libraries into medoid-centered groups, ensuring representative sampling while minimizing redundancy, as demonstrated in of million-compound sets.

Registration and

Registration and in chemical databases involve systematic processes to ensure the accurate ingestion, validation, and maintenance of chemical structures and associated information, preventing redundancy and preserving . These procedures are essential for handling the vast and diverse nature of chemical data, from molecular structures to experimental , in both public repositories and proprietary systems used in and . The foundational systems for chemical registration emerged in the with the development of the (CAS) Registry System, which began assigning unique identifiers to chemical substances to catalog and avoid duplicates in . Early computerization efforts at CAS in the facilitated the electronic indexing and registration of chemical entities, laying the groundwork for modern database management. Key processes in registration include structure to standardize representations, particularly for tautomers and salts, which can exist in multiple forms but represent the same . For instance, normalization algorithms adjust states and tautomeric equilibria to generate a preferred , as implemented in systems like PubChem's pipeline. Duplicate resolution relies on canonical identifiers, such as canonical SMILES or InChI, which provide a unique string representation for each unique structure, enabling efficient detection and merging of identical entries across databases. addition accompanies these steps, capturing details like the data source, registration date, and contributor information to maintain traceability and context. In pharmaceutical workflows, registration systems integrate with Electronic Lab Notebooks (ELNs) to streamline submission from experiments, automating validation and assignment of internal identifiers while enforcing business rules for salt forms and . Versioning mechanisms track updates to registered s, preserving historical records of modifications such as property revisions or structural corrections, as seen in PubChem's approach to maintaining multiple substance versions. Modern standards, including the IUPAC Blue Book's 2013 recommendations for preferred names, guide naming conventions during registration to ensure consistency in database entries. Challenges in these processes include managing proprietary data, where much information remains locked in private databases, limiting and increasing curation burdens. Automated registration can introduce errors due to inconsistencies in representation. Audit trails form a critical component, logging all registration actions for reproducibility and compliance with standards like (GLP), which mandates verifiable records in nonclinical studies. These trails support regulatory audits by providing immutable histories of data changes, ensuring accountability in regulated environments such as pharmaceutical testing facilities.

Technologies and Implementations

Chemical Toolkits and Database Cartridges

Chemical toolkits are embeddable software libraries designed to handle chemical structures and enable cheminformatics operations within larger applications or databases. These libraries provide core functionalities such as molecule parsing, manipulation, and computation of properties, facilitating the of chemical into custom systems. Prominent open-source toolkits include RDKit, an open-source cheminformatics library initially developed by in 2006 with C++ and bindings for structure handling, including substructure searching and molecular descriptor calculation. RDKit offers APIs for generating fingerprints and descriptors like molecular weight and logP, as well as tools for substructure matching using algorithms such as fingerprints. Another key open-source option is the Chemistry Development Kit (CDK), a library originating in the early that supports and 3D rendering of chemical structures, input/output routines for formats like SMILES, and substructure searching via . The CDK emphasizes for tasks in molecular informatics and has been foundational for numerous research projects. Commercial toolkits, such as ChemAxon's JChem suite, deliver robust structure representation and processing capabilities, including canonicalization, tautomer handling, and integration with database systems for chemical searches. These tools prioritize enterprise-scale performance for descriptor computation and structure standardization. Database cartridges are specialized extensions or plugins that augment relational databases with chemical-specific query capabilities, allowing native SQL-based operations on molecular data. The Oracle Chemical Cartridge, developed in the 2000s, integrates chemical handling into Oracle databases, enabling SQL queries for substructure and similarity searches directly on stored structures. For PostgreSQL, the RDKit cartridge provides an extension for storing molecules as binary data, indexing them for rapid retrieval, and executing substructure searches using operators like @> for pattern matching. It also supports descriptor computations within queries, such as calculating topological polar surface area on-the-fly. Early precursors to these modern toolkits include the Daylight Toolkit from the 1990s, a C-based library that pioneered chemical information processing, including SMILES parsing and substructure pattern searching, influencing subsequent developments in the field. Key features across these toolkits and cartridges encompass efficient indexing for substructure searches—often using inverted indexes or fingerprint-based methods—and for on-demand descriptor computation to support workflows. In recent years, toolkits like RDKit have integrated with frameworks such as , enabling seamless incorporation of chemical features into predictive models for properties like bioactivity as of 2025. A practical example involves using RDKit in PostgreSQL to build custom structure indexes: developers can create a table with a bytea column for molecule storage, populate it via MolFromSmiles, and generate indexes with rdkit.mfp2 fingerprints for accelerated similarity queries. This approach allows scalable handling of large chemical datasets without external processing.

Web-Based and Integrated Systems

Web-based chemical databases have proliferated since the early 2000s, driven by open data initiatives that facilitate public access to vast repositories of molecular information. PubChem, launched by the National Institutes of Health (NIH) in September 2004, exemplifies this trend as a comprehensive online resource providing web interfaces and programmatic access to 119 million compounds (as of 2025), including biological activities and patents. Its API, initially developed as part of the Power User Gateway (PUG) around the same period, enables programmatic querying of chemical structures and properties. ChemSpider, introduced in March 2007 by the Royal Society of Chemistry, operates as a crowd-sourced platform aggregating data from hundreds of sources, allowing users to contribute and validate over 130 million unique structures through community deposition and annotation. Federated systems enhance accessibility by integrating multiple databases without centralized data storage. The European Molecular Biology Laboratory's (EMBL-EBI) hosts services like and UniChem, which provide web-based access to curated bioactivity data and cross-references across 41 chemical databases, respectively, supporting unified queries via federated architectures. These platforms emphasize , with offering web interfaces for searching drug-like molecules and their targets since its inception in 2009. Key features of these systems include RESTful APIs for efficient querying, such as 's PUG REST, which supports structure searches, property retrieval, and batch operations in formats like and XML. ChemSpider's API similarly allows text and substructure searches, returning results in SMILES or formats for seamless data export. tools are integral, with providing 2D depictions via its Sketcher and interactive 3D rendering through the PubChem3D Viewer, enabling conformational analysis and overlay of molecular models. EMBL-EBI services integrate similar capabilities, exporting data in standard formats like InChI and to support downstream analysis. Integration with other scientific tools extends the utility of these databases. For instance, and offer nodes in , an open-source , allowing users to embed database queries within pipelines for cheminformatics tasks like similarity searching.:_Chem_4399_5399/Text/8_Interacting_with_Databases:_Desktop_and_Web_based_Applications) These systems also connect to electronic lab notebooks (ELNs) via , facilitating direct import of experimental into cloud-hosted environments like (AWS) for scalable processing and storage. 's AWS extensions further enable deployment of chemical on elastic compute resources, handling large-scale queries without local infrastructure. The rise of these web-based systems post-2000 aligns with broader policies, such as those from the NIH and EMBL-EBI, which prioritize free access and reuse to accelerate research in and . By 2025, emerging trends incorporate techniques across databases, as seen in tools like kMoL, which enable privacy-preserving model training on distributed chemical datasets without sharing raw data, addressing silos in proprietary and public repositories. Despite these advances, challenges persist, including API rate limits that restrict high-volume access—PubChem enforces no more than 5 requests per second to prevent server overload. Data licensing issues also complicate usage, as varying terms across sources (e.g., non-commercial restrictions in some public datasets) can limit commercial applications and require careful attribution to avoid legal conflicts.

Software Tools and Standards

Software tools for chemical database management encompass standalone applications that enable structure drawing, data conversion, workflow automation, and to databases. MarvinSketch, developed by Chemaxon, is a widely used chemical drawing tool that supports creation, depiction, and direct to database formats for querying and in cheminformatics systems. Similarly, , first released in by what is now Signals Software, provides integrated support for drawing molecules and exporting them to chemical databases, facilitating publication-ready visualizations and data registration. For workflow orchestration, Pipeline Pilot from allows users to build visual pipelines for processing chemical data, including batch registration into databases and integration with molecular property calculations. Open-source options like OpenBabel, originating from efforts in the early and formalized in the , serve as a versatile toolbox for converting between 146 chemical file formats (as of 2025), enabling seamless data exchange and manipulation in local database environments. These tools often incorporate features such as for efficient registration of multiple compounds into databases and automated validation against structural standards to ensure . For instance, MarvinSketch and Pilot support bulk operations for importing structures and checking compliance with encoding conventions, reducing manual errors in database curation. Standards play a crucial role in ensuring compatibility and among chemical databases. The Compound Identifier (CID), maintained by the (NCBI), provides a unique, stable accession number for standardized chemical structures, facilitating cross-database referencing and searchability. The International Union of Pure and Applied Chemistry (IUPAC) endorses protocols for chemical data exchange, including the use of identifiers like InChI for unambiguous representation. A seminal standard is the Chemical Markup Language (CML), an XML-based schema introduced in 1998, which structures chemical data such as molecules, reactions, and spectra for machine-readable exchange. Since 2016, the principles—emphasizing , , , and reusability—have been adapted to chemistry, guiding the design of databases to support automated data sharing and reuse in computational workflows. Recent advancements in technologies further enhance these standards. In 2025, updates to (RDF) implementations, such as those in PubChemRDF, enable representations of chemical entities, allowing for richer interconnections between databases and improved querying via . Despite these developments, challenges persist in achieving full across vendor-specific tools and databases, particularly with the influx of AI-generated chemical that may not conform to established formats. Evolving standards are needed to accommodate outputs, ensuring validation mechanisms can handle novel structures without compromising data quality.

References

  1. [1]
    4: Understanding Public Chemical Databases - Chemistry LibreTexts
    May 7, 2022 · (SOME) DATABASE BASIC. 1.1. What is a database? A database is an “organized collection of information.” The information in a database can be ...
  2. [2]
    A comprehensive review of database resources in chemistry - Redalyc
    Jul 1, 2020 · This paper provides an overview of the most frequently used free chemistry databases such as PubChem, Crystallography Open Database, PubMed, ZINC, ChemSpider, ...
  3. [3]
    Chemical Database - an overview | ScienceDirect Topics
    A chemical database is defined as a resource that provides emergency response and chemical handling information for various chemical substances, including data ...
  4. [4]
    Chemical Database - an overview | ScienceDirect Topics
    Chemical databases are defined as organized collections of extensive information on chemical structures, properties, biological activity, and spectra, ...
  5. [5]
    ChEMBL - EMBL-EBI
    ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data
  6. [6]
    Polymer Database(PoLyInfo) - DICE :: National Institute for Materials ...
    Polymer Database "PoLyInfo" systematically provides various data required for polymeric material design. The main data source is academic literature on polymers ...Missing: biomolecules | Show results with:biomolecules
  7. [7]
    (PDF) ChemDB: A Public Database of Small Molecules and Related ...
    They can be used as combinatorial building blocks for chemical synthesis, as molecular probes in chemical genomics and systems biology, and for the screening ...
  8. [8]
  9. [9]
    Cheminformatics in Drug Discovery and Material Design - MDPI
    Cheminformatics tools can predict the properties of a new molecule based on its chemical structure, reducing the need for time-consuming and expensive ...
  10. [10]
    What are the benefits of a centralized chemical inventory database?
    Apr 20, 2025 · In summary, a centralized chemical inventory database offers substantial benefits through improved visibility and control over chemical stocks.Improved Compliance And... · Enhanced Inventory... · Streamlined Data Management...Missing: distributed | Show results with:distributed<|separator|>
  11. [11]
    10 Most-used Cheminformatics Databases for the Biopharma ...
    Mar 21, 2025 · These databases play a crucial role in drug discovery, molecular modeling, and toxicity prediction. They help researchers identify potential ...
  12. [12]
    Closing the gap between centralized and decentralized compound ...
    The developed backend system and centralized data management facilitates the operation and integration of the stores into an existing store environment. MeSH ...
  13. [13]
    A comparison of approaches to accessing existing biological ... - NIH
    Jun 20, 2023 · Many existing biological and chemical databases are stored in the form of a relational database (RDB). Converting a relational database into ...
  14. [14]
    More CAS History - C&EN - American Chemical Society
    Dec 21, 2015 · ... punched cards to input the search parameters. But there were two companies, in the mid-1960s, who had developed generalized database ...
  15. [15]
    PubChem 2025 update - PubMed - NIH
    Jan 6, 2025 · With additions from over 130 new sources, PubChem contains >1000 data sources, 119 million compounds, 322 million substances and 295 million ...
  16. [16]
    200 years of Gmelin's handbook | Feature - Chemistry World
    May 17, 2017 · The card index was not a novel concept – it had been pioneered by the Swedish taxonomist Linneaus in the 1700s – but Gmelin made particularly ...
  17. [17]
    Beilstein's "Handbuch der Organischen Chemie" is Published in ...
    In 1881 Friedrich Konrad Beilstein's Offsite Link issued the first edition of his Handbuch der organischen Chemie from Hamburg, Germany, in 1881.Missing: Handbook precursor
  18. [18]
    Flashback: 1976 – computerised Chemical Registry System | Opinion
    Jan 10, 2016 · The CAS Registry was started in 1965. The Chemical Registry System was developed by the Chemical Abstracts Service (CAS) from work begun in ...
  19. [19]
    CAS History
    CAS was founded to share research, started as Chemical Abstracts in 1907, became CAS in 1909, and introduced the Chemical Registry System in 1956.
  20. [20]
    The Largest Curated Crystal Structure Database - CCDC
    Established in 1965 with historical structures dating back to the 1920s, the Cambridge Structural Database (CSD) is the world's largest curated repository ...WebCSD · CSD-CrossMiner · CSD Portfolio software suites...Missing: milestones | Show results with:milestones
  21. [21]
    Twenty Five Years of Progress in Cheminformatics - Wendy Warr
    Willett's team designed a screening system for searching large databases of chemical structures (Jakes, S.E.; Willett, P. Pharmacophoric pattern matching in ...
  22. [22]
    [PDF] JCAMP-CS: A Standard Exchange Format for Chemical Structure ...
    The JCAMP-DX format provides a standard for the exchange of data on IR spectra. Extensions of this format to other spectral data are being developed.
  23. [23]
    Ten Years of Service - PubChem - NIH
    Sep 16, 2014 · September 16, 2004 is a special day in the history of PubChem. It marks the beginning of PubChem as an on-line resource.
  24. [24]
    Understanding REACH - ECHA - European Union
    Therefore, the regulation has an impact on most companies across the EU. REACH places the burden of proof on companies. To comply with the regulation ...
  25. [25]
    An open source chemical structure curation pipeline using RDKit
    Sep 1, 2020 · The ChEMBL database is a freely available bioactivity database containing close to 2.5 million compound records on nearly 2 million unique ...
  26. [26]
    CAS COVID-19 Antiviral Candidate Compounds Dataset
    Mar 31, 2020 · CAS has released an open access dataset of chemical compounds with known or potential antiviral activity to support COVID-19 research and dataMissing: acceleration | Show results with:acceleration
  27. [27]
    Computational approaches streamlining drug discovery - Nature
    Apr 26, 2023 · Here we review recent advances in ligand discovery technologies, their potential for reshaping the whole process of drug discovery and development.
  28. [28]
    Using SMILES strings for the description of chemical connectivity in ...
    May 18, 2018 · Computer descriptions of chemical molecular connectivity are necessary for searching chemical databases and for predicting chemical ...
  29. [29]
    CAS Surveys Its First 100 Years - American Chemical Society
    Jun 11, 2007 · TO GRASP HOW LONG ago Chemical Abstracts began, consider all that had not yet happened in 1907. Roald Amundsen had not reached the South ...
  30. [30]
    CAS REGISTRY | CAS
    CAS REGISTRY is the database of disclosed chemical substances curated from scientific literature and other sources. It contains over 290 million substances.CAS STNext · CAS References · Contact · Portuguese
  31. [31]
    PubChem 2025 update | Nucleic Acids Research - Oxford Academic
    Nov 18, 2024 · Abstract. PubChem (https://pubchem.ncbi.nlm.nih.gov) is a large and highly-integrated public chemical database resource at NIH.
  32. [32]
    ChemSpider: Search and Share Chemistry - Homepage
    A free chemical structure database providing fast text and structure search access to over 130 million structures from hundreds of data sources.Simple search · Structure Search · Advanced Search · Data sources
  33. [33]
    Chemical Databases: Curation or Integration by User-Defined ... - NIH
    Mar 11, 2015 · This paper will outline some of the valuable resources available to drug discovery researchers, highlight some of the issues around curation and standardisation
  34. [34]
    PubChem
    PubChem is the world's largest collection of freely accessible chemical information. Search chemicals by name, molecular formula, structure, and other ...About · Water · Caffeine · Compounds
  35. [35]
    PubChem in 2021: new data content and improved web interfaces
    Nov 5, 2020 · ... of August 2020). This corresponds to an increase in substances, compounds and bioactivities by 19%, 14% and 14%, respectively, compared to ...
  36. [36]
    A new semi-automated workflow for chemical data retrieval and ...
    Dec 10, 2018 · At this stage duplicates in the list are detected and removed with a check on the first three layers of InChI generated from SMILES. Output.
  37. [37]
    A review of molecular representation in the age of machine learning
    Feb 18, 2022 · Despite potentially being computationally expensive, various algorithms have been developed to canonicalize SMILES strings including Universal ...
  38. [38]
    Current Challenges in Development of a Database of ... - Frontiers
    May 25, 2015 · For duplicate detection, one string should mean only one structure. Canonical SMILES, Isomeric SMILES, and Unique SMILES should be all ...
  39. [39]
    Dortmund Data Bank - DDBST GmbH
    Dortmund Data Bank,Thermophysical Properties. ... 1973 a computerized data bank for phase equilibrium data was started by J. Gmehling and U. Onken at the ...
  40. [40]
    The Dortmund Data Bank: A computerized system for retrieval ...
    The Dortmund Data Bank (DDB) was started in 1973 with the intention to employ the vast store of vapor-liquid equilibrium (VLE) data from the literature.
  41. [41]
    The NIST Chemistry WebBook: A Chemical Data Resource on the ...
    The site was established in 1996 and has grown to encompass a wide variety of thermochemical, ion energetics, solubility, and spectroscopic data.
  42. [42]
    Reaxys | An expert-curated chemistry database - Elsevier
    Reaxys is an innovative chemistry database that optimizes small molecule discovery. Discover, innovate and develop with confidence.Reaxys resources · Reaxys for drug discovery · Higher Education
  43. [43]
    What property information is included in Reaxys? - Elsevier Support
    Oct 16, 2025 · There are two options for finding the property information: Search for a chemical substance by structure. Reaxys will present all abstracted ...
  44. [44]
    A database of molecular properties integrated in the Materials Project
    Dec 4, 2023 · We present a FAIR expansion of the Materials Project database (“MPcules”) that adds more than 170 000 molecules studied using density functional theory (DFT) ...
  45. [45]
  46. [46]
    BIGCHEM: Challenges and Opportunities for Big Data Analysis in ...
    We briefly discuss some challenges and opportunities of this fast growing area of research with a focus on those to be addressed within the BIGCHEM project.
  47. [47]
    Identifying uncertainty in physical–chemical property estimation with ...
    May 30, 2024 · One challenge in this process is that there is an inherent discrepancy in the three solubility approach with regards to how the data are ...
  48. [48]
    Reaxys
    Reaxys includes three chemistry information databases: Beilstein, Gmelin, and Patent Chemistry Database (see Table 1). Table 1. Content Information of Reaxys ...Missing: CASREACT | Show results with:CASREACT
  49. [49]
    CAS Databases
    CAS provides accurate and authoritative chemistry content, curated and quality-controlled by hundreds of Ph.D. scientists from around the world.
  50. [50]
    Curating Reagents in Chemical Reaction Data with an Interactive ...
    Sep 20, 2024 · We rely on the atom-atom mapping (AAM) provided in USPTO to extract reagents from reactions.Missing: NLP | Show results with:NLP
  51. [51]
    Suitability of large language models for extraction of high-quality ...
    Nov 26, 2024 · In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents.
  52. [52]
    ChEMU 2020: Natural Language Processing Methods Are Effective ...
    The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents.Missing: curation standardization
  53. [53]
    Reaction Data Curation I: Chemical Structures and Transformations ...
    Here, we suggest a 4 steps protocol that includes the curation of individual structures (reactants and products), chemical transformations, reaction conditions ...
  54. [54]
    Retrosynthetic crosstalk between single-step reaction and multi-step ...
    Aug 28, 2025 · Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery ...
  55. [55]
    AI-Driven Synthetic Route Design Incorporated with Retrosynthesis ...
    Mar 8, 2022 · In this study, we developed a data-driven CASP application integrated with various portions of retrosynthesis knowledge called “ReTReK”
  56. [56]
    Quantum chemical data generation as fill-in for reliability ... - NIH
    We demonstrate and discuss the feasibility of autonomous first-principles mechanistic explorations for providing quantum chemical data.
  57. [57]
    Finding Information on Chemical Reactions and Reagents - Guides
    Jul 16, 2018 · CASREACT searches graphical reactions published from 1840 on, although the coverage is most comprehensive after 1975.
  58. [58]
    IBM RXN: New AI model boosts mapping of chemical reactions
    Jan 28, 2021 · Our recent paper by investigates deep learning models to classify chemical reactions and visualizes the chemical reaction space.
  59. [59]
    IBM RXN for Chemistry
    RXN for Chemistry. Predict reactions, find retrosynthesis pathways, and derive experimental procedures with RXN for Chemistry. Sign up. Log in.
  60. [60]
    AI-Powered Reaction Prediction and Retrosynthesis - ResearchGate
    Sep 12, 2025 · Challenges and Limitations: Challenges include incomplete reaction data and noise in training datasets, affecting model accuracy. Generalization ...
  61. [61]
    Binding Database Home
    As of October 30, 2025, BindingDB's patent dataset comprises: Patents: 8,288; Binding measurements: 1,270,194; Compounds: 633,329; Target proteins: 3,025 ...BindingDB · Binding Database Home · Pathway in Binding Database · About
  62. [62]
    None
    ### Summary of ChEMBL 36 Release Notes
  63. [63]
    The ChEMBL Database in 2023: a drug discovery platform spanning ...
    Nov 2, 2023 · ChEMBL contains ∼2.4 million unique chemical structures which, as part of the ChEMBL curation process, must be standardised. In collaboration ...
  64. [64]
    The ChEMBL database as linked open data
    May 8, 2013 · We have illustrated the advantages of using open standards and ontologies to link the ChEMBL database to other databases. Using those links ...
  65. [65]
    ChEMBL Database in 2023: Drug Discovery Platform
    Nov 2, 2023 · The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods
  66. [66]
    Fifteen years of ChEMBL and its role in cheminformatics and drug ...
    Mar 10, 2025 · In ChEMBL 19 (July 2014) the content of ChEMBL was expanded to include more than 40 K compound records and 245 K bioactivity data points ...
  67. [67]
    Human Genome Map Turns 10 - C&EN - American Chemical Society
    May 17, 2013 · The Human Genome Project officially ended in 2003. The $3.8 billion ... databases has also increased dramatically as a result of the Human Genome ...
  68. [68]
    Combatting over-specialization bias in growing chemical databases
    May 19, 2023 · In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral.
  69. [69]
    Hidden Challenges of Privacy and Ethics in Biological Big Data - NIH
    In addition, the emergence of electronic health records (EHR) with the rise of personalized medicine makes patients vulnerable to breaching privacy.
  70. [70]
    InChI, the IUPAC International Chemical Identifier
    May 30, 2015 · To address the lack of a non-proprietary, strictly-unique standard chemical identifier, the InChI project was initiated in 2000 by two ...
  71. [71]
    SMILES, a chemical language and information system. 1 ...
    SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules.
  72. [72]
    PubChem chemical structure standardization
    Aug 10, 2018 · PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay.
  73. [73]
    A Smarter, More Collaborative Future for InChI - ChemistryViews
    Mar 17, 2025 · Originally designed to identify small molecules, InChI can now also handle organometallics, polymers, and nanomaterials, making it more useful ...
  74. [74]
    Organization of 3D Structures in the Protein Data Bank - RCSB PDB
    Oct 25, 2023 · An ENTRY is all data pertaining to a particular structure deposited in the PDB and is designated with a 4-character alphanumeric identifier ...Organization Of 3d... · Overview · Relevance In Exploring The...
  75. [75]
    Connectivity stepwise derivation (CSD) method: a generic chemical ...
    Aug 8, 2024 · Thus, the adjacency matrix of the benzene molecule is populated with 1 in row 1, columns 2 and 6 because atom 1 is connected to atoms 2 and 6 ...
  76. [76]
    A Survey of Quantitative Descriptions of Molecular Structure - PMC
    An example of a topological descriptor is the Wiener index [7, 8]. It is simply the sum of the edge counts in the shortest paths between all pairs of non ...
  77. [77]
    [PDF] How to compute the Wiener index of a graph
    The Wiener index of a graph G is equal to the sum of distances between all pairs of vertices of G, It is known that the Wiener index of a molecular graph.
  78. [78]
    [PDF] The-Wiener-Index-Development-and-Applications.pdf - ResearchGate
    In the above equation N is the number of carbon atoms in an alkane. The sta ... Computation of the Wiener index for the tree T from Figure 4 using the method.<|control11|><|separator|>
  79. [79]
    Molecular Descriptor - an overview | ScienceDirect Topics
    The van der Waals volume (Vvdw) is the volume of the space inside the van der Waals molecular surface. The van der Waals volume is closely connected to the ...
  80. [80]
    About the InChI Standard - InChI Trust
    InChI is a structure-based chemical identifier, developed by IUPAC and the InChI Trust. It is a standard identifier for chemical databases.
  81. [81]
    Extended-Connectivity Fingerprints - ACS Publications
    Another commonly used class of fingerprints is available through Daylight Chemical Information Systems. (20) It uses features based upon the presence of ...
  82. [82]
    Daylight Theory: Fingerprints
    Fingerprints are a very abstract representation of certain structural features of a molecule; before we describe them, we'll discuss the problems that inspired ...
  83. [83]
    rdkit.Chem.Crippen module — The RDKit 2025.09.2 documentation
    rdkit.Chem.Crippen module¶. Atom-based calculation of LogP and MR using Crippen's approach. Reference: Wildman and G. M. Crippen JCICS _39_ 868-873 (1999).
  84. [84]
    MolAI: A Deep Learning Framework for Data-Driven Molecular ...
    Sep 15, 2025 · This study introduces MolAI, a robust deep learning model designed for data-driven molecular descriptor generation.
  85. [85]
    CAS Number Search - the NIST WebBook
    Search for Species Data by CAS Registry Number. Please follow the steps below to conduct your search: Enter a registry number (e.g., 74-82-8):
  86. [86]
  87. [87]
    Systematic benchmark of substructure search in molecular graphs
    Jul 31, 2012 · The Ullmann algorithm [7] is a backtracking procedure that employs a relaxation-based refinement step to reduce the search space. It operates on ...
  88. [88]
    The PubChem Compound Help
    Superstructure search allows one to identify chemical structures that comprise or make up (i.e., is a substructure of) the provided chemical structure query.supported structure file format. · Compound Name/Text Search · Substructure and...
  89. [89]
    Query Features - Chemaxon Docs
    Position variation bond (or variable point of attachment) is used to express that a bond may be attached to multiple positions (atoms), most often used for ...
  90. [90]
    Weininger's Realization - Dalke Scientific Software
    Dec 2, 2016 · I was therefore surprised to discover that Daylight introduced the term "fingerprint" to the field, around 1990 or so. The concept existed ...
  91. [91]
    Sachem: a chemical cartridge for high-performance substructure ...
    May 23, 2018 · We present Sachem, a new open-source chemical cartridge that aims to run substructure search queries on the largest publicly available datasets.
  92. [92]
    Fragmentation in Distributed DBMS - GeeksforGeeks
    Jul 31, 2025 · Fragmentation is the process of dividing a database table into smaller parts called fragments, which are stored on different sites in a ...Missing: chemical | Show results with:chemical
  93. [93]
    Subgraph isomorphism problem - Wikipedia
    Subgraph isomorphism is a generalization of both the maximum clique problem and the problem of testing whether a graph contains a Hamiltonian cycle.Decision problem and... · Algorithms · Applications
  94. [94]
    ChemDB update—full-text search and virtual chemical space
    With the fuzzy search option turned on, such a query returns several results, including the intended match for the structure of aspirin, with a correct name ...
  95. [95]
    Chemical Similarity Searching - ACS Publications
    This paper reviews the use of similarity searching in chemical databases. It begins by introducing the concept of similarity searching.
  96. [96]
    Information Retrieval for Chemists: CAS Registry Numbers
    CAS Registry Numbers (CAS RN) are used to locate information about a chemical compound quickly and accurately. · Every chemical compound has its own CAS Registry ...
  97. [97]
    A History of Chemical Abstracts Service, 1907-1998
    This paper is a history of Chemical Abstracts Service from its beginnings back in 1907 when the chemical abstracts were handwritten by volunteers.
  98. [98]
    A standard method to generate canonical SMILES based on the InChI
    Sep 18, 2012 · The InChI is often used to identify and remove duplicates in chemical databases. As shown by the results of the duplicate test, the InChI ...
  99. [99]
    canSAR chemistry registration and standardization pipeline
    May 28, 2022 · The pipeline consists of five steps to register the compounds and create the compounds' hierarchy: 1. Structure checker, 2. Standardization, 3.
  100. [100]
    Compound Registration - Chemical Registration Software - Chemaxon
    Compound Registration is an end-to-end chemical registration system out of the box. Use its flexible configuration options to apply your own business logic.
  101. [101]
    Blue Book | International Union of Pure and Applied Chemistry
    Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 2013, IUPAC Blue book, prepared for publication by Henri A Favre and Warren H ...
  102. [102]
    Perspective Transition to sustainable chemistry through digitalization
    Nov 11, 2021 · To date, the majority of chemical reactions data are locked in few proprietary databases, which, in reality, offer only limited capability ...
  103. [103]
    Experimental Errors in QSAR Modeling Sets: What We Can Do and ...
    Jun 19, 2017 · The major issues existing in the public data sources include (1) the incorrect representation of chemical structures (i.e., structural errors) ...Figure 1 · Can We Improve Qsar Models... · Materials And MethodsMissing: registration | Show results with:registration
  104. [104]
    Implementing Data Integrity Compliance in a GLP Test Facility
    Aug 2, 2021 · Alarms and events related to critical data; Reports; Audit trails. How can these critical records be made compliant with regulatory requirements ...
  105. [105]
    Audit Trail Requirements for a Digitalized Regulated Laboratory
    Jul 3, 2025 · This article discusses options for reviews, the regulatory requirements and guidance for audit trails, audit trail design, procedures for audit trail review.
  106. [106]
    aaGetting Started with the RDKit in Python
    This document is intended to provide an overview of how one can use the RDKit functionality from Python. It's not comprehensive and it's not a manual.Reading, Drawing, And... · Working With Molecules · Drawing MoleculesMissing: history | Show results with:history
  107. [107]
    The Chemistry Development Kit (CDK): An Open-Source Java ...
    The CDK provides methods for many common tasks in molecular informatics, including 2D and 3D rendering of chemical structures, I/O routines, SMILES parsing and ...
  108. [108]
    Chemistry Development Kit
    The Chemistry Development Kit (CDK) is a collection of modular Java libraries for processing chemical information (Cheminformatics).
  109. [109]
    Chemical Structure Representation Toolkit - Chemaxon
    Standardizer - canonicalizing chemical structures. Standardizer's main purpose is to transform chemical structures into representations that obey certain ...
  110. [110]
    JChem Cartridge for Oracle - Chemaxon Docs
    You can search data by structure, substructure and similarity through extensions to Oracle's native SQL language. Chemical data can be easily inserted and ...Architecture · Extensible Index, jc_idxtype · JChem-table functions
  111. [111]
    The RDKit database cartridge
    This document is a tutorial and reference guide for the RDKit PostgreSQL cartridge. If you find mistakes, or have suggestions for improvements, please either ...
  112. [112]
    RDKit - Nile Documentation
    RDKit is a powerful open-source cheminformatics and machine learning toolkit that provides PostgreSQL with the ability to handle and analyze chemical ...Missing: history features
  113. [113]
    Daylight Toolkit
    The Daylight Toolkit is a programming library that provides all functions needed for chemical information processing and substructure pattern searching along ...Missing: 1990s precursor
  114. [114]
    Enhancing Opioid Bioactivity Predictions through Integration of ... - NIH
    In this study, we investigate the effectiveness of transfer learning in building robust deep learning models to enhance ligand bioactivity prediction.
  115. [115]
    The evolution of open science in cheminformatics - NIH
    Apr 3, 2025 · The adoption of Semantic Web technologies and Linked Open Data (LOD) principles in cheminformatics began gaining momentum in the early 2000s, ...Missing: post- | Show results with:post-
  116. [116]
    Power User Gateway (PUG) - PubChem - NIH
    The PubChem Power User Gateway (PUG) provides access to PubChem services via a programmatic interface, using a single CGI and XML communication.
  117. [117]
    ChemSpider: An Online Chemical Information Resource
    Aug 30, 2010 · ChemSpider is a free, online chemical database offering access to physical and chemical properties, molecular structure, spectral data, synthetic methods, ...
  118. [118]
    EMBL's European Bioinformatics Institute (EMBL-EBI) in 2022
    Dec 7, 2022 · All EMBL-EBI data resources and many software systems can be downloaded and installed locally, and are made available on an open and free basis ...
  119. [119]
    Chemical Biology Services - EMBL-EBI
    ChEMBL is a database of bioactivity and ADMET information on drugs and drug-like molecules. ChEBI. ChEBI is a database and ontology of Chemical Entities of ...Chembl · Chebi · EubopenMissing: federated systems
  120. [120]
    PUG REST - PubChem - NIH
    Dates. Returns dates associated with PubChem identifiers; note that not all date types are relevant to all identifier types – see the table below. Multiple ...Missing: history | Show results with:history
  121. [121]
    Introduction — ChemSpiPy 2.0.0 documentation - Read the Docs
    ChemSpiPy is a Python wrapper that allows simple access to the web APIs offered by ChemSpider. The aim is to provide an interface for users to access and query ...
  122. [122]
    3D Structure Viewer - PubChem - NIH
    PubChem 3D Viewer provides a user friendly interface for rendering multiple 3-dimensional structures of PubChem compound records and for visualization of ...Missing: 2D | Show results with:2D
  123. [123]
    EMBL-EBI data resources and tools
    EMBL's European Bioinformatics Institute maintains the world's most comprehensive range of freely available and up-to-date molecular data resources.Missing: federated | Show results with:federated
  124. [124]
    Process, Search, and Analyze Electronic Lab Notebooks Data on AWS
    Dec 12, 2022 · Retrieving, processing, and analyzing electronic lab notebook (ELN) data is slow, costly, and labor intensive. AWS storage, databases, and artificial ...
  125. [125]
    [PDF] KNIME Amazon Web Services Integration User Guide
    Jul 5, 2024 · KNIME Analytics Platform includes a set of nodes to interact with Amazon Web Services (AWS™). They allow you to create connections to Amazon ...Missing: chemical ELNs
  126. [126]
    kMoL: an open-source machine and federated learning library for ...
    Feb 25, 2025 · kMoL is an open-source machine learning library with integrated federated learning capabilities developed to address such challenges.
  127. [127]
    Three pillars for ensuring public access and integrity of chemical ...
    Mar 28, 2025 · Many scientists downloading data from public databases are likely unaware of potential licensing limitations or the importance of ...Missing: issues | Show results with:issues
  128. [128]
    Why Open Drug Discovery Needs Four Simple Rules for Licensing ...
    Sep 27, 2012 · We have formulated four rules for licensing data for open drug discovery, which we propose as a starting point for consideration by databases ...
  129. [129]
    Marvin - Chemical Drawing Software - Chemaxon
    Marvin comes with a universal search bar to make features, templates, tools, functional groups and conversion from name to structure easily accessible.Missing: toolkit | Show results with:toolkit
  130. [130]
    ChemDraw | Revvity Signals Software
    Since 1985 ChemDraw has provided powerful capabilities and integrations to help you quickly turn ideas & drawings into publications you can be proud of.Signals ChemDraw · Join ChemDraw Connect · Trials · Request A Quote
  131. [131]
    BIOVIA Pipeline Pilot | Dassault Systèmes
    BIOVIA Pipeline Pilot accelerates Innovation in science and engineering with AI and Machine Learning.BIOVIA Pipeline Pilot Training · Biology Collections · Chemistry CollectionsMissing: database | Show results with:database
  132. [132]
    Open Babel - the chemistry toolbox — Open Babel openbabel-3-1-1 ...
    Open Babel is a chemical toolbox designed to speak the many languages of chemical data. It's an open, collaborative project allowing anyone to search, convert, ...
  133. [133]
    Open Babel: An open chemical toolbox - PubMed
    Oct 7, 2011 · Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats.
  134. [134]
    Five basic things you need to know about MarvinSketch - Chemaxon
    Sep 3, 2013 · From creating structures and reactions for manuscripts and presentations, to front-end to query databases or other cheminformatics tools, ...
  135. [135]
    [PDF] PIPELINE PILOT OVERVIEW - Dassault Systemes
    BIOVIA Pipeline Pilot optimizes the research innovation cycle by providing capabilities for scientific analysis (in dark blue) and allowing for the automation ...
  136. [136]
    PubChem Compounds - NIH
    PubChem Compound records are derived summaries that give users access to a rich set of related content.
  137. [137]
    Chemical Data Exchange Protocols — IUPAC FAIR Chemistry ...
    IUPAC is hosting a community project through the WorldFAIR Initiative to define a common protocol for programmatic exchange of chemical representation.
  138. [138]
    Chemical Markup Language | CML
    CML provides support for most chemistry, especially molecules, compounds, reactions, spectra, crystals and computational chemistry (compchem).
  139. [139]
    The FAIR Guiding Principles for scientific data management ... - Nature
    Mar 15, 2016 · This article describes four foundational principles—Findability, Accessibility, Interoperability, and Reusability—that serve to guide data ...
  140. [140]
    From discovery to delivery: Governance of AI in the pharmaceutical ...
    Data integration from various sources, such as experimental results, clinical trials, and chemical databases, poses significant challenges due to differences in ...
  141. [141]
    Top 10 Challenges in Artificial Intelligence for Materials and ...
    Explore the challenges in AI for materials and chemicals, focusing on data diversity and the need for tailored machine learning solutions.