Fact-checked by Grok 2 weeks ago

Simplified Molecular Input Line Entry System

The Simplified Molecular Input Line Entry System (SMILES) is a line notation system for representing the structure of chemical molecules and reactions using compact ASCII strings composed of atomic symbols, numbers for ring closures, and symbols for bonds, branches, and stereochemistry.^[1] The OpenSMILES specification defines an open standard for the language.^[2] Developed to facilitate computer processing of chemical information, SMILES encodes molecular topology in a human-readable yet machine-parsable format, allowing unambiguous description of connectivity, aromaticity, and isomerism without requiring graphical input.^[3] SMILES was initiated by chemist David Weininger in 1986 while working at the U.S. Environmental Protection Agency's Mid-Continent Ecology Division Laboratory in Duluth, Minnesota, as part of efforts to create an efficient, chemist-friendly language for chemical databases and modeling software.^[1] The methodology and encoding rules were first detailed in a seminal 1988 paper published in the Journal of Chemical Information and Computer Sciences, where it was described as a system "designed for modern chemical information processing" based on principles of atomic valence and graph theory.^[3] Weininger, who passed away in 2016, envisioned SMILES as an open, extensible standard; subsequent refinements, including canonicalization for unique string generation and extensions for reactions (SMIRKS) and substructure searching (SMARTS), were advanced by Daylight Chemical Information Systems in the early 1990s.^[4] Key features of SMILES include its simplicity—basic organic molecules like ethanol are denoted as "CCO"—and flexibility for complex structures, such as rings (e.g., cyclohexane as "C1CCCCC1") and stereocenters (using @ or / symbols).^[5]^[2] Unlike graphical formats, SMILES prioritizes linear representation, making it ideal for text-based storage, search, and exchange in cheminformatics applications, including drug discovery databases like PubChem and ChEMBL.^[6]^[7]^[8] Its adoption has grown due to interoperability with software tools for molecular generation, property prediction, and virtual screening, though limitations exist in fully capturing three-dimensional conformations without extensions.

History and Development

Origins and Creation

The Simplified Molecular Input Line Entry System (SMILES) was developed by David Weininger in 1986 while working at the U.S. Environmental Protection Agency's Mid-Continent Ecology Division Laboratory in Duluth, Minnesota, where it emerged as a compact and human-readable alternative to the cumbersome connection tables traditionally used to represent molecular structures in computational chemistry.^[1]^[3] The design was completed while Weininger was at Pomona College, and subsequent implementation and refinements were carried out at Daylight Chemical Information Systems, Inc., starting in the late 1980s. This notation addressed the need for a streamlined method to encode chemical information, enabling easier manipulation and storage of molecular data in early cheminformatics applications.^[1] The primary motivations behind SMILES were to facilitate the exchange of molecular structures across diverse software systems in cheminformatics, reducing the complexity associated with graphical or tabular formats. Weininger drew inspiration from predecessor line notations, notably the Wiswesser Line Notation (WLN) introduced in the 1940s, but sought to create a more intuitive and versatile system that avoided WLN's rigid, mnemonic-based rules while supporting broader structural representations.^[3] By prioritizing simplicity and portability, SMILES aimed to bridge the gap between human interpretation and machine processing of chemical data.^[3] SMILES was first detailed in a seminal publication by Weininger in 1988, appearing in the Journal of Chemical Information and Computer Sciences, which outlined its encoding rules and methodology.^[3] Central to its design is the use of linear character strings to depict two-dimensional molecular graphs, eschewing explicit coordinate information to focus solely on atomic connectivity and basic stereochemistry.^[3] This approach ensured that SMILES strings could be generated, parsed, and interconverted efficiently without reliance on visual or positional data.^[3]

Evolution and Standardization

Following the initial introduction of SMILES in 1988, David Weininger and collaborators at Daylight Chemical Information Systems expanded the notation throughout the 1990s to address ambiguities and enhance its applicability. These refinements included detailed rules for representing aromaticity, where lowercase letters denote aromatic atoms and bonds are implied as alternating single and double, and stereochemistry, using symbols like "@" for chiral centers and "/" and "" for double-bond configurations. A key advancement in canonicalization was introduced in a 1989 publication, where algorithms leveraging graph isomorphism produced unique SMILES strings, reducing duplicates in large-scale molecular inventories.^[9] These expansions were documented in Daylight's comprehensive tutorials and theory manuals, which became de facto references for implementers seeking consistent parsing and generation of SMILES strings.^[10]^[1] A pivotal step in standardization occurred in 2007 when the Blue Obelisk open-source cheminformatics community initiated the OpenSMILES project to create a non-proprietary, interoperable specification for SMILES. Culminating in the 2012 OpenSMILES specification, this effort clarified encoding rules, resolved variations in aromaticity perception, and promoted canonical SMILES generation to ensure unique string representations across software tools, fostering widespread adoption in databases and cheminformatics pipelines. The specification emphasized backward compatibility while standardizing features like ring closures and branching to improve data exchange in computational chemistry.^[2]^[11] In the 2020s, SMILES saw further alignment with international standards through IUPAC's ongoing project to formalize an updated specification, known as SMILES+, aimed at enhancing notation consistency without altering the core syntax. As of January 2025, the project reported major accomplishments from 2019-2024, including cheminformatics toolkit comparisons, with ongoing plans for further standardization. This integration builds on OpenSMILES by incorporating IUPAC guidelines for stereochemistry and tautomer handling, ensuring SMILES remains compatible with emerging standards like InChI.^[12] Recent milestones include 2023 cheminformatics community discussions on adapting SMILES for AI-driven molecular modeling, highlighting its role in training generative models while noting the unchanged core syntax's robustness for machine learning tasks like de novo design. These conversations, featured in conferences and reviews, underscore SMILES' enduring utility without necessitating major revisions.^[13]^[14]

Formal Foundations

Graph-Based Structural Definition

In the Simplified Molecular Input Line Entry System (SMILES), molecular structures are represented as undirected graphs where atoms correspond to vertices and chemical bonds serve as edges. Single bonds are denoted by default or explicit hyphens, double bonds by equals signs, triple bonds by hash marks, and aromatic bonds by lowercase letters or alternating single and double bonds in specific conventions. This graph-theoretic foundation allows SMILES to capture the connectivity and topology of a molecule without embedding spatial or stereochemical information, focusing solely on the abstract relational structure.^[3] Valence rules in SMILES ensure that each atom achieves its standard valence through explicit bonds and implicit hydrogens, which are automatically inferred by the system to fill remaining valences without altering the graph's core. For instance, carbon atoms assume a valence of four, so an isolated "C" implies methane (CH₄) with four implicit hydrogens attached as unrepresented vertices. Organic subset atoms (like C, N, O) follow predefined valences, while others in brackets can specify explicit charges or isotopes, but implicit hydrogens remain calculated to satisfy typical bonding patterns unless overridden. This approach maintains graph simplicity by omitting hydrogen vertices and edges in most cases, prioritizing computational efficiency in parsing and storage.^[3] The SMILES string encodes molecular connectivity through a linear traversal of the graph, typically following a main chain with branches denoted by parentheses and ring closures indicated by matching numerical digits. This parsing mimics a depth-first search (DFS) starting from an arbitrary atom, exploring neighbors sequentially and backtracking to represent branches or cycles. Formally, SMILES serializes the graph's adjacency matrix in this DFS order, where each symbol or digit records a vertex, edge type, or closure, reconstructing the full topology upon decoding. Rings are handled by labeling edges that connect non-adjacent vertices, ensuring the graph remains acyclic in representation but cyclically complete. By design, this basic encoding excludes stereochemistry or three-dimensional coordinates, emphasizing topological isomorphism over geometric detail.^[3]

Context-Free Language Specification

The Simplified Molecular Input Line Entry System (SMILES) is formally specified as a context-free language, allowing for unambiguous parsing of linear strings into molecular structures via a defined grammar. This linguistic formalization ensures that SMILES strings can be systematically generated and validated, mapping directly to the connectivity and hierarchy of chemical graphs. The specification, outlined in the OpenSMILES standard, employs a Backus-Naur Form (BNF)-style grammar to define valid constructs, distinguishing it from regular languages by accommodating nested elements like branches and rings.^[2] The core grammar rules are structured around key non-terminals such as atoms, bonds, chains, branches, and rings, with productions that recursively build molecular expressions. For instance, the start symbol (often denoted as molecule or S) derives from an initial atom, extended by optional bonds to subsequent atoms, branches enclosed in parentheses, or ring closures via digit pairs; representative productions include S → atom | S bond atom | S branch | ringclosure, where branches follow rules like branch → (chain) to handle side chains. These rules permit linear traversal of the molecule while supporting hierarchical nesting, ensuring all valid SMILES strings adhere to valence and connectivity constraints implicit in the productions. The full set of productions, detailed in the OpenSMILES document, covers over 20 non-terminals to encompass organic subsets, aromaticity, and extensions like isotopes, without introducing context-sensitive dependencies.^[2] Parsing SMILES strings leverages the context-free nature of the grammar, commonly via recursive descent algorithms that process atoms sequentially and recurse into branches upon encountering parentheses. Stack-based approaches manage ring closures by pushing opening digits and popping upon matching closures, preventing invalid pairings. The OpenSMILES specification recommends that parsers support at least 100 levels of nesting for branches and 1000 ring closures. This results in a parse tree where nodes represent atoms or bonds, with branches forming subtrees and rings linking via back-edges, devoid of cycles in the derivation except for explicit ring notations. The equivalence to recognition by a pushdown automaton underscores the grammar's power, as the stack handles the necessary memory for nested structures without requiring full graph context during initial parsing.^[2]^[15] Although the grammar generates multiple valid strings for the same molecule—due to alternative orders of branch or ring traversal—the language remains well-defined and parsable to a unique graph. Canonical SMILES variants address this non-uniqueness by enforcing a standardized traversal, such as depth-first with specific tie-breaking rules for symmetry, yielding a single representative string per structure. This ambiguity does not compromise parsing reliability, as all variants derive from the same grammar and map isomorphically to the molecule.^[2] The context-free formalism is essential, as the language is not regular: nested branches introduce arbitrary-depth recursion, akin to balanced parentheses, which a finite automaton cannot track without a stack. However, this level of expressiveness suffices for molecular complexity, where nesting depths rarely exceed practical limits, avoiding the need for more advanced grammars like context-sensitive ones. Seminal implementations, such as those in cheminformatics libraries, confirm the grammar's adequacy for efficient parsing of real-world chemical datasets.^[2]

Basic Notation Elements

Atomic Representations

In SMILES, atoms are primarily represented using their standard atomic symbols from the periodic table, with specific conventions to ensure compactness and readability. The system distinguishes between an "organic subset" of common elements and more general atomic specifications, allowing for efficient notation in chemical structures.^[3] The organic subset includes the elements boron (B), carbon (C), nitrogen (N), oxygen (O), phosphorus (P), sulfur (S), fluorine (F), chlorine (Cl), bromine (Br), and iodine (I). These atoms can be denoted using uppercase letters without enclosing brackets when they carry no formal charge and have implicit hydrogens determined by standard valence rules. For example, the symbol 'C' represents a carbon atom with four bonds, typically implying methane (CH₄) in isolation, while 'N' denotes nitrogen with three bonds, as in ammonia (NH₃). This bracket-free notation applies only to uncharged atoms in the organic subset with normal valences, promoting brevity for organic molecules.^[1]^[2] For atoms outside the organic subset or those requiring additional properties, SMILES uses square brackets to enclose the full specification. Inside brackets, the atomic symbol is followed by optional descriptors for isotope, charge, hydrogen count, or stereochemistry. Examples include [Na⁺] for sodium ion, [H] for explicit hydrogen, or [Fe] for iron. Charges are indicated by a sign (+ or -) optionally followed by a numeric value, such as [O⁻] for oxide ion or [NH₄⁺] for ammonium. Radicals can be specified with a dot, as in [CH₃•], and isotopes precede the symbol, like [¹⁴C]. This bracketed format is mandatory for metals, transition elements, and any organic subset atoms with unusual properties.^[1]^[3] Hydrogen atoms are handled primarily through implicit rules to minimize string length, especially in organic contexts. For atoms in the organic subset, the number of implicit hydrogens is calculated based on the atom's standard valence minus the number of explicit bonds in the SMILES string, assuming a neutral charge. Standard valences are carbon (4), nitrogen (3 or 5 in some cases), oxygen (2), phosphorus (3 or 5), sulfur (2, 4, or 6), halogens (1), and boron (3). For instance, ethanol is written as CCO, where the first C has three implicit H (CH₃-), the second C has two implicit H (-CH₂-), and O has one implicit H (-OH). Hydrogens can be made explicit using [H] or by specifying a count after the atom in brackets, such as [CH₄] for methane or C[H] for explicit attachment. In cases like metal complexes, implicit hydrogens are suppressed, and explicit specification may be required for accuracy.^[1]^[2] Special cases include pseudoatoms, such as the wildcard '*' which matches any atom and is treated as part of the organic subset for notation purposes. It is used in pattern matching or queries, for example, in substructure searches, and can be written without brackets like other organic atoms. These conventions ensure SMILES remains a context-free language while accommodating diverse chemical entities.^[2]^[1]

Bond Types and Connectivity

In SMILES notation, the default bond between adjacent atoms is a single covalent bond, which is implied without any symbol; for instance, the string "CC" represents ethane (C₂H₆), where the two carbon atoms are connected by a single bond.^[16]^[17] This convention simplifies the representation of linear hydrocarbon chains, assuming standard valence rules for organic atoms unless otherwise specified.^[3] Explicit bond symbols are used to denote higher bond orders or specific types, always preceding the atom they connect to, except at the start of the string. Single bonds can be explicitly indicated with a hyphen "-", double bonds with "=", triple bonds with "#", and aromatic bonds with ":". For example, "C=O" denotes formaldehyde (H₂C=O), where the "=" symbol specifies the double bond between carbon and oxygen.^[10]^[16] Aromatic bonds require lowercase atomic symbols for the connected atoms, such as "c:c" for an aromatic bond between two carbon atoms, as in benzene fragments, ensuring compatibility with valence models that treat aromaticity as a delocalized bond order of 1.5.^[17] Bond orders up to quadruple are supported in extended specifications, denoted by "$", though such bonds are rare in common organic molecules.^[17] Connectivity in SMILES is established through linear juxtaposition of atomic symbols, forming chains where each pair of consecutive atoms is linked by the specified or default bond. This graph-based approach relies on valence enforcement by parsers, which infer implicit hydrogens to satisfy standard atomic valences (e.g., carbon's valence of 4) and reject structures exceeding typical limits, preventing invalid representations like over-bonded atoms.^[3]^[1] Alternating double bonds in chains, such as in "C=CC" for propene, follow organic chemistry conventions without additional symbols, maintaining conciseness while adhering to bond order alternation rules in unsaturated systems.^[16]

Structural Features

Ring Systems

In SMILES, cyclic structures are denoted by ring closure labels, which connect non-adjacent atoms in the linear notation to form rings. A ring is specified by placing a digit from 1 to 9 immediately after the atomic symbol of the atom where the ring bond opens, and repeating the same digit after the atomic symbol where the ring closes, thereby linking those two atoms with a bond. This approach effectively "breaks" the cycle at one point and labels the endpoints numerically for reconnection. For instance, the SMILES for cyclopropane is C1CC1, where the first carbon is followed by 1 to open the ring, the second and third carbons continue the chain, and the final 1 after the third carbon closes the ring to the first carbon via a single bond by default.^[1] The ring closure digit applies to the immediately preceding atom, but a bond type can be specified by inserting the bond symbol (such as = for double or # for triple) before the digit, altering the nature of the closing bond. Without a symbol, the bond defaults to single for aliphatic atoms or aromatic for lowercase symbols in applicable contexts. Multiple ring closures can originate from or terminate at the same atom by appending multiple digits sequentially after it, enabling the representation of fused, bridged, or spiro systems. For example, norbornane, a bridged bicyclic hydrocarbon, is expressed as C1CC2CCC1C2, where digit 1 closes the first bridge and digit 2 closes the second, with the shared atoms defining the bridgehead connections implicitly through the traversal order.^[1] Structures with multiple independent or fused rings utilize distinct digits for each open ring, supporting up to nine simultaneous open rings via digits 1 through 9, as higher numbers would conflict without closure. Once a ring is closed by reusing a digit, that number becomes available for a new ring elsewhere in the string. For molecules requiring more than nine concurrent open rings—such as highly polycyclic frameworks—the notation employs a percent sign followed by two digits, ranging from %10 to %99, to extend the labeling capacity without ambiguity. An example is the use of %10 in a complex polycycle where standard digits are exhausted, ensuring the closure matches the opening %10 precisely. This mechanism avoids nesting complications in the linear string, as closures are resolved sequentially during parsing.^[2] Fused ring systems, where rings share two adjacent atoms (and thus a bond), are constructed by initiating a new ring from an atom within an existing ring and closing it with a distinct digit, leveraging the shared sequence to imply the fusion. For decalin (decahydronaphthalene), a fused bicyclic system, the SMILES is C1CCC2CCCCC2C1, where the first ring opens with 1, the second ring branches implicitly from the fourth carbon via the sequence and closes with 2, and the overall structure closes with 1 to fuse the rings at the bridgehead carbons. Bridgehead atoms in such systems are handled through the atom sequence and multiple closures without additional notation, though stereodescriptors (covered elsewhere) may be added for chirality. This digit-based approach ensures efficient encoding of ring topologies while maintaining the string's readability and compactness.^[1]

Branching Patterns

In SMILES notation, branches representing side chains or substituents are encoded by enclosing them in parentheses immediately following the atom from which they emanate. This allows for the depiction of tree-like molecular structures without cycles. For instance, the SMILES string CC(O)C represents 2-propanol (isopropanol), where the (O) denotes a hydroxyl group branching from the central carbon atom.^[1]^[2] Multiple branches can be specified sequentially from the same atom by placing additional parenthetical expressions in succession. An example is CC(O)(Cl)C, which describes 2-chloro-2-propanol, with both a hydroxyl and a chlorine atom branching from the second carbon. This sequential notation ensures that each branch attaches to the preceding atom before the main chain resumes.^[1]^[2] Nested branches enable the representation of more complex substituents within a branch, achieved by embedding additional parenthetical expressions inside an outer pair. The depth of nesting is theoretically unlimited, though practical implementations impose limits based on computational resources, typically supporting several levels for most molecular structures. For example, deeper nesting might describe a branched alkyl chain attached to another branch, illustrating hierarchical complexity in molecular architectures.^[2]^[1] After a branch closes with a closing parenthesis, the notation resumes the main chain from the atom that initiated the branch. This is evident in CC(=O)O, which encodes acetic acid, where (=O) specifies a double-bonded oxygen branch from the second carbon, followed by the continuation to the terminal hydroxyl group. Branches can integrate with ring notations, but the parentheses primarily handle acyclic deviations.^[1]^[2] The core syntactic rule for branches stipulates that each opening parenthesis ( initiates a branch from the immediately preceding atom, while each closing parenthesis ) terminates the most recent open branch, pairing strictly in a last-in-first-out manner. This stack-based parsing ensures unambiguous structure reconstruction.^[2]^[1] For disconnected structures, such as salts, complexes, or mixtures, SMILES employs a period . as a separator between independent molecular components. The order of these components is arbitrary and does not imply connectivity. A canonical example is [Na+].[Cl-], representing sodium chloride in ionic form.^[1]^[2]

Aromaticity Conventions

In the Simplified Molecular Input Line Entry System (SMILES), aromaticity for delocalized electron systems in rings is primarily indicated through the use of lowercase letters for atoms, distinguishing them from aliphatic counterparts. This convention applies to sp²-hybridized atoms such as carbon (c), nitrogen (n), oxygen (o), phosphorus (p), and sulfur (s) that participate in aromatic rings, implying an alternating pattern of single and double bonds without explicit specification in many cases.^[3] The lowercase notation simplifies representation by eliding bond details, relying on the parser to infer the delocalized nature based on ring structure and atom types.^[17] Aromatic bonds between these lowercase atoms are denoted explicitly by the lowercase colon (:), representing a delocalized bond; however, this symbol is frequently omitted in practice, with single bonds (-) or no bond symbol assumed to imply aromatic connectivity when connecting aromatic atoms. For instance, benzene is compactly written as c1ccccc1, where the ring closure digit 1 denotes the cycle, and the sequence of c atoms with implicit bonds conveys the aromatic sextet. In contrast, the Kekulé form provides an alternative non-aromatic representation using uppercase atoms and explicit alternating double bonds (=), such as C1=CC=CC=C1 for benzene, which avoids ambiguity in bond localization but results in longer strings and potential multiplicity in encoding the same structure.^[3] The aromatic lowercase form is generally preferred in canonical SMILES generation for its uniqueness and brevity, as multiple Kekulé variants can describe the same molecule.^[17] SMILES aromaticity conventions enforce specific structural rules to ensure valid delocalized systems, though common examples like benzene involve six-membered rings satisfying Hückel’s 4n+2 pi-electron rule. Heteroaromatic compounds follow analogous patterns, with pyridine represented as c1ccccn1, where the nitrogen (n) integrates into the aromatic cycle without disrupting the delocalized bonding.^[18] These rules extend to larger or fused systems, provided the atoms meet hybridization and electron count criteria defined in the specification.^[17] A key aspect of these conventions is the role of software in handling aromaticity during SMILES parsing and generation. Tools apply algorithmic detection—often using an extended version of Hückel’s rule—to identify aromatic patterns, validating lowercase notations and converting invalid or Kekulé inputs to the canonical aromatic form where appropriate.^[3] This process ensures interoperability across cheminformatics systems, allowing both aromatic and Kekulé representations as input while standardizing output to the lowercase aromatic notation for consistency.

Advanced Specifications

Stereochemical Descriptors

SMILES incorporates stereochemical information to distinguish between isomers, particularly through isomeric SMILES strings that specify configurations around double bonds and chiral centers, enabling the representation of 2D and limited 3D stereochemistry without full coordinate data.^[1] This is achieved using directional symbols for bond orientations and chiral indicators for atomic configurations, with parsing relying on the traversal direction in the string to determine relative positions.^[2] Such descriptors are essential for accurately encoding molecules like alkenes and amino acids, where spatial arrangement affects chemical properties.^[19] Double-bond stereochemistry, representing cis/trans or E/Z configurations, is denoted by the symbols / and \ as directional single bonds adjacent to the double bond (=). These indicate the relative orientation of substituents on the atoms connected by the double bond, where matching directions (both / or both \) signify trans geometry and opposing directions signify cis. For instance, the string F/[C](/page/Fluorine)=C/[F](/page/Fluorine) represents trans-1,2-difluoroethene, with the fluorine atoms on opposite sides, while F/[C](/page/Fluorine)=C\[F](/page/Fluorine) denotes the cis isomer.^[19] The specification requires explicit directionality on the bonds immediately preceding and following the double bond, and the parser interprets the geometry based on whether the directional bonds point in the same or opposite directions during string traversal.^[2] Tetrahedral stereochemistry at chiral centers, typically carbon atoms with four different substituents, uses the @ symbol for anticlockwise configuration and @@ for clockwise, placed after the atomic symbol within branches or the main chain. The configuration is defined relative to the order of neighbors in the SMILES string: for a central atom, the incoming bond serves as the viewpoint, and the subsequent branches and outgoing bond are ordered; @ indicates that viewing from the implicit hydrogen (or specified if explicit), the sequence appears anticlockwise. An example is N[C@@H](C)C(=O)O for L-alanine, where the chiral carbon has the amino group (N), methyl (C), carboxyl (C(=O)O), and implicit hydrogen arranged clockwise when ordered as written.^[19] Explicit hydrogens may be required in some cases to fully specify the center, and the notation applies to any tetrahedral atom, not just carbon.^[2] Extensions for axial chirality in allenes and square planar complexes build on these symbols. For allenes, featuring cumulative double bonds like in propadiene derivatives, stereochemistry at the central sp-hybridized carbon is indicated by @ or @@ following the atom, specifying the twist of the perpendicular planes formed by the substituents. For example, NC(Br)=[C@]=C(O)C denotes a specific enantiomer of an allene, where the substituents are oriented according to the chiral specification.^[2] Square planar geometry, common in coordination compounds, uses @SP1, @SP2, or @SP3 for anticlockwise arrangements of ligands around the central metal, with the incoming bond as reference; an example is [Pt@SP1](Cl)(Br)(I)N (U-shape configuration) for a chiral platinum complex. These notations ensure consistent parsing by maintaining directionality from the string's linear traversal.^[2]

Isotopic Labels

In SMILES, isotopic labels are specified using a numeric prefix indicating the mass number, placed immediately before the atomic symbol within square brackets for the affected atom. This notation allows precise representation of specific isotopes without altering the core structural description. For instance, [13C] represents the carbon-13 isotope, while the absence of a prefix, as in plain C, defaults to the most abundant naturally occurring isotope of the element (typically carbon-12 in this case).^[2]^[1] The isotope prefix follows a simple numeric format and can precede leading zeros if needed, though they are optional; thus, [2H], [02H], and [002H] all denote deuterium (hydrogen-2). This rule applies universally to any element in the periodic table, enabling isotopic specification for organic and inorganic atoms alike. Common applications include hydrogen ([2H] for deuterium or [3H] for tritium), carbon ([13C]), nitrogen ([15N]), and oxygen ([17O] or [18O]), which are frequently used in labeled compounds. Examples from the specification include [2H]O[2H] for heavy water (deuterium oxide) and [235U] for uranium-235.^[2]^[1] Importantly, isotopic designations in SMILES do not influence the atom's valence, hybridization, or bonding connectivity, as these properties are governed solely by the elemental symbol and its standard chemical behavior; the isotope serves only to distinguish mass variants for identification purposes. This design ensures compatibility with standard valence rules while supporting representations of isotopically substituted molecules, which are crucial in applications like nuclear magnetic resonance (NMR) spectroscopy and isotopic labeling studies for tracing metabolic pathways or reaction mechanisms. For example, [13CH4] specifies carbon-13 methane, useful in mass spectrometry or NMR experiments to probe molecular dynamics.^[2]^[1] An extension of this notation integrates with stereochemical descriptors when isotopes create asymmetry by differentiating otherwise identical substituents. In such cases, the isotope contributes to the atom's identity for chirality determination, allowing specification of configurations influenced by mass differences. A representative example is [2H]C@HCl, which depicts a chiral chlorofluoromethane where one hydrogen is replaced by deuterium, and the @ symbol denotes the tetrahedral stereochemistry at the carbon center. This capability is particularly relevant for studying isotopically induced chirality in biochemical or synthetic contexts.^[2]^[19]

Extensions for Complex Molecules

Extensions for complex molecules in the Simplified Molecular Input Line Entry System (SMILES) encompass optional notations that go beyond the core specification to address chemical reactions, polymeric structures, and substructure queries, though these features are not standardized in OpenSMILES and their implementation can differ across software tools.^[2] These extensions enable representation of dynamic processes and large-scale assemblies that are challenging with basic SMILES strings, facilitating applications in cheminformatics and materials science.^[1] Reaction SMILES uses the '>' symbol to delineate reactants, optional agents, and products in a chemical transformation. For instance, the hydration of ethylene is denoted as C=C.O>>CCO, where the left side lists the alkene and water as reactants, and the right side shows ethanol as the product. Another example is the substitution reaction C=CCBr>>C=CCI, representing the conversion of allyl bromide to allyl iodide without specified agents.^[1] This notation supports the depiction of multi-component reactions and is widely adopted in reaction databases and simulation software.^[1] For polymers, extensions employ asterisks (*) to mark the endpoints of repeating units, allowing concise description of chain-like macromolecules. Polyethylene, for example, is represented by the repeating unit [*]CC[*], where the asterisks indicate sites for inter-unit connections.^[20] This approach, seen in tools like PSMILES, builds on Daylight SMILES syntax while accommodating the connectivity of long chains.^[21] SMARTS (SMILES Arbitrary Target Specification) serves as a query language extension, enhancing SMILES with pattern-matching capabilities for substructure searches. It introduces wildcards such as '?' to match any organic subset atom (e.g., carbon, nitrogen, oxygen, phosphorus, or sulfur) and '*' to match any non-hydrogen atom. For example, the pattern C?O identifies carbon-oxygen single bonds where the oxygen is attached to any organic atom, useful for querying functional groups across molecular datasets.^[22] SMARTS also supports logical operators for more complex queries, making it essential for virtual screening and database filtering.^[22] In the 2020s, ongoing proposals aim to expand SMILES support for biomolecules, such as representing peptides as extended chains with repeating amino acid units and modifications. The IUPAC SMILES+ initiative, launched in 2019, seeks to formalize these extensions into a comprehensive standard. As of 2025, the project is in final review stages, with a recommendation expected for publication in Pure and Applied Chemistry later in the year.^[12] Similarly, BigSMILES provides a structured notation for polymers that can extend to biopolymers like polypeptides, using descriptors for repeating units to capture sequence variability.^[23] These developments address limitations in core SMILES for handling the scale and diversity of biological macromolecules.^[23]

Practical Examples

Simple Molecular Strings

Simple molecular strings in SMILES notation provide a straightforward way to encode small, linear molecules by sequencing atomic symbols, with default single bonds between adjacent atoms and implicit hydrogen atoms added to satisfy standard valences (carbon: 4, nitrogen: 3, oxygen: 2).^[3] This approach leverages the organic subset of elements, where uppercase letters denote atoms without explicit charges or isotopes, and no rings or stereochemistry are indicated.^[17] The simplest example is methane, represented as C. This single carbon atom is parsed as a central vertex in the molecular graph, with four implicit hydrogens attached to fulfill the tetravalent carbon, forming CH₄; no bonds are specified since there are no adjacent atoms.^[24] For water, the SMILES string O denotes a single oxygen atom, interpreted as a graph vertex with two implicit hydrogens bonded to it, yielding H₂O; the parser recognizes organic oxygen's divalent nature and adds hydrogens accordingly. Ammonia is encoded as N, where the nitrogen atom serves as the graph's sole vertex, augmented by three implicit hydrogens to match trivalent nitrogen, resulting in NH₃. Ethanol's SMILES CCO maps to a linear chain graph: the first C is a carbon with three implicit hydrogens (CH₃-), connected by a default single bond to the second C (with two implicit hydrogens, -CH₂-), which bonds to O (with one implicit hydrogen, -OH); this sequential parsing builds the acyclic structure CH₃CH₂OH. Ethene uses C=C, parsed as two carbon atoms connected by an explicit double bond: each C receives two implicit hydrogens (H₂C=CH₂), with the = symbol overriding the default single bond to define the unsaturated graph edge. Acetone is represented by CC(=O)C, where parsing proceeds left to right: the first C (CH₃-) bonds singly to the second C (the carbonyl carbon, with no implicit hydrogens), from which a branch (=O) attaches oxygen via a double bond (no hydrogens on O), and the second C then connects to a final C (CH₃-); this constructs the graph CH₃C(=O)CH₃, using parentheses to denote the off-chain double bond.^[25]

Elaborate Structure Illustrations

To illustrate the application of SMILES notation to more complex molecular structures, consider the representation of benzene, a fundamental aromatic hydrocarbon. The SMILES string for benzene is c1ccccc1, where lowercase letters denote aromatic atoms (carbon in this case), and the numbers 1 indicate the closure of a six-membered ring by connecting the first and last atoms. This compact notation captures the delocalized π-electron system without explicit double bonds, adhering to the aromaticity convention where alternating single and double bonds are implied but not specified. A more elaborate example is aspirin (acetylsalicylic acid), which combines an aromatic ring, branches, and functional groups. Its SMILES is CC(=O)Oc1ccccc1C(=O)O. Parsing begins with the acetyl branch CC(=O)O, where C is a methyl carbon bonded to another C (carbonyl), with =O indicating a double bond to oxygen; this attaches via the ester oxygen O to the aromatic ring c1ccccc1. The ring closes with number 1, and the final C(=O)O branches from the adjacent ring carbon, representing the carboxylic acid group. The aromatic lowercase c atoms imply sp² hybridization and alternating bonds, while uppercase C and O denote aliphatic or explicit atoms. This string integrates branching with parentheses and ring closure to depict the ortho-substituted benzoic acid derivative. For stereochemistry in complex structures, ibuprofen (2-(4-(2-methylpropyl)phenyl)propanoic acid) serves as a chiral example, with the biologically active S-enantiomer specified in SMILES as CC(C)CC1=CC=C(C=C1)[C@@H](C)C(=O)O. The parsing starts with the isobutyl chain CC(C)C, where the first C bonds to a branched methyl (C) and then to methylene C, connecting to the para-substituted aromatic ring C1=CC=C(C=C1). Uppercase C in the ring indicates Kekulé form with explicit double bonds (=), though aromatic notation c1ccc(cc1) is equivalent; the chiral center is marked by [C@@H], specifying the S configuration via the @@ tetrahedral descriptor, followed by the methyl branch (C) and carboxylic acid C(=O)O. This notation highlights how SMILES embeds stereochemical information at asymmetric carbons using atomic specifications and directionality rules. Caffeine, a purine alkaloid with fused rings and multiple branches, exemplifies multi-component integration in SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C. The string opens with the N-methyl group CN1, where N1 is a ring nitrogen numbered for closure; it connects to C=NC2=, forming part of the imidazole ring, with C2 initiating the fused pyrimidine. The fusion is denoted by shared atom C1 and ring number 2, followed by carbonyl C(=O)N with N-methyl branches (C) and another C(=O)N2C closing the ring. Aromaticity is partially explicit with =, but the structure relies on ring closures, branches, and heteroatoms to represent the xanthine core with three methyl substituents at positions 1, 3, and 7. This parsing demonstrates how SMILES handles polycyclic systems by layering ring numbers and parentheses for connectivity.

Applications and Implementations

Software Tools and Libraries

The Daylight SMILES Toolkit, developed by Daylight Chemical Information Systems, serves as the foundational proprietary library for handling SMILES notation, enabling the generation, parsing, and canonicalization of molecular strings through utility objects like streams and substructures.^[26] Originally created in the 1980s, it remains a core reference for SMILES implementation, though some components have influenced open-source alternatives.^[27] RDKit, an open-source cheminformatics toolkit available in Python and C++, provides extensive support for SMILES parsing, writing, and canonicalization, including handling of stereochemistry and aromaticity as per the OpenSMILES specification.^[28] It facilitates molecule manipulation and fingerprint generation from SMILES inputs, making it widely used in computational chemistry workflows.^[28] Open Babel, a free and open-source software suite, acts as a multi-format converter that robustly processes SMILES strings for input and output, supporting extensions for radicals and implicit hydrogens while adhering to the OpenSMILES standard.^[29] It enables seamless interconversion between SMILES and over 100 other chemical file formats, aiding in data exchange for molecular modeling.^[30] For web-based applications, SmilesDrawer is a lightweight, dependency-free JavaScript library designed specifically for parsing SMILES strings and rendering 2D molecular structures client-side with high performance and low memory usage.^[31] Released under the MIT license, it supports customizable visualizations and is suitable for interactive browser environments. Online validation and translation of SMILES are facilitated by the NIH's CACTUS Chemical Identifier Resolver, a web service that converts SMILES inputs to structural depictions, identifiers, and other formats, thereby verifying syntactic and semantic correctness.^[32] This tool, hosted by the National Cancer Institute, processes queries without requiring software installation and supports batch operations for large datasets.^[33]

Integration in Cheminformatics and Machine Learning

In cheminformatics, SMILES strings serve as a foundational input for generating molecular fingerprints, which are binary vectors encoding structural features to enable efficient similarity searches across large chemical databases. These fingerprints, such as the Extended Connectivity Fingerprints (ECFP), are derived by parsing the SMILES representation to identify substructural patterns, allowing quantitative comparison via metrics like Tanimoto similarity. This approach facilitates virtual screening in drug discovery by identifying structurally analogous compounds, with studies demonstrating that SMILES-derived fingerprints achieve high recall rates in retrieving known actives from databases like PubChem, which as of 2025 stores over 119 million unique compounds primarily represented in SMILES format.^[34]^[35] In machine learning applications, SMILES has become a preferred textual representation for training transformer-based models on molecular data, enabling self-supervised pretraining for downstream tasks like property prediction. For instance, ChemBERTa, a RoBERTa-inspired model pretrained on 77 million PubChem SMILES strings, leverages masked language modeling to learn contextual embeddings that outperform traditional descriptors in tasks such as toxicity classification. Recent advancements incorporate contrastive learning on SMILES datasets to enhance representation robustness; the CONSMI framework (2024) uses SMILES enumeration to generate positive pairs for contrastive objectives, yielding embeddings that improve molecular similarity tasks by 10-15% over non-contrastive baselines in evaluations on ZINC and PubChem subsets. Similarly, 2023-2025 studies on contrastive methods, such as SimSon, integrate multi-view learning across SMILES variants to capture structural invariances, boosting performance in property prediction by addressing canonicalization ambiguities.^[36] SMILES integration in drug design prominently features generative models that output novel SMILES strings conditioned on desired properties, streamlining de novo molecule creation. Variational autoencoders and GANs, trained on SMILES corpora like ChEMBL, produce candidates with validity rates exceeding 90% after parsing and sanitization, as validated in reinforcement learning frameworks where rewards incorporate chemical feasibility. Post-generation validity checking is critical, involving RDKit or OpenBabel parsers to detect syntactic errors or invalid valences.^[37]

Limitations and Comparisons

Inherent Constraints

One inherent limitation of the Simplified Molecular Input Line Entry System (SMILES) is its allowance for multiple string representations of the same molecular structure, as it is not inherently canonical, which can lead to ambiguities in database searches and comparisons.^[1] For instance, the molecule propanol can be encoded as CCCO or OCCC, among other variants, requiring additional canonicalization steps for uniqueness.^[1] This non-uniqueness extends to tautomers, where different tautomeric forms—such as keto-enol pairs—are represented by distinct SMILES strings, hindering the unified depiction of equilibrium structures.^[38] SMILES primarily captures two-dimensional connectivity and topology, providing only partial support for stereochemistry without incorporating full three-dimensional spatial data.^[39] Stereochemical features are denoted using directional symbols like @ for counterclockwise and @@ for clockwise chirality at tetrahedral centers, but these do not account for conformational dynamics or precise 3D coordinates.^[40] As molecular size increases, SMILES strings grow lengthy and complex, often resulting in parsing challenges and errors when processing invalid or malformed inputs.^[23] Common parsing failures, as observed in tools like RDKit, include syntax violations, unclosed rings, mismatched parentheses, and valence inconsistencies, with invalid SMILES comprising up to 89% of outputs from certain generative models.^[41] SMILES is fundamentally designed for static molecular graphs and lacks native support for quantum states, electronic configurations, or dynamic processes like bond vibrations.^[3] For macromolecules such as polymers or proteins, standard SMILES becomes inadequate, necessitating extensions like BigSMILES to handle repeating units and stochastic elements.^[23] Recent analyses, particularly in artificial intelligence contexts, critique SMILES for its structural ambiguities and limited expressiveness in machine learning tasks, proposing algebraic data types as a more robust alternative for encoding molecular hierarchies and properties.^[42]

Alternatives to SMILES Notation

The International Chemical Identifier (InChI) is an IUPAC-endorsed standard for encoding chemical structures in a unique, canonical string format that ensures a single representation per molecule, addressing SMILES' potential for multiple isomorphic notations. InChI organizes information into distinct layers covering main connectivity, tetrahedral stereochemistry, and isotopic specifications, enabling precise differentiation of isomers and variants. While more verbose and less intuitive for manual input than SMILES, this layered approach facilitates robust database storage and retrieval without normalization steps.^[43]^[44] In contrast, the MOL and SDF file formats provide connection table representations that explicitly include atomic coordinates for 2D or 3D conformations, along with bond details and optional properties. Originating from MDL Information Systems, MOL files describe single molecules, while SDF extends this to multiple entries with metadata, making them ideal for structure visualization and molecular dynamics but resulting in larger file sizes compared to the compact linear strings of SMILES. These coordinate-based formats preserve spatial arrangements critical for applications like docking simulations, though they demand more storage and parsing overhead.^[45]^[46] SMILES stands out for its human-readable syntax, resembling traditional chemical nomenclature, which simplifies manual creation and editing by chemists, and its straightforward generation from graphical depictions. A key drawback is the absence of built-in canonicalization, necessitating additional processing to achieve uniqueness akin to InChI. In machine learning workflows, SMILES' string-based linearity supports rapid tokenization and input feeding into models, outperforming graph-oriented formats like RDF triples used in semantic chemical databases, where query resolution involves heavier relational traversals.^[44]^[47]^[48] As of 2024, reviews of generative models in de novo drug design continue to favor SMILES for its seamless integration with transformer architectures and broad software ecosystem, even as alternatives like SELFIES—designed to produce only valid molecular strings—offer improved syntactic reliability during optimization tasks.^[13]^[49]

References

[1]
Daylight Theory: SMILES
SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical method using printable characters) for entering and representing ...
[2]
SMILES, a chemical language and information system. 1 ...
SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules ... Open PDF. Journal of Chemical Information and ...
[3]
a tribute to David Weininger, 1952–2016
Feb 3, 2018 · Previously, while at the EPA Duluth station (1978—1982), he had invented SMILES, the first line notation for chemical structures easily readable ...So Long, And Thanks For All... · Dave Weininger, One Of A... · Anthony Nicholls
[4]
[PDF] Appendix F SMILES Notation Tutorial
What is SMILES? SMILES is the “Simplified Molecular Input Line Entry System,” which is used to translate a chemical's three-dimensional structure into a string ...
[5]
SMILES | DrugBank Help Center
SMILES is a line notation system used for describing the structure of chemical species using short ASCII strings.
[6]
SMILES Tutorial - Daylight
SMILES Tutorial. Table of Contents. 1. Introduction 2. Atoms 3. Properties of Atoms 4. Bonds 5. Branching 6. Rings 7. Aromaticity 8. Stereo Isomerism 9 ...Missing: 1990s expansions Weininger
[7]
OpenSMILES specification
May 15, 2016 · It is hosted under the banner of the Blue Obelisk project, with the intent to solicit contributions and comments from the entire computational ...
[8]
[PDF] IUPAC SMILES+ - InChI Trust
OpenSMILES, a Blue Obelisk community driven effort created a non-proprietary open specification of SMILES (2007) [2]. ○ OpenSMILES clarified some ...
[9]
IUPAC SMILES+ Specification
This project seeks to establish a formalized recommended up-to-date specification of the SMILES format.Missing: 2020s | Show results with:2020s
[10]
Transformer-based models for chemical SMILES representation
Oct 30, 2024 · Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules ... View PDFView articleView in ...<|control11|><|separator|>
[11]
Conference feedback: AI in Chemistry 2023 | Oxford Protein ...
Oct 10, 2023 · To kick off the series of shorter talks at the conference, Daniel Probst spoke about his work on the explainable prediction of catalyzing ...
[12]
[PDF] Parsing and Conversion of SMILES-Strings to Molecular Graphs
Oct 29, 2010 · To define a context–free grammar (CFG) a mathematical structure is needed. A. CFG is defined by a tuple G = (V, Σ, S, P). Where V is a set of ...
[13]
[PDF] arXiv:2009.13946v1 [cs.LG] 29 Sep 2020
Sep 29, 2020 · Grammar rules are obtained from the OpenSMILES specification [James et al., 2016], which denotes how the. SMILES representation was formed based ...
[14]
SMILES Tutorial: Bonds - Daylight
Single, double, triple, and aromatic bonds are represented by the symbols `-', =', `#', and `:', respectively.
[15]
[PDF] OpenSMILES specification
May 15, 2016 · SMILES was originally developed as a proprietary specification by Daylight Chemical Information Systems Since the intro-.Missing: 1986 | Show results with:1986
[16]
SMILES Tutorial: Conventions - Daylight
There is no single rigorous definition of aromaticity in chemistry. To a synthetic chemist, aromaticity implies something about reactivity; to a ...Missing: 1990s expansions Weininger
[17]
SMILES Tutorial: Isomerism - Daylight
These symbols indicate relative directionality between the connected atoms and have meaning only when they occur on both atoms which are double bonded.
[18]
Guide - Polymer Genome
... )=CC=C1 . A SMILES string used for Polymer Genome represents the repeating unit of a polymer, which has 2 dangling bonds for linking with the next repeating ...
[19]
PSMILES
PSMILES strings are very useful for data-driven polymer discovery, design or prediction task. A PSMILES string follows the daylight SMILES syntax defined at ...<|separator|>
[20]
4. SMARTS - A Language for Describing Molecular Patterns - Daylight
SMARTS is a language that allows you to specify substructures using rules that are straightforward extensions of SMILES.Missing: pseudoatoms * | Show results with:pseudoatoms *
[21]
BigSMILES: A Structurally-Based Line Notation for Describing ...
Sep 12, 2019 · In BigSMILES, polymeric fragments are represented by a list of repeating units enclosed by curly brackets. The chemical structures of the ...Introduction · Syntax · Discussion · Supporting Information
[22]
Methane | CH4 | CID 297 - PubChem
Methane | CH4 | CID 297 - structure, chemical names, physical and chemical properties, classification, patents, literature, biological activities, ...
[23]
Acetone | CH3-CO-CH3 | CID 180 - PubChem - NIH
2.1.4 SMILES. CC(=O)C. Computed by OEChem 2.3.0 (PubChem release 2025.04.14). PubChem. 2.2 Molecular Formula. C3H6O. Computed by PubChem 2.2 (PubChem release ...
[24]
SMILES TM Toolkit - Daylight>Products
The SMILESTM Toolkit is a chemical information programming library that supports a number of utility objects (streams, sequences, paths, substructs). It used ...
[25]
Daylight Chemical Information Systems
The Daylight Toolkit enables companies to build applications to add a broad range of cheminformatics capabilities to environments of any scale.
[26]
The RDKit Book — The RDKit 2025.09.1 documentation
The ATTCHORD attribute must have a specification for each bond that comes from the macro atom. The specification is contained between parentheses, and the ...
[27]
rdkit - PyPI
1. pip install rdkit. Copy PIP instructions. Latest version. Released: Oct 6, 2025. A collection of chemoinformatics and machine-learning software written in ...🔥 Rdkit Python Wheels · Available Builds · Installation<|control11|><|separator|>
[28]
SMILES format (smi, smiles) - Open Babel
Open Babel implements the OpenSMILES specification. It also implements an extension to this specification for radicals. Note that the l <atomno> option, used ...
[29]
User Guide — Open Babel openbabel-3-1-1 documentation
SMILES extensions for radicals · Other Supported Extensions · Contributing to Open Babel · Overview · Developing Open Babel · Documentation · Adding a new test ...Supported File Formats and... · Install Open Babel · The Open Babel GUI · API
[30]
reymond-group/smilesDrawer - GitHub
A small, highly performant JavaScript component for parsing and drawing SMILES strings. Released under the MIT license. - reymond-group/smilesDrawer.
[31]
CACTUS Online SMILES Translator - NCI/CADD - NIH
No information is available for this page. · Learn whyMissing: validation | Show results with:validation
[32]
NCI/CADD Chemical Identifier Resolver - NIH
This service works as a resolver for different chemical structure identifiers and allows one to convert a given structure identifier into another ...Missing: validation tool
[33]
PubChem 2025 update - Oxford Academic
Nov 18, 2024 · With additions from over 130 new sources, PubChem contains >1000 data sources, 119 million compounds, 322 million substances and 295 million ...
[34]
Getting Started with the RDKit in Python
This document is intended to provide an overview of how one can use the RDKit functionality from Python. It's not comprehensive and it's not a manual.
[35]
CONSMI: Contrastive Learning in the Simplified Molecular Input ...
Jan 19, 2024 · Here, we describe a contrastive learning framework using SMILES enumeration to learn more comprehensive potential representations of SMILES.
[36]
Deep reinforcement learning for de novo drug design - Science
Generative models are trained with a stack-augmented memory network to produce chemically feasible SMILES strings, and predictive models are derived to forecast ...Results · Rl Formulation As Applied To... · Neural Network Architectures<|control11|><|separator|>
[37]
SELFIES and the future of molecular string representations - PMC
This allows for tautomers of the same molecule to be represented by the same InChI string, while with the Smiles framework, each tautomer is represented by a ...
[38]
Stereochemistry and Atom Parity in SMILES | Depth-First
May 4, 2020 · This article explains the SMILES stereochemical notation system in detail. Atom Parity. SMILES expresses stereochemical configuration through ...Missing: expansions 1990s aromaticity
[39]
Jmol SMILES and Jmol SMARTS: specifications and applications
Sep 26, 2016 · This article focuses on the development of SMILES and SMARTS dialects that can be used specifically in the context of a 3D molecular ...
[40]
UnCorrupt SMILES: a novel approach to de novo design
Feb 14, 2023 · To better understand the invalid SMILES the parsing errors captured by the RDKit were classified into six different error types (Fig. 1).
[41]
[2501.13633] Representation of Molecules via Algebraic Data Types
Jan 23, 2025 · The paper introduces a novel molecular representation using Algebraic Data Types (ADTs), which are composite data structures formed through the ...Missing: alternatives | Show results with:alternatives
[42]
InChI, the IUPAC International Chemical Identifier
May 30, 2015 · Moreover, for SMILES, the canonicalization algorithm was published ... published when InChIKey was introduced in 2007, and the statement ...
[43]
A standard method to generate canonical SMILES based on the InChI
Sep 18, 2012 · I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations.
[44]
MOL file format (MT06966) - IUPAC Gold Book
The MOL file format encodes chemical structures, substructures and conformations as text-based connection tables. It is used by MDL Information Systems Inc.<|control11|><|separator|>
[45]
2.5: Structural Data Files - Chemistry LibreTexts
Aug 21, 2022 · There are a variety of file formats and the most common are based on the MDL Molfile, of which V2000 is the most common, although V3000 is also ...
[46]
From SMILES to Graphs: The Next Frontier in ML-Driven ... - Quantori
Mar 5, 2025 · Graph-based formats offer a richer, more flexible approach, opening the door to advanced generative modeling techniques like diffusion models.From Smiles To Graphs: The... · Graph-Based Representations · A Glimpse Into The Future...
[47]
Linking the Resource Description Framework to cheminformatics ...
Mar 7, 2011 · Converting RDF expressed molecular data, such as SMILES strings, into chemical graphs was done using the Chemistry Development Kit (CDK) [21, 22] ...
[48]
Invalid SMILES are beneficial rather than detrimental to chemical ...
Mar 29, 2024 · Here I provide causal evidence that the ability to produce invalid outputs is not harmful but is instead beneficial to chemical language models.