Simplified Molecular Input Line Entry System
The Simplified Molecular Input Line Entry System (SMILES) is a line notation system for representing the structure of chemical molecules and reactions using compact ASCII strings composed of atomic symbols, numbers for ring closures, and symbols for bonds, branches, and stereochemistry.[1] The OpenSMILES specification defines an open standard for the language.[2] Developed to facilitate computer processing of chemical information, SMILES encodes molecular topology in a human-readable yet machine-parsable format, allowing unambiguous description of connectivity, aromaticity, and isomerism without requiring graphical input.[3] SMILES was initiated by chemist David Weininger in 1986 while working at the U.S. Environmental Protection Agency's Mid-Continent Ecology Division Laboratory in Duluth, Minnesota, as part of efforts to create an efficient, chemist-friendly language for chemical databases and modeling software.[1] The methodology and encoding rules were first detailed in a seminal 1988 paper published in the Journal of Chemical Information and Computer Sciences, where it was described as a system "designed for modern chemical information processing" based on principles of atomic valence and graph theory.[3] Weininger, who passed away in 2016, envisioned SMILES as an open, extensible standard; subsequent refinements, including canonicalization for unique string generation and extensions for reactions (SMIRKS) and substructure searching (SMARTS), were advanced by Daylight Chemical Information Systems in the early 1990s.[4] Key features of SMILES include its simplicity—basic organic molecules like ethanol are denoted as "CCO"—and flexibility for complex structures, such as rings (e.g., cyclohexane as "C1CCCCC1") and stereocenters (using @ or / symbols).[5][2] Unlike graphical formats, SMILES prioritizes linear representation, making it ideal for text-based storage, search, and exchange in cheminformatics applications, including drug discovery databases like PubChem and ChEMBL.[6][7][8] Its adoption has grown due to interoperability with software tools for molecular generation, property prediction, and virtual screening, though limitations exist in fully capturing three-dimensional conformations without extensions.History and Development
Origins and Creation
The Simplified Molecular Input Line Entry System (SMILES) was developed by David Weininger in 1986 while working at the U.S. Environmental Protection Agency's Mid-Continent Ecology Division Laboratory in Duluth, Minnesota, where it emerged as a compact and human-readable alternative to the cumbersome connection tables traditionally used to represent molecular structures in computational chemistry.[1][3] The design was completed while Weininger was at Pomona College, and subsequent implementation and refinements were carried out at Daylight Chemical Information Systems, Inc., starting in the late 1980s. This notation addressed the need for a streamlined method to encode chemical information, enabling easier manipulation and storage of molecular data in early cheminformatics applications.[1] The primary motivations behind SMILES were to facilitate the exchange of molecular structures across diverse software systems in cheminformatics, reducing the complexity associated with graphical or tabular formats. Weininger drew inspiration from predecessor line notations, notably the Wiswesser Line Notation (WLN) introduced in the 1940s, but sought to create a more intuitive and versatile system that avoided WLN's rigid, mnemonic-based rules while supporting broader structural representations.[3] By prioritizing simplicity and portability, SMILES aimed to bridge the gap between human interpretation and machine processing of chemical data.[3] SMILES was first detailed in a seminal publication by Weininger in 1988, appearing in the Journal of Chemical Information and Computer Sciences, which outlined its encoding rules and methodology.[3] Central to its design is the use of linear character strings to depict two-dimensional molecular graphs, eschewing explicit coordinate information to focus solely on atomic connectivity and basic stereochemistry.[3] This approach ensured that SMILES strings could be generated, parsed, and interconverted efficiently without reliance on visual or positional data.[3]Evolution and Standardization
Following the initial introduction of SMILES in 1988, David Weininger and collaborators at Daylight Chemical Information Systems expanded the notation throughout the 1990s to address ambiguities and enhance its applicability. These refinements included detailed rules for representing aromaticity, where lowercase letters denote aromatic atoms and bonds are implied as alternating single and double, and stereochemistry, using symbols like "@" for chiral centers and "/" and "" for double-bond configurations. A key advancement in canonicalization was introduced in a 1989 publication, where algorithms leveraging graph isomorphism produced unique SMILES strings, reducing duplicates in large-scale molecular inventories.[9] These expansions were documented in Daylight's comprehensive tutorials and theory manuals, which became de facto references for implementers seeking consistent parsing and generation of SMILES strings.[10][1] A pivotal step in standardization occurred in 2007 when the Blue Obelisk open-source cheminformatics community initiated the OpenSMILES project to create a non-proprietary, interoperable specification for SMILES. Culminating in the 2012 OpenSMILES specification, this effort clarified encoding rules, resolved variations in aromaticity perception, and promoted canonical SMILES generation to ensure unique string representations across software tools, fostering widespread adoption in databases and cheminformatics pipelines. The specification emphasized backward compatibility while standardizing features like ring closures and branching to improve data exchange in computational chemistry.[2][11] In the 2020s, SMILES saw further alignment with international standards through IUPAC's ongoing project to formalize an updated specification, known as SMILES+, aimed at enhancing notation consistency without altering the core syntax. As of January 2025, the project reported major accomplishments from 2019-2024, including cheminformatics toolkit comparisons, with ongoing plans for further standardization. This integration builds on OpenSMILES by incorporating IUPAC guidelines for stereochemistry and tautomer handling, ensuring SMILES remains compatible with emerging standards like InChI.[12] Recent milestones include 2023 cheminformatics community discussions on adapting SMILES for AI-driven molecular modeling, highlighting its role in training generative models while noting the unchanged core syntax's robustness for machine learning tasks like de novo design. These conversations, featured in conferences and reviews, underscore SMILES' enduring utility without necessitating major revisions.[13][14]Formal Foundations
Graph-Based Structural Definition
In the Simplified Molecular Input Line Entry System (SMILES), molecular structures are represented as undirected graphs where atoms correspond to vertices and chemical bonds serve as edges. Single bonds are denoted by default or explicit hyphens, double bonds by equals signs, triple bonds by hash marks, and aromatic bonds by lowercase letters or alternating single and double bonds in specific conventions. This graph-theoretic foundation allows SMILES to capture the connectivity and topology of a molecule without embedding spatial or stereochemical information, focusing solely on the abstract relational structure.[3] Valence rules in SMILES ensure that each atom achieves its standard valence through explicit bonds and implicit hydrogens, which are automatically inferred by the system to fill remaining valences without altering the graph's core. For instance, carbon atoms assume a valence of four, so an isolated "C" implies methane (CH₄) with four implicit hydrogens attached as unrepresented vertices. Organic subset atoms (like C, N, O) follow predefined valences, while others in brackets can specify explicit charges or isotopes, but implicit hydrogens remain calculated to satisfy typical bonding patterns unless overridden. This approach maintains graph simplicity by omitting hydrogen vertices and edges in most cases, prioritizing computational efficiency in parsing and storage.[3] The SMILES string encodes molecular connectivity through a linear traversal of the graph, typically following a main chain with branches denoted by parentheses and ring closures indicated by matching numerical digits. This parsing mimics a depth-first search (DFS) starting from an arbitrary atom, exploring neighbors sequentially and backtracking to represent branches or cycles. Formally, SMILES serializes the graph's adjacency matrix in this DFS order, where each symbol or digit records a vertex, edge type, or closure, reconstructing the full topology upon decoding. Rings are handled by labeling edges that connect non-adjacent vertices, ensuring the graph remains acyclic in representation but cyclically complete. By design, this basic encoding excludes stereochemistry or three-dimensional coordinates, emphasizing topological isomorphism over geometric detail.[3]Context-Free Language Specification
The Simplified Molecular Input Line Entry System (SMILES) is formally specified as a context-free language, allowing for unambiguous parsing of linear strings into molecular structures via a defined grammar. This linguistic formalization ensures that SMILES strings can be systematically generated and validated, mapping directly to the connectivity and hierarchy of chemical graphs. The specification, outlined in the OpenSMILES standard, employs a Backus-Naur Form (BNF)-style grammar to define valid constructs, distinguishing it from regular languages by accommodating nested elements like branches and rings.[2] The core grammar rules are structured around key non-terminals such as atoms, bonds, chains, branches, and rings, with productions that recursively build molecular expressions. For instance, the start symbol (often denoted as molecule or S) derives from an initial atom, extended by optional bonds to subsequent atoms, branches enclosed in parentheses, or ring closures via digit pairs; representative productions include S → atom | S bond atom | S branch | ringclosure, where branches follow rules like branch → (chain) to handle side chains. These rules permit linear traversal of the molecule while supporting hierarchical nesting, ensuring all valid SMILES strings adhere to valence and connectivity constraints implicit in the productions. The full set of productions, detailed in the OpenSMILES document, covers over 20 non-terminals to encompass organic subsets, aromaticity, and extensions like isotopes, without introducing context-sensitive dependencies.[2] Parsing SMILES strings leverages the context-free nature of the grammar, commonly via recursive descent algorithms that process atoms sequentially and recurse into branches upon encountering parentheses. Stack-based approaches manage ring closures by pushing opening digits and popping upon matching closures, preventing invalid pairings. The OpenSMILES specification recommends that parsers support at least 100 levels of nesting for branches and 1000 ring closures. This results in a parse tree where nodes represent atoms or bonds, with branches forming subtrees and rings linking via back-edges, devoid of cycles in the derivation except for explicit ring notations. The equivalence to recognition by a pushdown automaton underscores the grammar's power, as the stack handles the necessary memory for nested structures without requiring full graph context during initial parsing.[2][15] Although the grammar generates multiple valid strings for the same molecule—due to alternative orders of branch or ring traversal—the language remains well-defined and parsable to a unique graph. Canonical SMILES variants address this non-uniqueness by enforcing a standardized traversal, such as depth-first with specific tie-breaking rules for symmetry, yielding a single representative string per structure. This ambiguity does not compromise parsing reliability, as all variants derive from the same grammar and map isomorphically to the molecule.[2] The context-free formalism is essential, as the language is not regular: nested branches introduce arbitrary-depth recursion, akin to balanced parentheses, which a finite automaton cannot track without a stack. However, this level of expressiveness suffices for molecular complexity, where nesting depths rarely exceed practical limits, avoiding the need for more advanced grammars like context-sensitive ones. Seminal implementations, such as those in cheminformatics libraries, confirm the grammar's adequacy for efficient parsing of real-world chemical datasets.[2]Basic Notation Elements
Atomic Representations
In SMILES, atoms are primarily represented using their standard atomic symbols from the periodic table, with specific conventions to ensure compactness and readability. The system distinguishes between an "organic subset" of common elements and more general atomic specifications, allowing for efficient notation in chemical structures.[3] The organic subset includes the elements boron (B), carbon (C), nitrogen (N), oxygen (O), phosphorus (P), sulfur (S), fluorine (F), chlorine (Cl), bromine (Br), and iodine (I). These atoms can be denoted using uppercase letters without enclosing brackets when they carry no formal charge and have implicit hydrogens determined by standard valence rules. For example, the symbol 'C' represents a carbon atom with four bonds, typically implying methane (CH₄) in isolation, while 'N' denotes nitrogen with three bonds, as in ammonia (NH₃). This bracket-free notation applies only to uncharged atoms in the organic subset with normal valences, promoting brevity for organic molecules.[1][2] For atoms outside the organic subset or those requiring additional properties, SMILES uses square brackets to enclose the full specification. Inside brackets, the atomic symbol is followed by optional descriptors for isotope, charge, hydrogen count, or stereochemistry. Examples include [Na⁺] for sodium ion, [H] for explicit hydrogen, or [Fe] for iron. Charges are indicated by a sign (+ or -) optionally followed by a numeric value, such as [O⁻] for oxide ion or [NH₄⁺] for ammonium. Radicals can be specified with a dot, as in [CH₃•], and isotopes precede the symbol, like [¹⁴C]. This bracketed format is mandatory for metals, transition elements, and any organic subset atoms with unusual properties.[1][3] Hydrogen atoms are handled primarily through implicit rules to minimize string length, especially in organic contexts. For atoms in the organic subset, the number of implicit hydrogens is calculated based on the atom's standard valence minus the number of explicit bonds in the SMILES string, assuming a neutral charge. Standard valences are carbon (4), nitrogen (3 or 5 in some cases), oxygen (2), phosphorus (3 or 5), sulfur (2, 4, or 6), halogens (1), and boron (3). For instance, ethanol is written as CCO, where the first C has three implicit H (CH₃-), the second C has two implicit H (-CH₂-), and O has one implicit H (-OH). Hydrogens can be made explicit using [H] or by specifying a count after the atom in brackets, such as [CH₄] for methane or C[H] for explicit attachment. In cases like metal complexes, implicit hydrogens are suppressed, and explicit specification may be required for accuracy.[1][2] Special cases include pseudoatoms, such as the wildcard '*' which matches any atom and is treated as part of the organic subset for notation purposes. It is used in pattern matching or queries, for example, in substructure searches, and can be written without brackets like other organic atoms. These conventions ensure SMILES remains a context-free language while accommodating diverse chemical entities.[2][1]Bond Types and Connectivity
In SMILES notation, the default bond between adjacent atoms is a single covalent bond, which is implied without any symbol; for instance, the string "CC" represents ethane (C₂H₆), where the two carbon atoms are connected by a single bond.[16][17] This convention simplifies the representation of linear hydrocarbon chains, assuming standard valence rules for organic atoms unless otherwise specified.[3] Explicit bond symbols are used to denote higher bond orders or specific types, always preceding the atom they connect to, except at the start of the string. Single bonds can be explicitly indicated with a hyphen "-", double bonds with "=", triple bonds with "#", and aromatic bonds with ":". For example, "C=O" denotes formaldehyde (H₂C=O), where the "=" symbol specifies the double bond between carbon and oxygen.[10][16] Aromatic bonds require lowercase atomic symbols for the connected atoms, such as "c:c" for an aromatic bond between two carbon atoms, as in benzene fragments, ensuring compatibility with valence models that treat aromaticity as a delocalized bond order of 1.5.[17] Bond orders up to quadruple are supported in extended specifications, denoted by "$", though such bonds are rare in common organic molecules.[17] Connectivity in SMILES is established through linear juxtaposition of atomic symbols, forming chains where each pair of consecutive atoms is linked by the specified or default bond. This graph-based approach relies on valence enforcement by parsers, which infer implicit hydrogens to satisfy standard atomic valences (e.g., carbon's valence of 4) and reject structures exceeding typical limits, preventing invalid representations like over-bonded atoms.[3][1] Alternating double bonds in chains, such as in "C=CC" for propene, follow organic chemistry conventions without additional symbols, maintaining conciseness while adhering to bond order alternation rules in unsaturated systems.[16]Structural Features
Ring Systems
In SMILES, cyclic structures are denoted by ring closure labels, which connect non-adjacent atoms in the linear notation to form rings. A ring is specified by placing a digit from 1 to 9 immediately after the atomic symbol of the atom where the ring bond opens, and repeating the same digit after the atomic symbol where the ring closes, thereby linking those two atoms with a bond. This approach effectively "breaks" the cycle at one point and labels the endpoints numerically for reconnection. For instance, the SMILES for cyclopropane is C1CC1, where the first carbon is followed by 1 to open the ring, the second and third carbons continue the chain, and the final 1 after the third carbon closes the ring to the first carbon via a single bond by default.[1] The ring closure digit applies to the immediately preceding atom, but a bond type can be specified by inserting the bond symbol (such as = for double or # for triple) before the digit, altering the nature of the closing bond. Without a symbol, the bond defaults to single for aliphatic atoms or aromatic for lowercase symbols in applicable contexts. Multiple ring closures can originate from or terminate at the same atom by appending multiple digits sequentially after it, enabling the representation of fused, bridged, or spiro systems. For example, norbornane, a bridged bicyclic hydrocarbon, is expressed as C1CC2CCC1C2, where digit 1 closes the first bridge and digit 2 closes the second, with the shared atoms defining the bridgehead connections implicitly through the traversal order.[1] Structures with multiple independent or fused rings utilize distinct digits for each open ring, supporting up to nine simultaneous open rings via digits 1 through 9, as higher numbers would conflict without closure. Once a ring is closed by reusing a digit, that number becomes available for a new ring elsewhere in the string. For molecules requiring more than nine concurrent open rings—such as highly polycyclic frameworks—the notation employs a percent sign followed by two digits, ranging from %10 to %99, to extend the labeling capacity without ambiguity. An example is the use of %10 in a complex polycycle where standard digits are exhausted, ensuring the closure matches the opening %10 precisely. This mechanism avoids nesting complications in the linear string, as closures are resolved sequentially during parsing.[2] Fused ring systems, where rings share two adjacent atoms (and thus a bond), are constructed by initiating a new ring from an atom within an existing ring and closing it with a distinct digit, leveraging the shared sequence to imply the fusion. For decalin (decahydronaphthalene), a fused bicyclic system, the SMILES is C1CCC2CCCCC2C1, where the first ring opens with 1, the second ring branches implicitly from the fourth carbon via the sequence and closes with 2, and the overall structure closes with 1 to fuse the rings at the bridgehead carbons. Bridgehead atoms in such systems are handled through the atom sequence and multiple closures without additional notation, though stereodescriptors (covered elsewhere) may be added for chirality. This digit-based approach ensures efficient encoding of ring topologies while maintaining the string's readability and compactness.[1]Branching Patterns
In SMILES notation, branches representing side chains or substituents are encoded by enclosing them in parentheses immediately following the atom from which they emanate. This allows for the depiction of tree-like molecular structures without cycles. For instance, the SMILES stringCC(O)C represents 2-propanol (isopropanol), where the (O) denotes a hydroxyl group branching from the central carbon atom.[1][2]
Multiple branches can be specified sequentially from the same atom by placing additional parenthetical expressions in succession. An example is CC(O)(Cl)C, which describes 2-chloro-2-propanol, with both a hydroxyl and a chlorine atom branching from the second carbon. This sequential notation ensures that each branch attaches to the preceding atom before the main chain resumes.[1][2]
Nested branches enable the representation of more complex substituents within a branch, achieved by embedding additional parenthetical expressions inside an outer pair. The depth of nesting is theoretically unlimited, though practical implementations impose limits based on computational resources, typically supporting several levels for most molecular structures. For example, deeper nesting might describe a branched alkyl chain attached to another branch, illustrating hierarchical complexity in molecular architectures.[2][1]
After a branch closes with a closing parenthesis, the notation resumes the main chain from the atom that initiated the branch. This is evident in CC(=O)O, which encodes acetic acid, where (=O) specifies a double-bonded oxygen branch from the second carbon, followed by the continuation to the terminal hydroxyl group. Branches can integrate with ring notations, but the parentheses primarily handle acyclic deviations.[1][2]
The core syntactic rule for branches stipulates that each opening parenthesis ( initiates a branch from the immediately preceding atom, while each closing parenthesis ) terminates the most recent open branch, pairing strictly in a last-in-first-out manner. This stack-based parsing ensures unambiguous structure reconstruction.[2][1]
For disconnected structures, such as salts, complexes, or mixtures, SMILES employs a period . as a separator between independent molecular components. The order of these components is arbitrary and does not imply connectivity. A canonical example is [Na+].[Cl-], representing sodium chloride in ionic form.[1][2]
Aromaticity Conventions
In the Simplified Molecular Input Line Entry System (SMILES), aromaticity for delocalized electron systems in rings is primarily indicated through the use of lowercase letters for atoms, distinguishing them from aliphatic counterparts. This convention applies to sp²-hybridized atoms such as carbon (c), nitrogen (n), oxygen (o), phosphorus (p), and sulfur (s) that participate in aromatic rings, implying an alternating pattern of single and double bonds without explicit specification in many cases.[3] The lowercase notation simplifies representation by eliding bond details, relying on the parser to infer the delocalized nature based on ring structure and atom types.[17]
Aromatic bonds between these lowercase atoms are denoted explicitly by the lowercase colon (:), representing a delocalized bond; however, this symbol is frequently omitted in practice, with single bonds (-) or no bond symbol assumed to imply aromatic connectivity when connecting aromatic atoms. For instance, benzene is compactly written as c1ccccc1, where the ring closure digit 1 denotes the cycle, and the sequence of c atoms with implicit bonds conveys the aromatic sextet. In contrast, the Kekulé form provides an alternative non-aromatic representation using uppercase atoms and explicit alternating double bonds (=), such as C1=CC=CC=C1 for benzene, which avoids ambiguity in bond localization but results in longer strings and potential multiplicity in encoding the same structure.[3] The aromatic lowercase form is generally preferred in canonical SMILES generation for its uniqueness and brevity, as multiple Kekulé variants can describe the same molecule.[17]
SMILES aromaticity conventions enforce specific structural rules to ensure valid delocalized systems, though common examples like benzene involve six-membered rings satisfying Hückel’s 4n+2 pi-electron rule. Heteroaromatic compounds follow analogous patterns, with pyridine represented as c1ccccn1, where the nitrogen (n) integrates into the aromatic cycle without disrupting the delocalized bonding.[18] These rules extend to larger or fused systems, provided the atoms meet hybridization and electron count criteria defined in the specification.[17]
A key aspect of these conventions is the role of software in handling aromaticity during SMILES parsing and generation. Tools apply algorithmic detection—often using an extended version of Hückel’s rule—to identify aromatic patterns, validating lowercase notations and converting invalid or Kekulé inputs to the canonical aromatic form where appropriate.[3] This process ensures interoperability across cheminformatics systems, allowing both aromatic and Kekulé representations as input while standardizing output to the lowercase aromatic notation for consistency.
Advanced Specifications
Stereochemical Descriptors
SMILES incorporates stereochemical information to distinguish between isomers, particularly through isomeric SMILES strings that specify configurations around double bonds and chiral centers, enabling the representation of 2D and limited 3D stereochemistry without full coordinate data.[1] This is achieved using directional symbols for bond orientations and chiral indicators for atomic configurations, with parsing relying on the traversal direction in the string to determine relative positions.[2] Such descriptors are essential for accurately encoding molecules like alkenes and amino acids, where spatial arrangement affects chemical properties.[19] Double-bond stereochemistry, representing cis/trans or E/Z configurations, is denoted by the symbols/ and \ as directional single bonds adjacent to the double bond (=). These indicate the relative orientation of substituents on the atoms connected by the double bond, where matching directions (both / or both \) signify trans geometry and opposing directions signify cis. For instance, the string F/[C](/page/Fluorine)=C/[F](/page/Fluorine) represents trans-1,2-difluoroethene, with the fluorine atoms on opposite sides, while F/[C](/page/Fluorine)=C\[F](/page/Fluorine) denotes the cis isomer.[19] The specification requires explicit directionality on the bonds immediately preceding and following the double bond, and the parser interprets the geometry based on whether the directional bonds point in the same or opposite directions during string traversal.[2]
Tetrahedral stereochemistry at chiral centers, typically carbon atoms with four different substituents, uses the @ symbol for anticlockwise configuration and @@ for clockwise, placed after the atomic symbol within branches or the main chain. The configuration is defined relative to the order of neighbors in the SMILES string: for a central atom, the incoming bond serves as the viewpoint, and the subsequent branches and outgoing bond are ordered; @ indicates that viewing from the implicit hydrogen (or specified if explicit), the sequence appears anticlockwise. An example is N[C@@H](C)C(=O)O for L-alanine, where the chiral carbon has the amino group (N), methyl (C), carboxyl (C(=O)O), and implicit hydrogen arranged clockwise when ordered as written.[19] Explicit hydrogens may be required in some cases to fully specify the center, and the notation applies to any tetrahedral atom, not just carbon.[2]
Extensions for axial chirality in allenes and square planar complexes build on these symbols. For allenes, featuring cumulative double bonds like in propadiene derivatives, stereochemistry at the central sp-hybridized carbon is indicated by @ or @@ following the atom, specifying the twist of the perpendicular planes formed by the substituents. For example, NC(Br)=[C@]=C(O)C denotes a specific enantiomer of an allene, where the substituents are oriented according to the chiral specification.[2] Square planar geometry, common in coordination compounds, uses @SP1, @SP2, or @SP3 for anticlockwise arrangements of ligands around the central metal, with the incoming bond as reference; an example is [Pt@SP1](Cl)(Br)(I)N (U-shape configuration) for a chiral platinum complex. These notations ensure consistent parsing by maintaining directionality from the string's linear traversal.[2]
Isotopic Labels
In SMILES, isotopic labels are specified using a numeric prefix indicating the mass number, placed immediately before the atomic symbol within square brackets for the affected atom. This notation allows precise representation of specific isotopes without altering the core structural description. For instance, [13C] represents the carbon-13 isotope, while the absence of a prefix, as in plain C, defaults to the most abundant naturally occurring isotope of the element (typically carbon-12 in this case).[2][1] The isotope prefix follows a simple numeric format and can precede leading zeros if needed, though they are optional; thus, [2H], [02H], and [002H] all denote deuterium (hydrogen-2). This rule applies universally to any element in the periodic table, enabling isotopic specification for organic and inorganic atoms alike. Common applications include hydrogen ([2H] for deuterium or [3H] for tritium), carbon ([13C]), nitrogen ([15N]), and oxygen ([17O] or [18O]), which are frequently used in labeled compounds. Examples from the specification include [2H]O[2H] for heavy water (deuterium oxide) and [235U] for uranium-235.[2][1] Importantly, isotopic designations in SMILES do not influence the atom's valence, hybridization, or bonding connectivity, as these properties are governed solely by the elemental symbol and its standard chemical behavior; the isotope serves only to distinguish mass variants for identification purposes. This design ensures compatibility with standard valence rules while supporting representations of isotopically substituted molecules, which are crucial in applications like nuclear magnetic resonance (NMR) spectroscopy and isotopic labeling studies for tracing metabolic pathways or reaction mechanisms. For example, [13CH4] specifies carbon-13 methane, useful in mass spectrometry or NMR experiments to probe molecular dynamics.[2][1] An extension of this notation integrates with stereochemical descriptors when isotopes create asymmetry by differentiating otherwise identical substituents. In such cases, the isotope contributes to the atom's identity for chirality determination, allowing specification of configurations influenced by mass differences. A representative example is [2H]C@HCl, which depicts a chiral chlorofluoromethane where one hydrogen is replaced by deuterium, and the @ symbol denotes the tetrahedral stereochemistry at the carbon center. This capability is particularly relevant for studying isotopically induced chirality in biochemical or synthetic contexts.[2][19]Extensions for Complex Molecules
Extensions for complex molecules in the Simplified Molecular Input Line Entry System (SMILES) encompass optional notations that go beyond the core specification to address chemical reactions, polymeric structures, and substructure queries, though these features are not standardized in OpenSMILES and their implementation can differ across software tools.[2] These extensions enable representation of dynamic processes and large-scale assemblies that are challenging with basic SMILES strings, facilitating applications in cheminformatics and materials science.[1] Reaction SMILES uses the '>' symbol to delineate reactants, optional agents, and products in a chemical transformation. For instance, the hydration of ethylene is denoted asC=C.O>>CCO, where the left side lists the alkene and water as reactants, and the right side shows ethanol as the product. Another example is the substitution reaction C=CCBr>>C=CCI, representing the conversion of allyl bromide to allyl iodide without specified agents.[1] This notation supports the depiction of multi-component reactions and is widely adopted in reaction databases and simulation software.[1]
For polymers, extensions employ asterisks (*) to mark the endpoints of repeating units, allowing concise description of chain-like macromolecules. Polyethylene, for example, is represented by the repeating unit [*]CC[*], where the asterisks indicate sites for inter-unit connections.[20] This approach, seen in tools like PSMILES, builds on Daylight SMILES syntax while accommodating the connectivity of long chains.[21]
SMARTS (SMILES Arbitrary Target Specification) serves as a query language extension, enhancing SMILES with pattern-matching capabilities for substructure searches. It introduces wildcards such as '?' to match any organic subset atom (e.g., carbon, nitrogen, oxygen, phosphorus, or sulfur) and '*' to match any non-hydrogen atom. For example, the pattern C?O identifies carbon-oxygen single bonds where the oxygen is attached to any organic atom, useful for querying functional groups across molecular datasets.[22] SMARTS also supports logical operators for more complex queries, making it essential for virtual screening and database filtering.[22]
In the 2020s, ongoing proposals aim to expand SMILES support for biomolecules, such as representing peptides as extended chains with repeating amino acid units and modifications. The IUPAC SMILES+ initiative, launched in 2019, seeks to formalize these extensions into a comprehensive standard. As of 2025, the project is in final review stages, with a recommendation expected for publication in Pure and Applied Chemistry later in the year.[12] Similarly, BigSMILES provides a structured notation for polymers that can extend to biopolymers like polypeptides, using descriptors for repeating units to capture sequence variability.[23] These developments address limitations in core SMILES for handling the scale and diversity of biological macromolecules.[23]
Practical Examples
Simple Molecular Strings
Simple molecular strings in SMILES notation provide a straightforward way to encode small, linear molecules by sequencing atomic symbols, with default single bonds between adjacent atoms and implicit hydrogen atoms added to satisfy standard valences (carbon: 4, nitrogen: 3, oxygen: 2).[3] This approach leverages the organic subset of elements, where uppercase letters denote atoms without explicit charges or isotopes, and no rings or stereochemistry are indicated.[17] The simplest example is methane, represented asC. This single carbon atom is parsed as a central vertex in the molecular graph, with four implicit hydrogens attached to fulfill the tetravalent carbon, forming CH₄; no bonds are specified since there are no adjacent atoms.[24]
For water, the SMILES string O denotes a single oxygen atom, interpreted as a graph vertex with two implicit hydrogens bonded to it, yielding H₂O; the parser recognizes organic oxygen's divalent nature and adds hydrogens accordingly.
Ammonia is encoded as N, where the nitrogen atom serves as the graph's sole vertex, augmented by three implicit hydrogens to match trivalent nitrogen, resulting in NH₃.
Ethanol's SMILES CCO maps to a linear chain graph: the first C is a carbon with three implicit hydrogens (CH₃-), connected by a default single bond to the second C (with two implicit hydrogens, -CH₂-), which bonds to O (with one implicit hydrogen, -OH); this sequential parsing builds the acyclic structure CH₃CH₂OH.
Ethene uses C=C, parsed as two carbon atoms connected by an explicit double bond: each C receives two implicit hydrogens (H₂C=CH₂), with the = symbol overriding the default single bond to define the unsaturated graph edge.
Acetone is represented by CC(=O)C, where parsing proceeds left to right: the first C (CH₃-) bonds singly to the second C (the carbonyl carbon, with no implicit hydrogens), from which a branch (=O) attaches oxygen via a double bond (no hydrogens on O), and the second C then connects to a final C (CH₃-); this constructs the graph CH₃C(=O)CH₃, using parentheses to denote the off-chain double bond.[25]
Elaborate Structure Illustrations
To illustrate the application of SMILES notation to more complex molecular structures, consider the representation of benzene, a fundamental aromatic hydrocarbon. The SMILES string for benzene isc1ccccc1, where lowercase letters denote aromatic atoms (carbon in this case), and the numbers 1 indicate the closure of a six-membered ring by connecting the first and last atoms. This compact notation captures the delocalized π-electron system without explicit double bonds, adhering to the aromaticity convention where alternating single and double bonds are implied but not specified.
A more elaborate example is aspirin (acetylsalicylic acid), which combines an aromatic ring, branches, and functional groups. Its SMILES is CC(=O)Oc1ccccc1C(=O)O. Parsing begins with the acetyl branch CC(=O)O, where C is a methyl carbon bonded to another C (carbonyl), with =O indicating a double bond to oxygen; this attaches via the ester oxygen O to the aromatic ring c1ccccc1. The ring closes with number 1, and the final C(=O)O branches from the adjacent ring carbon, representing the carboxylic acid group. The aromatic lowercase c atoms imply sp² hybridization and alternating bonds, while uppercase C and O denote aliphatic or explicit atoms. This string integrates branching with parentheses and ring closure to depict the ortho-substituted benzoic acid derivative.
For stereochemistry in complex structures, ibuprofen (2-(4-(2-methylpropyl)phenyl)propanoic acid) serves as a chiral example, with the biologically active S-enantiomer specified in SMILES as CC(C)CC1=CC=C(C=C1)[C@@H](C)C(=O)O. The parsing starts with the isobutyl chain CC(C)C, where the first C bonds to a branched methyl (C) and then to methylene C, connecting to the para-substituted aromatic ring C1=CC=C(C=C1). Uppercase C in the ring indicates Kekulé form with explicit double bonds (=), though aromatic notation c1ccc(cc1) is equivalent; the chiral center is marked by [C@@H], specifying the S configuration via the @@ tetrahedral descriptor, followed by the methyl branch (C) and carboxylic acid C(=O)O. This notation highlights how SMILES embeds stereochemical information at asymmetric carbons using atomic specifications and directionality rules.
Caffeine, a purine alkaloid with fused rings and multiple branches, exemplifies multi-component integration in SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C. The string opens with the N-methyl group CN1, where N1 is a ring nitrogen numbered for closure; it connects to C=NC2=, forming part of the imidazole ring, with C2 initiating the fused pyrimidine. The fusion is denoted by shared atom C1 and ring number 2, followed by carbonyl C(=O)N with N-methyl branches (C) and another C(=O)N2C closing the ring. Aromaticity is partially explicit with =, but the structure relies on ring closures, branches, and heteroatoms to represent the xanthine core with three methyl substituents at positions 1, 3, and 7. This parsing demonstrates how SMILES handles polycyclic systems by layering ring numbers and parentheses for connectivity.