Fact-checked by Grok 2 weeks ago

Simplified Molecular Input Line Entry System

The Simplified Molecular Input Line Entry System (SMILES) is a line notation system for representing the structure of chemical molecules and reactions using compact ASCII strings composed of atomic symbols, numbers for ring closures, and symbols for bonds, branches, and . The OpenSMILES specification defines an for the . Developed to facilitate computer processing of chemical information, SMILES encodes molecular topology in a human-readable yet machine-parsable format, allowing unambiguous description of connectivity, , and isomerism without requiring graphical input. SMILES was initiated by chemist David Weininger in 1986 while working at the U.S. Environmental Protection Agency's Mid-Continent Ecology Division Laboratory in , as part of efforts to create an efficient, chemist-friendly language for chemical databases and modeling software. The methodology and encoding rules were first detailed in a seminal 1988 paper published in the Journal of Chemical Information and Computer Sciences, where it was described as a system "designed for modern chemical information processing" based on principles of atomic valence and . Weininger, who passed away in 2016, envisioned SMILES as an open, extensible standard; subsequent refinements, including for unique string generation and extensions for (SMIRKS) and substructure searching (SMARTS), were advanced by Daylight Chemical Information Systems in the early 1990s. Key features of SMILES include its simplicity—basic organic molecules like are denoted as ""—and flexibility for complex structures, such as rings (e.g., as "C1CCCCC1") and stereocenters (using @ or / symbols). Unlike graphical formats, SMILES prioritizes linear representation, making it ideal for text-based storage, search, and exchange in cheminformatics applications, including databases like and . Its adoption has grown due to interoperability with software tools for molecular generation, property prediction, and , though limitations exist in fully capturing three-dimensional conformations without extensions.

History and Development

Origins and Creation

The Simplified Molecular Input Line Entry System (SMILES) was developed by David Weininger in 1986 while working at the U.S. Environmental Protection Agency's Mid-Continent Ecology Division Laboratory in Duluth, Minnesota, where it emerged as a compact and human-readable alternative to the cumbersome connection tables traditionally used to represent molecular structures in computational chemistry. The design was completed while Weininger was at Pomona College, and subsequent implementation and refinements were carried out at Daylight Chemical Information Systems, Inc., starting in the late 1980s. This notation addressed the need for a streamlined method to encode chemical information, enabling easier manipulation and storage of molecular data in early cheminformatics applications. The primary motivations behind SMILES were to facilitate the exchange of molecular structures across diverse software systems in cheminformatics, reducing the complexity associated with graphical or tabular formats. Weininger drew inspiration from predecessor line notations, notably the Wiswesser Line Notation (WLN) introduced in the , but sought to create a more intuitive and versatile system that avoided WLN's rigid, mnemonic-based rules while supporting broader structural representations. By prioritizing simplicity and portability, SMILES aimed to bridge the gap between human interpretation and machine processing of chemical data. SMILES was first detailed in a seminal publication by Weininger in 1988, appearing in the Journal of Chemical Information and Computer Sciences, which outlined its encoding rules and methodology. Central to its design is the use of linear character strings to depict two-dimensional molecular graphs, eschewing explicit coordinate information to focus solely on atomic connectivity and basic . This approach ensured that SMILES strings could be generated, parsed, and interconverted efficiently without reliance on visual or positional data.

Evolution and Standardization

Following the initial introduction of SMILES in 1988, David Weininger and collaborators at Daylight Chemical Information Systems expanded the notation throughout the 1990s to address ambiguities and enhance its applicability. These refinements included detailed rules for representing , where lowercase letters denote aromatic atoms and bonds are implied as alternating single and double, and , using symbols like "@" for chiral centers and "/" and "" for double-bond configurations. A key advancement in was introduced in a 1989 , where algorithms leveraging produced unique SMILES strings, reducing duplicates in large-scale molecular inventories. These expansions were documented in Daylight's comprehensive tutorials and theory manuals, which became de facto references for implementers seeking consistent and generation of SMILES strings. A pivotal step in standardization occurred in 2007 when the Blue Obelisk open-source cheminformatics community initiated the OpenSMILES project to create a non-proprietary, interoperable specification for SMILES. Culminating in the 2012 OpenSMILES specification, this effort clarified encoding rules, resolved variations in aromaticity perception, and promoted canonical SMILES generation to ensure unique string representations across software tools, fostering widespread adoption in databases and cheminformatics pipelines. The specification emphasized backward compatibility while standardizing features like ring closures and branching to improve data exchange in computational chemistry. In the 2020s, SMILES saw further alignment with international standards through IUPAC's ongoing project to formalize an updated specification, known as SMILES+, aimed at enhancing notation consistency without altering the core syntax. As of January 2025, the project reported major accomplishments from 2019-2024, including cheminformatics toolkit comparisons, with ongoing plans for further standardization. This integration builds on OpenSMILES by incorporating IUPAC guidelines for and handling, ensuring SMILES remains compatible with emerging standards like InChI. Recent milestones include 2023 cheminformatics community discussions on adapting SMILES for AI-driven molecular modeling, highlighting its role in training generative models while noting the unchanged core syntax's robustness for tasks like design. These conversations, featured in conferences and reviews, underscore SMILES' enduring utility without necessitating major revisions.

Formal Foundations

Graph-Based Structural Definition

In the Simplified Molecular Input Line Entry System (SMILES), molecular structures are represented as undirected graphs where atoms correspond to vertices and chemical bonds serve as edges. Single bonds are denoted by default or explicit hyphens, double bonds by equals signs, triple bonds by hash marks, and aromatic bonds by lowercase letters or alternating single and double bonds in specific conventions. This graph-theoretic foundation allows SMILES to capture the and of a without embedding spatial or stereochemical information, focusing solely on the abstract relational structure. Valence rules in SMILES ensure that each atom achieves its standard valence through explicit bonds and implicit hydrogens, which are automatically inferred by the system to fill remaining valences without altering the graph's core. For instance, carbon atoms assume a valence of four, so an isolated "C" implies methane (CH₄) with four implicit hydrogens attached as unrepresented vertices. Organic subset atoms (like C, N, O) follow predefined valences, while others in brackets can specify explicit charges or isotopes, but implicit hydrogens remain calculated to satisfy typical bonding patterns unless overridden. This approach maintains graph simplicity by omitting hydrogen vertices and edges in most cases, prioritizing computational efficiency in parsing and storage. The SMILES string encodes molecular connectivity through a linear traversal of the , typically following a main with branches denoted by parentheses and ring closures indicated by matching numerical digits. This parsing mimics a (DFS) starting from an arbitrary atom, exploring neighbors sequentially and backtracking to represent branches or cycles. Formally, SMILES serializes the graph's in this DFS order, where each symbol or digit records a , type, or closure, reconstructing the full upon decoding. Rings are handled by labeling edges that connect non-adjacent vertices, ensuring the graph remains acyclic in representation but cyclically complete. By design, this basic encoding excludes or three-dimensional coordinates, emphasizing topological over geometric detail.

Context-Free Language Specification

The Simplified Molecular Input Line Entry System (SMILES) is formally specified as a context-free language, allowing for unambiguous parsing of linear strings into molecular structures via a defined grammar. This linguistic formalization ensures that SMILES strings can be systematically generated and validated, mapping directly to the connectivity and hierarchy of chemical graphs. The specification, outlined in the OpenSMILES standard, employs a Backus-Naur Form (BNF)-style grammar to define valid constructs, distinguishing it from regular languages by accommodating nested elements like branches and rings. The core grammar rules are structured around key non-terminals such as atoms, bonds, chains, branches, and rings, with productions that recursively build molecular expressions. For instance, the start symbol (often denoted as molecule or S) derives from an initial atom, extended by optional bonds to subsequent atoms, branches enclosed in parentheses, or ring closures via digit pairs; representative productions include S → atom | S bond atom | S branch | ringclosure, where branches follow rules like branch → (chain) to handle side chains. These rules permit linear traversal of the molecule while supporting hierarchical nesting, ensuring all valid SMILES strings adhere to valence and connectivity constraints implicit in the productions. The full set of productions, detailed in the OpenSMILES document, covers over 20 non-terminals to encompass subsets, , and extensions like isotopes, without introducing context-sensitive dependencies. Parsing SMILES strings leverages the context-free nature of the grammar, commonly via recursive descent algorithms that process atoms sequentially and recurse into branches upon encountering parentheses. Stack-based approaches manage ring closures by pushing opening digits and popping upon matching closures, preventing invalid pairings. The OpenSMILES specification recommends that parsers support at least 100 levels of nesting for branches and 1000 ring closures. This results in a where nodes represent atoms or bonds, with branches forming subtrees and rings linking via back-edges, devoid of cycles in the except for explicit ring notations. The to by a underscores the grammar's power, as the stack handles the necessary memory for nested structures without requiring full graph context during initial parsing. Although the generates multiple valid for the same —due to alternative orders of branch or traversal—the language remains well-defined and parsable to a unique . SMILES variants address this non-uniqueness by enforcing a standardized traversal, such as depth-first with specific tie-breaking rules for , yielding a single representative per . This does not compromise parsing reliability, as all variants derive from the same and map isomorphically to the . The context-free formalism is essential, as the language is not : nested branches introduce arbitrary-depth , akin to balanced parentheses, which a finite cannot track without a . However, this level of expressiveness suffices for molecular complexity, where nesting depths rarely exceed practical limits, avoiding the need for more advanced grammars like context-sensitive ones. Seminal implementations, such as those in cheminformatics libraries, confirm the grammar's adequacy for efficient of real-world chemical datasets.

Basic Notation Elements

Atomic Representations

In SMILES, atoms are primarily represented using their standard atomic symbols from the periodic table, with specific conventions to ensure compactness and readability. The system distinguishes between an "organic subset" of common elements and more general atomic specifications, allowing for efficient notation in chemical structures. The organic subset includes the elements (B), (C), (N), oxygen (O), phosphorus (P), sulfur (S), fluorine (F), chlorine (Cl), bromine (Br), and iodine (I). These atoms can be denoted using uppercase letters without enclosing brackets when they carry no and have implicit hydrogens determined by standard rules. For example, the symbol 'C' represents a atom with four bonds, typically implying (CH₄) in isolation, while 'N' denotes with three bonds, as in (NH₃). This bracket-free notation applies only to uncharged atoms in the organic subset with normal valences, promoting brevity for organic molecules. For atoms outside the organic subset or those requiring additional properties, SMILES uses square brackets to enclose the full specification. Inside brackets, the atomic symbol is followed by optional descriptors for isotope, charge, hydrogen count, or stereochemistry. Examples include [Na⁺] for sodium ion, [H] for explicit hydrogen, or [Fe] for iron. Charges are indicated by a sign (+ or -) optionally followed by a numeric value, such as [O⁻] for oxide ion or [NH₄⁺] for ammonium. Radicals can be specified with a dot, as in [CH₃•], and isotopes precede the symbol, like [¹⁴C]. This bracketed format is mandatory for metals, transition elements, and any organic subset atoms with unusual properties. Hydrogen atoms are handled primarily through implicit rules to minimize string length, especially in contexts. For atoms in the organic subset, the number of implicit hydrogens is calculated based on the atom's standard valence minus the number of explicit bonds in the SMILES string, assuming a neutral charge. Standard valences are carbon (4), (3 or 5 in some cases), oxygen (2), (3 or 5), (2, 4, or 6), (1), and (3). For instance, is written as CCO, where the first C has three implicit H (CH₃-), the second C has two implicit H (-CH₂-), and O has one implicit H (-OH). Hydrogens can be made explicit using [H] or by specifying a count after the atom in brackets, such as [CH₄] for or C[H] for explicit attachment. In cases like metal complexes, implicit hydrogens are suppressed, and explicit specification may be required for accuracy. Special cases include pseudoatoms, such as the wildcard '*' which matches any atom and is treated as part of the organic subset for notation purposes. It is used in or queries, for example, in substructure searches, and can be written without brackets like other organic atoms. These conventions ensure SMILES remains a while accommodating diverse chemical entities.

Bond Types and Connectivity

In SMILES notation, the default bond between adjacent atoms is a single , which is implied without any symbol; for instance, the string "CC" represents (C₂H₆), where the two carbon atoms are connected by a . This convention simplifies the representation of linear chains, assuming standard rules for atoms unless otherwise specified. Explicit bond symbols are used to denote higher bond orders or specific types, always preceding the atom they connect to, except at the start of the string. Single bonds can be explicitly indicated with a "-", double bonds with "=", triple bonds with "#", and aromatic bonds with ":". For example, "C=O" denotes (H₂C=O), where the "=" symbol specifies the between carbon and oxygen. Aromatic bonds require lowercase atomic symbols for the connected atoms, such as "c:c" for an aromatic bond between two carbon atoms, as in fragments, ensuring compatibility with models that treat as a delocalized of 1.5. Bond orders up to quadruple are supported in extended specifications, denoted by "$", though such bonds are rare in common organic molecules. Connectivity in SMILES is established through linear juxtaposition of atomic symbols, forming chains where each pair of consecutive atoms is linked by the specified or default bond. This graph-based approach relies on valence enforcement by parsers, which infer implicit hydrogens to satisfy standard atomic valences (e.g., carbon's valence of 4) and reject structures exceeding typical limits, preventing invalid representations like over-bonded atoms. Alternating double bonds in chains, such as in "C=CC" for propene, follow conventions without additional symbols, maintaining conciseness while adhering to alternation rules in unsaturated systems.

Structural Features

Ring Systems

In SMILES, cyclic structures are denoted by ring closure labels, which connect non-adjacent atoms in the linear notation to form rings. A ring is specified by placing a digit from 1 to 9 immediately after the atomic symbol of the atom where the ring bond opens, and repeating the same digit after the atomic symbol where the ring closes, thereby linking those two atoms with a bond. This approach effectively "breaks" the cycle at one point and labels the endpoints numerically for reconnection. For instance, the SMILES for cyclopropane is C1CC1, where the first carbon is followed by 1 to open the ring, the second and third carbons continue the chain, and the final 1 after the third carbon closes the ring to the first carbon via a single bond by default. The ring closure digit applies to the immediately preceding atom, but a bond type can be specified by inserting the bond symbol (such as = for or # for ) before the digit, altering the nature of the closing bond. Without a symbol, the bond defaults to for aliphatic atoms or aromatic for lowercase symbols in applicable contexts. Multiple ring closures can originate from or terminate at the same atom by appending multiple digits sequentially after it, enabling the representation of fused, bridged, or spiro systems. For example, , a bridged bicyclic , is expressed as C1CC2CCC1C2, where digit 1 closes the first bridge and digit 2 closes the second, with the shared atoms defining the connections implicitly through the traversal order. Structures with multiple independent or fused rings utilize distinct digits for each open ring, supporting up to nine simultaneous open rings via digits 1 through 9, as higher numbers would conflict without closure. Once a is closed by reusing a digit, that number becomes available for a new ring elsewhere in the string. For molecules requiring more than nine concurrent open rings—such as highly polycyclic frameworks—the notation employs a followed by two digits, ranging from %10 to %99, to extend the labeling capacity without ambiguity. An example is the use of %10 in a complex polycycle where standard digits are exhausted, ensuring the closure matches the opening %10 precisely. This mechanism avoids nesting complications in the linear string, as closures are resolved sequentially during . Fused ring systems, where rings share two adjacent atoms (and thus a bond), are constructed by initiating a new ring from an atom within an existing ring and closing it with a distinct , leveraging the shared sequence to imply the fusion. For (decahydronaphthalene), a fused bicyclic system, the SMILES is C1CCC2CCCCC2C1, where the first ring opens with 1, the second ring branches implicitly from the fourth carbon via the sequence and closes with 2, and the overall structure closes with 1 to fuse the rings at the carbons. Bridgehead atoms in such systems are handled through the atom sequence and multiple closures without additional notation, though stereodescriptors (covered elsewhere) may be added for . This digit-based approach ensures efficient encoding of ring topologies while maintaining the string's and compactness.

Branching Patterns

In SMILES notation, branches representing side chains or substituents are encoded by enclosing them in parentheses immediately following the atom from which they emanate. This allows for the depiction of tree-like molecular structures without cycles. For instance, the SMILES string CC(O)C represents 2-propanol (isopropanol), where the (O) denotes a hydroxyl group branching from the central carbon atom. Multiple branches can be specified sequentially from the same atom by placing additional parenthetical expressions in succession. An example is CC(O)(Cl)C, which describes 2-chloro-2-propanol, with both a hydroxyl and a atom branching from the second carbon. This sequential notation ensures that each attaches to the preceding atom before the main chain resumes. Nested branches enable the representation of more complex substituents within a , achieved by additional parenthetical expressions inside an outer pair. The depth of nesting is theoretically unlimited, though practical implementations impose limits based on computational resources, typically supporting several levels for most molecular structures. For example, deeper nesting might describe a branched alkyl chain attached to another , illustrating hierarchical complexity in molecular architectures. After a closes with a closing parenthesis, the notation resumes the main chain from the atom that initiated the branch. This is evident in CC(=O)O, which encodes acetic acid, where (=O) specifies a double-bonded oxygen branch from the second carbon, followed by the continuation to the terminal hydroxyl group. Branches can integrate with ring notations, but the parentheses primarily handle acyclic deviations. The core syntactic rule for branches stipulates that each opening parenthesis ( initiates a from the immediately preceding , while each closing parenthesis ) terminates the most recent open branch, pairing strictly in a last-in-first-out manner. This stack-based ensures unambiguous structure reconstruction. For disconnected structures, such as salts, complexes, or mixtures, SMILES employs a period . as a between independent molecular components. The order of these components is arbitrary and does not imply . A example is [Na+].[Cl-], representing in ionic form.

Aromaticity Conventions

In the Simplified Molecular Input Line Entry System (SMILES), for systems in rings is primarily indicated through the use of lowercase letters for atoms, distinguishing them from aliphatic counterparts. This convention applies to sp²-hybridized atoms such as carbon (c), (n), oxygen (o), (p), and (s) that participate in aromatic rings, implying an alternating pattern of single and double bonds without explicit specification in many cases. The lowercase notation simplifies representation by eliding bond details, relying on the parser to infer the based on ring structure and atom types. Aromatic bonds between these lowercase atoms are denoted explicitly by the lowercase colon (:), representing a delocalized bond; however, this symbol is frequently omitted in practice, with single bonds (-) or no bond symbol assumed to imply aromatic connectivity when connecting aromatic atoms. For instance, is compactly written as c1ccccc1, where the ring closure digit 1 denotes the cycle, and the sequence of c atoms with implicit bonds conveys the aromatic . In contrast, the Kekulé form provides an alternative non-aromatic representation using uppercase atoms and explicit alternating double bonds (=), such as C1=CC=CC=C1 for , which avoids ambiguity in bond localization but results in longer strings and potential multiplicity in encoding the same structure. The aromatic lowercase form is generally preferred in SMILES generation for its uniqueness and brevity, as multiple Kekulé variants can describe the same . SMILES aromaticity conventions enforce specific structural rules to ensure valid delocalized systems, though common examples like involve six-membered rings satisfying Hückel’s 4n+2 pi-electron rule. Heteroaromatic compounds follow analogous patterns, with represented as c1ccccn1, where the (n) integrates into the aromatic cycle without disrupting the delocalized bonding. These rules extend to larger or fused systems, provided the atoms meet hybridization and electron count criteria defined in the specification. A key aspect of these conventions is the role of software in handling during SMILES and generation. Tools apply algorithmic detection—often using an extended version of Hückel’s rule—to identify aromatic patterns, validating lowercase notations and converting invalid or Kekulé inputs to the canonical aromatic form where appropriate. This process ensures interoperability across cheminformatics systems, allowing both aromatic and Kekulé representations as input while standardizing output to the lowercase aromatic notation for consistency.

Advanced Specifications

Stereochemical Descriptors

SMILES incorporates stereochemical information to distinguish between isomers, particularly through isomeric SMILES strings that specify configurations around double bonds and chiral centers, enabling the representation of 2D and limited stereochemistry without full coordinate data. This is achieved using directional symbols for bond orientations and chiral indicators for atomic configurations, with parsing relying on the traversal direction in the string to determine relative positions. Such descriptors are essential for accurately encoding molecules like alkenes and , where spatial arrangement affects chemical properties. Double-bond , representing / or / configurations, is denoted by the symbols / and \ as directional single bonds adjacent to the (=). These indicate the relative orientation of substituents on the atoms connected by the , where matching directions (both / or both \) signify and opposing directions signify . For instance, the F/[C](/page/Fluorine)=C/[F](/page/Fluorine) represents trans-1,2-difluoroethene, with the fluorine atoms on opposite sides, while F/[C](/page/Fluorine)=C\[F](/page/Fluorine) denotes the isomer. The specification requires explicit directionality on the bonds immediately preceding and following the , and the parser interprets the based on whether the directional bonds point in the same or opposite directions during traversal. Tetrahedral stereochemistry at chiral centers, typically carbon atoms with four different substituents, uses the @ symbol for anticlockwise configuration and @@ for clockwise, placed after the atomic symbol within branches or the main chain. The configuration is defined relative to the order of neighbors in the SMILES string: for a central atom, the incoming bond serves as the viewpoint, and the subsequent branches and outgoing bond are ordered; @ indicates that viewing from the implicit (or specified if explicit), the sequence appears anticlockwise. An example is N[C@@H](C)C(=O)O for L-alanine, where the chiral carbon has the amino group (N), methyl (C), carboxyl (C(=O)O), and implicit arranged clockwise when ordered as written. Explicit hydrogens may be required in some cases to fully specify the center, and the notation applies to any tetrahedral atom, not just carbon. Extensions for axial chirality in allenes and square planar complexes build on these symbols. For allenes, featuring cumulative double bonds like in propadiene derivatives, stereochemistry at the central sp-hybridized carbon is indicated by @ or @@ following the atom, specifying the twist of the perpendicular planes formed by the substituents. For example, NC(Br)=[C@]=C(O)C denotes a specific enantiomer of an allene, where the substituents are oriented according to the chiral specification. Square planar geometry, common in coordination compounds, uses @SP1, @SP2, or @SP3 for anticlockwise arrangements of ligands around the central metal, with the incoming bond as reference; an example is [Pt@SP1](Cl)(Br)(I)N (U-shape configuration) for a chiral platinum complex. These notations ensure consistent parsing by maintaining directionality from the string's linear traversal.

Isotopic Labels

In SMILES, isotopic labels are specified using a numeric indicating the , placed immediately before the atomic symbol within square brackets for the affected atom. This notation allows precise representation of specific without altering the core structural description. For instance, [13C] represents the isotope, while the absence of a prefix, as in plain C, defaults to the most abundant naturally occurring isotope of the element (typically in this case). The isotope prefix follows a simple numeric format and can precede leading zeros if needed, though they are optional; thus, [2H], [02H], and [002H] all denote deuterium (hydrogen-2). This rule applies universally to any element in the periodic table, enabling isotopic specification for organic and inorganic atoms alike. Common applications include hydrogen ([2H] for deuterium or [3H] for tritium), carbon ([13C]), nitrogen ([15N]), and oxygen ([17O] or [18O]), which are frequently used in labeled compounds. Examples from the specification include [2H]O[2H] for heavy water (deuterium oxide) and [235U] for uranium-235. Importantly, isotopic designations in SMILES do not influence the atom's , hybridization, or , as these properties are governed solely by the elemental symbol and its standard chemical behavior; the isotope serves only to distinguish mass variants for identification purposes. This design ensures compatibility with standard valence rules while supporting representations of isotopically substituted molecules, which are crucial in applications like (NMR) spectroscopy and studies for tracing metabolic pathways or reaction mechanisms. For example, [13CH4] specifies , useful in or NMR experiments to probe . An extension of this notation integrates with stereochemical descriptors when isotopes create asymmetry by differentiating otherwise identical substituents. In such cases, the isotope contributes to the atom's identity for determination, allowing specification of configurations influenced by mass differences. A representative example is [2H]C@HCl, which depicts a chiral chlorofluoromethane where one is replaced by , and the @ symbol denotes the tetrahedral at the carbon center. This capability is particularly relevant for studying isotopically induced in biochemical or synthetic contexts.

Extensions for Complex Molecules

Extensions for complex molecules in the Simplified Molecular Input Line Entry System (SMILES) encompass optional notations that go beyond the core specification to address chemical reactions, polymeric structures, and substructure queries, though these features are not standardized in OpenSMILES and their implementation can differ across software tools. These extensions enable representation of dynamic processes and large-scale assemblies that are challenging with basic SMILES strings, facilitating applications in and . Reaction SMILES uses the '>' to delineate , optional agents, and products in a chemical transformation. For instance, the of is denoted as C=C.O>>CCO, where the left side lists the and as reactants, and the right side shows as the product. Another example is the C=CCBr>>C=CCI, representing the conversion of to allyl iodide without specified agents. This notation supports the depiction of multi-component reactions and is widely adopted in reaction databases and simulation software. For polymers, extensions employ asterisks (*) to mark the endpoints of repeating units, allowing concise description of chain-like macromolecules. Polyethylene, for example, is represented by the repeating unit [*]CC[*], where the asterisks indicate sites for inter-unit connections. This approach, seen in tools like PSMILES, builds on Daylight SMILES syntax while accommodating the connectivity of long chains. SMARTS (SMILES Arbitrary Target Specification) serves as a query language extension, enhancing SMILES with pattern-matching capabilities for substructure searches. It introduces wildcards such as '?' to match any organic subset atom (e.g., carbon, nitrogen, oxygen, phosphorus, or sulfur) and '*' to match any non-hydrogen atom. For example, the pattern C?O identifies carbon-oxygen single bonds where the oxygen is attached to any organic atom, useful for querying functional groups across molecular datasets. SMARTS also supports logical operators for more complex queries, making it essential for virtual screening and database filtering. In the 2020s, ongoing proposals aim to expand SMILES support for biomolecules, such as representing peptides as extended chains with repeating units and modifications. The IUPAC SMILES+ initiative, launched in 2019, seeks to formalize these extensions into a comprehensive standard. As of 2025, the project is in final review stages, with a recommendation expected for publication in Pure and Applied Chemistry later in the year. Similarly, BigSMILES provides a structured notation for polymers that can extend to like polypeptides, using descriptors for repeating units to capture sequence variability. These developments address limitations in core SMILES for handling the scale and diversity of biological macromolecules.

Practical Examples

Simple Molecular Strings

Simple molecular strings in SMILES notation provide a straightforward way to encode small, linear molecules by sequencing atomic symbols, with default single bonds between adjacent atoms and implicit atoms added to satisfy standard valences (: 4, : 3, : 2). This approach leverages the organic subset of elements, where uppercase letters denote atoms without explicit charges or isotopes, and no rings or are indicated. The simplest example is , represented as C. This single carbon atom is parsed as a central in the , with four implicit hydrogens attached to fulfill the tetravalent carbon, forming CH₄; no bonds are specified since there are no adjacent atoms. For , the SMILES string O denotes a single oxygen atom, interpreted as a with two implicit hydrogens bonded to it, yielding H₂O; the parser recognizes organic oxygen's divalent nature and adds hydrogens accordingly. Ammonia is encoded as N, where the atom serves as the 's sole vertex, augmented by three implicit hydrogens to match trivalent , resulting in NH₃. Ethanol's SMILES CCO maps to a linear chain : the first C is a carbon with three implicit hydrogens (CH₃-), connected by a default to the second C (with two implicit hydrogens, -CH₂-), which bonds to O (with one implicit hydrogen, -OH); this sequential parsing builds the acyclic structure CH₃CH₂OH. Ethene uses C=C, parsed as two carbon atoms connected by an explicit : each C receives two implicit hydrogens (H₂C=CH₂), with the = symbol overriding the default to define the unsaturated . Acetone is represented by CC(=O)C, where parsing proceeds left to right: the first C (CH₃-) bonds singly to the second C (the carbonyl carbon, with no implicit hydrogens), from which a branch (=O) attaches oxygen via a (no hydrogens on O), and the second C then connects to a final C (CH₃-); this constructs the graph CH₃C(=O)CH₃, using parentheses to denote the off-chain .

Elaborate Structure Illustrations

To illustrate the application of SMILES notation to more complex molecular structures, consider the representation of , a fundamental aromatic . The SMILES string for benzene is c1ccccc1, where lowercase letters denote aromatic atoms (carbon in this case), and the numbers 1 indicate the closure of a six-membered by connecting the first and last atoms. This compact notation captures the delocalized π-electron system without explicit double bonds, adhering to the convention where alternating single and double bonds are implied but not specified. A more elaborate example is aspirin (acetylsalicylic acid), which combines an aromatic ring, branches, and functional groups. Its SMILES is CC(=O)Oc1ccccc1C(=O)O. Parsing begins with the acetyl branch CC(=O)O, where C is a methyl carbon bonded to another C (carbonyl), with =O indicating a double bond to oxygen; this attaches via the ester oxygen O to the aromatic ring c1ccccc1. The ring closes with number 1, and the final C(=O)O branches from the adjacent ring carbon, representing the carboxylic acid group. The aromatic lowercase c atoms imply sp² hybridization and alternating bonds, while uppercase C and O denote aliphatic or explicit atoms. This string integrates branching with parentheses and ring closure to depict the ortho-substituted benzoic acid derivative. For stereochemistry in complex structures, ibuprofen (2-(4-(2-methylpropyl)phenyl)propanoic acid) serves as a chiral example, with the biologically active S-enantiomer specified in SMILES as CC(C)CC1=CC=C(C=C1)[C@@H](C)C(=O)O. The parsing starts with the isobutyl chain CC(C)C, where the first C bonds to a branched methyl (C) and then to methylene C, connecting to the para-substituted aromatic ring C1=CC=C(C=C1). Uppercase C in the ring indicates Kekulé form with explicit double bonds (=), though aromatic notation c1ccc(cc1) is equivalent; the chiral center is marked by [C@@H], specifying the S configuration via the @@ tetrahedral descriptor, followed by the methyl branch (C) and C(=O)O. This notation highlights how SMILES embeds stereochemical information at asymmetric carbons using atomic specifications and directionality rules. Caffeine, a purine alkaloid with fused s and multiple branches, exemplifies multi-component integration in SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C. The string opens with the N-methyl group CN1, where N1 is a nitrogen numbered for closure; it connects to C=NC2=, forming part of the , with C2 initiating the fused . The fusion is denoted by shared atom C1 and number 2, followed by carbonyl C(=O)N with N-methyl branches (C) and another C(=O)N2C closing the . is partially explicit with =, but the structure relies on closures, branches, and heteroatoms to represent the core with three methyl substituents at positions 1, 3, and 7. This parsing demonstrates how SMILES handles polycyclic systems by layering numbers and parentheses for connectivity.

Applications and Implementations

Software Tools and Libraries

The Daylight SMILES Toolkit, developed by Daylight Chemical Information Systems, serves as the foundational proprietary library for handling SMILES notation, enabling the generation, parsing, and of molecular strings through utility objects like streams and substructures. Originally created in the 1980s, it remains a core reference for SMILES implementation, though some components have influenced open-source alternatives. RDKit, an open-source cheminformatics toolkit available in Python and C++, provides extensive support for SMILES parsing, writing, and canonicalization, including handling of stereochemistry and aromaticity as per the OpenSMILES specification. It facilitates molecule manipulation and fingerprint generation from SMILES inputs, making it widely used in computational chemistry workflows. Open Babel, a free and open-source software suite, acts as a multi-format converter that robustly processes SMILES strings for input and output, supporting extensions for radicals and implicit hydrogens while adhering to the OpenSMILES standard. It enables seamless interconversion between SMILES and over 100 other chemical file formats, aiding in data exchange for molecular modeling. For web-based applications, SmilesDrawer is a , dependency-free designed specifically for parsing SMILES strings and rendering 2D molecular structures client-side with high performance and low memory usage. Released under the , it supports customizable visualizations and is suitable for interactive environments. Online validation and translation of SMILES are facilitated by the NIH's Chemical Identifier Resolver, a that converts SMILES inputs to structural depictions, identifiers, and other formats, thereby verifying syntactic and semantic correctness. This tool, hosted by the , processes queries without requiring software installation and supports batch operations for large datasets.

Integration in Cheminformatics and Machine Learning

In cheminformatics, SMILES strings serve as a foundational input for generating molecular fingerprints, which are binary vectors encoding structural features to enable efficient similarity searches across large chemical databases. These fingerprints, such as the Extended Connectivity Fingerprints (ECFP), are derived by parsing the SMILES representation to identify substructural patterns, allowing quantitative comparison via metrics like Tanimoto similarity. This approach facilitates in by identifying structurally analogous compounds, with studies demonstrating that SMILES-derived fingerprints achieve high recall rates in retrieving known actives from databases like , which as of 2025 stores over 119 million unique compounds primarily represented in SMILES format. In applications, SMILES has become a preferred textual for training transformer-based models on molecular data, enabling self-supervised pretraining for downstream tasks like property prediction. For instance, ChemBERTa, a RoBERTa-inspired model pretrained on 77 million SMILES strings, leverages masked language modeling to learn contextual embeddings that outperform traditional descriptors in tasks such as . Recent advancements incorporate contrastive learning on SMILES datasets to enhance robustness; the CONSMI framework (2024) uses SMILES enumeration to generate positive pairs for contrastive objectives, yielding embeddings that improve molecular similarity tasks by 10-15% over non-contrastive baselines in evaluations on and subsets. Similarly, 2023-2025 studies on contrastive methods, such as SimSon, integrate multi-view learning across SMILES variants to capture structural invariances, boosting performance in property prediction by addressing ambiguities. SMILES integration in prominently features generative models that output novel SMILES strings conditioned on desired properties, streamlining molecule creation. Variational autoencoders and GANs, trained on SMILES corpora like , produce candidates with validity rates exceeding 90% after and sanitization, as validated in frameworks where rewards incorporate chemical feasibility. Post-generation validity checking is critical, involving RDKit or OpenBabel parsers to detect syntactic errors or invalid valences.

Limitations and Comparisons

Inherent Constraints

One inherent limitation of the Simplified Molecular Input Line Entry System (SMILES) is its allowance for multiple string representations of the same molecular structure, as it is not inherently , which can lead to ambiguities in database searches and comparisons. For instance, the molecule propanol can be encoded as CCCO or OCCC, among other variants, requiring additional steps for uniqueness. This non-uniqueness extends to tautomers, where different tautomeric forms—such as keto-enol pairs—are represented by distinct SMILES strings, hindering the unified depiction of structures. SMILES primarily captures two-dimensional and , providing only partial support for without incorporating full three-dimensional spatial data. Stereochemical features are denoted using directional symbols like @ for counterclockwise and @@ for clockwise at tetrahedral centers, but these do not account for conformational dynamics or precise 3D coordinates. As molecular size increases, SMILES strings grow lengthy and complex, often resulting in challenges and errors when processing or malformed inputs. Common failures, as observed in tools like RDKit, include syntax violations, unclosed rings, mismatched parentheses, and valence inconsistencies, with SMILES comprising up to 89% of outputs from certain generative models. SMILES is fundamentally designed for static molecular graphs and lacks native support for quantum states, electronic configurations, or dynamic processes like bond vibrations. For macromolecules such as polymers or proteins, standard SMILES becomes inadequate, necessitating extensions like BigSMILES to handle repeating units and stochastic elements. Recent analyses, particularly in contexts, critique SMILES for its structural ambiguities and limited expressiveness in tasks, proposing algebraic data types as a more robust alternative for encoding molecular hierarchies and properties.

Alternatives to SMILES Notation

The (InChI) is an IUPAC-endorsed standard for encoding chemical structures in a unique, canonical string format that ensures a single representation per , addressing SMILES' potential for multiple isomorphic notations. InChI organizes information into distinct layers covering main , tetrahedral , and isotopic specifications, enabling precise differentiation of isomers and variants. While more verbose and less intuitive for manual input than SMILES, this layered approach facilitates robust database storage and retrieval without normalization steps. In contrast, the and file formats provide connection table representations that explicitly include atomic coordinates for 2D or conformations, along with bond details and optional properties. Originating from MDL Information Systems, files describe single molecules, while extends this to multiple entries with , making them ideal for structure visualization and but resulting in larger file sizes compared to the compact linear strings of SMILES. These coordinate-based formats preserve spatial arrangements critical for applications like simulations, though they demand more storage and parsing overhead. SMILES stands out for its human-readable syntax, resembling traditional , which simplifies manual creation and editing by chemists, and its straightforward generation from graphical depictions. A key drawback is the absence of built-in , necessitating additional processing to achieve uniqueness akin to InChI. In workflows, SMILES' string-based linearity supports rapid tokenization and input feeding into models, outperforming graph-oriented formats like RDF triples used in semantic chemical databases, where query resolution involves heavier relational traversals. As of 2024, reviews of generative models in drug design continue to favor SMILES for its seamless integration with architectures and broad software ecosystem, even as alternatives like SELFIES—designed to produce only valid molecular strings—offer improved syntactic reliability during optimization tasks.

References

  1. [1]
    Daylight Theory: SMILES
    SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical method using printable characters) for entering and representing ...
  2. [2]
    SMILES, a chemical language and information system. 1 ...
    SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules ... Open PDF. Journal of Chemical Information and ...
  3. [3]
    a tribute to David Weininger, 1952–2016
    Feb 3, 2018 · Previously, while at the EPA Duluth station (1978—1982), he had invented SMILES, the first line notation for chemical structures easily readable ...So Long, And Thanks For All... · Dave Weininger, One Of A... · Anthony Nicholls
  4. [4]
    [PDF] Appendix F SMILES Notation Tutorial
    What is SMILES? SMILES is the “Simplified Molecular Input Line Entry System,” which is used to translate a chemical's three-dimensional structure into a string ...
  5. [5]
    SMILES | DrugBank Help Center
    SMILES is a line notation system used for describing the structure of chemical species using short ASCII strings.
  6. [6]
    SMILES Tutorial - Daylight
    SMILES Tutorial. Table of Contents. 1. Introduction 2. Atoms 3. Properties of Atoms 4. Bonds 5. Branching 6. Rings 7. Aromaticity 8. Stereo Isomerism 9 ...Missing: 1990s expansions Weininger
  7. [7]
    OpenSMILES specification
    May 15, 2016 · It is hosted under the banner of the Blue Obelisk project, with the intent to solicit contributions and comments from the entire computational ...
  8. [8]
    [PDF] IUPAC SMILES+ - InChI Trust
    OpenSMILES, a Blue Obelisk community driven effort created a non-proprietary open specification of SMILES (2007) [2]. ○ OpenSMILES clarified some ...
  9. [9]
    IUPAC SMILES+ Specification
    This project seeks to establish a formalized recommended up-to-date specification of the SMILES format.Missing: 2020s | Show results with:2020s
  10. [10]
    Transformer-based models for chemical SMILES representation
    Oct 30, 2024 · Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules ... View PDFView articleView in ...<|control11|><|separator|>
  11. [11]
    Conference feedback: AI in Chemistry 2023 | Oxford Protein ...
    Oct 10, 2023 · To kick off the series of shorter talks at the conference, Daniel Probst spoke about his work on the explainable prediction of catalyzing ...
  12. [12]
    [PDF] Parsing and Conversion of SMILES-Strings to Molecular Graphs
    Oct 29, 2010 · To define a context–free grammar (CFG) a mathematical structure is needed. A. CFG is defined by a tuple G = (V, Σ, S, P). Where V is a set of ...
  13. [13]
    [PDF] arXiv:2009.13946v1 [cs.LG] 29 Sep 2020
    Sep 29, 2020 · Grammar rules are obtained from the OpenSMILES specification [James et al., 2016], which denotes how the. SMILES representation was formed based ...
  14. [14]
    SMILES Tutorial: Bonds - Daylight
    Single, double, triple, and aromatic bonds are represented by the symbols `-', =', `#', and `:', respectively.
  15. [15]
    [PDF] OpenSMILES specification
    May 15, 2016 · SMILES was originally developed as a proprietary specification by Daylight Chemical Information Systems Since the intro-.Missing: 1986 | Show results with:1986
  16. [16]
    SMILES Tutorial: Conventions - Daylight
    There is no single rigorous definition of aromaticity in chemistry. To a synthetic chemist, aromaticity implies something about reactivity; to a ...Missing: 1990s expansions Weininger
  17. [17]
    SMILES Tutorial: Isomerism - Daylight
    These symbols indicate relative directionality between the connected atoms and have meaning only when they occur on both atoms which are double bonded.
  18. [18]
    Guide - Polymer Genome
    ... )=CC=C1 . A SMILES string used for Polymer Genome represents the repeating unit of a polymer, which has 2 dangling bonds for linking with the next repeating ...
  19. [19]
    PSMILES
    PSMILES strings are very useful for data-driven polymer discovery, design or prediction task. A PSMILES string follows the daylight SMILES syntax defined at ...<|separator|>
  20. [20]
    4. SMARTS - A Language for Describing Molecular Patterns - Daylight
    SMARTS is a language that allows you to specify substructures using rules that are straightforward extensions of SMILES.Missing: pseudoatoms * | Show results with:pseudoatoms *
  21. [21]
    BigSMILES: A Structurally-Based Line Notation for Describing ...
    Sep 12, 2019 · In BigSMILES, polymeric fragments are represented by a list of repeating units enclosed by curly brackets. The chemical structures of the ...Introduction · Syntax · Discussion · Supporting Information
  22. [22]
    Methane | CH4 | CID 297 - PubChem
    Methane | CH4 | CID 297 - structure, chemical names, physical and chemical properties, classification, patents, literature, biological activities, ...
  23. [23]
    Acetone | CH3-CO-CH3 | CID 180 - PubChem - NIH
    2.1.4 SMILES. CC(=O)C. Computed by OEChem 2.3.0 (PubChem release 2025.04.14). PubChem. 2.2 Molecular Formula. C3H6O. Computed by PubChem 2.2 (PubChem release ...
  24. [24]
    SMILES TM Toolkit - Daylight>Products
    The SMILESTM Toolkit is a chemical information programming library that supports a number of utility objects (streams, sequences, paths, substructs). It used ...
  25. [25]
    Daylight Chemical Information Systems
    The Daylight Toolkit enables companies to build applications to add a broad range of cheminformatics capabilities to environments of any scale.
  26. [26]
    The RDKit Book — The RDKit 2025.09.1 documentation
    The ATTCHORD attribute must have a specification for each bond that comes from the macro atom. The specification is contained between parentheses, and the ...
  27. [27]
    rdkit - PyPI
    1. pip install rdkit. Copy PIP instructions. Latest version. Released: Oct 6, 2025. A collection of chemoinformatics and machine-learning software written in ...🔥 Rdkit Python Wheels · Available Builds · Installation<|control11|><|separator|>
  28. [28]
    SMILES format (smi, smiles) - Open Babel
    Open Babel implements the OpenSMILES specification. It also implements an extension to this specification for radicals. Note that the l <atomno> option, used ...
  29. [29]
    User Guide — Open Babel openbabel-3-1-1 documentation
    SMILES extensions for radicals · Other Supported Extensions · Contributing to Open Babel · Overview · Developing Open Babel · Documentation · Adding a new test ...Supported File Formats and... · Install Open Babel · The Open Babel GUI · API
  30. [30]
    reymond-group/smilesDrawer - GitHub
    A small, highly performant JavaScript component for parsing and drawing SMILES strings. Released under the MIT license. - reymond-group/smilesDrawer.
  31. [31]
    CACTUS Online SMILES Translator - NCI/CADD - NIH
    No information is available for this page. · Learn whyMissing: validation | Show results with:validation
  32. [32]
    NCI/CADD Chemical Identifier Resolver - NIH
    This service works as a resolver for different chemical structure identifiers and allows one to convert a given structure identifier into another ...Missing: validation tool
  33. [33]
    PubChem 2025 update - Oxford Academic
    Nov 18, 2024 · With additions from over 130 new sources, PubChem contains >1000 data sources, 119 million compounds, 322 million substances and 295 million ...
  34. [34]
    Getting Started with the RDKit in Python
    This document is intended to provide an overview of how one can use the RDKit functionality from Python. It's not comprehensive and it's not a manual.
  35. [35]
    CONSMI: Contrastive Learning in the Simplified Molecular Input ...
    Jan 19, 2024 · Here, we describe a contrastive learning framework using SMILES enumeration to learn more comprehensive potential representations of SMILES.
  36. [36]
    Deep reinforcement learning for de novo drug design - Science
    Generative models are trained with a stack-augmented memory network to produce chemically feasible SMILES strings, and predictive models are derived to forecast ...Results · Rl Formulation As Applied To... · Neural Network Architectures<|control11|><|separator|>
  37. [37]
    SELFIES and the future of molecular string representations - PMC
    This allows for tautomers of the same molecule to be represented by the same InChI string, while with the Smiles framework, each tautomer is represented by a ...
  38. [38]
    Stereochemistry and Atom Parity in SMILES | Depth-First
    May 4, 2020 · This article explains the SMILES stereochemical notation system in detail. Atom Parity. SMILES expresses stereochemical configuration through ...Missing: expansions 1990s aromaticity
  39. [39]
    Jmol SMILES and Jmol SMARTS: specifications and applications
    Sep 26, 2016 · This article focuses on the development of SMILES and SMARTS dialects that can be used specifically in the context of a 3D molecular ...
  40. [40]
    UnCorrupt SMILES: a novel approach to de novo design
    Feb 14, 2023 · To better understand the invalid SMILES the parsing errors captured by the RDKit were classified into six different error types (Fig. 1).
  41. [41]
    [2501.13633] Representation of Molecules via Algebraic Data Types
    Jan 23, 2025 · The paper introduces a novel molecular representation using Algebraic Data Types (ADTs), which are composite data structures formed through the ...Missing: alternatives | Show results with:alternatives
  42. [42]
    InChI, the IUPAC International Chemical Identifier
    May 30, 2015 · Moreover, for SMILES, the canonicalization algorithm was published ... published when InChIKey was introduced in 2007, and the statement ...
  43. [43]
    A standard method to generate canonical SMILES based on the InChI
    Sep 18, 2012 · I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations.
  44. [44]
    MOL file format (MT06966) - IUPAC Gold Book
    The MOL file format encodes chemical structures, substructures and conformations as text-based connection tables. It is used by MDL Information Systems Inc.<|control11|><|separator|>
  45. [45]
    2.5: Structural Data Files - Chemistry LibreTexts
    Aug 21, 2022 · There are a variety of file formats and the most common are based on the MDL Molfile, of which V2000 is the most common, although V3000 is also ...
  46. [46]
    From SMILES to Graphs: The Next Frontier in ML-Driven ... - Quantori
    Mar 5, 2025 · Graph-based formats offer a richer, more flexible approach, opening the door to advanced generative modeling techniques like diffusion models.From Smiles To Graphs: The... · Graph-Based Representations · A Glimpse Into The Future...
  47. [47]
    Linking the Resource Description Framework to cheminformatics ...
    Mar 7, 2011 · Converting RDF expressed molecular data, such as SMILES strings, into chemical graphs was done using the Chemistry Development Kit (CDK) [21, 22] ...
  48. [48]
    Invalid SMILES are beneficial rather than detrimental to chemical ...
    Mar 29, 2024 · Here I provide causal evidence that the ability to produce invalid outputs is not harmful but is instead beneficial to chemical language models.