SMILES arbitrary target specification
SMiles ARbitrary Target Specification (SMARTS) is a cheminformatics language designed for expressing and matching substructural patterns in molecules, extending the capabilities of the Simplified Molecular Input Line Entry System (SMILES) to support pattern recognition rather than just exact molecular representations.[1] Developed by Daylight Chemical Information Systems, SMARTS incorporates logical operators, recursive definitions, and specialized symbols for atoms and bonds, allowing users to define complex queries for substructure searching in chemical databases and applications such as drug discovery and molecular modeling.[1] Unlike SMILES, which primarily encodes complete molecular structures as linear strings, SMARTS treats every atom and bond as a pattern matcher, enabling the specification of features like aromaticity, chirality, and connectivity with greater flexibility.[1] Key features include atomic primitives (e.g., [C] for aliphatic carbon or [#6] for carbon by atomic number), bond descriptors (e.g., ~ for any bond type), and logical constructs such as ! (negation), & (conjunction), and , (disjunction) to build intricate substructure rules.[1] Recursive SMARTS further enhance expressiveness by allowing self-referential patterns, such as identifying atoms connected to a methyl group via [$(*C)].[1] SMARTS has been integral to computational chemistry tools since its introduction, supporting applications in substructure-based screening, reaction prediction, and visualization software like Jmol and the Cambridge Structural Database (CSD) Python API.[2][3] Its adoption stems from the need for precise, machine-readable queries in large-scale chemical data analysis, with extensions over time including chirality support in version 4.1 and component-level grouping for disconnected structures.[1]Introduction to SMARTS
Definition and Purpose
SMARTS, or SMILES Arbitrary Target Specification, is a line notation language that extends the Simplified Molecular Input Line Entry System (SMILES) to define arbitrary substructural patterns in molecules, facilitating the identification of specific features such as functional groups or pharmacophores within larger chemical structures.[1] Developed as a query mechanism, SMARTS allows users to symbolically represent atoms, bonds, and connectivity rules that can match against molecular graphs, treating molecules as abstract graphs where nodes (atoms) and edges (bonds) are queried for patterns rather than exact structures.[1] This extension enables flexible substructure searching, where unspecified elements are handled through wildcards and variables, permitting the specification of partial or generalized targets without requiring complete molecular descriptions.[1] In cheminformatics, the primary purpose of SMARTS is to support efficient database querying for substructure retrieval, where patterns are matched against vast collections of compounds to identify molecules containing desired motifs, such as in drug discovery pipelines.[1] It also underpins similarity analysis by quantifying structural overlaps between query patterns and database entries, aiding in the prioritization of lead compounds based on shared substructural features.[4] Furthermore, SMARTS contributes to reaction prediction by defining reactive sites or transformation templates, allowing computational models to anticipate outcomes in synthetic pathways through pattern-based matching of reactants and products.[5] These applications leverage SMARTS's role as a standardized, expressive query language that bridges linear notations with graph-based molecular representations, enhancing automation in chemical data processing.[1]Historical Development and Relation to SMILES
SMARTS was developed in the late 1980s by David Weininger at Daylight Chemical Information Systems, Inc., as an extension of the SMILES notation specifically designed to facilitate substructure searching in molecular databases.[6] Weininger passed away in 2016.[6] This innovation built directly on SMILES, which Weininger had earlier created for compact representation of complete chemical structures, but SMARTS introduced capabilities for defining abstract patterns rather than fixed molecules.[1] Daylight, founded in 1987 by Weininger and Josef Taitz, commercialized these tools to address growing needs in computational chemistry for efficient querying of chemical information systems.[7] While SMILES focuses on generating a unique, canonical string for an entire molecule to enable isomorphism checks and database storage, SMARTS diverges by supporting variable atoms and bonds, logical operators, and recursive patterns to match substructures flexibly within larger molecules.[1] For instance, SMILES might encode benzene as c1ccccc1 for precise depiction, whereas SMARTS could use 11 to query any aromatic ring system, allowing wildcards and conditions unmet by SMILES's rigid syntax.[1] This distinction arose from the recognition that substructure searches required expressive power beyond SMILES's canonicalization, enabling applications like pharmacophore identification and reaction prediction in cheminformatics workflows.[6] A pivotal advancement came in the 1990s with refinements to both SMILES and SMARTS, including the addition of primitives for chirality, which expanded pattern-matching precision alongside evolving SMILES versions to handle complex queries previously unsupported.[1] By the early 2000s, SMARTS was fully integrated into Daylight's C++ Toolkit, a comprehensive library for chemical computation that provided APIs for parsing, searching, and manipulating patterns, solidifying its role in industrial and academic software.[7] This integration spurred widespread adoption, with cheminformatics communities contributing to its de facto standardization through shared implementations and extensions, despite no formal governing body.[4]Core Syntax Elements
Atomic Properties and Symbols
In SMARTS, atoms are primarily specified using elemental symbols or enclosed in square brackets[ ] to define explicit properties, allowing for precise matching in substructure searches. The organic subset of atoms, such as C, N, O, P, S, B, and halogens (F, Cl, Br, I), can be written without brackets for aliphatic forms or in lowercase (e.g., c for aromatic carbon) to denote aromaticity.[1] General atoms beyond the organic subset require bracketed notation with their symbols, such as [Na] for sodium.[1]
Key atomic properties include charge, specified as an integer following the atom symbol within brackets, such as [+1] for a singly positive charge or [-] for a single negative charge; multiple charges are denoted by repetition, like [++] for +2.[1] Hydrogen count is indicated by HBond Types and Specifications
In SMARTS, bonds are specified using symbols that denote their type, order, and properties, allowing for precise substructure queries in molecular patterns. The basic bond symbols include- for a single (aliphatic) bond, = for a double bond, # for a triple bond, and : for an aromatic bond.[1] These symbols extend the SMILES notation to support querying specific connectivity between atoms, such as in C=C for an ethene-like double bond or c1ccccc1 where implicit : bonds define benzene's aromatic ring.[1]
For flexible matching, the tilde ~ serves as a wildcard for any bond type, enabling broad queries like [#6]~[#6] to match any connection between carbon atoms regardless of order.[1] Absent a bond symbol, SMARTS defaults to "single or aromatic," as seen in cc matching adjacent aromatic carbons or CC matching aliphatic single-bonded carbons.[1] Additionally, @ denotes any ring bond (version 4.6+), useful for identifying cyclic connections without specifying order.[1]
Stereo bonds incorporate directional indicators / for "up" and \ for "down," primarily for specifying cis/trans (double bond) or tetrahedral stereochemistry in queries (version 4.1+).[1] These can combine with unspecified options, such as /? for "up or unspecified," allowing partial stereo matching, as in F/?C=C\Cl for trans-1,2-difluoroethene or equivalents with unspecified direction.[1] In 2D depictions, / and \ also represent wedge/dash bonds to convey stereochemical configuration.[1]
Bond properties extend querying capabilities, particularly for order and aromaticity. While fixed symbols define exact orders, logical operators enable combinations; , (disjunction/OR) can match alternatives via subpatterns, such as [([C]-[C]),([C]=[C])] for carbons connected by single or double bonds, while ; (conjunction/AND, low precedence) combines conditions. Aromaticity toggles via : enforce delocalized bonds, distinguishing them from aliphatic equivalents like c-c in biphenyl.[1]
In recursive SMARTS patterns, bond matching uses variables denoted by numbers following the symbol, such as -1 to assign and reference a specific bond type across layers, facilitating complex queries like matching equivalent bonds in nested structures.[1] For instance, a pattern might use bond variable 1 to ensure consistent order in recursive atom environments, enhancing precision in substructure searches.[1]
Structural Patterns
Connectivity and Branching
In SMARTS notation, connectivity between atoms is primarily expressed through sequential arrangement, where atoms are listed in order to represent direct bonds, defaulting to single or aromatic bonds unless otherwise specified. This mirrors the linear chain syntax of SMILES but is adapted for substructure querying. For example, the patternCCO matches a linear chain of two carbon atoms connected to an oxygen atom, as found in the ethanol backbone.[1]
Branching structures are denoted using parentheses to indicate side chains or substituents attached to the preceding atom. The notation CC(O)C specifies a branched chain where the second carbon connects to a hydroxyl group in addition to the adjacent carbons, corresponding to the structure of isopropanol. Multiple branches from a single atom are represented by successive parenthetical groups, such as CC(O)(Cl)C, which matches a central carbon bearing both a hydroxy and a chloro substituent alongside the chain. Nested parentheses enable the description of more intricate branching, allowing hierarchical attachments without altering the main sequence.[1]
For queries involving disconnected components, the dot (.) operator separates independent fragments, permitting matches across non-adjacent parts of a molecule. The pattern CC.O, for instance, identifies an ethyl group alongside a separate oxygen atom in different molecular segments. Connectivity can be further constrained by specifying an atom's degree—the total number of bonds it forms—using the Xn descriptor within square brackets, as in [C;X4], which targets a carbon atom with exactly four connections, typical of tetrahedral geometry.[1]
Cyclicity and Ring Closures
In SMARTS, cyclicity is represented through ring closure digits, which connect non-adjacent atoms to form cycles in substructure patterns. Digits 1 through 9 are appended to atoms to indicate the start and end of a ring bond, mirroring the SMILES convention but adapted for querying. For example, the patternC1CC1 specifies a three-membered aliphatic ring, matching cyclopropane or similar structures.[1]
To handle larger molecules with more than nine rings, the '%' symbol precedes two-digit numbers for closures ranging from 10 to 99, such as %10. Multiple independent rings reuse the same digits without overlap; for instance, c1ccccc1c1ccccc1 denotes biphenyl, with two separate aromatic six-membered rings linked by a bond. In fused systems, a shared digit indicates the common bond, as in c12ccccc1cccc2 for naphthalene, where the '1' and '2' define the fused edges.[1][9]
Ring properties enhance query precision by qualifying atoms based on their cyclic environment. The uppercase 'R' specifies participation in a given number of smallest set of smallest rings (SSSR), with [R3] matching atoms in exactly three rings. The lowercase 'r' targets ring size, such as [r3] for atoms in a three-membered ring or [r6] for six-membered rings like those in benzene. Aromatic cyclicity combines lowercase symbols for atoms and bonds with closures, exemplified by c1ccccc1, which exclusively matches aromatic six-membered rings.[1]
Querying rings flexibly employs wildcards and primitives for broad matching. The '*' wildcard represents any heavy atom in ring contexts, while [R] matches any atom involved in at least one ring, regardless of type or size. Atom variables, labeled as [C:1], enable tracking specific ring atoms across patterns for refined substructure searches.[1]
Advanced Features
Logical Operators and Grouping
In SMARTS, logical operators enable the combination of atomic properties, bond specifications, and substructural conditions to form more complex queries. The primary operators include& for high-precedence logical AND, , for logical OR, ! for logical NOT, and ; for low-precedence logical AND. These operators apply within atomic expressions enclosed in square brackets [ ] or between subpatterns, allowing precise specification of conditions such as [C&H0], which matches an aliphatic carbon atom with exactly zero hydrogen atoms (a carbon without hydrogens).[1] Within bracketed atomic expressions, adjacent primitives are implicitly combined using high-precedence AND (&), so [CH3] is equivalent to [C&H3], denoting an aliphatic carbon with three hydrogens; the NOT operator negates a single condition, as in [!c] for any non-aromatic carbon atom.[1]
The OR operator (,) combines alternatives at a precedence level below AND, facilitating queries like [N,O], which matches either aliphatic nitrogen or oxygen atoms. For more nuanced combinations, parentheses enforce explicit grouping to override default operator precedence, which follows the order ! > & > ; > ,, with left-to-right associativity for operators of equal precedence. For instance, [c,n;H1] without parentheses would parse as ([c,n];H1)—an aromatic carbon or nitrogen atom that also has exactly one hydrogen—but to group the OR separately, it becomes [ (c,n) ; H1 ], ensuring the hydrogen condition applies to the combined alternatives. This grouping mechanism extends beyond atoms to subpatterns, where parentheses clarify precedence in linear notations, such as (C=O) to specify a carbonyl group as a branched substructure attached to a prior atom, avoiding ambiguity in connectivity.[1]
In non-recursive contexts, these operators support flat combinations of conditions, such as using OR for alternative branches in a query like [N,O], which matches either a nitrogen or oxygen atom. The low-precedence AND (;) is particularly useful for chaining multiple conditions across broader subpatterns, evaluated after higher-precedence operations, as in [#6;X3], matching a carbon atom with exactly three connections. Parentheses also aid in disambiguating such expressions by grouping subpatterns, ensuring that logical operations apply as intended without unintended left-to-right chaining; for example, default parsing of a&b,c yields (a&b),c (AND then OR), but (a&b),c explicitly confirms this while allowing overrides like a&(b,c) for AND with a grouped OR. These features, rooted in extensions of SMILES syntax, enhance the expressiveness of SMARTS for substructure searching while maintaining compactness.[1]
Recursive Definitions and Layers
Recursive SMARTS enable the definition of complex atomic environments by embedding subpatterns within a parent pattern, using the syntax$(subpattern) where subpattern is a valid SMARTS expression starting with the atom of interest.[1] This construct treats the enclosed subpattern as a property of the preceding atom, allowing for the specification of nested structural features without including the recursive atoms in the primary match.[1] For instance, the pattern C[$(aaO)] matches a carbon atom adjacent to an oxygen ortho on an aromatic ring, where aaO defines the aromatic environment.[1]
Nesting of recursive SMARTS is supported through multiple layers of $() enclosures, permitting up to 10 levels of hierarchy to describe increasingly intricate structures such as side chains attached to ring systems.[1] This layered approach facilitates hierarchical matching, where each recursive layer refines the context of the atoms in the outer pattern; for example, C[$(aa[$(O)])] embeds an oxygen specification within an aromatic context for the carbon.[1] Beyond 10 layers, parsing is not supported to maintain computational feasibility.[1] Recursive SMARTS cannot include reaction expressions due to semantic ambiguities in handling bonds and mappings.[1]
Atom mapping within recursive SMARTS employs digits (1 through 9) to label atoms, with the key feature that these digits can be reused across different recursive subpatterns to ensure consistent correspondence between mapped atoms in nested environments.[1] This reuse allows for aligned mappings in hierarchical queries, such as identifying repeating units in polymer chains where corresponding atoms in each recursive instance share the same digit label.[1] For example, in a pattern like $(C1CC1)$(C1CC1), the digit 1 maps equivalent carbons across the two recursive rings, enabling unified treatment in substructure searches.[1]
In practical queries, recursive SMARTS excel at matching extended motifs, such as polypeptide backbones that approximate alpha-helical structures through repeated amide linkages.[1] A representative pattern for a dipeptide unit, extendable recursively for longer chains, is [NX3H2][CX4H]([*])[CX3](=[OX1])[NX3][CX4H]([*])[CX3](=[OX1])[OX2H], where * serves as a variable for side chains and recursion can embed further backbone repetitions to target polymer-like sequences.[9] Such applications are particularly valuable in cheminformatics for identifying biomolecular scaffolds with hierarchical organization.[1]
Practical Usage
Illustrative Examples
To illustrate the application of SMARTS syntax, consider basic patterns that target common molecular features. These examples demonstrate how atomic specifications, bonds, branching, rings, logical operators, and recursion can be combined to define substructures precisely.[10] A simple pattern for a hydroxyl group is[OH], which matches an oxygen atom attached to a hydrogen, as found in alcohols or phenols.[10] Similarly, C=O targets a carbonyl group, specifying a carbon double-bonded to an oxygen, characteristic of ketones, aldehydes, or carboxylic acids.[10]
For branched structures, the pattern CC(=O)O represents acetic acid, where the first C denotes a methyl group connected to a central carbon that forms a double bond with oxygen (=O) and a single bond to a hydroxyl (O). This uses branching notation with parentheses to specify the attachments around the carbonyl carbon.[10]
Cyclic patterns leverage ring closure digits; for instance, c1ccccc1 matches benzene as an aromatic six-membered ring, with lowercase letters indicating aromatic atoms and the digit 1 closing the ring. This aromatic query distinguishes it from aliphatic cycles.[10]
Logical operators refine atomic properties, such as [N&+0] for a neutral nitrogen atom, where & enforces the condition of zero charge (+0) alongside the default nitrogen valence.[10]
Recursive definitions allow specifying repeating motifs; a basic example is $(CC), which matches an ethylene repeat unit by recursively querying two connected aliphatic carbons (CC), useful for identifying polymer-like chains.[10]