Fact-checked by Grok 2 weeks ago

Virtual screening

Virtual screening (VS) is an computational technique employed in to identify promising by evaluating the potential binding affinity of large libraries of small molecules against a specific , such as a protein receptor. This method serves as a cost-effective and efficient alternative to traditional high-throughput experimental screening, enabling the rapid prioritization of candidates for further validation from vast chemical spaces often exceeding billions of compounds. The primary approaches in virtual screening include ligand-based virtual screening (LBVS), which identifies novel compounds by assessing structural similarities or features to known active ligands, and structure-based virtual screening (SBVS), which predicts interactions using the three-dimensional atomic structure of the target protein typically obtained from or NMR spectroscopy. A related variant, fragment-based virtual screening (FBVS), focuses on low-molecular-weight fragments (typically under 300 ) to build more drug-like molecules through linking or growing strategies. These methods often integrate quantitative structure-activity relationship (QSAR) modeling in LBVS for predictive accuracy and molecular simulations in SBVS to estimate binding poses and affinities. Key techniques in virtual screening encompass similarity searching via metrics like the Tanimoto coefficient, algorithms such as support vector machines (SVM) for , and scoring functions (empirical, force-field-based, or knowledge-based) to rank compounds by predicted potency. Recent advances have incorporated (AI) and to enhance hit identification, with platforms like AI-accelerated protocols enabling the screening of ultra-large libraries (e.g., 5.5 billion compounds) in days while achieving micromolar-affinity hits validated by . These innovations address challenges like false positives and computational demands, improving accuracies over 99% in some deep neural network-based systems. In , virtual screening facilitates lead optimization, , and the identification of inhibitors for in diseases like cancer, infectious diseases, and neurological disorders, significantly reducing the time and expense of early-stage research compared to wet-lab methods. Its importance has grown with the expansion of accessible compound databases (e.g., , ) and structural initiatives, positioning it as a cornerstone of modern pharmaceutical pipelines for accelerating the transition from target validation to clinical candidates.

Fundamentals

Definition and Principles

Virtual screening (VS) is an computational technique employed in to identify potential bioactive compounds by evaluating large libraries of small molecules, or ligands, against biological targets such as proteins, predicting their ability to form favorable interactions. These libraries can encompass millions to billions of compounds, enabling the rapid assessment of chemical space far beyond what is feasible experimentally. The foundational principles of revolve around predicting binding affinity, the strength of non-covalent interactions between a and its , to identify —compounds with a high likelihood of binding effectively—and facilitate subsequent lead optimization, where promising are refined into more potent candidates. Central to this process are molecular interactions such as hydrogen bonding, which involves the sharing of hydrogen atoms between electronegative atoms, and hydrophobic effects, where non-polar regions cluster to minimize exposure to , stabilizing the - complex. Unlike (HTS), which relies on physical assays to test compounds experimentally, is purely computational, offering significant reductions in time, cost, and resource demands while prioritizing targets with available structural data or known . A typical VS workflow begins with library preparation, where compound databases are curated for drug-likeness and converted into suitable formats for computation. This is followed by screening via predictive models to generate scores reflecting binding potential, ranking the compounds based on these scores to prioritize top candidates, and final hit selection through post-processing to ensure chemical diversity and synthetic feasibility before experimental validation. Analogous to molecular docking, which simulates ligand placement in a target's binding site, these steps provide a high-level framework for hit identification without requiring physical synthesis.

Historical Development

The roots of virtual screening trace back to the foundations of in the mid-20th century, with quantitative structure-activity relationship (QSAR) models serving as an early precursor to ligand-based approaches. In 1964, Corwin Hansch and Toshio Fujita introduced the first systematic QSAR framework, correlating with through linear free-energy relationships, which laid the groundwork for predicting potency without direct experimental testing. This evolved through the and amid advances in molecular modeling and database management, enabling initial computational searches of small compound libraries for potential drug candidates. By the late , these efforts had matured into rudimentary ligand-based screening techniques, focusing on similarity searches and basic mapping to identify compounds with desired structural features. The term "virtual screening" emerged in the late 1990s to describe these approaches as analogs to experimental . A pivotal milestone occurred in the 1980s with the advent of structure-based methods, exemplified by the development of the program in 1982 by Irwin D. Kuntz and colleagues at the . This algorithm pioneered automated by geometrically matching atoms to receptor binding sites, allowing the virtual evaluation of thousands of molecules against protein structures derived from . The saw the rise of ligand-based virtual screening, driven by modeling software that identified common spatial arrangements of molecular features essential for activity, such as donors and hydrophobic regions. Tools like (introduced in 1990) facilitated 3D database searches, complementing emerging high-throughput experimental screening and accelerating hit identification in pharmaceutical research. Post-2000, virtual screening became integrated into industrial pipelines, bolstered by that enabled screening of millions of compounds in days rather than years. The completion of the in 2003 dramatically expanded the pool of viable drug targets, from fewer than 500 known proteins to thousands, fueling demand for efficient virtual tools to prioritize candidates. Open-source contributions further democratized access, including (first released in 1990 by Arthur Olson's group at Institute), which introduced genetic algorithm-based for flexible posing, and RDKit (open-sourced in 2006 after development in the early ), a cheminformatics toolkit supporting fingerprint-based similarity searches and descriptor generation for large-scale ligand-based screening. Around the , virtual screening underwent a from primarily rule-based and physics-driven methods to data-driven approaches, leveraging to refine predictions from vast datasets of binding affinities and structural information. This transition enhanced accuracy in handling diverse chemical spaces and reduced false positives, solidifying virtual screening as a standard, cost-effective complement to wet-lab experiments in pharma workflows.

Methods

Ligand-Based Methods

Ligand-based methods in virtual screening leverage information from known active compounds to identify potential hits from large chemical databases through assessments of , pharmacophoric features, or predicted physicochemical properties, without necessitating the target's three-dimensional structure. These approaches are particularly valuable when structural data for the is unavailable or unreliable, enabling the prioritization of compounds likely to exhibit similar behaviors based on the assumption that structurally or functionally analogous ligands share common interaction profiles. Early implementations focused on simple similarity searching using fingerprints, but evolved to incorporate three-dimensional aspects for more accurate predictions of bioactivity. Pharmacophore models form a of ligand-based screening, defined as the three-dimensional arrangement of molecular features—such as hydrogen bond donors and acceptors, hydrophobic centers, aromatic rings, and positively or negatively ionizable groups—that are essential for ligand-target recognition and activity. These models are typically constructed by superimposing a set of known active ligands using techniques like least-squares fitting or clique detection algorithms to identify shared features, followed by validation against inactive compounds to refine specificity. A seminal example is the algorithm, introduced in the mid-1990s within the Catalyst software suite, which employs a hypothesis-driven approach to generate common-feature pharmacophores from multiple flexible ligand conformations, facilitating database querying for novel scaffolds that match the geometric and chemical constraints. Shape-based virtual screening emphasizes the geometric complementarity of molecular volumes, comparing query and database compounds via overlap metrics that approximate shapes with Gaussian functions or polyhedral representations to account for van der Waals surfaces. This method excels in identifying flexible ligands by generating conformational ensembles and optimizing alignments through combinatorial search algorithms, often outperforming methods in scaffold-hopping scenarios where functional groups vary but overall is conserved. The ROCS (Rapid Overlay of Chemical Structures) software exemplifies this paradigm, utilizing Gaussian-based volumetric similarity scoring to rapidly screen millions of compounds, with demonstrated significant enrichment in prospective studies against diverse targets. Field-based virtual screening extends shape considerations by incorporating molecular interaction fields, aligning compounds based on similarities in electrostatic potentials, steric hindrance, and hydrophobic distributions, often represented as graphs or bitstring fingerprints for efficient matching. Field-graph matching techniques discretize these fields into nodes and edges to capture qualitative patterns, enabling the detection of bioisosteric replacements. Similarity between aligned fields is quantified using the Tanimoto on binary fingerprints, given by
T(A,B) = \frac{|A \cap B|}{|A \cup B|}
where A and B denote the bitsets of query and candidate fields, respectively; values approaching 1 indicate high congruence. Tools like FieldScreen apply this to prioritize diverse chemotypes with analogous profiles.
Quantitative structure-activity relationship (QSAR) models support ligand-based screening by predicting binding affinities or activities from molecular descriptors, serving as filters to rank pharmacophore or shape matches. Two-dimensional QSAR employs topological indices, while three-dimensional variants like Comparative Molecular Field Analysis (CoMFA) probe steric and electrostatic fields at lattice points around aligned ligands, relating them to experimental potencies via partial least squares regression. A prototypical CoMFA equation might take the form
\log\left(\frac{1}{IC_{50}}\right) = a \cdot DES + b \cdot ELEC + c
where DES and ELEC are steric and electrostatic descriptors, and a, b, c are fitted coefficients; this approach has been instrumental in optimizing leads for potency, as validated in numerous kinase inhibitor series.

Structure-Based Methods

Structure-based methods in virtual screening leverage the three-dimensional atomic coordinates of the target , typically a protein, to predict and evaluate potential binding interactions. These coordinates are obtained from experimental techniques such as , (NMR) spectroscopy, or computational approaches like , which construct models based on sequence similarity to known structures. By incorporating the target's geometry and physicochemical properties, these methods enable the simulation of placement within binding pockets, accounting for intermolecular forces like van der Waals, electrostatic, and hydrogen bonding interactions. This contrasts with ligand-based approaches by explicitly modeling target- complementarity rather than relying solely on ligand properties. Protein- forms the cornerstone of structure-based virtual screening, involving the prediction of orientations (poses) and binding affinities within the target's . In rigid , both the protein and are treated as inflexible, which is computationally efficient but less accurate for dynamic systems; flexible , however, allows conformational adjustments in the (and sometimes side chains in the protein) to better mimic physiological conditions. Scoring functions assess the quality of docked poses by estimating binding , categorized as force-field-based (physics-derived, e.g., using or CHARMM parameters), empirical (fitted to experimental data), or knowledge-based (derived from statistical potentials). For instance, employs an empirical scoring function that approximates the total binding energy as E = E_{\text{vdw}} + E_{\text{elec}} + E_{\text{Hbond}} + E_{\text{desolv}}, where terms represent van der Waals, electrostatic, hydrogen bonding, and desolvation contributions, respectively, enabling rapid evaluation of thousands of compounds. Key algorithms in docking employ stochastic search techniques to explore the vast conformational space efficiently. Genetic algorithms (GAs), inspired by evolutionary processes, iteratively evolve populations of poses through selection, crossover, and mutation to optimize scoring; simulations, conversely, use random sampling with criteria to escape local minima. Prominent software implementations include Glide, which uses a hierarchical filtering approach with an OPLS for , achieving success rates above 70% in pose prediction for diverse targets, and , which applies GAs with multiple scoring functions like GoldScore (force-field-based) or ChemScore (empirical) to handle flexibility. identification precedes , often via geometric algorithms that detect cavities or pockets using tools like fpocket or CASTp, prioritizing sites with scores based on enclosure and hydrophobicity. Post-docking analysis refines initial results to improve hit identification. Consensus scoring combines ranks or scores from multiple functions (e.g., averaging and Glide outputs) to reduce false positives, enhancing enrichment factors by up to 2-5 fold in benchmarks against single scorers. Rescoring with more rigorous methods, such as Poisson-Boltzmann surface area (MM-PBSA), further evaluates top poses for energetic accuracy. Finally, hits are filtered for absorption, distribution, metabolism, excretion, and (ADMET) properties using predictive models, ensuring viable leads for experimental validation.

Hybrid Methods

Hybrid methods in virtual screening integrate ligand-based and structure-based techniques to leverage their complementary strengths, thereby enhancing prediction robustness and minimizing false positives. A typical begins with ligand-based filtering, such as matching or shape similarity searches, to rapidly large compound libraries, followed by structure-based refinement via molecular to assess binding poses and affinities more precisely. This sequential synergy allows for efficient enrichment of potential hits while compensating for the limitations of individual paradigms, such as the lack of structural context in ligand-based methods alone. Pharmacophore-constrained docking exemplifies a specific hybrid approach, where pharmacophore models—derived from known ligands or receptor sites—guide pose generation and scoring during to enforce critical interactions like bonds and hydrophobic contacts. In this method, programs generate multiple poses per compound without initial scoring, which are then filtered using receptor-based s, achieving up to 95% reduction in decoys while retaining approximately 80% of actives in benchmarks on targets like neuraminidase and CDK2. The PharmDock program implements this by optimizing protein-derived s for both sampling and ranking, demonstrating improved bioactive pose identification in virtual screening applications. Similarly, multi-objective scoring functions combine ligand-based metrics, such as similarity, with structure-based estimates to provide a holistic evaluation, as seen in workflows that yield high enrichment factors on diverse targets. Receptor-based pharmacophore modeling further illustrates hybrid integration by extracting features directly from the protein binding pocket, capturing key interaction sites for subsequent virtual screening. Workflows like Apo2ph4 generate these models from apo or holo protein structures, enabling the rapid identification of pocket-compatible compounds that can then be refined through . hybrids address target flexibility by simulating ligands against multiple protein conformations, often incorporating ligand-based biases; for example, the LigBEnD method uses atomic property fields from known ligands to weight scores, achieving over 80% accuracy in pose prediction within 2 Å RMSD for targets. These strategies offer enhanced coverage for with incomplete or structural data, facilitating more reliable identification across challenging systems. In the context of inhibitors, a multistage pipeline combining modeling, shape similarity, and screened 260,000 compounds from the NCI database, yielding two novel micromolar inhibitors ( values of 62 μM and 162 μM) with an enrichment factor exceeding 465.

Computational Infrastructure

Ligand-Based Approaches

Ligand-based virtual screening relies on computational resources optimized for rapid processing of molecular descriptors and similarity computations, rather than intensive simulations. Hardware requirements emphasize multi-core CPUs for generation and similarity searches, with GPUs accelerating operations in large-scale comparisons. For instance, tools like PyRMD operate efficiently on modern workstations with at least 4 GB RAM for basic tasks, but screening extensive libraries necessitates higher memory to handle descriptor storage without frequent disk I/O. When processing PubChem-scale databases exceeding 100 million compounds, memory demands typically reach tens of GB of RAM, depending on dimensionality and database indexing strategies, to enable in-memory similarity matching and avoid bottlenecks. Software infrastructure for ligand-based approaches centers on cheminformatics libraries that facilitate descriptor computation and database querying. Open-source tools such as RDKit provide robust capabilities for generating molecular fingerprints and performing Tanimoto similarity searches, forming the backbone of many screening pipelines. OpenBabel complements these by handling diverse file formats and preprocessing structures for input into similarity algorithms. Commercial platforms, including Schrödinger's , offer integrated environments for ligand scouting with advanced and shape-based filtering, enabling seamless workflow automation. Scalability in ligand-based virtual screening is achieved through parallelization techniques tailored to distributed environments. (MPI) enables high-level parallelization for similarity matching across clusters, distributing database subsets to multiple nodes for concurrent querying and achieving near-linear speedups on thousands of cores. Cloud computing platforms like AWS support of millions of compounds, leveraging elastic resources for cost-effective ultra-large library exploration. Optimization strategies focus on reducing computational overhead while preserving chemical information. Extended-connectivity fingerprints (ECFP), such as ECFP4 with 2,048 bits, balance descriptor richness and efficiency by encoding topological features circularly, allowing rapid similarity calculations via bitwise operations. techniques, including or hashing, further accelerate searches by minimizing vector comparisons, particularly for diverse libraries where ensures representation of chemical space without exhaustive enumeration.

Structure-Based Approaches

Structure-based virtual screening imposes significantly higher computational demands than ligand-based approaches due to its reliance on physics-based simulations, such as molecular and dynamics, which require detailed modeling of protein- interactions. High-end graphics processing units (GPUs) are essential for accelerating these calculations, particularly through CUDA-enabled frameworks that parallelize the exhaustive search of conformational spaces during . For instance, GPU-optimized can reduce computation times for large libraries by up to 10-fold compared to CPU-only systems, enabling the processing of millions of compounds in feasible timeframes. Additionally, substantial resources are necessary, often at the terabyte scale for ultra-large libraries, to handle models, ligand databases, and output trajectories from ensemble-based runs that account for protein flexibility. Key software tools for structure-based virtual screening include docking suites like AutoDock Vina and , which employ scoring functions to predict binding affinities and poses. AutoDock Vina, for example, leverages multithreading and empirical scoring to achieve up to 60-fold speed improvements over earlier versions, making it suitable for high-throughput applications. facilitates flexible docking within receptor binding sites, supporting anchor-and-grow strategies for efficient exploration of chemical space. These docking tools are often integrated with molecular dynamics software such as for post-docking refinement, where simulations stabilize predicted complexes and assess binding stability over time. To achieve scalability, structure-based virtual screening commonly employs or (HPC) s, distributing tasks across multiple nodes for parallel execution. For exhaustive searches, such as one million compounds against a target, computations may require several days on a of 100 cores, highlighting the need for optimized in shared HPC environments. Platforms like EXSCALATE demonstrate extreme-scale capabilities by scaling to full supercomputers, processing billions of compounds through distributed workflows. Optimization strategies mitigate the inherent complexity of these simulations, including incremental docking approaches that build ligand poses stepwise to reduce search space dimensionality. Virtual screening cascades further enhance efficiency by applying sequential filters—such as initial matching followed by refined —prioritizing promising candidates and minimizing full computations on low-affinity molecules. These techniques collectively manage the trade-off between accuracy and throughput in resource-intensive structure-based pipelines.

Accuracy and Validation

Evaluation Metrics

The performance of virtual screening methods is assessed using quantitative metrics that evaluate their ability to prioritize active compounds over inactives, with a particular emphasis on early recognition given the vast scale of screened libraries. These metrics provide standardized tools for validating computational outputs prior to experimental follow-up, enabling fair comparisons across methods. A primary metric is the enrichment factor (EF), which quantifies the degree to which actives are concentrated in the top-ranked of results compared to random selection. The formula for EF at a given rank k (e.g., top 1% or 5%) is EF_k = \frac{\frac{\text{Hits in top } k}{k}}{\frac{\text{Total Hits}}{\text{Total compounds}}}, where values greater than 1 indicate successful enrichment. Another key measure is the area under the curve (ROC-AUC), which plots the true positive rate against the across all thresholds and yields a value between 0 and 1, with 0.5 representing random performance and higher values indicating better overall discrimination. To address limitations in ROC-AUC for prioritizing early hits, the Boltzmann-enhanced discrimination of ROC (BEDROC) applies exponential weighting to emphasize rankings at the list's beginning, producing a score bounded between 0 and 1 that balances statistical rigor with early recognition sensitivity. Additional classification-based measures include (the proportion of true actives correctly identified), specificity (the proportion of true inactives correctly excluded), and the Matthews (MCC), which provides a balanced score from -1 to 1 accounting for true and false positives/negatives, with 0 indicating random classification. Hit rates (fraction of actives recovered) and false positive rates are commonly reported in benchmarks like the of Useful Decoys, (DUD-E), where they highlight method efficacy against challenging inactives. Validation protocols rely on decoy sets to simulate real screening scenarios, such as DUD-E's collection of 102 with 22,886 actives and over 1.4 million property-matched generated via to ensure physicochemical similarity but topological dissimilarity (using ECFP4 fingerprints). In ligand-based approaches like quantitative structure-activity relationship (QSAR) modeling, k-fold cross-validation divides data into training and test subsets iteratively to assess generalizability and prevent . For benchmarking, standardized datasets such as DUD-E and DEKOIS 2.0 enable comparative evaluation of workflows, with DEKOIS 2.0 providing 81 benchmark sets for 80 protein targets, 18,197 actives, and 1,121,074 decoys optimized for tests through matching and filters. These resources facilitate the application of metrics like EF and BEDROC to quantify performance across diverse protein families.

Challenges and Limitations

Virtual screening (VS) encounters significant technical challenges, particularly in structure-based methods where conformational sampling errors during can lead to inaccurate predictions of poses. These errors arise from the limited exploration of and protein conformational space, often resulting in suboptimal modes that deviate from experimental structures by more than 2 RMSD. flexibility further complicates , as proteins can undergo induced-fit adaptations upon , requiring advanced ensemble or simulations to account for multiple receptor states, yet these approaches remain computationally demanding and imperfect. Additionally, the effects of molecules in the are frequently underrepresented, leading to overestimated affinities since explicit models are rarely feasible at scale. In ligand-based methods, descriptor inaccuracies pose a core limitation, as molecular descriptors used in QSAR models often fail to capture subtle electronic or steric features critical for activity prediction, with standard deviations in binding affinity estimates reaching 1-2 kcal/. These inaccuracies stem from the empirical of many descriptors, which may not generalize across diverse chemical spaces. Data-related issues undermine the reliability of VS models, including biases in training sets where certain chemotypes, such as benzodiazepines or kinase inhibitors, are overrepresented, skewing predictions toward familiar scaffolds and reducing novelty in hit identification. Activity cliffs exacerbate this, occurring when structurally similar compounds exhibit large potency differences (e.g., >100-fold), challenging QSAR models to interpolate accurately and contributing to high prediction errors in cliff-rich regions of chemical space. Practical limitations include the generation of false positives due to approximations in scoring functions, which prioritize speed over precision and often rank non-binders highly, necessitating extensive experimental follow-up that can consume 20-50% of screening budgets. Scalability versus accuracy trade-offs are inherent, as high-throughput of million-compound libraries requires simplified models that sacrifice detailed physics-based simulations. Regulatory hurdles in pharmaceutical validation also persist, complicating the acceptance of hits without orthogonal experimental validation. Furthermore, post-2020 developments in covalent inhibitors highlight outdated aspects of traditional VS pipelines, which struggle with reactivity modeling and positioning, as covalent tools lag behind the rising prominence of irreversible binders like those targeting proteases. As of 2025, ongoing advancements include the integration of for improved validation metrics, such as AI-driven enrichment assessments in ultra-large library screenings, enhancing overall accuracy in diverse targets.

Applications

In Drug Discovery

Virtual screening plays a pivotal in the early stages of pipelines by enabling the rapid identification of potential hit compounds from vast chemical libraries, typically comprising millions to billions of molecules. In hit identification, computational methods such as or modeling are applied to screen libraries of 10^6 to 10^8 compounds, prioritizing those with favorable binding predictions for subsequent experimental validation, often yielding 50-200 hits for wet-lab testing. This process significantly narrows the search space compared to traditional , allowing researchers to focus resources on promising candidates. During lead optimization, iterative virtual screening refines these hits by incorporating structure-activity relationship data and simulations, guiding the design of analogs with improved potency and selectivity. Notable case studies illustrate the practical impact of virtual screening in identifying therapeutic leads. In 2020, structure-based virtual screening targeted the main protease, screening a library of 235 million compounds to identify three initial inhibitors with micromolar IC₅₀ values, which were further optimized to nanomolar potency and demonstrated broad-spectrum activity against coronaviruses including , , and MERS-CoV. Similarly, a historical ligand-based virtual screening effort in 2010 combined modeling with to discover novel glycogen synthase kinase-3β (GSK-3β) inhibitors, such as 2-anilino-5-phenyl-1,3,4-oxadiazole derivatives, exhibiting nanomolar affinity, selectivity over CDK2, and efficacy in increasing liver accumulation. The economic advantages of virtual screening stem from its ability to reduce the time and cost of by minimizing reliance on resource-intensive wet-lab assays; for instance, it can significantly decrease the number of compounds requiring physical and testing, accelerating the path from to clinical candidate. In drug repurposing, virtual screening has proven invaluable, as seen in the 2021 identification of repurposed inhibitors for SARS-CoV-2's main protease and from a library of 6,218 approved drugs, yielding seven cell-active s including omipalisib, which showed 200-fold greater potency than in human lung cells and synergistic effects in combinations. Post-2020 applications have expanded to AI-assisted virtual screening for rare diseases, where models enhance hit prediction accuracy to 80-90%.

In Other Scientific Fields

Virtual screening has been adapted to , where it facilitates the of novel pesticides and by targeting specific in target organisms. For instance, structure-based virtual screening combined with molecular has been employed to discover inhibitors of acetolactate synthase (), a key in biosynthesis in , leading to the development of novel non-sulfonylurea that effectively control weeds while minimizing off-target effects. Similarly, machine learning-enhanced virtual screening platforms have been developed to predict herbicide-likeness and screen large chemical libraries for compounds inhibiting , resulting in candidates with improved potency and reduced environmental persistence compared to traditional methods. These applications demonstrate how virtual screening accelerates the of mode-of-action-specific , addressing challenges like . In , virtual screening supports the rational design of ligands for catalysts and sensors by evaluating binding affinities and properties across vast chemical spaces. High-throughput computational screening has been used to identify optimal organic linkers for metal-organic frameworks (MOFs), enabling the discovery of structures with enhanced performance for gas storage and separation. For sensors, computational approaches predict interactions between MOF pores and target analytes, facilitating the development of selective gas sensors. Molecular simulations further refine these designs by assessing ligand-framework stability, as seen in screenings that prioritize ligands for robust, tunable MOF-based catalysts. Environmental applications leverage virtual screening to identify compounds or enzymes that degrade pollutants, promoting bioremediation strategies. In silico docking and pharmacophore modeling have been applied to screen potential substrates for laccase enzymes, which oxidize phenolic pollutants like dyes and pesticides, predicting degradation pathways and binding energies to guide enzyme engineering for wastewater treatment. Structure-based virtual screening has also identified variants of cytochrome P450 enzymes (e.g., CYP120A1) with enhanced thermostability and activity against sulfonamide antibiotics, enabling more efficient microbial bioremediation of contaminated soils. These approaches reduce experimental trial-and-error, focusing on inhibitors or activators that accelerate pollutant breakdown into non-toxic byproducts. Emerging uses of virtual screening extend to prediction and . In , ensemble-based virtual screening models predict compound by integrating molecular descriptors and , filtering out hazardous candidates early in chemical design with improved predictive performance. For , computational screening has been applied to identify potential therapeutic peptides.

Advances and Future Directions

Machine Learning Integration

has been integrated into virtual screening to enhance the prediction of molecular activities by learning complex patterns from chemical datasets, surpassing traditional rule-based methods in handling high-dimensional data. approaches, such as random forests and neural networks applied to molecular graphs, enable accurate and of binding affinities and bioactivities. For instance, random forests aggregate multiple decision trees to predict compound efficacy, achieving enrichment factors up to 20-fold in hit identification compared to random selection. Unsupervised methods, like clustering on descriptor spaces, aid in exploring chemical space for novel leads. Substructural analysis leverages fragment-based machine learning to pinpoint bioactive motifs within molecules, facilitating the identification of key pharmacophores. Techniques such as support vector machines trained on fragment descriptors have successfully isolated motifs responsible for target inhibition, as demonstrated in inhibitor discovery for calcium and integrin-binding protein 1 (CIB1), where ML-driven fragment screening yielded novel ligands with confirmed binding affinities in the micromolar range. Scaffold hopping, which replaces core structures while preserving activity, is advanced by graph neural networks (GNNs) that encode molecular topologies as graphs, propagating features across atoms to generate analogous scaffolds. Recursive partitioning, a foundational in quantitative structure-activity relationship (QSAR) modeling, builds decision trees on molecular descriptors to classify compounds iteratively. Random forests extend this by averaging predictions from numerous trees, reducing and enhancing robustness in virtual screening. Node splitting in these trees minimizes measures, such as the Gini index, defined as G(p) = 1 - \sum_{i=1}^{c} p_i^2 where p_i represents the proportion of instances in class i among c classes; the optimal split selects the descriptor threshold that maximizes the reduction in weighted Gini across child nodes. Deep learning advances have transformed virtual screening through convolutional neural networks (CNNs) that process molecular fields as image-like representations, capturing spatial interactions for scoring functions. Models like Gnina employ CNNs for pose prediction and affinity estimation, outperforming traditional in success rates by 10-20% on diverse targets. Transformer-based models, such as ChemBERTa pretrained on over 77 million SMILES strings via , excel in property prediction tasks relevant to screening, achieving ROC-AUC scores of 0.78-0.84 on MoleculeNet datasets like Tox21 and , with performance scaling logarithmically with pretraining data size. To address imbalanced datasets common in virtual screening—where actives are rare—techniques like and focal loss have been integrated, boosting by up to 30% in hit enrichment. Post-2020 developments include generative models for design, which synthesize novel molecules conditioned on desired properties, expanding the screened chemical space beyond existing libraries. Variational autoencoders and generative adversarial networks (GANs) have generated drug-like candidates with optimized , as in REINVENT, which produced more synthesizable leads than random enumeration while maintaining target affinity. These models integrate seamlessly into virtual screening pipelines, prioritizing generated compounds for and reducing experimental costs. As of 2025, models have further advanced this area, enabling high-fidelity 3D molecular generation conditioned on protein targets, improving lead optimization efficiency. Quantum computing is emerging as a transformative technology for virtual screening, particularly in enhancing the accuracy of calculations during molecular . Algorithms such as the (VQE) enable precise computation of free energies by leveraging to model complex molecular interactions that classical computers struggle with due to exponential scaling. This approach promises to revolutionize structure-based virtual screening by providing quantum-accurate simulations of protein-ligand , potentially accelerating hit identification in pipelines. Early applications have demonstrated VQE's feasibility for small-molecule systems, with ongoing research focusing on scaling to larger biomolecular complexes. Advancements in are further propelling virtual screening through specialized generative models and privacy-preserving frameworks. Generative adversarial networks (GANs) facilitate library design by generating diverse, drug-like molecules that optimize desired properties, such as binding affinity, while exploring vast chemical spaces more efficiently than traditional methods. For instance, GAN-based architectures have been optimized to produce chemically valid structures, addressing challenges like mode collapse in training and enabling targeted lead optimization. Complementing this, allows secure sharing of proprietary datasets across institutions without centralizing sensitive information, fostering collaborative virtual screening for while maintaining data privacy through decentralized model updates. Initiatives like the MELLODDY consortium exemplify this, integrating ADME-Tox predictions from multiple pharmaceutical partners to enhance screening accuracy. Key trends in virtual screening include deeper integration with experimental structural biology and efforts toward sustainable computing practices. The 2017 Nobel Prize in Chemistry for cryo-electron microscopy (cryo-EM) has catalyzed its synergy with computational methods, providing high-resolution structures of challenging targets like membrane proteins to inform more reliable docking and screening campaigns. This post-Nobel expansion has improved structure quality for virtual screening, enabling better prediction of ligand poses in dynamic complexes. Blockchain technology supports secure collaborations by enabling tamper-proof sharing of screening results and intellectual property in distributed networks, reducing risks in multi-party drug discovery efforts. Additionally, sustainability initiatives in high-performance computing (HPC) address the environmental footprint of large-scale virtual screening, with green HPC strategies optimizing energy efficiency through workload-aware scheduling and renewable-powered data centers to minimize carbon emissions from intensive simulations. Looking toward the 2030s, virtual screening is poised for real-time applications in , where AI-driven platforms could dynamically tailor compound libraries to individual genomic profiles for rapid hit selection. Post-2023 innovations, such as diffusion models for molecular generation, are bridging this gap by enabling conditional synthesis of drug-like molecules conditioned on target structures, enhancing virtual screening's ability to explore novel chemical spaces with high fidelity. These models, including target-aware variants, have shown promise in generating pharmacophore-aligned ligands, potentially streamlining lead optimization and supporting on-demand screening in clinical settings by the decade's end. Overall, these trajectories emphasize hybrid quantum-AI systems and ethical data practices as cornerstones for scalable, impactful virtual screening.

References

  1. [1]
    Structure-Based Virtual Screening for Drug Discovery: a Problem ...
    We reviewed the recent advances and applications in SBVS with a special focus on docking-based virtual screening.
  2. [2]
    Virtual Screening Algorithms in Drug Discovery: A Review Focused ...
    May 5, 2023 · This review presents an overview of the algorithms used in VS, describing them and showing their use in drug design and their contribution to the drug ...
  3. [3]
    An artificial intelligence accelerated virtual screening platform for ...
    Sep 5, 2024 · Structure-based virtual screening plays a key role in drug discovery by identifying promising compounds for further development and refinement.
  4. [4]
    Structure-Based Virtual Screening for Drug Discovery: Principles ...
    Structure-Based Virtual Screening for Drug Discovery: Principles, Applications and Recent Advances · 1. INTRODUCTION · 2. VIRTUAL SCREENING IN STRUCTURE-BASED ...
  5. [5]
    The Light and Dark Sides of Virtual Screening: What Is There to Know?
    Virtual screening consists of using computational tools to predict potentially bioactive compounds from files containing large libraries of small molecules.
  6. [6]
    QSAR-Based Virtual Screening: Advances and Applications in Drug ...
    Nov 13, 2018 · Quantitative structure–activity relationship (QSAR) analysis is a ligand-based drug design method developed more than 50 years ago by Hansch and ...
  7. [7]
    History of 3D pharmacophore searching: commercial, academic and ...
    The history of the evolution of the methodology to search 3D databases using pharmacophores is recounted, starting with the work of Kier 1968–1971.Missing: rise | Show results with:rise
  8. [8]
    In Silico Drug Discovery: Solving the “Target‐rich and Lead‐poor ...
    Dec 14, 2006 · The completion of the Human Genome Project in 2003 and recent advances in proteomics and the Structural Genomics Initiative have identified ...
  9. [9]
    aaGetting Started with the RDKit in Python
    This document is intended to provide an overview of how one can use the RDKit functionality from Python. It's not comprehensive and it's not a manual.Missing: history | Show results with:history
  10. [10]
    New Trends in Virtual Screening - ACS Publications
    Sep 28, 2020 · Over the last 20 years, virtual screening has become a key component of industrial and academic drug discovery.Missing: milestones | Show results with:milestones
  11. [11]
    Combining docking with pharmacophore filtering for improved virtual ...
    This method uses a docking program for pose generation without regard to scoring, followed by filtering with receptor-based pharmacophore searches.
  12. [12]
    PharmDock: a pharmacophore-based docking program
    Apr 16, 2014 · A new pharmacophore-based docking program PharmDock that combines pose sampling and ranking based on optimized protein-based pharmacophore models.
  13. [13]
    Structure- and Ligand-Based Virtual Screening on DUD-E+
    Apr 9, 2020 · Last, we present results for a hybrid approach that combines ensemble docking with its ligand-based counterpart. Performance Analysis in Virtual ...
  14. [14]
    Apo2ph4: A Versatile Workflow for the Generation of Receptor ...
    Dec 16, 2022 · Pharmacophore models are widely used as efficient virtual screening (VS) filters for the target-directed enrichment of large compound ...Introduction · Methods · Results and Discussion · Supporting Information
  15. [15]
    Ligand-biased ensemble receptor docking (LigBEnD): a hybrid ...
    Sep 8, 2017 · We developed a new ligand-biased ensemble receptor docking method and composite scoring function which combine the use of ligand-based atomic property field ( ...
  16. [16]
    Multistage virtual screening and identification of novel HIV-1 ...
    Aug 28, 2015 · Multistage virtual screening and identification of novel HIV-1 protease inhibitors by integrating SVM, shape, pharmacophore and docking methods.
  17. [17]
    PyRMD: A New Fully Automated AI-Powered Ligand-Based Virtual ...
    Jul 16, 2021 · As for the hardware requirements, PyRMD should perform well enough on most modern personal computers and workstations. We suggest a RAM size of ...
  18. [18]
    VSFlow: an open-source ligand-based virtual screening tool
    Mar 31, 2023 · VSFlow is a versatile command-line tool to perform ligand-based virtual screenings in large compound databases on the basis of the RDKit ...Missing: history | Show results with:history
  19. [19]
    Ligand-based approach for predicting drug targets and for virtual ...
    Open Babel is used to convert various molecular formats and perform molecular 2D structural similarity evaluation based on molecular fingerprints. RDKit is used ...
  20. [20]
    Virtual Screening Web Service | Schrödinger
    Upload one or more virtual screening inputs which include the docking model, shape screening probes, and known active compounds. Select multiple libraries to ...
  21. [21]
    A Review on Parallel Virtual Screening Softwares for High ...
    In this review, we discuss such implementations of parallelization algorithms in virtual screening programs.
  22. [22]
    Large-scale virtual screening on public cloud resources with Apache ...
    Mar 6, 2017 · We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach.Parallel Screening · Experiments · Performance Metrics
  23. [23]
    Extended-Connectivity Fingerprints - ACS Publications
    In this paper we describe in detail two rules leading to two different fingerprints: a standard ECFP and a variant termed FCFP. ECFPs are intended to capture ...
  24. [24]
    Sort & Slice: a simple and superior alternative to hash-based folding ...
    Dec 3, 2024 · We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures.
  25. [25]
    [PDF] Recent Developments in Structure-Based Virtual Screening ... - arXiv
    Nov 6, 2022 · Deep learning-based (Section 5) and. GPU-based docking programs (Section 7) can be used to speed up structure-based virtual screens in any of ...
  26. [26]
    GPU-Accelerated Flexible Molecular Docking - PMC - NIH
    In this work, we develop algorithms and software building blocks for molecular docking that can take advantage of graphics processing units (GPUs).
  27. [27]
    GPU-optimized approaches to molecular docking-based virtual ...
    This paper presents the implementations and comparative analysis of two GPU-optimized implementations of a virtual screening algorithm targeting novel GPU ...
  28. [28]
    Survey of public domain software for docking simulations and virtual ...
    AD Vina aims to use multithreading optimally to increase the speed of docking simulations. It has been reported to achieve a near 60-fold increase in speed ( ...
  29. [29]
    Accelerating Virtual High-Throughput Ligand Docking - NIH
    In this paper we give the current state of high-throughput virtual screening. We describe a case study of using a task-parallel MPI (Message Passing ...
  30. [30]
    [PDF] DOVIS: A Tool for High-throughput Virtual Screening - DTIC
    (1) To dock one-million compounds in a reasonable time (~1 week) requires 200+. CPUs. On a shared HPC platform, what's the best way to request CPUs at this.
  31. [31]
    (PDF) An extreme-scale virtual screening platform for drug discovery
    Oct 27, 2022 · PDF | On May 17, 2022, Davide Gadioli and others published An extreme-scale virtual screening platform for drug discovery | Find, ...
  32. [32]
    A practical guide to large-scale docking | Nature Protocols
    Sep 24, 2021 · Here we outline best practices and control docking calculations that help evaluate docking parameters for a given target prior to undertaking a large-scale ...
  33. [33]
    Evaluating Virtual Screening Methods: Good and Bad Metrics for the ...
    We then introduce the BEDROC metric as a logical consequence of the analysis, incorporating the notion of early recognition into the ROC metric formalism.
  34. [34]
    A statistical framework to evaluate virtual screening
    Jul 20, 2009 · Many metrics, AU-ROC, RIE, BEDROC, pROC etc., are currently used to evaluate the performance of ranking methods in VS studies [8–17]. However, ...
  35. [35]
    Machine learning classification can reduce false positives in ... - PNAS
    In broad terms, virtual screening approaches can be categorized into two classes: ligand-based screens and structure-based screens (10–12). Ligand-based ...
  36. [36]
    Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and ...
    Jun 20, 2012 · Fundamentally, DUD and DUD-E are designed to measure value-added screening performance of 3-D methods over simple 1-D molecular properties.Introduction · Results · Discussion · Supporting Information
  37. [37]
    Evaluation of QSAR Equations for Virtual Screening - MDPI
    Whenever possible, QSAR models should be validated on an external set (termed validation set or test set) [9]. An external set can be obtained by splitting the ...Evaluation Of Qsar Equations... · 2. Results · 4. Methods<|control11|><|separator|>
  38. [38]
    Evaluation and Optimization of Virtual Screening Workflows with ...
    May 25, 2013 · The application of molecular benchmarking sets helps to assess the actual performance of virtual screening (VS) workflows.
  39. [39]
    Challenges and Advances in Structure-Based Virtual Screening
    Dec 23, 2013 · As SBVS is based on computational docking, it suffers from all the challenges faced by docking and scoring. Specifically, it needs to account ...
  40. [40]
    Virtual ligand screening: strategies, perspectives and limitations - PMC
    In the late 1980s and early 1990s, experimental high-throughput screening (HTS) and combinatorial chemistry were aggressively developed to overcome the lead ...
  41. [41]
    Maximum Unbiased Validation (MUV) Data Sets for Virtual ...
    Here, refined nearest neighbor analysis is used to design benchmark data sets for virtual screening based on PubChem bioactivity data.
  42. [42]
    On Outliers and Activity CliffsWhy QSAR Often Disappoints
    Of particular importance is the detection of trueoutliers, an inherently difficult problem that is confounded by the presence of cliffs in activity landscapes ...
  43. [43]
    Regulating the Use of AI in Drug Development: Legal Challenges ...
    These challenges include compliance with evolving regulatory frameworks, managing risks related to data privacy and security, ensuring intellectual property ...
  44. [44]
    Diversity in Genomic Studies: A Roadmap to Address the Imbalance
    There are already clear examples of population-enriched clinically important variants only discovered in underrepresented populations; a few of these include ...
  45. [45]
    Advancements, challenges, and future frontiers in covalent inhibitors ...
    This review encompasses a broad examination of various classes of covalent inhibitors and drugs, with a focus on their mechanisms, applications, and ...
  46. [46]
    Ultralarge Virtual Screening Identifies SARS-CoV-2 Main Protease ...
    Feb 10, 2022 · We explored two virtual screening strategies to find inhibitors of the SARS-CoV-2 main protease in ultralarge chemical libraries.
  47. [47]
    Merging Ligand-Based and Structure-Based Methods in Drug ...
    Oct 15, 2020 · Virtual screening (VS) is an outstanding cornerstone in the drug discovery pipeline. A variety of computational approaches, which are ...
  48. [48]
  49. [49]
    Structure-Based Virtual Screening: From Classical to Artificial ...
    Structure-based virtual screening for drug discovery: principles, applications and recent advances. Curr. Top. Med. Chem. 14, 1923–1938. doi: 10.2174 ...
  50. [50]
    Drugs repurposed for COVID-19 by virtual screening of ... - PNAS
    Jul 7, 2021 · Drug repurposing is a tangible strategy for developing antiviral agents within a short period. In general, drug repurposing starts with virtual ...
  51. [51]
    Artificial intelligence (AI) in drug design and discovery
    AI-driven virtual screening techniques, including deep neural networks (DNNs) and graph convolutional networks (GCNs), have attained 80–90 % accuracy in ...<|control11|><|separator|>
  52. [52]
    Identification of some novel AHAS inhibitors via molecular docking ...
    Jan 1, 2007 · Identification of some novel AHAS inhibitors via molecular docking and virtual screening approach. Bioorg Med Chem. 2007 Jan 1;15(1):374-80 ...Missing: agrochemical | Show results with:agrochemical
  53. [53]
    Comprehensive machine learning based study of the chemical ...
    Jun 1, 2021 · By combining machine learning (ML) models with a set of herbicide-likeness rules, virtual screening platform is proposed. ... Action mechanisms of ...
  54. [54]
    Discovering new mode‐of‐action pesticide leads inhibiting protein ...
    Aug 14, 2024 · The herbicide resistance crisis necessitates discovering molecules with new modes of action. Using computational screening, we identified ...
  55. [55]
    Machine Learning Meets with Metal Organic Frameworks for Gas ...
    Apr 29, 2021 · In this review, we highlight the current state of the art in ML-assisted computational screening of MOFs for gas storage and separation.
  56. [56]
    Chemical Sensors Based on Metal–Organic Frameworks - Yi - 2016
    Apr 28, 2016 · This review provides an update on various metal–organic framework (MOF)-based chemical sensors and their classification on the basis of different mechanisms of ...
  57. [57]
    Application of molecular docking simulation to screening of metal ...
    Molecular docking in MOFs is a technique that determines the optimal fit between two molecules after a guest molecule (drug) is docked with accessible ...<|control11|><|separator|>
  58. [58]
    An Insilco approach to bioremediation: Laccase as a case study
    Protein-ligand docking tool can be used to screen pollutants for their susceptibility to degradation by already characterized enzyme. Laccase being a broad ...
  59. [59]
    Rational design of CYP120A1 variants and eco-friendly alternatives ...
    Utilizing structure-based virtual screening to identify target enzymes with improved degradation capabilities and thermostabilities. Concurrently, the study ...
  60. [60]
    In Silico Approaches in Bioremediation Research and Advancements
    Jul 2, 2023 · This chapter addresses the injurious effect of heavy metal emissions and processes employed for bioremediation by microorganisms and plants.
  61. [61]
    Virtual screening of toxic compounds with ensemble predictors
    In this work, we take a pragmatic approach, proposing a systematic ensembling method for virtual toxicity screening. ... Toxicity prediction models. We tested ...
  62. [62]
    Virtual Screening of Peptide Libraries: The Search for ... - MDPI
    Virtual screening of peptide libraries uses computational tools, mainly structure-based virtual screening (SBVS), to identify novel peptides for therapeutic ...Virtual Screening Of Peptide... · 1.1. Targeting Ppis With... · 3.2. Anticancer Peptides
  63. [63]
    Machine Learning-based Virtual Screening and Its Applications to ...
    Partition impurity can be calculated with entropy or the Gini index. Quinlan describes how to use entropy to find the split that gives the highest ...
  64. [64]
    Machine Learning-driven Fragment-based Discovery of CIB1 ...
    FRASE-based virtual screening identified the first small-molecule CIB1 ligand (with binding confirmed in a TR-FRET assay) showing specific cell-killing activity ...Missing: bioactive motifs
  65. [65]
    Advanced machine learning for innovative drug discovery
    Aug 8, 2025 · We review how novel machine learning developments are enhancing structural-based drug discovery; providing better forecasts of molecular ...
  66. [66]
    None
    ### Summary of ChemBERTa Model
  67. [67]
    Deep generative molecular design reshapes drug discovery - NIH
    Generative models for de novo molecular generation are able to design molecules with multiple design constraints such as potency, safety, and desired metabolic ...
  68. [68]
    Integrating artificial intelligence into small molecule development for ...
    Oct 1, 2025 · Virtual screening and binding affinity prediction ... Quantum computing, through algorithms like the Variational Quantum Eigensolver and Quantum ...
  69. [69]
    Computer-aided drug discovery: From traditional simulation ...
    We also examine the potential of quantum computing to revolutionize drug discovery by solving complex molecular problems that are currently beyond the reach of ...Introduction · Deep Learning Models Predict... · Quantum Computing For Drug...
  70. [70]
    From GPUs to AI and quantum: three waves of acceleration in ...
    Here, we identify three waves of acceleration and their applications in a bioinformatics context: (i) GPU computing, (ii) AI and (iii) next-generation quantum ...Keynote (green) · Second Wave: Ai · Acknowledgements
  71. [71]
    Unleashing the power of generative AI in drug discovery
    This review focuses on recent advancements in deep generative models (DGMs) for de novo drug design, exploring diverse algorithms and their profound impact.
  72. [72]
    Optimizing drug design by merging generative AI with a physics ...
    Aug 8, 2025 · Generative Adversarial Networks, while capable of producing high yields of chemically valid molecules, often face issues like mode collapse ...<|separator|>
  73. [73]
    Privacy-preserving techniques for decentralized and secure ...
    In this paper, we present an overview of these techniques for decentralized ML to illustrate its benefits and drawbacks in the field of drug discovery.Informatics (orange) · Secure Multiparty... · Federated Learning
  74. [74]
  75. [75]
    Computational approaches streamlining drug discovery - Nature
    Apr 26, 2023 · Likewise, virtual libraries that use in silico screening were traditionally limited to a collection of compounds available in stock from vendors ...
  76. [76]
    Critical Assessment of AI-Based Protein Structure Prediction
    Sep 25, 2025 · This breakthrough was recognized with the 2017 Nobel Prize in Chemistry. Recent developments in time-resolved cryo-EM using microfluidic mixing ...
  77. [77]
    A review on the decarbonization of high-performance computing ...
    There is a growing need for actions that enhance sustainability awareness and responsiveness in HPC centers due to increasing energy consumption, costs, and ...Missing: screening | Show results with:screening
  78. [78]
    Unlocking the potential: multimodal AI in biotechnology and digital ...
    Oct 20, 2025 · Real-Time Monitoring: Wearable devices and sensors equipped with AI algorithms enable continuous health monitoring. These devices can detect ...
  79. [79]
    A dual diffusion model enables 3D molecule generation and lead ...
    Mar 26, 2024 · We have developed a conditional deep generative model, PMDM, for 3D molecule generation fitting specified targets.
  80. [80]
    Knowledge-guided diffusion model for 3D ligand-pharmacophore ...
    Mar 6, 2025 · We then apply DiffPhore for virtual screening of human glutaminyl cyclases, promising drug targets for neurodegenerative diseases and cancer ...Missing: post- | Show results with:post-
  81. [81]
    Four scenarios for the future of medicines and social policy in 2030
    The future of medicines is likely determined by an array of scientific, socioeconomic, policy, medical need, and geopolitical factors, with many uncertainties ...Missing: virtual | Show results with:virtual