Neural architecture search
Neural architecture search (NAS) is a subfield of automated machine learning (AutoML) that automates the design of artificial neural network architectures by systematically exploring a defined search space to identify models that optimize performance metrics such as accuracy, efficiency, or latency, thereby reducing the reliance on expert manual engineering.[1][2] The origins of NAS trace back to early applications of evolutionary algorithms for neural network design in the late 1980s, but the field surged in prominence with the advent of deep learning.[1] The foundational modern approach was introduced by Zoph and Le in 2016, who employed reinforcement learning to generate and evaluate architectures on image classification tasks like CIFAR-10, demonstrating that automated search could rival hand-crafted designs but at significant computational expense—often requiring thousands of GPU hours.[3] Subsequent works, such as NASNet by Zoph et al. in 2017, refined this by searching for reusable "cells" on smaller datasets and transferring them to larger ones like ImageNet, achieving state-of-the-art results in object detection and classification.[4] At its core, NAS comprises three interconnected components: the search space, which specifies the universe of possible architectures (e.g., cell-based structures with operations like convolutions or skip connections); the search strategy, which navigates this space using algorithms like reinforcement learning, evolutionary methods, or Bayesian optimization; and the performance estimation strategy, which assesses candidate architectures through full training, weight sharing, or low-fidelity proxies to mitigate costs.[1] Early strategies were computationally intensive, but innovations like parameter sharing in ENAS (Pham et al., 2018) reduced search times from days to hours by reusing weights across architectures.[5] Key advancements have diversified search strategies, including evolutionary algorithms in AmoebaNet (Real et al., 2018), which used regularized evolution to evolve high-performing convolutional cells competitive with NASNet on ImageNet while emphasizing model size constraints.[6] A major breakthrough came with gradient-based methods like DARTS (Liu et al., 2018), which reformulates architecture search as a differentiable optimization problem, enabling end-to-end training via bilevel optimization and drastically cutting search costs to GPU-days rather than GPU-thousands.[7] These approaches have extended NAS beyond vision to domains like natural language processing, reinforcement learning, and efficient mobile deployment. In the years since, NAS has further advanced with hardware-aware methods, zero-shot estimation techniques, and integrations with transformer-based large language models.[1][8] NAS has transformed deep learning by consistently discovering architectures that surpass manually designed ones in benchmarks, such as achieving lower error rates on CIFAR-10 and ImageNet while balancing trade-offs in parameters and inference speed.[4][6] Its importance lies in democratizing AI model development, accelerating innovation, and enabling tailored solutions for resource-constrained environments, though challenges remain in scalability, generalization across tasks, and benchmark reliability.[2][1]Introduction
Definition and Motivation
Neural Architecture Search (NAS) is an automated methodology for discovering optimal neural network architectures tailored to specific machine learning tasks, such as classification or segmentation. It operates through three core components: a search space that delineates the universe of possible architectures (e.g., layer types, connections, and hyperparameters); a search strategy that navigates this space to sample candidate architectures; and a performance estimation strategy that assesses the efficacy of these candidates, often via training or proxy metrics. This framework shifts the burden of architecture engineering from human experts to algorithmic exploration, enabling the identification of high-performing models without exhaustive manual iteration.[1] The motivation for NAS arises from the inherent limitations of traditional manual architecture design, which demands substantial domain expertise, iterative experimentation, and significant time investment, often leading to suboptimal or biased outcomes due to human intuition. By automating this process, NAS not only enhances model accuracy and efficiency—frequently surpassing hand-crafted designs like ResNet or VGG—but also democratizes advanced neural network development, integrating it into broader AutoML ecosystems to streamline end-to-end machine learning pipelines. This approach addresses the growing complexity of deep learning models, where manual tuning becomes increasingly infeasible as architectures scale in depth and width.[9] NAS can be viewed as a natural extension of hyperparameter optimization, evolving to target structural elements like topology and operations rather than just tuning parameters. Early applications concentrated on image classification benchmarks, including CIFAR-10 for smaller-scale validation and ImageNet for large-scale transferability; for example, reinforcement learning-based searches on CIFAR-10 yielded architectures with error rates around 3.65%, while subsequent adaptations like NASNet achieved state-of-the-art top-1 accuracy of 82.7% on ImageNet.[3][4] A central trade-off in NAS is the balance between computational expense and performance improvements, as exhaustive searches can demand thousands of GPU hours—early reinforcement learning methods, for instance, required up to 1800 GPU-days on CIFAR-10—prompting ongoing research into efficient approximations to make NAS viable for resource-constrained environments.[10]Historical Development
Neural architecture search (NAS) originated in 2016 with the pioneering work of Barret Zoph and Quoc V. Le, who introduced an RL-based approach using a recurrent neural network controller to generate architectures, achieving state-of-the-art performance of 3.65% test error on CIFAR-10, surpassing prior hand-designed convolutional networks.[3] This method marked the first automated discovery of competitive neural architectures, setting the foundation for subsequent NAS research by demonstrating that machine learning could optimize network design directly from data.[3] In 2017, advancements scaled NAS to larger datasets, exemplified by NASNet, which searched for transferable "cells" on CIFAR-10 and applied them to ImageNet, yielding a model with 82.7% top-1 accuracy while reducing computational demand by 9 billion FLOPS (28% fewer than the best prior human-designed architectures like Inception-v4).[4] This transferability innovation highlighted NAS's potential for practical deployment across tasks. Around the same time, evolutionary methods began emerging as compute-efficient alternatives to RL, though they gained prominence later. The high computational costs of early RL-based NAS—often requiring thousands of GPU days—prompted a shift toward efficiency in 2018, with the introduction of one-shot methods like Efficient Neural Architecture Search (ENAS) and gradient-based approaches such as DARTS, which relaxed the discrete search space into a continuous one for end-to-end optimization, reducing search time to hours on a single GPU.[7] Hardware-aware NAS also debuted that year with FBNet, incorporating latency predictions into the search objective to optimize for edge devices.[11] Post-2020, NAS evolved toward hardware optimization and broader architectures, with expansions in hardware-aware methods for edge devices through 2022–2025, including multi-objective searches balancing accuracy, latency, and energy on resource-constrained platforms.[12] Integration with transformers surged, enabling automated design of efficient variants for vision and language tasks, as surveyed in comprehensive reviews.[13] By 2025, literature highlighted generative NAS paradigms, leveraging diffusion models and LLMs to sample architectures from learned distributions, further enhancing scalability.[14] Key milestones included the establishment of AutoML conferences and ICML workshops fostering collaboration, alongside benchmarks like NAS-Bench-101 in 2019, which tabulated 423,000 architectures on CIFAR-10 to standardize reproducible evaluations.[15]Core Concepts
Search Space Definition
In neural architecture search (NAS), the search space encompasses all possible neural network architectures that can be constructed and evaluated during the optimization process. It serves as the foundational domain from which NAS methods sample and select candidate architectures, directly influencing the diversity, expressiveness, and computational feasibility of the search. Broadly, search spaces are categorized into macro and micro types. Macro search spaces define the entire network topology, including the sequence of layers, connections, and global structure, allowing for comprehensive exploration of full architectures but often at high computational cost. In contrast, micro search spaces focus on smaller, repeatable building blocks or modules, such as cells, which are then stacked to form the complete network, enabling more manageable optimization while promoting modularity and transferability across models.[8] Search spaces are parameterized either discretely or continuously to represent architectural choices. Discrete parameterization involves selecting from a predefined set of operations (e.g., convolutions, pooling, skip connections) for each position in the architecture, resulting in a combinatorial space where each choice is categorical. For instance, in chain-structured spaces, architectures are represented as sequential compositions of layers, where each layer's operation and hyperparameters (e.g., kernel size, number of filters) are chosen independently, leading to an exponential growth in possibilities as the number of layers increases. Hierarchical search spaces extend this by organizing architectures across multiple levels, such as optimizing individual operations within cells and then the arrangement of those cells into blocks, as seen in methods like PNAS, which progressively refines structures from low to high levels. Continuous parameterization relaxes the discrete choices into a differentiable form, typically using softmax distributions over candidate operations to create a supernet where architectures are weighted mixtures, facilitating gradient-based optimization. A prominent example is DARTS, where each edge in a directed acyclic graph (DAG) representing a cell is parameterized by architecture variables that blend operations continuously.[1][7] Defining effective search spaces faces significant challenges, primarily the curse of dimensionality arising from their vast size. For example, the NASNet micro search space for a single cell with five blocks and five operation choices per block yields approximately 10^{18} possible topologies, rendering exhaustive enumeration impractical and necessitating efficient sampling or approximation strategies like pruning irrelevant operations to reduce complexity. These spaces often exceed 10^{10} configurations even in constrained settings, amplifying the computational burden of evaluating candidates. To mitigate this, techniques such as restricting operation sets or imposing structural priors (e.g., DAG constraints) are employed, though they risk biasing toward human-designed priors. Performance estimation on these spaces, which assesses architecture quality without full training, is crucial but deferred to separate methodologies.[1] Recent developments as of 2025 emphasize domain-specific search spaces tailored to emerging architectures, particularly vision transformers (ViTs) and multimodal models, to better capture task-unique priors and improve efficiency. For ViTs, search spaces now incorporate transformer-specific primitives like attention heads, positional encodings, and patch embeddings, as in Autoformer, which optimizes embedding dimensions and layer configurations within a ViT block to enhance out-of-domain generalization. In multimodal contexts, spaces integrate cross-modal fusion operations (e.g., attention across text and image branches), enabling NAS to discover hybrid architectures for tasks like visual question answering, though challenges persist in balancing modality-specific constraints with overall scalability. These trends shift from generic convolutional spaces toward hybrid designs, prioritizing adaptability to large-scale pretraining and fine-tuning paradigms.[16]Performance Estimation
Performance estimation in neural architecture search (NAS) involves evaluating the quality of candidate architectures to guide the search process toward high-performing models. The most accurate approach is full training, where each candidate architecture is trained from scratch to convergence on the target dataset and evaluated on a held-out validation set. This method provides precise performance metrics but is computationally prohibitive; for instance, the seminal NASNet search required approximately 2,000 GPU-days on CIFAR-10 using reinforcement learning to explore thousands of architectures. Such costs highlight the need for efficient alternatives, as full training scales poorly with the size of the search space. To mitigate these expenses, proxy tasks employ low-fidelity approximations that trade some accuracy for speed. These include training candidates for fewer epochs, using smaller proxy datasets (e.g., subsets of CIFAR-10 or MNIST instead of ImageNet), or reducing model width and depth while assuming rank consistency with full training. Surveys indicate that early stopping after a fixed number of iterations often preserves performance rankings with Spearman's rank correlation coefficients above 0.7 on benchmarks like NAS-Bench-201, enabling searches to complete in hours rather than days.[17] However, the reliability of these proxies varies across search spaces, necessitating validation on specific tasks.[1] More recent advancements focus on zero-cost proxies, which estimate performance without any training by analyzing architecture properties in a single forward or backward pass on random data. Introduced in 2021, these proxies draw from pruning techniques and include metrics like synaptic saliency (measuring parameter importance via gradients) and Jacobian covariance (assessing feature sensitivity).[18] Variants in methods like ENAS extend this by incorporating shared computations, but pure zero-cost approaches achieve rank correlations of up to 0.8 with true accuracy on NAS-Bench-101 without training. By 2025, innovations such as parametric zero-cost proxies (ParZC) enhance adaptability through learnable parameters, improving correlation on diverse benchmarks like NDS, while evolving composite proxies combine multiple metrics nonlinearly for better generalization across tasks.[19][20] Weight-sharing strategies further amortize costs by training a supernet that encompasses all candidate architectures, allowing subsets to inherit pre-trained weights for rapid evaluation. This forms the basis for one-shot NAS, where architectures are sampled from the supernet and assessed via proxy metrics, reducing search times from GPU-days to hours as demonstrated in ENAS on CIFAR-10. Performance is typically measured by primary objectives like top-1 accuracy and inference latency, alongside secondary factors such as robustness to adversarial attacks, often quantified via expected calibration error or defense success rates. To validate proxy effectiveness, Spearman's rank correlation \rho between proxy scores and true performance is commonly used: \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} where d_i are rank differences and n is the number of architectures; values exceeding 0.6 indicate reliable proxies for guiding NAS.[18][17]Search Strategies
Reinforcement Learning Approaches
Reinforcement learning approaches to neural architecture search frame the problem as a sequential decision-making process, where a controller—typically a recurrent neural network (RNN)—samples architectures from a defined search space by generating sequences of architectural choices, such as layer types, filter sizes, and connections.[3] The controller is trained as a policy in an RL setting, with the reward signal derived from the validation accuracy of the sampled architectures after training them on a dataset like CIFAR-10.[3] To optimize the policy, the REINFORCE algorithm is employed, which updates the controller's parameters \theta using the policy gradient: \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi(a|s; \theta) (R - b), where \pi(a|s; \theta) is the policy probability of action a given state s, R is the reward, b is a baseline (e.g., an exponential moving average of past rewards) to reduce variance, and \alpha is the learning rate.[3] This baseline helps stabilize training by subtracting the expected reward from the actual reward, mitigating high variance in gradient estimates.[3] The seminal work by Zoph and Le introduced this RL-based NAS framework, demonstrating its efficacy by searching for convolutional architectures on CIFAR-10, achieving a test error rate of 3.65%, which outperformed prior hand-crafted models like DenseNet at the time.[3] Building on this, Zoph et al. developed NASNet, which uses an RNN controller to search for reusable "cells"—motifs of operations that can be stacked to form full networks—optimized initially on CIFAR-10 and transferred to ImageNet, yielding 82.7% top-1 accuracy while requiring fewer floating-point operations than human-designed alternatives.[4] In NASNet, the controller employs Proximal Policy Optimization for more stable updates compared to vanilla REINFORCE, focusing the search on normal and reduction cells with operations like convolutions, pooling, and skips.[4] Variants of this approach, such as recurrent NAS methods, emphasize sequential decision-making for generating architectures, often incorporating bidirectional LSTMs or network morphisms to preserve performance during exploration.[21] For instance, Cai et al. integrated network transformations with REINFORCE to efficiently evolve architectures without full retraining from scratch.[22] These methods maintain the core RNN controller for sampling but enhance efficiency through techniques like weight caching, where intermediate model weights are reused across similar architectures to reduce redundant computations.[22] Despite their pioneering role as the first fully automated NAS methods, RL approaches suffer from significant sample inefficiency, often requiring the training of thousands of child models—e.g., 12,800 for CIFAR-10 searches—demanding substantial resources like 800 GPUs over weeks.[3] This high computational cost, coupled with risks of premature convergence to suboptimal architectures due to exploration biases, has led to their decline in favor of more efficient strategies in subsequent research.[21] Early RL NAS established the viability of automated architecture design but highlighted the need for variance reduction and faster evaluation to scale effectively.[21]Evolutionary Algorithms
Evolutionary algorithms for neural architecture search (NAS) maintain a population of candidate architectures, each represented as a genotype encoding the network's topology and operations. These architectures are evaluated for fitness based on their performance, typically measured by validation accuracy after training on a proxy task. The process iterates over generations, applying genetic operators to evolve the population toward higher-performing designs. Key operations include crossover, which combines topologies from two parent architectures by exchanging substructures such as cells or layers; mutation, which alters operations or connections, such as replacing a convolution with a different kernel size; and selection, often using tournament selection where architectures compete in pairs or groups, or rank-based selection prioritizing top performers. These derivative-free methods explore discrete search spaces effectively by mimicking natural selection, balancing exploration through diversity and exploitation via fitness-driven choices. Seminal work includes LargeEvo, which applies regularized evolution to search for image classifiers, achieving 94.6% accuracy on CIFAR-10 comparable to reinforcement learning methods but with reduced hyperparameter tuning.[23] This approach uses tournament selection with age regularization to favor younger architectures, preventing premature convergence. Similarly, AmoebaNet employs aging evolution, introducing an age attribute to genotypes that biases selection toward recent mutations, yielding hierarchical cell-based architectures with 83.9% top-1 accuracy on ImageNet while surpassing hand-designed models.[6][24] Recent advances incorporate population-based training (PBT) principles, such as population-guided evolution, to dynamically steer mutations using distribution statistics from the current population, enabling rediscovery of expert designs like ResNet variants with minimal human bias. This 2025 update enhances efficiency by adapting exploration without fixed hyperparameters, reducing search time by up to 66% on benchmarks like NAS-Bench-101 compared to standard regularized evolution.[25] Evolutionary NAS offers advantages in parallelizability, as fitness evaluations across the population can run concurrently on distributed systems, and excels in discrete spaces where gradient-based methods falter. Mutation probability is commonly set as p_m = \frac{1}{L}, where L is the architecture's length in operations or nodes, ensuring on average one change per individual to maintain diversity. For scalability, evolutionary methods have produced hierarchical architectures transferable to large-scale tasks like ImageNet classification, often matching reinforcement learning performance with greater architectural diversity that supports robustness across datasets.[6][24]Bayesian Optimization Methods
Bayesian optimization (BO) methods in neural architecture search (NAS) model the performance landscape of architectures as a black-box function, using a probabilistic surrogate to guide efficient exploration of the search space. Typically, a Gaussian process (GP) serves as the surrogate model, providing a posterior distribution over the objective function based on observed evaluations, which captures both mean predictions and uncertainty. This allows BO to balance exploration and exploitation through acquisition functions, such as expected improvement (EI), defined asEI(x) = \mathbb{E}[\max(f(x) - f(x^+), 0)]
where f(x) is the predicted performance at architecture x, and f(x^+) is the current best observed value.[26] In NAS, this framework minimizes the number of costly full trainings by prioritizing promising architectures for evaluation.[27] Early applications of BO in NAS extended hyperparameter optimization techniques to architecture search. SMAC, a sequential model-based algorithm using random forest surrogates, and BOHB, which combines Bayesian modeling with Hyperband for multi-fidelity optimization, were adapted from hyperparameter tuning to NAS benchmarks like NAS-Bench-101, where they robustly handle invalid architectures and outperform random search by achieving equivalent performance approximately five times faster after around 50 evaluations.[28] A notable advancement is BANANAS (2019), which replaces traditional GP surrogates with neural network predictors—such as ensembles of feedforward networks or graph convolutional networks—paired with path-based encoding of architectures to improve scalability and accuracy in high-dimensional spaces. BANANAS demonstrates state-of-the-art results on NAS-Bench-101 (test error of 5.923% after 150 evaluations) and NAS-Bench-201, showing high correlation (e.g., Spearman rank correlations exceeding 0.8) between surrogate predictions and true performance.[29] For handling the mixed discrete-continuous nature of NAS spaces, tree-structured Parzen estimators (TPE) model the distribution of top-performing architectures versus others using density estimates over tree-structured conditionals, enabling effective sampling in frameworks like NNI for tasks such as chain-structured NAS.[30] The efficiency of BO methods stems from their ability to reduce the number of architecture evaluations by 10-100 times compared to random search, particularly in benchmark spaces where surrogates achieve strong predictive correlations. For instance, on NAS-Bench-101, BO variants like SMAC require fewer than 25 full evaluations to match random search's median performance, leveraging uncertainty to avoid redundant sampling.[28][27]
Local Search Techniques
Local search techniques in neural architecture search (NAS) rely on iterative optimization starting from an initial architecture, exploring nearby candidates in the search space, and moving to a better-performing neighbor if one is found. This approach, often implemented as hill-climbing, begins with a randomly initialized or hand-crafted neural network architecture and repeatedly evaluates modifications to its components, such as convolutional layers or connections, until no further improvement is possible within the defined neighborhood. The simplicity of this method makes it particularly suitable for discrete search spaces where full evaluation of each candidate is computationally feasible, contrasting with more global strategies by focusing on local improvements without requiring probabilistic models or population maintenance.[31] A seminal example of hill-climbing in NAS is the Neural Architecture Search by Hill-climbing (NASH) method, which applies network morphisms—transformations that preserve the functionality of the parent network while expanding its capacity—to generate child architectures. Starting from a small initial network, NASH iteratively selects the best child based on validation performance after brief training, achieving a test error below 6% on CIFAR-10 in approximately 12 hours on a single GPU. Neighborhoods in such methods are typically defined by small, targeted changes, including single operation swaps (e.g., replacing a convolution with a different kernel size), layer addition or removal, insertion of skip connections, or widening/deepening existing layers, with greedy selection favoring the modification that maximizes performance gain. Regularized evolution serves as a variant, incorporating age-based regularization in the selection process to encourage exploration beyond strict local optima while maintaining a focus on incremental improvements.[31][6] The core decision rule in hill-climbing evaluates neighbors by computing the performance difference: \Delta = \text{perf}(A') - \text{perf}(A) where A is the current architecture, A' is a neighbor, and \text{perf}(\cdot) denotes the measured accuracy or loss after partial training. A move is accepted if \Delta > 0, ensuring monotonic progress toward a local optimum; ties or small gains may incorporate epsilon-greedy perturbations for stability. This process incurs low overhead, as it can operate in a single-threaded manner without surrogate models, enabling rapid iteration on modest hardware.[31] Despite these advantages, hill-climbing is inherently prone to local optima, especially in rugged NAS search spaces where noise from finite training exacerbates trapping in suboptimal architectures. Recent analyses highlight that reducing evaluation noise through techniques like longer warm-up training enhances its effectiveness, making it competitive with more complex methods on benchmarks like NAS-Bench-101. To mitigate local optima, hybrids incorporating random restarts—reinitializing the search from new starting points upon convergence—have been explored, allowing multiple local optima to be sampled efficiently within a fixed computational budget.[32]Efficient and Advanced Methods
One-Shot NAS
One-shot neural architecture search (NAS) is a paradigm that trains a single supernet—a large neural network that encompasses all possible sub-architectures within a defined search space—to enable efficient sampling and evaluation of candidate architectures. By employing weight-sharing mechanisms, the supernet allows multiple architectures to reuse the same parameters during training, thereby avoiding the redundant computation required for independent training of each candidate, which can otherwise demand thousands of GPU hours. This approach builds on performance estimation strategies that leverage shared weights to approximate the quality of sub-architectures quickly.[33] Pioneering methods in one-shot NAS include the Efficient Neural Architecture Search (ENAS) framework, introduced in 2018, which uses a reinforcement learning-based controller to sample subgraphs from the supernet while sharing weights across operations to guide the search toward high-performing architectures. Another key method is Single Path One-Shot (SPOS) from 2019, which employs uniform sampling from the supernet during training to derive diverse architectures, emphasizing simplicity and broad exploration of the search space. ProxylessNAS, also from 2018, extends this paradigm to target specific hardware constraints, such as mobile devices, by directly optimizing architectures on the deployment platform using shared weights and latency-aware sampling, achieving competitive accuracy with reduced inference latency.[33][34][35] One-shot NAS offers substantial advantages, including up to a 1000-fold speedup in search time compared to training architectures independently, as demonstrated on benchmarks like CIFAR-10 where ENAS completed searches in under one GPU day versus thousands for prior methods. However, challenges arise from correlations in shared weights, where frequent reuse can lead to biased performance estimates as sub-architectures do not train in isolation, potentially inflating rankings for over-represented paths. To mitigate this, techniques like path dropout have been proposed to randomly mask paths during supernet training, promoting diversity and more reliable evaluations. Methods like SPOS improve architecture ranking correlation with standalone training through uniform single-path sampling, achieving Kendall Tau correlations of 0.42–0.64 on benchmarks.[33][34][34] As of 2025, advancements in one-shot NAS have introduced training-free variants that integrate zero-cost proxies—metrics derived from the architecture's topology or initial forward passes without any parameter updates—to enable instantaneous performance prediction and further accelerate searches. For instance, methods like TG-NAS use operator embeddings and graph learning to generalize these proxies across diverse search spaces, achieving strong correlations with final accuracies on ImageNet while requiring zero training epochs for evaluation. These developments extend the efficiency of one-shot paradigms to resource-constrained settings, such as edge devices, without compromising architectural quality.Gradient-Based NAS
Gradient-based neural architecture search (NAS) methods enable the optimization of discrete architectural choices through continuous relaxations, allowing the use of gradient descent for efficient end-to-end training.[7] These approaches represent the search space as a continuous distribution over possible architectures, typically by assigning learnable parameters to candidate operations and relaxing selection mechanisms to differentiable forms.[7] This formulation addresses the limitations of discrete search strategies by enabling backpropagation through the architecture parameters, significantly reducing computational costs compared to enumerative or sampling-based methods.[7] A core technique in gradient-based NAS is the relaxation of categorical operation choices using a softmax function over architecture parameters \alpha_i, weighted by a temperature \tau: \bar{\alpha}_i = \frac{\exp(\alpha_i / \tau)}{\sum_j \exp(\alpha_j / \tau)} This produces a weighted combination of operations during search, with the final architecture derived by selecting the operation with the highest \alpha_i via argmax after optimization.[7] The search process is often framed as a bilevel optimization problem, where architecture parameters \alpha are optimized in the outer loop using validation loss, while network weights w are optimized in the inner loop using training loss: \min_{\alpha} L_{\text{val}}(w^*(\alpha), \alpha), \quad \text{where} \quad w^*(\alpha) = \arg\min_w L_{\text{train}}(w, \alpha). Approximations, such as alternating single-step updates, make this tractable without full inner-loop convergence.[7] The landmark method, Differentiable Architecture Search (DARTS), introduced this bilevel framework for searching repeatable cell structures in convolutional networks.[7] DARTS searches on CIFAR-10 in four GPU days, yielding a normal cell and reduction cell that, when stacked into a macro architecture, achieve a 2.76% test error with 3.3 million parameters—outperforming prior manual designs while using fewer resources.[7] Subsequent variants addressed limitations in DARTS, such as sensitivity to softmax relaxation and high memory demands. Gradually Differentiable Architecture Search (GDAS) replaces softmax with Gumbel-softmax sampling to better approximate discrete choices during search, enabling robust architecture discovery in four GPU hours and a 2.82% CIFAR-10 test error with 2.5 million parameters.[36] Partially Connected DARTS (PC-DARTS) mitigates memory overhead by partially connecting channels in the supernet—sampling subsets for computation while normalizing edges—allowing searches with larger batch sizes and achieving a 2.57% CIFAR-10 test error in 0.1 GPU days.[37] By 2025, advances have scaled gradient-based NAS to large models like transformers, with methods like Smooth Activation DARTS (SA-DARTS) introducing regularization to counter skip-connection dominance and discretization gaps between search and evaluation phases, improving stability and performance on complex spaces.[38] Similarly, DASViT extends differentiable search to vision transformers, optimizing token mixing and projection operations in a continuous space to yield architectures outperforming ViT-B/16 baselines on ImageNet and other datasets while addressing scalability challenges in high-dimensional transformer designs.[39]Multi-Objective NAS
Multi-objective neural architecture search (NAS) extends traditional NAS by optimizing architectures with respect to multiple conflicting criteria simultaneously, such as prediction accuracy, computational complexity measured in floating-point operations (FLOPs), inference latency, and resource constraints like power consumption.[40] This approach is formalized as a multi-objective optimization problem: \min_{A} (f_1(A), f_2(A), \dots, f_k(A)), where A represents a neural architecture and each f_i(A) denotes an objective function, such as error rate or latency.[41] Solutions are selected via non-dominated sorting, identifying Pareto-optimal architectures where no objective can improve without degrading another.[42] Pareto optimization in multi-objective NAS often adapts evolutionary algorithms like NSGA-II, which employs non-dominated sorting and crowding distance to approximate the Pareto front—a set of non-dominated architectures balancing trade-offs.[41] In NSGA-Net, for instance, NSGA-II explores a cell-based search space to minimize classification error and FLOPs, using crowding distance to promote diversity and prevent premature convergence to suboptimal solutions.[43] This evolutionary framework maintains a population of architectures, applying crossover and mutation operators informed by prior knowledge, yielding a diverse Pareto front in a single search run.[42] Early methods like MONAS (2018) apply reinforcement learning to multi-objective NAS for mobile devices, defining a composite reward function that balances accuracy and hardware constraints such as peak power consumption.[44] MONAS searches for convolutional neural network architectures on datasets like CIFAR-10, achieving accuracies comparable to single-objective baselines while satisfying power budgets under 1 watt.[44] Similarly, hardware-aware approaches like FBNet (2018) use differentiable NAS to jointly optimize accuracy and device-specific latency, incorporating quantization-aware search (Q-NAS) to evaluate architectures under low-precision constraints like INT8, enabling deployment on mobile hardware with latencies as low as 2.9 ms on Samsung Galaxy S8.[11] Recent advancements address sustainability in multi-objective NAS, such as generative methods that integrate evolutionary algorithms with generative models to explore architecture distributions. For example, a 2024 framework combines multi-objective evolutionary search with generative architecture generation to optimize for accuracy and efficiency.[45] CE-NAS (2024) further exemplifies this by employing reinforcement learning and multi-objective optimization to dynamically allocate GPU resources based on carbon intensity, achieving up to 7.22× reduction in CO₂ emissions compared to standard NAS while maintaining high accuracy (e.g., 80.6% top-1 on ImageNet).[46]Evaluation and Applications
Benchmarks and Datasets
Standardized benchmarks have become essential in neural architecture search (NAS) to enable fair, reproducible comparisons across methods by providing pre-evaluated architectures and performance metrics. These benchmarks typically consist of tabular datasets mapping architectures to their accuracies, training curves, and other properties, allowing rapid evaluation without full retraining. Early benchmarks focused on vision tasks, but recent ones incorporate multi-domain and hardware-aware aspects to better reflect real-world deployment.[15][47] NAS-Bench-101, introduced in 2019, is a foundational tabular benchmark containing 423,624 unique architectures from a cell-based search space, each fully trained and evaluated on CIFAR-10 for 108 epochs. It provides precomputed metrics such as validation accuracy, test accuracy, and training time, facilitating fast prototyping of NAS algorithms by querying the table instead of training from scratch. The benchmark's top-performing architecture achieves a test accuracy of 94.07% on CIFAR-10, while methods like DARTS typically recover architectures around 91-92% accuracy when evaluated on this space.[15] Building on this, NAS-Bench-201, released in 2020, extends reproducibility to multiple datasets with 15,625 architectures in a different search space, evaluated on CIFAR-10, CIFAR-100, and a downsampled ImageNet variant (ImageNet-16-120). This allows assessment of architecture transferability across datasets, revealing that strong performers on CIFAR-10 often generalize well but degrade on more complex tasks like ImageNet-16-120. For instance, the best architecture yields 91.38% accuracy on CIFAR-10, 73.85% on CIFAR-100, and 51.03% on ImageNet-16-120.[47] For larger search spaces, NAS-Bench-301 (2020) introduces a surrogate modeling approach to handle the DARTS search space of approximately 10^18 architectures, using Gaussian processes to estimate performance via zero-shot proxies without exhaustive evaluation. This benchmark supports rapid zero-shot predictions, with surrogate models achieving correlations up to 0.85 with true ImageNet performance at a fraction of the computational cost. By 2025, it has become a standard for prototyping in expansive spaces, integrating with performance estimation strategies.[48] Hardware-aware benchmarks address deployment constraints beyond accuracy, such as latency and energy. HW-NAS-Bench (2021) augments NAS-Bench-201 by providing measured hardware metrics (e.g., latency, FLOPS) for all architectures across six devices, including CPU, GPU, and edge platforms, enabling multi-objective NAS that balances accuracy with efficiency. For example, it shows that top accuracy architectures often incur 2-5x higher latency on mobile GPUs compared to optimized ones.[49] Common datasets in NAS benchmarks are CIFAR-10 for initial validation and ImageNet for large-scale testing, due to their established role in CNN evaluation. Emerging benchmarks incorporate VTAB (Visual Task Adaptation Benchmark, 2019), a suite of 19 diverse vision tasks (e.g., object classification, counting) for assessing transfer learning, with NAS methods evaluated on few-shot adaptation to measure generalization beyond single-dataset accuracy.[50] Key metrics in these benchmarks include search cost (measured in GPU hours) and anytime accuracy (best validation accuracy at any training epoch), emphasizing efficiency alongside performance. The following table summarizes representative results for top architectures or methods on select benchmarks:| Benchmark | Dataset | Top Accuracy (%) | Example Method (Accuracy %) | Search Cost (GPU Hours) | Source |
|---|---|---|---|---|---|
| NAS-Bench-101 | CIFAR-10 | 94.07 | DARTS (~91.8) | N/A (tabular) | Ying et al., 2019 |
| NAS-Bench-201 | CIFAR-10 | 91.38 | Random Search (90.83) | ~0.5 | Dong & Yang, 2020 |
| NAS-Bench-201 | ImageNet-16-120 | 51.03 | ENAS (50.59) | ~1.5 | Dong & Yang, 2020 |
| NAS-Bench-301 | ImageNet | ~75.5 (surrogate est.) | Zero-shot Proxy (corr. 0.85) | <0.1 | Zela et al., 2020 |
| HW-NAS-Bench | CIFAR-10 (GPU) | 91.25 | Hardware-Optimized (latency 1.2ms) | N/A | Cai et al., 2021 |