Fact-checked by Grok 2 weeks ago

Neural architecture search

Neural architecture search (NAS) is a subfield of automated machine learning (AutoML) that automates the design of artificial neural network architectures by systematically exploring a defined search space to identify models that optimize performance metrics such as accuracy, efficiency, or latency, thereby reducing the reliance on expert manual engineering.^[1]^[2] The origins of NAS trace back to early applications of evolutionary algorithms for neural network design in the late 1980s, but the field surged in prominence with the advent of deep learning.^[1] The foundational modern approach was introduced by Zoph and Le in 2016, who employed reinforcement learning to generate and evaluate architectures on image classification tasks like CIFAR-10, demonstrating that automated search could rival hand-crafted designs but at significant computational expense—often requiring thousands of GPU hours.^[3] Subsequent works, such as NASNet by Zoph et al. in 2017, refined this by searching for reusable "cells" on smaller datasets and transferring them to larger ones like ImageNet, achieving state-of-the-art results in object detection and classification.^[4] At its core, NAS comprises three interconnected components: the search space, which specifies the universe of possible architectures (e.g., cell-based structures with operations like convolutions or skip connections); the search strategy, which navigates this space using algorithms like reinforcement learning, evolutionary methods, or Bayesian optimization; and the performance estimation strategy, which assesses candidate architectures through full training, weight sharing, or low-fidelity proxies to mitigate costs.^[1] Early strategies were computationally intensive, but innovations like parameter sharing in ENAS (Pham et al., 2018) reduced search times from days to hours by reusing weights across architectures.^[5] Key advancements have diversified search strategies, including evolutionary algorithms in AmoebaNet (Real et al., 2018), which used regularized evolution to evolve high-performing convolutional cells competitive with NASNet on ImageNet while emphasizing model size constraints.^[6] A major breakthrough came with gradient-based methods like DARTS (Liu et al., 2018), which reformulates architecture search as a differentiable optimization problem, enabling end-to-end training via bilevel optimization and drastically cutting search costs to GPU-days rather than GPU-thousands.^[7] These approaches have extended NAS beyond vision to domains like natural language processing, reinforcement learning, and efficient mobile deployment. In the years since, NAS has further advanced with hardware-aware methods, zero-shot estimation techniques, and integrations with transformer-based large language models.^[1]^[8] NAS has transformed deep learning by consistently discovering architectures that surpass manually designed ones in benchmarks, such as achieving lower error rates on CIFAR-10 and ImageNet while balancing trade-offs in parameters and inference speed.^[4]^[6] Its importance lies in democratizing AI model development, accelerating innovation, and enabling tailored solutions for resource-constrained environments, though challenges remain in scalability, generalization across tasks, and benchmark reliability.^[2]^[1]

Introduction

Definition and Motivation

Neural Architecture Search (NAS) is an automated methodology for discovering optimal neural network architectures tailored to specific machine learning tasks, such as classification or segmentation. It operates through three core components: a search space that delineates the universe of possible architectures (e.g., layer types, connections, and hyperparameters); a search strategy that navigates this space to sample candidate architectures; and a performance estimation strategy that assesses the efficacy of these candidates, often via training or proxy metrics. This framework shifts the burden of architecture engineering from human experts to algorithmic exploration, enabling the identification of high-performing models without exhaustive manual iteration.^[1] The motivation for NAS arises from the inherent limitations of traditional manual architecture design, which demands substantial domain expertise, iterative experimentation, and significant time investment, often leading to suboptimal or biased outcomes due to human intuition. By automating this process, NAS not only enhances model accuracy and efficiency—frequently surpassing hand-crafted designs like ResNet or VGG—but also democratizes advanced neural network development, integrating it into broader AutoML ecosystems to streamline end-to-end machine learning pipelines. This approach addresses the growing complexity of deep learning models, where manual tuning becomes increasingly infeasible as architectures scale in depth and width.^[9] NAS can be viewed as a natural extension of hyperparameter optimization, evolving to target structural elements like topology and operations rather than just tuning parameters. Early applications concentrated on image classification benchmarks, including CIFAR-10 for smaller-scale validation and ImageNet for large-scale transferability; for example, reinforcement learning-based searches on CIFAR-10 yielded architectures with error rates around 3.65%, while subsequent adaptations like NASNet achieved state-of-the-art top-1 accuracy of 82.7% on ImageNet.^[3]^[4] A central trade-off in NAS is the balance between computational expense and performance improvements, as exhaustive searches can demand thousands of GPU hours—early reinforcement learning methods, for instance, required up to 1800 GPU-days on CIFAR-10—prompting ongoing research into efficient approximations to make NAS viable for resource-constrained environments.^[10]

Historical Development

Neural architecture search (NAS) originated in 2016 with the pioneering work of Barret Zoph and Quoc V. Le, who introduced an RL-based approach using a recurrent neural network controller to generate architectures, achieving state-of-the-art performance of 3.65% test error on CIFAR-10, surpassing prior hand-designed convolutional networks.^[3] This method marked the first automated discovery of competitive neural architectures, setting the foundation for subsequent NAS research by demonstrating that machine learning could optimize network design directly from data.^[3] In 2017, advancements scaled NAS to larger datasets, exemplified by NASNet, which searched for transferable "cells" on CIFAR-10 and applied them to ImageNet, yielding a model with 82.7% top-1 accuracy while reducing computational demand by 9 billion FLOPS (28% fewer than the best prior human-designed architectures like Inception-v4).^[4] This transferability innovation highlighted NAS's potential for practical deployment across tasks. Around the same time, evolutionary methods began emerging as compute-efficient alternatives to RL, though they gained prominence later. The high computational costs of early RL-based NAS—often requiring thousands of GPU days—prompted a shift toward efficiency in 2018, with the introduction of one-shot methods like Efficient Neural Architecture Search (ENAS) and gradient-based approaches such as DARTS, which relaxed the discrete search space into a continuous one for end-to-end optimization, reducing search time to hours on a single GPU.^[7] Hardware-aware NAS also debuted that year with FBNet, incorporating latency predictions into the search objective to optimize for edge devices.^[11] Post-2020, NAS evolved toward hardware optimization and broader architectures, with expansions in hardware-aware methods for edge devices through 2022–2025, including multi-objective searches balancing accuracy, latency, and energy on resource-constrained platforms.^[12] Integration with transformers surged, enabling automated design of efficient variants for vision and language tasks, as surveyed in comprehensive reviews.^[13] By 2025, literature highlighted generative NAS paradigms, leveraging diffusion models and LLMs to sample architectures from learned distributions, further enhancing scalability.^[14] Key milestones included the establishment of AutoML conferences and ICML workshops fostering collaboration, alongside benchmarks like NAS-Bench-101 in 2019, which tabulated 423,000 architectures on CIFAR-10 to standardize reproducible evaluations.^[15]

Core Concepts

Search Space Definition

In neural architecture search (NAS), the search space encompasses all possible neural network architectures that can be constructed and evaluated during the optimization process. It serves as the foundational domain from which NAS methods sample and select candidate architectures, directly influencing the diversity, expressiveness, and computational feasibility of the search. Broadly, search spaces are categorized into macro and micro types. Macro search spaces define the entire network topology, including the sequence of layers, connections, and global structure, allowing for comprehensive exploration of full architectures but often at high computational cost. In contrast, micro search spaces focus on smaller, repeatable building blocks or modules, such as cells, which are then stacked to form the complete network, enabling more manageable optimization while promoting modularity and transferability across models.^[8] Search spaces are parameterized either discretely or continuously to represent architectural choices. Discrete parameterization involves selecting from a predefined set of operations (e.g., convolutions, pooling, skip connections) for each position in the architecture, resulting in a combinatorial space where each choice is categorical. For instance, in chain-structured spaces, architectures are represented as sequential compositions of layers, where each layer's operation and hyperparameters (e.g., kernel size, number of filters) are chosen independently, leading to an exponential growth in possibilities as the number of layers increases. Hierarchical search spaces extend this by organizing architectures across multiple levels, such as optimizing individual operations within cells and then the arrangement of those cells into blocks, as seen in methods like PNAS, which progressively refines structures from low to high levels. Continuous parameterization relaxes the discrete choices into a differentiable form, typically using softmax distributions over candidate operations to create a supernet where architectures are weighted mixtures, facilitating gradient-based optimization. A prominent example is DARTS, where each edge in a directed acyclic graph (DAG) representing a cell is parameterized by architecture variables that blend operations continuously.^[1]^[7] Defining effective search spaces faces significant challenges, primarily the curse of dimensionality arising from their vast size. For example, the NASNet micro search space for a single cell with five blocks and five operation choices per block yields approximately 10^{18} possible topologies, rendering exhaustive enumeration impractical and necessitating efficient sampling or approximation strategies like pruning irrelevant operations to reduce complexity. These spaces often exceed 10^{10} configurations even in constrained settings, amplifying the computational burden of evaluating candidates. To mitigate this, techniques such as restricting operation sets or imposing structural priors (e.g., DAG constraints) are employed, though they risk biasing toward human-designed priors. Performance estimation on these spaces, which assesses architecture quality without full training, is crucial but deferred to separate methodologies.^[1] Recent developments as of 2025 emphasize domain-specific search spaces tailored to emerging architectures, particularly vision transformers (ViTs) and multimodal models, to better capture task-unique priors and improve efficiency. For ViTs, search spaces now incorporate transformer-specific primitives like attention heads, positional encodings, and patch embeddings, as in Autoformer, which optimizes embedding dimensions and layer configurations within a ViT block to enhance out-of-domain generalization. In multimodal contexts, spaces integrate cross-modal fusion operations (e.g., attention across text and image branches), enabling NAS to discover hybrid architectures for tasks like visual question answering, though challenges persist in balancing modality-specific constraints with overall scalability. These trends shift from generic convolutional spaces toward hybrid designs, prioritizing adaptability to large-scale pretraining and fine-tuning paradigms.^[16]

Performance Estimation

Performance estimation in neural architecture search (NAS) involves evaluating the quality of candidate architectures to guide the search process toward high-performing models. The most accurate approach is full training, where each candidate architecture is trained from scratch to convergence on the target dataset and evaluated on a held-out validation set. This method provides precise performance metrics but is computationally prohibitive; for instance, the seminal NASNet search required approximately 2,000 GPU-days on CIFAR-10 using reinforcement learning to explore thousands of architectures. Such costs highlight the need for efficient alternatives, as full training scales poorly with the size of the search space. To mitigate these expenses, proxy tasks employ low-fidelity approximations that trade some accuracy for speed. These include training candidates for fewer epochs, using smaller proxy datasets (e.g., subsets of CIFAR-10 or MNIST instead of ImageNet), or reducing model width and depth while assuming rank consistency with full training. Surveys indicate that early stopping after a fixed number of iterations often preserves performance rankings with Spearman's rank correlation coefficients above 0.7 on benchmarks like NAS-Bench-201, enabling searches to complete in hours rather than days.^[17] However, the reliability of these proxies varies across search spaces, necessitating validation on specific tasks.^[1] More recent advancements focus on zero-cost proxies, which estimate performance without any training by analyzing architecture properties in a single forward or backward pass on random data. Introduced in 2021, these proxies draw from pruning techniques and include metrics like synaptic saliency (measuring parameter importance via gradients) and Jacobian covariance (assessing feature sensitivity).^[18] Variants in methods like ENAS extend this by incorporating shared computations, but pure zero-cost approaches achieve rank correlations of up to 0.8 with true accuracy on NAS-Bench-101 without training. By 2025, innovations such as parametric zero-cost proxies (ParZC) enhance adaptability through learnable parameters, improving correlation on diverse benchmarks like NDS, while evolving composite proxies combine multiple metrics nonlinearly for better generalization across tasks.^[19]^[20] Weight-sharing strategies further amortize costs by training a supernet that encompasses all candidate architectures, allowing subsets to inherit pre-trained weights for rapid evaluation. This forms the basis for one-shot NAS, where architectures are sampled from the supernet and assessed via proxy metrics, reducing search times from GPU-days to hours as demonstrated in ENAS on CIFAR-10. Performance is typically measured by primary objectives like top-1 accuracy and inference latency, alongside secondary factors such as robustness to adversarial attacks, often quantified via expected calibration error or defense success rates. To validate proxy effectiveness, Spearman's rank correlation \rho between proxy scores and true performance is commonly used:

\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}

where d_i are rank differences and n is the number of architectures; values exceeding 0.6 indicate reliable proxies for guiding NAS.^[18]^[17]

Search Strategies

Reinforcement Learning Approaches

Reinforcement learning approaches to neural architecture search frame the problem as a sequential decision-making process, where a controller—typically a recurrent neural network (RNN)—samples architectures from a defined search space by generating sequences of architectural choices, such as layer types, filter sizes, and connections.^[3] The controller is trained as a policy in an RL setting, with the reward signal derived from the validation accuracy of the sampled architectures after training them on a dataset like CIFAR-10.^[3] To optimize the policy, the REINFORCE algorithm is employed, which updates the controller's parameters \theta using the policy gradient: \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi(a|s; \theta) (R - b), where \pi(a|s; \theta) is the policy probability of action a given state s, R is the reward, b is a baseline (e.g., an exponential moving average of past rewards) to reduce variance, and \alpha is the learning rate.^[3] This baseline helps stabilize training by subtracting the expected reward from the actual reward, mitigating high variance in gradient estimates.^[3] The seminal work by Zoph and Le introduced this RL-based NAS framework, demonstrating its efficacy by searching for convolutional architectures on CIFAR-10, achieving a test error rate of 3.65%, which outperformed prior hand-crafted models like DenseNet at the time.^[3] Building on this, Zoph et al. developed NASNet, which uses an RNN controller to search for reusable "cells"—motifs of operations that can be stacked to form full networks—optimized initially on CIFAR-10 and transferred to ImageNet, yielding 82.7% top-1 accuracy while requiring fewer floating-point operations than human-designed alternatives.^[4] In NASNet, the controller employs Proximal Policy Optimization for more stable updates compared to vanilla REINFORCE, focusing the search on normal and reduction cells with operations like convolutions, pooling, and skips.^[4] Variants of this approach, such as recurrent NAS methods, emphasize sequential decision-making for generating architectures, often incorporating bidirectional LSTMs or network morphisms to preserve performance during exploration.^[21] For instance, Cai et al. integrated network transformations with REINFORCE to efficiently evolve architectures without full retraining from scratch.^[22] These methods maintain the core RNN controller for sampling but enhance efficiency through techniques like weight caching, where intermediate model weights are reused across similar architectures to reduce redundant computations.^[22] Despite their pioneering role as the first fully automated NAS methods, RL approaches suffer from significant sample inefficiency, often requiring the training of thousands of child models—e.g., 12,800 for CIFAR-10 searches—demanding substantial resources like 800 GPUs over weeks.^[3] This high computational cost, coupled with risks of premature convergence to suboptimal architectures due to exploration biases, has led to their decline in favor of more efficient strategies in subsequent research.^[21] Early RL NAS established the viability of automated architecture design but highlighted the need for variance reduction and faster evaluation to scale effectively.^[21]

Evolutionary Algorithms

Evolutionary algorithms for neural architecture search (NAS) maintain a population of candidate architectures, each represented as a genotype encoding the network's topology and operations. These architectures are evaluated for fitness based on their performance, typically measured by validation accuracy after training on a proxy task. The process iterates over generations, applying genetic operators to evolve the population toward higher-performing designs. Key operations include crossover, which combines topologies from two parent architectures by exchanging substructures such as cells or layers; mutation, which alters operations or connections, such as replacing a convolution with a different kernel size; and selection, often using tournament selection where architectures compete in pairs or groups, or rank-based selection prioritizing top performers. These derivative-free methods explore discrete search spaces effectively by mimicking natural selection, balancing exploration through diversity and exploitation via fitness-driven choices. Seminal work includes LargeEvo, which applies regularized evolution to search for image classifiers, achieving 94.6% accuracy on CIFAR-10 comparable to reinforcement learning methods but with reduced hyperparameter tuning.^[23] This approach uses tournament selection with age regularization to favor younger architectures, preventing premature convergence. Similarly, AmoebaNet employs aging evolution, introducing an age attribute to genotypes that biases selection toward recent mutations, yielding hierarchical cell-based architectures with 83.9% top-1 accuracy on ImageNet while surpassing hand-designed models.^[6]^[24] Recent advances incorporate population-based training (PBT) principles, such as population-guided evolution, to dynamically steer mutations using distribution statistics from the current population, enabling rediscovery of expert designs like ResNet variants with minimal human bias. This 2025 update enhances efficiency by adapting exploration without fixed hyperparameters, reducing search time by up to 66% on benchmarks like NAS-Bench-101 compared to standard regularized evolution.^[25] Evolutionary NAS offers advantages in parallelizability, as fitness evaluations across the population can run concurrently on distributed systems, and excels in discrete spaces where gradient-based methods falter. Mutation probability is commonly set as p_m = \frac{1}{L}, where L is the architecture's length in operations or nodes, ensuring on average one change per individual to maintain diversity. For scalability, evolutionary methods have produced hierarchical architectures transferable to large-scale tasks like ImageNet classification, often matching reinforcement learning performance with greater architectural diversity that supports robustness across datasets.^[6]^[24]

Bayesian Optimization Methods

Bayesian optimization (BO) methods in neural architecture search (NAS) model the performance landscape of architectures as a black-box function, using a probabilistic surrogate to guide efficient exploration of the search space. Typically, a Gaussian process (GP) serves as the surrogate model, providing a posterior distribution over the objective function based on observed evaluations, which captures both mean predictions and uncertainty. This allows BO to balance exploration and exploitation through acquisition functions, such as expected improvement (EI), defined as
EI(x) = \mathbb{E}[\max(f(x) - f(x^+), 0)]
where f(x) is the predicted performance at architecture x, and f(x^+) is the current best observed value.^[26] In NAS, this framework minimizes the number of costly full trainings by prioritizing promising architectures for evaluation.^[27] Early applications of BO in NAS extended hyperparameter optimization techniques to architecture search. SMAC, a sequential model-based algorithm using random forest surrogates, and BOHB, which combines Bayesian modeling with Hyperband for multi-fidelity optimization, were adapted from hyperparameter tuning to NAS benchmarks like NAS-Bench-101, where they robustly handle invalid architectures and outperform random search by achieving equivalent performance approximately five times faster after around 50 evaluations.^[28] A notable advancement is BANANAS (2019), which replaces traditional GP surrogates with neural network predictors—such as ensembles of feedforward networks or graph convolutional networks—paired with path-based encoding of architectures to improve scalability and accuracy in high-dimensional spaces. BANANAS demonstrates state-of-the-art results on NAS-Bench-101 (test error of 5.923% after 150 evaluations) and NAS-Bench-201, showing high correlation (e.g., Spearman rank correlations exceeding 0.8) between surrogate predictions and true performance.^[29] For handling the mixed discrete-continuous nature of NAS spaces, tree-structured Parzen estimators (TPE) model the distribution of top-performing architectures versus others using density estimates over tree-structured conditionals, enabling effective sampling in frameworks like NNI for tasks such as chain-structured NAS.^[30] The efficiency of BO methods stems from their ability to reduce the number of architecture evaluations by 10-100 times compared to random search, particularly in benchmark spaces where surrogates achieve strong predictive correlations. For instance, on NAS-Bench-101, BO variants like SMAC require fewer than 25 full evaluations to match random search's median performance, leveraging uncertainty to avoid redundant sampling.^[28]^[27]

Local Search Techniques

Local search techniques in neural architecture search (NAS) rely on iterative optimization starting from an initial architecture, exploring nearby candidates in the search space, and moving to a better-performing neighbor if one is found. This approach, often implemented as hill-climbing, begins with a randomly initialized or hand-crafted neural network architecture and repeatedly evaluates modifications to its components, such as convolutional layers or connections, until no further improvement is possible within the defined neighborhood. The simplicity of this method makes it particularly suitable for discrete search spaces where full evaluation of each candidate is computationally feasible, contrasting with more global strategies by focusing on local improvements without requiring probabilistic models or population maintenance.^[31] A seminal example of hill-climbing in NAS is the Neural Architecture Search by Hill-climbing (NASH) method, which applies network morphisms—transformations that preserve the functionality of the parent network while expanding its capacity—to generate child architectures. Starting from a small initial network, NASH iteratively selects the best child based on validation performance after brief training, achieving a test error below 6% on CIFAR-10 in approximately 12 hours on a single GPU. Neighborhoods in such methods are typically defined by small, targeted changes, including single operation swaps (e.g., replacing a convolution with a different kernel size), layer addition or removal, insertion of skip connections, or widening/deepening existing layers, with greedy selection favoring the modification that maximizes performance gain. Regularized evolution serves as a variant, incorporating age-based regularization in the selection process to encourage exploration beyond strict local optima while maintaining a focus on incremental improvements.^[31]^[6] The core decision rule in hill-climbing evaluates neighbors by computing the performance difference:

\Delta = \text{perf}(A') - \text{perf}(A)

where A is the current architecture, A' is a neighbor, and \text{perf}(\cdot) denotes the measured accuracy or loss after partial training. A move is accepted if \Delta > 0, ensuring monotonic progress toward a local optimum; ties or small gains may incorporate epsilon-greedy perturbations for stability. This process incurs low overhead, as it can operate in a single-threaded manner without surrogate models, enabling rapid iteration on modest hardware.^[31] Despite these advantages, hill-climbing is inherently prone to local optima, especially in rugged NAS search spaces where noise from finite training exacerbates trapping in suboptimal architectures. Recent analyses highlight that reducing evaluation noise through techniques like longer warm-up training enhances its effectiveness, making it competitive with more complex methods on benchmarks like NAS-Bench-101. To mitigate local optima, hybrids incorporating random restarts—reinitializing the search from new starting points upon convergence—have been explored, allowing multiple local optima to be sampled efficiently within a fixed computational budget.^[32]

Efficient and Advanced Methods

One-Shot NAS

One-shot neural architecture search (NAS) is a paradigm that trains a single supernet—a large neural network that encompasses all possible sub-architectures within a defined search space—to enable efficient sampling and evaluation of candidate architectures. By employing weight-sharing mechanisms, the supernet allows multiple architectures to reuse the same parameters during training, thereby avoiding the redundant computation required for independent training of each candidate, which can otherwise demand thousands of GPU hours. This approach builds on performance estimation strategies that leverage shared weights to approximate the quality of sub-architectures quickly.^[33] Pioneering methods in one-shot NAS include the Efficient Neural Architecture Search (ENAS) framework, introduced in 2018, which uses a reinforcement learning-based controller to sample subgraphs from the supernet while sharing weights across operations to guide the search toward high-performing architectures. Another key method is Single Path One-Shot (SPOS) from 2019, which employs uniform sampling from the supernet during training to derive diverse architectures, emphasizing simplicity and broad exploration of the search space. ProxylessNAS, also from 2018, extends this paradigm to target specific hardware constraints, such as mobile devices, by directly optimizing architectures on the deployment platform using shared weights and latency-aware sampling, achieving competitive accuracy with reduced inference latency.^[33]^[34]^[35] One-shot NAS offers substantial advantages, including up to a 1000-fold speedup in search time compared to training architectures independently, as demonstrated on benchmarks like CIFAR-10 where ENAS completed searches in under one GPU day versus thousands for prior methods. However, challenges arise from correlations in shared weights, where frequent reuse can lead to biased performance estimates as sub-architectures do not train in isolation, potentially inflating rankings for over-represented paths. To mitigate this, techniques like path dropout have been proposed to randomly mask paths during supernet training, promoting diversity and more reliable evaluations. Methods like SPOS improve architecture ranking correlation with standalone training through uniform single-path sampling, achieving Kendall Tau correlations of 0.42–0.64 on benchmarks.^[33]^[34]^[34] As of 2025, advancements in one-shot NAS have introduced training-free variants that integrate zero-cost proxies—metrics derived from the architecture's topology or initial forward passes without any parameter updates—to enable instantaneous performance prediction and further accelerate searches. For instance, methods like TG-NAS use operator embeddings and graph learning to generalize these proxies across diverse search spaces, achieving strong correlations with final accuracies on ImageNet while requiring zero training epochs for evaluation. These developments extend the efficiency of one-shot paradigms to resource-constrained settings, such as edge devices, without compromising architectural quality.

Gradient-Based NAS

Gradient-based neural architecture search (NAS) methods enable the optimization of discrete architectural choices through continuous relaxations, allowing the use of gradient descent for efficient end-to-end training.^[7] These approaches represent the search space as a continuous distribution over possible architectures, typically by assigning learnable parameters to candidate operations and relaxing selection mechanisms to differentiable forms.^[7] This formulation addresses the limitations of discrete search strategies by enabling backpropagation through the architecture parameters, significantly reducing computational costs compared to enumerative or sampling-based methods.^[7] A core technique in gradient-based NAS is the relaxation of categorical operation choices using a softmax function over architecture parameters \alpha_i, weighted by a temperature \tau:

\bar{\alpha}_i = \frac{\exp(\alpha_i / \tau)}{\sum_j \exp(\alpha_j / \tau)}

This produces a weighted combination of operations during search, with the final architecture derived by selecting the operation with the highest \alpha_i via argmax after optimization.^[7] The search process is often framed as a bilevel optimization problem, where architecture parameters \alpha are optimized in the outer loop using validation loss, while network weights w are optimized in the inner loop using training loss:

\min_{\alpha} L_{\text{val}}(w^*(\alpha), \alpha), \quad \text{where} \quad w^*(\alpha) = \arg\min_w L_{\text{train}}(w, \alpha).

Approximations, such as alternating single-step updates, make this tractable without full inner-loop convergence.^[7] The landmark method, Differentiable Architecture Search (DARTS), introduced this bilevel framework for searching repeatable cell structures in convolutional networks.^[7] DARTS searches on CIFAR-10 in four GPU days, yielding a normal cell and reduction cell that, when stacked into a macro architecture, achieve a 2.76% test error with 3.3 million parameters—outperforming prior manual designs while using fewer resources.^[7] Subsequent variants addressed limitations in DARTS, such as sensitivity to softmax relaxation and high memory demands. Gradually Differentiable Architecture Search (GDAS) replaces softmax with Gumbel-softmax sampling to better approximate discrete choices during search, enabling robust architecture discovery in four GPU hours and a 2.82% CIFAR-10 test error with 2.5 million parameters.^[36] Partially Connected DARTS (PC-DARTS) mitigates memory overhead by partially connecting channels in the supernet—sampling subsets for computation while normalizing edges—allowing searches with larger batch sizes and achieving a 2.57% CIFAR-10 test error in 0.1 GPU days.^[37] By 2025, advances have scaled gradient-based NAS to large models like transformers, with methods like Smooth Activation DARTS (SA-DARTS) introducing regularization to counter skip-connection dominance and discretization gaps between search and evaluation phases, improving stability and performance on complex spaces.^[38] Similarly, DASViT extends differentiable search to vision transformers, optimizing token mixing and projection operations in a continuous space to yield architectures outperforming ViT-B/16 baselines on ImageNet and other datasets while addressing scalability challenges in high-dimensional transformer designs.^[39]

Multi-Objective NAS

Multi-objective neural architecture search (NAS) extends traditional NAS by optimizing architectures with respect to multiple conflicting criteria simultaneously, such as prediction accuracy, computational complexity measured in floating-point operations (FLOPs), inference latency, and resource constraints like power consumption.^[40] This approach is formalized as a multi-objective optimization problem: \min_{A} (f_1(A), f_2(A), \dots, f_k(A)), where A represents a neural architecture and each f_i(A) denotes an objective function, such as error rate or latency.^[41] Solutions are selected via non-dominated sorting, identifying Pareto-optimal architectures where no objective can improve without degrading another.^[42] Pareto optimization in multi-objective NAS often adapts evolutionary algorithms like NSGA-II, which employs non-dominated sorting and crowding distance to approximate the Pareto front—a set of non-dominated architectures balancing trade-offs.^[41] In NSGA-Net, for instance, NSGA-II explores a cell-based search space to minimize classification error and FLOPs, using crowding distance to promote diversity and prevent premature convergence to suboptimal solutions.^[43] This evolutionary framework maintains a population of architectures, applying crossover and mutation operators informed by prior knowledge, yielding a diverse Pareto front in a single search run.^[42] Early methods like MONAS (2018) apply reinforcement learning to multi-objective NAS for mobile devices, defining a composite reward function that balances accuracy and hardware constraints such as peak power consumption.^[44] MONAS searches for convolutional neural network architectures on datasets like CIFAR-10, achieving accuracies comparable to single-objective baselines while satisfying power budgets under 1 watt.^[44] Similarly, hardware-aware approaches like FBNet (2018) use differentiable NAS to jointly optimize accuracy and device-specific latency, incorporating quantization-aware search (Q-NAS) to evaluate architectures under low-precision constraints like INT8, enabling deployment on mobile hardware with latencies as low as 2.9 ms on Samsung Galaxy S8.^[11] Recent advancements address sustainability in multi-objective NAS, such as generative methods that integrate evolutionary algorithms with generative models to explore architecture distributions. For example, a 2024 framework combines multi-objective evolutionary search with generative architecture generation to optimize for accuracy and efficiency.^[45] CE-NAS (2024) further exemplifies this by employing reinforcement learning and multi-objective optimization to dynamically allocate GPU resources based on carbon intensity, achieving up to 7.22× reduction in CO₂ emissions compared to standard NAS while maintaining high accuracy (e.g., 80.6% top-1 on ImageNet).^[46]

Evaluation and Applications

Benchmarks and Datasets

Standardized benchmarks have become essential in neural architecture search (NAS) to enable fair, reproducible comparisons across methods by providing pre-evaluated architectures and performance metrics. These benchmarks typically consist of tabular datasets mapping architectures to their accuracies, training curves, and other properties, allowing rapid evaluation without full retraining. Early benchmarks focused on vision tasks, but recent ones incorporate multi-domain and hardware-aware aspects to better reflect real-world deployment.^[15]^[47] NAS-Bench-101, introduced in 2019, is a foundational tabular benchmark containing 423,624 unique architectures from a cell-based search space, each fully trained and evaluated on CIFAR-10 for 108 epochs. It provides precomputed metrics such as validation accuracy, test accuracy, and training time, facilitating fast prototyping of NAS algorithms by querying the table instead of training from scratch. The benchmark's top-performing architecture achieves a test accuracy of 94.07% on CIFAR-10, while methods like DARTS typically recover architectures around 91-92% accuracy when evaluated on this space.^[15] Building on this, NAS-Bench-201, released in 2020, extends reproducibility to multiple datasets with 15,625 architectures in a different search space, evaluated on CIFAR-10, CIFAR-100, and a downsampled ImageNet variant (ImageNet-16-120). This allows assessment of architecture transferability across datasets, revealing that strong performers on CIFAR-10 often generalize well but degrade on more complex tasks like ImageNet-16-120. For instance, the best architecture yields 91.38% accuracy on CIFAR-10, 73.85% on CIFAR-100, and 51.03% on ImageNet-16-120.^[47] For larger search spaces, NAS-Bench-301 (2020) introduces a surrogate modeling approach to handle the DARTS search space of approximately 10^18 architectures, using Gaussian processes to estimate performance via zero-shot proxies without exhaustive evaluation. This benchmark supports rapid zero-shot predictions, with surrogate models achieving correlations up to 0.85 with true ImageNet performance at a fraction of the computational cost. By 2025, it has become a standard for prototyping in expansive spaces, integrating with performance estimation strategies.^[48] Hardware-aware benchmarks address deployment constraints beyond accuracy, such as latency and energy. HW-NAS-Bench (2021) augments NAS-Bench-201 by providing measured hardware metrics (e.g., latency, FLOPS) for all architectures across six devices, including CPU, GPU, and edge platforms, enabling multi-objective NAS that balances accuracy with efficiency. For example, it shows that top accuracy architectures often incur 2-5x higher latency on mobile GPUs compared to optimized ones.^[49] Common datasets in NAS benchmarks are CIFAR-10 for initial validation and ImageNet for large-scale testing, due to their established role in CNN evaluation. Emerging benchmarks incorporate VTAB (Visual Task Adaptation Benchmark, 2019), a suite of 19 diverse vision tasks (e.g., object classification, counting) for assessing transfer learning, with NAS methods evaluated on few-shot adaptation to measure generalization beyond single-dataset accuracy.^[50] Key metrics in these benchmarks include search cost (measured in GPU hours) and anytime accuracy (best validation accuracy at any training epoch), emphasizing efficiency alongside performance. The following table summarizes representative results for top architectures or methods on select benchmarks:

Benchmark	Dataset	Top Accuracy (%)	Example Method (Accuracy %)	Search Cost (GPU Hours)	Source
NAS-Bench-101	CIFAR-10	94.07	DARTS (~91.8)	N/A (tabular)	Ying et al., 2019
NAS-Bench-201	CIFAR-10	91.38	Random Search (90.83)	~0.5	Dong & Yang, 2020
NAS-Bench-201	ImageNet-16-120	51.03	ENAS (50.59)	~1.5	Dong & Yang, 2020
NAS-Bench-301	ImageNet	~75.5 (surrogate est.)	Zero-shot Proxy (corr. 0.85)	<0.1	Zela et al., 2020
HW-NAS-Bench	CIFAR-10 (GPU)	91.25	Hardware-Optimized (latency 1.2ms)	N/A	Cai et al., 2021

Challenges and Recent Advances

One of the primary challenges in neural architecture search (NAS) is its exorbitant computational cost, often demanding thousands of GPU-days for exhaustive evaluation of candidate architectures during the search process.^[51] This expense arises from the need to train and assess numerous models iteratively, limiting accessibility for researchers without substantial resources.^[52] Additionally, poor generalization across datasets remains a persistent issue, as architectures optimized for specific benchmarks frequently underperform on unseen data distributions due to overfitting to the search task's peculiarities.^[53] Sensitivity to the search space design exacerbates this, with biases in space construction leading to skewed rankings and missed opportunities for discovering robust models.^[54] Reproducibility further complicates progress, as results are highly sensitive to random seeds in initialization and sampling, compounded by the absence of standardized APIs for consistent evaluation protocols.^[55]^[56] Recent advances have addressed these hurdles through innovative paradigms like generative NAS, which reformulates the search as approximating the underlying distribution of high-performing architectures, thereby enabling sampling of promising candidates without full enumeration.^[57] This approach, introduced in 2025, leverages probabilistic modeling to prioritize likely optimal structures, significantly reducing computational overhead while maintaining competitive performance. Efficient global search strategies incorporating zero-cost evaluators have also gained traction; for instance, 2025 variants of Efficient NAS (ENAS) use lightweight proxies to estimate architecture quality instantaneously, facilitating broader exploration in large spaces without resource-intensive training.^[58]^[59] These evaluators, such as synaptic saliency or Jacobian traces, provide hardware-aware predictions that scale to complex scenarios.^[54] Emerging applications extend NAS beyond traditional convolutional neural networks (CNNs) to architectures like transformers and graph neural networks (GNNs), where tailored search spaces account for attention mechanisms and graph topologies to yield specialized models for sequence and relational data tasks.^[60]^[61] In parallel, sustainable NAS initiatives, exemplified by GreenNAS, incorporate energy minimization as an objective, using performance predictors to curb carbon emissions during hyperparameter and architecture tuning, aligning AI development with environmental goals.^[62] Looking ahead, integrating NAS with federated learning promises privacy-preserving searches across decentralized devices, as demonstrated by frameworks like DPFNAS that adapt architectures while enforcing differential privacy. Hybrid methods blending Bayesian optimization for global exploration with gradient-based refinement for local efficiency are poised to enhance search robustness, particularly in multi-objective settings that balance accuracy and hardware constraints.^[63]

References

[1]
[PDF] Neural Architecture Search: A Survey
Neural Architecture Search (NAS), the process of automating architecture engineering, is thus a logical next step in automating machine learning. Already by now ...Missing: key | Show results with:key
[2]
Neural Architecture Search - AutoML.org
Neural Architecture Search (NAS) automates the process of architecture design of neural networks. NAS approaches optimize the topology of the networks.Missing: key | Show results with:key
[3]
Neural Architecture Search with Reinforcement Learning - arXiv
Nov 5, 2016 · In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning.
[4]
Learning Transferable Architectures for Scalable Image Recognition
Jul 21, 2017 · This paper introduces NASNet, a method to learn image architectures by searching for a building block on a small dataset and transferring it to ...
[5]
[PDF] Efficient Neural Architecture Search via Parameter Sharing
Abstract. We propose Efficient Neural Architecture Search. (ENAS), a fast and inexpensive approach for au- tomatic model design. ENAS constructs a large.Missing: original | Show results with:original
[6]
Regularized Evolution for Image Classifier Architecture Search - arXiv
Feb 5, 2018 · Access Paper: View a PDF of the paper titled Regularized Evolution for Image Classifier Architecture Search, by Esteban Real and 2 other authors.
[7]
[1806.09055] DARTS: Differentiable Architecture Search - arXiv
Jun 24, 2018 · Abstract:This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner.
[8]
[PDF] Neural Architecture Search - AutoML.org
Different RL approaches differ in how they represent the agent's policy and how they optimize it: Zoph and Le [74] use a recurrent neural network (RNN) policy ...Missing: original paper
[9]
[PDF] A Survey on Computationally Efficient Neural Architecture Search
Since training DNNs is itself computationally expensive, early NAS methods suffer from a high computational burden. For example, Zoph et al. [22] use ...
[10]
FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable ...
Dec 9, 2018 · Abstract:Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large.
[11]
[2506.13755] MARCO: Hardware-Aware Neural Architecture Search ...
Jun 16, 2025 · This paper introduces MARCO (Multi-Agent Reinforcement learning with Conformal Optimization), a novel hardware-aware framework for efficient neural ...Missing: review | Show results with:review
[12]
https://arxiv.org/abs/2506.13755
[13]
Systematic review on neural architecture search
Jan 6, 2025 · Neural Architecture Search (NAS), as a subset of AutoML, is a framework for optimizing hyperparameters related to the architecture of Neural ...
[14]
NAS-Bench-101: Towards Reproducible Neural Architecture Search
Feb 25, 2019 · NAS-Bench-101 is a public dataset for NAS research, containing 423k unique architectures and over 5 million trained models, to help researchers ...
[15]
Advances in neural architecture search | National Science Review
This paper delves into the multifaceted aspects of NAS, elaborating on its recent advances, applications, tools, benchmarks and prospective research directions.
[16]
Transformer-based neural architecture search for effective visible ...
Mar 1, 2025 · We employ a novel transformer-based neural architecture search (TNAS) deep learning approach for effective VI–reID.
[17]
[PDF] Efficient Evaluation Methods for Neural Architecture Search: A Survey
Jan 14, 2023 · The early-stopping strategy assumes that the performance ranking obtained by the partially-trained architectures is consistent with the actual ...
[18]
[2101.08134] Zero-Cost Proxies for Lightweight NAS - arXiv
Jan 20, 2021 · We propose a series of zero-cost proxies, based on recent pruning literature, that use just a single minibatch of training data to compute a model's score.Missing: seminal | Show results with:seminal
[19]
ParZC: Parametric Zero-Cost Proxies for Efficient NAS
Apr 11, 2025 · We introduce a novel method called Parametric Zero-Cost Proxies (ParZC) framework to enhance the adaptability of zero-cost proxies through parameterization.
[20]
Evolving Comprehensive Proxies for Zero-Shot Neural Architecture ...
Jul 13, 2025 · In this work, we address this issue by assembling four distinct zero-cost proxies in a nonlinear fashion to provide a comprehensive evaluation of DNNs across ...
[21]
[1808.05377] Neural Architecture Search: A Survey - arXiv
Aug 16, 2018 · We provide an overview of existing work in this field of research and categorize them according to three dimensions: search space, search strategy, and ...Missing: 2024 2025
[22]
[PDF] Aging Evolution for Image Classifier Architecture Search
Oct 4, 2018 · a neural network architecture in a real experiment. In a real experiment, training and evaluating an architecture yields a noisy accuracy.
[23]
Population-based guiding for evolutionary neural architecture search - Scientific Reports
### Summary of Population-Based Guiding (PBG) for Evolutionary NAS
[24]
Neural Architecture Search with Bayesian Optimisation and Optimal ...
Feb 11, 2018 · This paper introduces NASBOT, a Gaussian process based Bayesian Optimization framework for neural architecture search, using a distance metric ...Missing: SMAC | Show results with:SMAC
[25]
[PDF] Efficient Deep Neural Architecture Search via Bayesian Optimization
Mar 31, 2025 · This work uses Bayesian Optimization (BO) for efficient Neural Architecture Search (NAS) in deep learning, achieving 100x acceleration over ...Missing: seminal | Show results with:seminal
[26]
[PDF] NAS-Bench-101: Towards Reproducible Neural Architecture Search
May 14, 2019 · NAS-Bench-101 is a public dataset for NAS research, mapping 423k architectures to metrics, enabling fast evaluation of diverse models.
[27]
[PDF] BANANAS: Bayesian Optimization with Neural Architectures ... - arXiv
Nov 2, 2020 · We compare BANANAS to the most popular NAS algorithms from a variety of paradigms: random search [30], regularized evolution [44], BOHB [11],.<|control11|><|separator|>
[28]
TPE, Random Search, Anneal Tuners on NNI
TPE¶. The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization (SMBO) approach. SMBO methods sequentially construct models to ...
[29]
Simple And Efficient Architecture Search for Convolutional Neural ...
Nov 13, 2017 · The paper proposes a method to automatically search for CNN architectures using a simple hill climbing procedure with network morphisms and ...
[30]
[PDF] Exploring the Loss Landscape in Neural Architecture Search
This paper shows that simple hill-climbing is a powerful baseline for NAS, and that reducing noise in training makes local search competitive with state-of-the ...
[31]
Efficient Neural Architecture Search via Parameter Sharing - arXiv
Feb 9, 2018 · On the CIFAR-10 dataset, ENAS designs novel architectures that achieve a test error of 2.89%, which is on par with NASNet (Zoph et al., 2018), ...
[32]
[1904.00420] Single Path One-Shot Neural Architecture Search with ...
Mar 31, 2019 · Abstract:We revisit the one-shot Neural Architecture Search (NAS) paradigm and analyze its advantages over existing NAS approaches.
[33]
Direct Neural Architecture Search on Target Task and Hardware
Dec 2, 2018 · In this paper, we present \emph{ProxylessNAS} that can \emph{directly} learn the architectures for large-scale target tasks and target hardware ...
[34]
Searching for A Robust Neural Architecture in Four GPU Hours - arXiv
Oct 10, 2019 · We propose an efficient NAS approach learning to search by gradient descent. Our approach represents the search space as a directed acyclic graph (DAG).
[35]
[1907.05737] PC-DARTS: Partial Channel Connections for Memory ...
Jul 12, 2019 · In this paper, we present a novel approach, namely, Partially-Connected DARTS, by sampling a small part of super-network to reduce the redundancy in exploring ...
[36]
Regularizing Differentiable Architecture Search with Smooth Activation
Apr 22, 2025 · In this paper, we undertake a simple but effective approach, named Smooth Activation DARTS (SA-DARTS), to overcome skip dominance and discretization ...
[37]
DASViT: Differentiable Architecture Search for Vision Transformer
Jul 17, 2025 · Experiments show that DASViT delivers architectures that break traditional Transformer encoder designs, outperform ViT-B/16 on multiple datasets ...Missing: scalable | Show results with:scalable
[38]
[2307.09099] A Survey on Multi-Objective Neural Architecture Search
Jul 18, 2023 · A Survey on Multi-Objective Neural Architecture Search. Authors:Seyed Mahdi Shariatzadeh, Mahmood Fathy, Reza Berangi, Mohammad Shahverdy.Missing: seminal | Show results with:seminal
[39]
NSGA-Net: neural architecture search using multi-objective genetic ...
NSGA-Net is a population-based search algorithm that explores a space of potential neural network architectures in three steps.
[40]
[PDF] NSGA-Net: Neural Architecture Search using Multi-Objective ... - IJCAI
This paper introduces. NSGA-Net – an evolutionary search algorithm that explores a space of potential neural network archi- tectures in three steps, namely, a ...
[41]
[PDF] NSGA-Net: Neural Architecture Search using Multi-Objective ...
NSGA-Net is a population-based search algorithm that explores a space of potential neural network architectures in three steps, namely, a population ...
[42]
MONAS: Multi-Objective Neural Architecture Search using ... - arXiv
Jun 27, 2018 · Access Paper: View a PDF of the paper titled MONAS: Multi-Objective Neural Architecture Search using Reinforcement Learning, by Chi-Hung Hsu ...Missing: seminal | Show results with:seminal
[43]
Architecture generation for multi-objective neural architecture search
Jan 1, 2024 · This paper presents a multi-objective NAS approach that integrates a multi-objective evolutionary algorithm (MOEA) with a generative model.
[44]
CE-NAS: An End-to-End Carbon-Efficient Neural Architecture Search Framework
### Summary of CE-NAS for Carbon-Efficient Multi-Objective NAS
[45]
NAS-Bench-201: Extending the Scope of Reproducible Neural ...
Jan 2, 2020 · In this work, we propose an extension to NAS-Bench-101: NAS-Bench-201 with a different search space, results on multiple datasets, and more diagnostic ...
[46]
[2008.09777] Surrogate NAS Benchmarks: Going Beyond the ...
Aug 22, 2020 · We show that surrogate NAS benchmarks can model the true performance of architectures better than tabular benchmarks (at a small fraction of the cost).
[47]
[2103.10584] HW-NAS-Bench:Hardware-Aware Neural Architecture ...
Mar 19, 2021 · We develop HW-NAS-Bench, the first public dataset for HW-NAS research which aims to democratize HW-NAS research to non-hardware experts.
[48]
A Large-scale Study of Representation Learning with the Visual ...
Oct 1, 2019 · We present the Visual Task Adaptation Benchmark (VTAB), which defines good representations as those that adapt to diverse, unseen tasks with few examples.Missing: NAS | Show results with:NAS
[49]
A survey on computationally efficient neural architecture search
Compared with NASNet [23], ENAS reduces the search cost from 1800 GPU-days down to 16 GPU-hours on CIFAR10 classification task. Luo et al. [78] proposed the NAO ...
[50]
[PDF] Systematic review on neural architecture search
Neural Architecture Search (NAS) automates the design of neural network architectures, which are core components of ML models.
[51]
Neural architecture search using attention enhanced precise path ...
Mar 20, 2025 · Predictor-based Neural Architecture Search (NAS) utilizes performance predictors to swiftly estimate architecture accuracy, thereby reducing ...Missing: early | Show results with:early
[52]
Zero-Shot Neural Architecture Search: Challenges, Solutions, and ...
May 1, 2025 · Moreover, ref. [37] have reviewed the progress of zero-cost NAS and highlighted open challenges such as metric instability, search space bias, ...
[53]
[PDF] Random Search and Reproducibility for Neural Architecture Search
Without benchmarking against leading hyperparameter optimization baselines, it difficult to quantify the performance gains provided by specialized NAS methods.
[54]
Best practices for scientific research on neural architecture search
The reproducibility crisis in machine learning has already shown how hard it is to reproduce each other's experiments without code in machine learning in ...
[55]
Generative neural architecture search - ScienceDirect.com
We propose a novel generative search strategy that transforms the NAS process into approximating the distribution of high-performing neural architectures.
[56]
[2502.03553] Efficient Global Neural Architecture Search - arXiv
Feb 5, 2025 · We develop an efficient search strategy by disjoining macro-micro network design that yields competitive architectures in terms of both accuracy and size.Missing: zero- variants
[57]
[PDF] RZ-NAS: Enhancing LLM-guided Neural Architecture Search via ...
Through the reflective Zero-. Cost evaluation strategy, RZ-NAS can achieve better per- formance than the original proxies. In addition, it can even outperform ...Missing: global | Show results with:global
[58]
automl/awesome-transformer-search: A curated list of ... - GitHub
To keep track of the large number of recent papers that look at the intersection of Transformers and Neural Architecture Search (NAS), we have created this ...
[59]
[PDF] arXiv:2503.02448v2 [cs.LG] 6 Mar 2025
Mar 6, 2025 · Outstanding NAS: Our study evaluates six advanced NAS baselines, including DARTS[8] and five Graph-. NAS based algorithms: GraphNAS[3], PAS ...
[60]
GreenNAS: A Green Approach to the Hyperparameters Tuning in ...
Mar 14, 2024 · GreenNAS is a green approach to hyperparameter tuning in deep learning, aiming to minimize environmental impact by using a performance ...
[61]
From federated learning to federated neural architecture search
Jan 4, 2021 · This survey paper starts with a brief introduction to federated learning, including both horizontal, vertical, and hybrid federated learning.Missing: BO | Show results with:BO