Fact-checked by Grok 2 weeks ago

Adversarial machine learning

Adversarial machine learning is a subfield of artificial intelligence that examines the vulnerabilities of machine learning models to deliberate manipulations, such as subtle perturbations to inputs or training data, which induce incorrect predictions despite the models performing well on standard benchmarks.^[1] These vulnerabilities, often termed adversarial examples, arise because many models, particularly deep neural networks, rely on non-robust decision boundaries that can be crossed with minimal changes imperceptible to humans, as first empirically demonstrated in image classifiers where adding noise flipped labels from one class to another.^[2] The field addresses both offensive techniques, including evasion attacks at inference time and data poisoning during training, and defensive strategies like adversarial training, which augments datasets with perturbed examples to foster robustness.^[3] Key characteristics include the transferability of adversarial examples across models, enabling black-box attacks without access to internal parameters, and the empirical observation that robustness gains often come at the cost of standard accuracy, highlighting a tension in model optimization.^[4] Pioneering work by Goodfellow et al. introduced the fast gradient sign method (FGSM), a simple yet effective way to generate such examples by maximizing loss in the input gradient direction, which underscored the linear susceptibility of models to small bounded perturbations.^[3] Despite advances in certified defenses using interval bound propagation or randomized smoothing, many proposed mitigations have been circumvented by stronger attacks, revealing that achieving provable robustness remains computationally intensive and theoretically challenging under worst-case assumptions.^[5] The field's significance lies in its implications for real-world deployments, where adversarial failures could undermine applications in autonomous vehicles, malware detection, or medical diagnostics, prompting calls for causal robustness over mere correlational fitting to align models with underlying data-generating processes.^[1] Controversies persist regarding the practical prevalence of attacks versus laboratory settings, with evidence suggesting that while contrived examples abound, physical-world exploits like sticker perturbations on signs have been realized, though defenses lag in scalability.^[4] Ongoing research emphasizes empirical evaluation on diverse datasets and threat models, prioritizing defenses that withstand adaptive adversaries over those vulnerable to simple countermeasures.^[5]

Historical Development

Origins and Early Examples

The concept of adversarial vulnerabilities in machine learning emerged from concerns over the fragility of classifiers to deliberate input manipulations, particularly in security-sensitive applications during the early 2000s. Initial practical demonstrations focused on evasion attacks against shallow models, such as those used in spam detection, where adversaries could alter features like word spellings or insertions to bypass filters while maintaining semantic intent. Dalvi et al. (2004) formalized this as an adversarial classification problem, modeling the interaction between a cost-aware attacker and a naive Bayes classifier on email data; they showed that spammers could achieve evasion by solving a convex optimization to shift inputs across decision boundaries with minimal feature changes, reducing detection rates significantly under realistic cost constraints. Theoretical underpinnings for such brittleness in simpler models, including linear classifiers and perceptrons, were explored through analyses of perturbation sensitivity near decision hyperplanes. Building on margin-based robustness ideas from support vector machines, early work in the 2000s, such as Lowd and Meek (2005), examined query-efficient evasion strategies against linear models in pattern recognition tasks; they demonstrated that repeated black-box queries could reconstruct sufficient model information to craft effective adversarial inputs, exploiting the low-dimensional separability of features in domains like text classification. These efforts highlighted from first principles how even optimally trained linear separators remain vulnerable to bounded perturbations that flip classifications without altering the underlying data distribution substantially. The transition to deeper architectures amplified these issues, with Szegedy et al. (2013) providing the first empirical evidence of adversarial examples in convolutional neural networks trained on ImageNet. By minimizing L_p-norm perturbations via box-constrained L-BFGS optimization on inputs, they generated nearly imperceptible noise—often undetectable to humans—that caused models like AlexNet to misclassify images with over 99% confidence, revealing the non-intuitive lack of robustness in high-dimensional feature spaces despite high accuracy on clean data.^[2] This demonstration, initially viewed as an "intriguing property" rather than a security threat, underscored the gap between empirical performance and causal stability to small input shifts, prompting broader scrutiny beyond shallow models.

Key Milestones in Attack Discovery

In 2014, Ian Goodfellow and colleagues introduced the Fast Gradient Sign Method (FGSM), a foundational single-step attack that computes perturbations as the sign of the gradient of the loss function with respect to the input, scaled by a small epsilon. This approach generated adversarial examples capable of fooling deep neural networks on datasets like ImageNet with perturbations often imperceptible to humans, revealing that models' reliance on linear behavior near data points creates exploitable vulnerabilities tied directly to gradient information. FGSM's simplicity and efficiency highlighted the causal role of optimization landscapes in enabling such attacks, influencing subsequent research by providing a baseline for white-box threat models under l_p norm constraints.^[3] Building on FGSM, Nicolas Papernot et al. in 2016 demonstrated the transferability of adversarial examples across different models, showing that perturbations crafted via white-box access to a surrogate model could achieve misclassification rates exceeding 80% on unseen black-box targets without direct gradient knowledge. This discovery enabled practical black-box attacks through limited query APIs, simulating real-world scenarios where adversaries lack full model access, such as cloud-based services, and underscored the non-uniqueness of adversarial perturbations across architectures. Concurrently, Nicholas Carlini and David Wagner developed a suite of optimization-based attacks in 2016, optimizing directly for minimal distortion under l_0, l_2, and l_infty norms using techniques like change-of-variables and targeted loss formulations, which reliably succeeded against defenses like distillation with distortion norms orders of magnitude smaller than prior methods.^[6]^[7] By 2017, Aleksander Madry et al. advanced iterative attacks with Projected Gradient Descent (PGD), an l_infty-bounded multi-step method that approximates solutions to constrained optimization problems by projecting updates onto feasible perturbation balls, outperforming FGSM in evasion success on CIFAR-10 and MNIST benchmarks under epsilon=0.3 (for normalized inputs). PGD's stronger adversarial generation—achieving near-perfect attack rates on undefended models—established it as a rigorous lower bound for robustness evaluation, spurring benchmarks that quantified vulnerabilities in state-of-the-art networks like ResNet, where even small bounded perturbations (e.g., 8/255 in pixel space) induced error rates over 90%. These developments from 2016 to 2018 collectively expanded attack scopes, emphasizing gradient-based causality and constraint-aware optimization as core to adversarial discovery.^[8]

Recent Advances and Standardization Efforts

In 2023 and 2024, adversarial attacks expanded significantly to large language models (LLMs) and generative systems, with prompt injection techniques enabling attackers to induce hallucinations or bypass safety constraints. Studies evaluated over 1,400 adversarial prompts across models including GPT-4, Claude 2, and Mistral 7B, revealing high success rates in eliciting unintended outputs despite alignment efforts.^[9] A 2024 analysis of nine jailbreak attack variants and seven defenses demonstrated that many methods, such as role-playing prompts or iterative refinement, achieved over 80% success in generating harmful content, underscoring limitations in current safeguards.^[10] These findings highlighted the scalability of evasion attacks to text-based generative models, where subtle input manipulations exploit probabilistic decoding.^[11] Concurrent research addressed backdoor triggers in federated learning, where malicious clients inject persistent vulnerabilities during distributed training without central data access. A 2024 study introduced BadFU, combining backdoor samples with camouflage to evade detection, achieving up to 95% attack success on global models while maintaining benign accuracy.^[12] Similarly, defenses like FedBAP used benign adversarial perturbations to dilute triggers, reducing backdoor efficacy by 70-90% in experiments on datasets such as CIFAR-10.^[13] These advances revealed causal dependencies in model aggregation, where even few compromised participants could propagate exploits across federated setups.^[14] Standardization efforts intensified with NIST's AI 100-2 report, first issued in January 2024 to define AML terminology and taxonomies for attacks across life cycles, then updated in March 2025 to incorporate generative AI threats, multimodal vectors, and federated backdoors.^[15] ^[16] The 2025 edition refined categories for predictive and generative systems, emphasizing empirical attack surfaces like prompt-based evasions and chained multimodal exploits in vision-language models.^[1] Complementary initiatives, such as the AdvML-Frontiers workshop at NeurIPS in December 2024, fostered benchmarks for multimodal adversarial robustness, including universal attacks on aligned LLMs via optimized images that override instructions with 70-90% transferability.^[17] ^[18] These developments prioritized verifiable threat modeling over untested mitigations, aiding industry adoption of standardized assessments.^[19]

Core Concepts and Taxonomy

Definitions and Threat Models

Adversarial machine learning is the subfield of machine learning security that examines attacks where adversaries exploit a model's brittleness to carefully designed inputs or data manipulations, causing failures such as incorrect classifications while the model performs adequately on unmodified, benign examples.^[20] These vulnerabilities stem from the non-robust optimization inherent in standard training objectives, which minimize average loss over training data but fail to ensure worst-case guarantees against perturbations.^[2] The core phenomenon involves adversarial examples—inputs x' derived from legitimate x via small changes \delta (e.g., \| \delta \|_p \leq \epsilon) that flip the model's output, often imperceptibly to humans under metrics like \ell_\infty norms.^[3] Threat models formalize the adversary's assumptions to evaluate attack feasibility and model robustness. Adversary knowledge is categorized as white-box, granting complete access to model parameters, gradients, and architecture for direct optimization of perturbations (e.g., via projected gradient descent), or black-box, restricting to oracle queries for outputs, relying on transferable examples or query-efficient approximations.^[21] Goals distinguish targeted attacks, forcing a specific erroneous output (e.g., classifying a panda as a gibbon), from untargeted ones inducing any misclassification, with targeted typically requiring larger perturbations.^[8] Perturbation constraints enforce realism, commonly using \ell_p norms where p=\infty caps maximum pixel changes (e.g., \epsilon = 8/255 for standardized image benchmarks on datasets like CIFAR-10, balancing attack success and stealth).^[8] \ell_2 norms limit Euclidean distance for smoother distortions, while \ell_0 counts sparse changes, though \ell_\infty prevails for its perceptual uniformity in bounded domains.^[22] A key distinction in threat modeling is between abstract digital perturbations and causally deployable ones; laboratory constructs succeeding in pixel space often fail in physical settings due to factors like lighting, viewpoint, or sensor noise, necessitating models incorporating expectation over transformations for realizability (e.g., via iterative printing and recapture). This gap highlights that empirical robustness under contrived constraints does not guarantee operational security against resource-bounded adversaries in unconstrained environments.

Classification of Adversarial Threats

Adversarial threats are categorized by the timing of adversary intervention, with causative attacks targeting the training phase through data poisoning or model parameter manipulation to embed vulnerabilities in the learned representation, and explorative attacks operating at inference time to probe or evade the model's decision boundaries without modifying the underlying parameters.^[23]^[24] This distinction arises from the causal pathway: causative interventions alter the model's generative process, yielding persistent effects across inputs, whereas explorative ones exploit fixed decision surfaces for targeted misclassifications.^[25] The NIST 2025 taxonomy further delineates threats by impact on system properties, classifying them as availability attacks that disrupt model deployment through resource exhaustion or overload, integrity attacks that induce erroneous outputs via evasion or poisoning, and confidentiality attacks that extract sensitive training data or model internals through inversion or membership inference.^[1] Empirical evaluation of these threats employs metrics including attack success rate (ASR), computed as \text{ASR} = \frac{\text{number of successful adversarial instances}}{\text{total adversarial instances}} \times 100, which quantifies evasion efficacy, alongside model degradation indicators such as accuracy drop from 95% to below 10% under poisoning with 5% tainted samples in controlled benchmarks.^[1]^[25] Supply-chain compromises of pre-trained models constitute a distinct, underexplored category within causative threats, where adversaries inject backdoors into publicly available foundational models, enabling downstream propagation of hidden triggers; for instance, tampering with Hugging Face repository models has demonstrated ASR exceeding 90% in inherited classifiers without direct access to end-user training.^[1]^[26] Academic literature's emphasis on explorative evasion, comprising over 70% of surveyed attack studies, risks underprioritizing these systemic vulnerabilities, as poisoning yields broader, harder-to-detect degradation in real-world deployments reliant on third-party components.^[25]^[26]

Attack Strategies

Training-Phase Attacks

Training-phase attacks target the model development process by corrupting training data or parameters, resulting in systematically flawed learned representations. These attacks exploit the reliance on data integrity, where even minor alterations can propagate to induce undesired behaviors such as reduced overall accuracy or targeted misclassifications. Unlike inference-time manipulations, training-phase interventions establish persistent vulnerabilities embedded in the model's weights.^[1] Data poisoning constitutes a primary vector, involving the injection of adversarial samples—either through feature perturbations or label flips—into the training corpus. Empirical evaluations on datasets like CIFAR-10 demonstrate that poisoning 1-5% of samples via label flipping can substantially degrade classifier performance, often flipping decisions on clean test inputs by altering decision boundaries. For instance, indiscriminate poisoning strategies have been shown to compromise neural networks trained on image data, with attackers optimizing perturbations to maximize global error rates while maintaining stealth. In targeted scenarios, outliers designed to influence specific classes can shift model parameters, as quantified in studies where small fractions of malicious inputs suffice to mislead optimization.^[27]^[28]^[29] Backdoor attacks embed conditional triggers within the training data, causing models to exhibit normal performance on clean inputs but misbehave—typically misclassifying to a target label—upon encountering the trigger. These are particularly insidious in federated learning, where participants upload local updates; a single malicious client can inject backdoors with minimal overhead, as demonstrated in frameworks where model replacement or gradient manipulation achieves high success rates without significantly elevating communication costs. Triggers often comprise subtle patterns, such as pixel patches in images, enabling attackers to retain control post-deployment. Research indicates that such attacks persist even under aggregation schemes like FedAvg, with attack success rates exceeding 90% in simulations on datasets like MNIST and CIFAR-10 using fewer than 5% malicious clients.^[30]^[31] In distributed machine learning, Byzantine attacks arise from rogue agents submitting arbitrary updates to skew global model aggregates, undermining consensus in parameter servers or peer-to-peer synchronization. Simulations reveal that standard defenses, such as median-based aggregation, fail when malicious participants exceed 20-30% of the network, leading to convergence to suboptimal or adversarial equilibria. Provably robust methods tolerate up to a quarter of Byzantine workers under strong convexity assumptions, but empirical tests on non-convex losses like those in deep networks show higher vulnerability, with failure rates climbing in high-dimensional settings. These attacks highlight causal chains from corrupted local gradients to global parameter drift, necessitating resilient optimization protocols.^[32]^[33]

Inference-Phase Attacks

Inference-phase attacks, also known as evasion attacks, target machine learning models during deployment by crafting imperceptible perturbations to input data, causing misclassification without altering the model's parameters or training process.^[20] These attacks exploit the sensitivity of learned decision boundaries to small changes in feature space, where a correctly classified input x is modified to x' = x + \delta such that the model outputs an incorrect label, often under constraints like \|\delta\|_\infty \leq \epsilon to ensure perturbations remain bounded and visually imperceptible.^[3] Empirical evidence demonstrates high efficacy in digital settings; for instance, the Fast Gradient Sign Method (FGSM), introduced in 2014, computes \delta = \epsilon \cdot \sign(\nabla_x L(f(x), y)) and achieves attack success rates (ASR) exceeding 90% on non-robust ImageNet classifiers with \epsilon = 8/255.^[3] Similarly, Projected Gradient Descent (PGD), a stronger iterative variant from 2017, refines perturbations over multiple steps within the same \epsilon-ball, yielding ASR near 100% under white-box access on datasets like CIFAR-10 and transferable success across models.^[8] Transferability underpins the practicality of these attacks, as perturbations optimized for one model often fool others without query access, rooted in shared linear vulnerabilities near decision boundaries rather than model-specific overfitting.^[2] In black-box scenarios, where only query responses are available, attackers rely on surrogate models or gradient estimation, with empirical studies capping successful attacks at around $10^4 queries for high-dimensional tasks like ImageNet, balancing efficiency against detection risks from excessive probing.^[34] Unlike training-phase attacks, inference-phase methods require no dataset access, focusing instead on runtime input manipulation, which amplifies threats in deployed systems such as autonomous vehicles or malware detectors. Real-world simulations underscore causal fragility: in 2017 experiments, perturbations akin to small stickers on stop signs deceived traffic sign classifiers in autonomous driving setups, reducing detection rates by over 80% under simulated lighting and angles, though physical deployment reveals limitations from viewpoint invariance and environmental noise, where digital ASR drops below 50% in uncontrolled conditions. These constraints highlight that while mathematically minimal perturbations suffice in controlled inference, real-world efficacy demands robust optimization accounting for geometry and dynamics, often failing beyond contrived stickers due to non-linear transformations in sensor pipelines.

Model-Centric Attacks

Model-centric attacks in adversarial machine learning focus on exploiting access to a model's outputs to reconstruct its internal parameters, architecture, or underlying training data, primarily driven by incentives to circumvent development costs or evade intellectual property protections. These differ from input-perturbation strategies by targeting the model's proprietary structure, enabling adversaries to replicate functionality without original training resources. Empirical demonstrations have shown success in black-box settings via query APIs, where attackers train substitute models that approximate the target's decision boundaries with measurable fidelity, such as equivalence in prediction accuracy or parameter recovery rates. Model extraction, also termed model stealing, involves querying a target model—often deployed as a machine learning-as-a-service (MLaaS) endpoint—to infer and replicate its logic. In a seminal 2016 study, Tramèr et al. demonstrated extraction of neural network classifiers from APIs like Amazon ML, achieving substitute models with up to 90% of the target's test accuracy using approximately 20,000 targeted queries per class, by solving for functional equivalence through decision tree or neural network approximations. Success relies on the model's overparameterization and query efficiency, with fidelity verified via metrics like prediction agreement on held-out data; for instance, extracted models matched oracle predictions on 84-99% of test instances across datasets like MNIST and CIFAR-10. Economic motivations are evident, as extracted models reduce computational overhead—training a comparable substitute can cost thousands in API fees but avoids millions in proprietary training—though attackers must optimize query strategies to evade rate limits. Model inversion attacks complement extraction by reversing model outputs to reconstruct private training instances, posing privacy risks in domains with sensitive data like medical imaging or biometrics. Attackers optimize inputs that maximize posterior probabilities for target classes, effectively inverting the model's learned mappings; Fredrikson et al. (2015) recovered facial features from genomic prediction models with sufficient detail for identification, while later works extended this to deep networks, yielding partial reconstructions from confidence scores. However, empirical limits persist in high-dimensional spaces: reconstructions degrade due to the curse of dimensionality and lossy mappings in overparameterized models, often producing blurred or low-fidelity outputs rather than exact data recovery, as quantified by metrics like mean squared error exceeding practical thresholds in experiments on ImageNet-scale datasets. Privacy concerns are thus context-dependent, with stronger protections in complex, high-variance models where inversion yields non-actionable noise. In practice, model-centric attacks face barriers that limit real-world prevalence despite theoretical feasibility. Query volumes required—often tens of thousands per model—incur substantial costs (e.g., $100-1000 for cloud APIs) and risk detection via anomaly monitoring, deterring non-state actors. Industry assessments in 2024, reviewing security incidents, report only isolated model stealing cases amid thousands of ML deployments, attributing rarity to these economic and operational hurdles rather than inherent robustness; surveys of MLaaS providers confirm attacks remain lab constructs, with no verified large-scale extractions of production models like GPT-series due to access controls and distillation inefficiencies.^[35]^[36]

Domain-Specific Vulnerabilities

Classical Supervised Learning

Classical supervised learning paradigms, such as image and text classification, rely on models trained to minimize empirical risk on labeled datasets, mapping inputs to discrete outputs via functions like softmax over logits. In image classification, convolutional neural networks (CNNs) dominate, achieving clean accuracies exceeding 95% on CIFAR-10 and 80% on ImageNet subsets.^[37] However, these models generalize poorly to inputs with imperceptible adversarial perturbations, as measured by l_p-norm bounded attacks like projected gradient descent (PGD). Empirical benchmarks reveal that even adversarially trained CNNs, optimized for robustness, suffer accuracy collapses; for instance, on CIFAR-10 under l_∞ perturbations (ε=8/255), top models yield robust accuracies of 60-70%, a 25-35% decline from clean baselines.^[38] On ImageNet, the disparity widens, with robust accuracies often 20-50% below clean, underscoring failures in out-of-distribution generalization despite extensive training data.^[37] Linear models, including logistic regression and support vector machines, exhibit comparatively higher robustness owing to explicit margin maximization, which widens decision boundaries and limits perturbation impacts within certified radii. Studies on toy datasets and simplified image tasks show linear classifiers retaining 80-90% accuracy under equivalent attacks where CNNs drop below 10%, attributable to their convex optimization avoiding local minima that amplify sensitivity in deep architectures.^[39] Yet, this resilience is tempered by critiques: linear models' simplicity yields suboptimal clean performance on high-dimensional data like natural images (often <70% accuracy), fostering over-optimism about scalability; non-convex deep nets, while capturing hierarchical features, inherently trade robustness for expressivity, as perturbations exploit gradient-based optimization pathologies absent in linear regimes.^[40] Transfer attacks across models highlight shared generalization deficits rather than architecture-specific brittleness. Adversarial examples generated on one CNN often fool others trained on identical data, with success rates up to 70-90% in black-box settings, driven by collective overfitting to spurious, non-robust correlations (e.g., background textures over object shapes).^[41] This transferability persists because models independently latch onto dataset biases, not due to universal non-linearity; mitigating overfitting via diverse training reduces it, affirming empirical risk minimization's causal role in vulnerabilities.^[42] In text classification, analogous issues arise with recurrent or transformer-based classifiers, where synonym swaps or negations evade safeguards, though benchmarks like GLUE under perturbation show milder drops (10-30%) than vision tasks, reflecting sparser input spaces.^[43]

Reinforcement Learning Environments

In reinforcement learning (RL) environments, adversarial attacks exploit the sequential and interactive nature of agent-environment dynamics, enabling policy manipulation through subtle state or observation perturbations that propagate over trajectories, unlike the static inputs of supervised learning. These perturbations can induce unsafe or suboptimal actions, derailing long-term reward accumulation by steering agents away from optimal policies. For instance, in Atari games, injecting adversarial noise into pixel-level observations—often bounded by small norms such as \epsilon = 0.05 in L_\infty distance—has been demonstrated to cause deep Q-network (DQN) agents to select actions leading to episode failures or score collapses exceeding 90% in benchmarks like Breakout and Pong.^[44] Similarly, strategically timed state alterations during critical decision points can enchant agents into repetitive low-reward loops, as shown in analyses of deep RL agents where attacks timed to exploit policy evaluation phases amplify damage beyond random noise. Empirical studies highlight how such vulnerabilities manifest in dynamic settings, where adversaries manipulate environmental feedback to tamper with reward signals or exploration paths, fostering cascading errors in value estimation. In continuous control tasks like MuJoCo simulations, observation perturbations as low as 1-5% relative magnitude have triggered unstable behaviors, such as robotic agents falling or deviating from goals, underscoring the sensitivity of policy gradients to input distortions.^[45] This contrasts with supervised domains, as RL's credit assignment over extended horizons allows initial perturbations to compound, enabling reward tampering that reduces cumulative returns by orders of magnitude in partially observable Markov decision processes (POMDPs). Recent advances in adversarial RL frameworks, such as diffusion-based perturbation generation for robustness testing, have quantified these issues in 2024-2025 benchmarks, revealing that undefended agents in multi-agent environments succumb to coordinated state attacks, dropping win rates from 80% to near zero. However, defensive strategies like adversarial training—incorporating perturbed states into policy optimization—enhance short-term resilience but impose substantial costs, with empirical evidence showing 2-5x increases in sample inefficiency due to noisier gradients and expanded exploration demands during training. The inherent exploration-exploitation trade-off in RL exacerbates this, as agents' probabilistic action selection during learning phases provides adversaries opportunities to bias trajectories toward exploitable states, a causal mechanism absent in non-sequential paradigms where inputs lack temporal dependency.^[46]

Natural Language and Generative Models

Adversarial attacks on natural language processing models, particularly large language models (LLMs), often exploit prompt vulnerabilities to elicit undesired outputs, such as harmful or policy-violating responses. These attacks, known as jailbreaks, manipulate input tokens through optimization techniques to bypass alignment safeguards. A prominent example is the Greedy Coordinate Gradient (GCG) method introduced in 2023, which iteratively optimizes discrete token sequences to form universal adversarial suffixes appended to benign prompts, achieving attack success rates of over 90% on models including GPT-3.5 and Vicuna for tasks like generating instructions for disallowed activities. This white-box approach leverages gradient-based updates in token embedding space, discretizing via greedy selection to maximize loss on safety objectives while maintaining semantic coherence. Token perturbations extend these vulnerabilities by introducing subtle modifications in the input sequence, such as synonym substitutions or embedding-space noise, to deceive classifiers or generators. In generative contexts, such perturbations can induce models to produce biased or unsafe text continuations, with transferability across models observed in controlled evaluations up to 2024. For instance, automated suffix generation via GCG variants has demonstrated robustness to paraphrasing and prefix variations, succeeding on aligned LLMs by exploiting autoregressive prediction flaws. These methods highlight the fragility of token-level robustness in LLMs, where even small, imperceptible changes—measured in edit distance or embedding norms—can shift outputs toward adversarial behaviors. In generative models, backdoor attacks target training or fine-tuning phases to embed triggers that activate latent manipulations, leading to controlled biases in synthesis. For diffusion-based text-to-image generators like Stable Diffusion, 2024 research demonstrated injection of arbitrary biases via natural textual triggers, enabling latent space inversion to produce targeted, undesired image distributions upon trigger activation. These backdoors persist post-deployment, inverting clean latents to synthesize biased outputs, such as stereotypical depictions, with high fidelity under specific prompts. Detection challenges arise from the models' probabilistic nature, where trigger inversion techniques recover backdoors but require access to generation traces. Despite laboratory demonstrations of high success rates, empirical exploits against closed-source LLM APIs remain rare as of 2025, primarily due to rate limiting that curtails iterative query optimization essential for attacks like GCG, alongside integrated moderation layers that reject suspicious inputs.^[47] This contrasts with unconstrained white-box settings, underscoring a gap between theoretical vulnerabilities and practical deployment resilience, where black-box constraints degrade transferability. Peer-reviewed evaluations confirm that while open-weight models succumb readily, API endpoints sustain lower attack success under real-world query budgets.

Adversarial Examples and Techniques

Generation and Optimization Methods

The Fast Gradient Sign Method (FGSM), introduced by Goodfellow et al. in 2014, generates adversarial perturbations in a single step by computing the sign of the gradient of the loss function L(\theta, x, y) with respect to the input x, yielding x' = x + \epsilon \cdot \operatorname{sign}(\nabla_x L(\theta, x, y)), where \epsilon bounds the perturbation magnitude to maximize loss increase under an l_\infty norm constraint.^[3] This linear approximation assumes small perturbations suffice to cross decision boundaries, enabling efficient computation but often yielding suboptimal attacks due to non-convexity.^[3] Projected Gradient Descent (PGD), proposed by Madry et al. in 2017, extends this via iterative optimization: starting from x_0 = x, each step updates x_{t+1} = \Pi_{x + S}(\ x_t + \alpha \cdot \operatorname{sign}(\nabla_x L(\theta, x_t, y))\ ), where \Pi projects onto the constrained perturbation set S = \{ \delta : \|\delta\|_p \leq \epsilon \} for l_p norms (p = 1, 2, \infty), and \alpha is a small step size over T iterations (typically 10-40).^[8] This approximates the inner maximization \max_{\delta \in S} L(\theta, x + \delta, y) in min-max robust training formulations, with empirical success rates exceeding 90% on CIFAR-10 classifiers under l_\infty balls of \epsilon = 8/255.^[8] Under Lipschitz smoothness and constraint qualifications, PGD converges to a stationary point of the constrained non-convex problem, though global optimality lacks guarantees due to landscape complexity.^[8] Gradient-free methods, such as Natural Evolution Strategies (NES), bypass derivative access by estimating gradients through zeroth-order queries: sampling perturbations \delta_i \sim \mathcal{N}(0, \sigma^2 I), evaluating model outputs f(x + \delta_i), and approximating \nabla_x \approx \frac{1}{m \sigma^2} \sum_{i=1}^m f(x + \delta_i) \delta_i via Monte Carlo, then applying estimated gradients in PGD-like iterations. Evolutionary strategies directly evolve populations of candidate perturbations, selecting fitter ones (higher loss) via mutation and crossover, achieving query efficiencies of $10^3 to $10^5 in black-box benchmarks on ImageNet subsets, though computationally slower than gradient methods by factors of 10-100. These techniques have empirically exposed models' reliance on non-robust features—spurious, high-variance patterns in training data that predict labels accurately but flip under minimal perturbation—as demonstrated by Ilyas et al. in 2019, where curating datasets to isolate such features yielded classifiers with 88% accuracy on clean ImageNet validation but near-100% adversarial vulnerability, attributing robustness gaps to data distribution artifacts over intrinsic architectural limits.^[48]

Access Model Variations

In white-box settings, adversaries possess full knowledge of the target model's architecture, parameters, and gradients, allowing for the computation of tailored perturbations that exploit internal model behavior. The Carlini-Wagner (CW) attack exemplifies this approach by solving a constrained optimization problem to minimize perturbation norms while ensuring misclassification, achieving attack success rates (ASR) approaching 100% on undefended convolutional neural networks evaluated on datasets such as CIFAR-10 and ImageNet subsets.^[7]^[49] However, such methods are computationally intensive, often requiring thousands of iterations per example due to gradient-based optimization.^[50] Black-box settings limit adversaries to external interactions, such as querying the model for predictions or decisions without gradient access, necessitating reliance on transferable adversarial examples generated via surrogate models or query-efficient estimation techniques like natural evolution strategies. Empirical evaluations around 2020 on image classification tasks demonstrated that perturbations crafted in white-box manner on substitute models transfer to black-box targets with success rates of 70-80% across architectures like ResNet and DenseNet, highlighting the phenomenon's robustness despite architectural differences.^[51]^[52] Query-based black-box variants, while adaptable, incur practical constraints like budget limits on API calls, reducing efficacy compared to white-box precision.^[53] Critics argue that white-box assumptions overestimate threat realism, as deployed models in production—such as those in cloud services—rarely expose internals, with proprietary protections and access controls rendering gradient extraction infeasible without insider compromise. This disconnect underemphasizes black-box hurdles, including escalating costs and detection risks from high-volume queries (e.g., thousands per attack in methods like SPSA), which empirical real-world analyses link to the scarcity of observed adversarial exploits despite theoretical vulnerabilities.^[54]^[55]

Defensive Approaches

Robustness-Enhancing Training

Adversarial training enhances model robustness by incorporating adversarially perturbed examples into the training objective, typically formulated as a min-max optimization problem: the inner maximization generates perturbations to maximize loss for given parameters, while the outer minimization updates the model to minimize the expected robust loss. This paradigm, pioneered by Madry et al. in 2017, uses projected gradient descent (PGD) for the inner loop to approximate worst-case perturbations under constraints like \ell_\infty-norm bounded by \epsilon = 0.3. On the MNIST dataset, PGD-adversarial training reduced the robust test error from approximately 10% under 20-step PGD attacks in standard models to about 5%, effectively halving the vulnerability while maintaining clean accuracy above 98%.^[8] Despite these gains, adversarial training introduces quantifiable trade-offs between clean and robust performance, as standardized benchmarks reveal. RobustBench evaluations of state-of-the-art robust models on CIFAR-10 show clean accuracies typically dropping 10-30% compared to standard training's 95%+ baseline, with robust accuracies under Auto-PGD attacks reaching only 50-60% for top models like those using WideResNet architectures. This degradation persists across datasets, suggesting that overparameterized networks prioritize memorizing dataset-specific patterns over learning invariant features, leading to brittle robustness that amplifies under stronger attacks or distribution shifts.^[38]^[37] Variants of adversarial training mitigate these trade-offs by regularizing the objective to balance natural and robust errors. TRADES, proposed by Zhang et al. in 2019, decomposes the loss into a standard cross-entropy term on clean inputs plus a Kullback-Leibler divergence penalty between predictions on clean and perturbed inputs, achieving up to 55% robust accuracy on CIFAR-10 under \ell_\infty attacks with \epsilon=8/255—a 5-10% improvement over vanilla PGD training—at a reduced clean accuracy penalty of around 5-10%. Empirical studies confirm TRADES' superior Pareto frontier, though it increases computational demands by 2-3x due to additional forward passes.^[56]

Detection and Mitigation Mechanisms

Statistical detection methods identify adversarial inputs as outliers relative to the training data distribution using techniques like density estimation or local intrinsic dimensionality (LID) scores. In the LID approach, adversarial examples are flagged by computing the intrinsic dimensionality of local neighborhoods around inputs, where perturbations inflate dimensionality compared to benign samples; Ma et al. (2018) demonstrated detection rates exceeding 90% for certain attacks on ImageNet classifiers, though empirical benchmarks reveal false positive rates on clean data ranging from 5-20% depending on thresholds and datasets, limiting reliability in high-stakes settings. Input preprocessing defenses, such as JPEG compression, neutralize perturbations by introducing lossy transformations that degrade fine-grained adversarial noise while preserving semantic content in vision tasks. This method has been shown to reduce attack success rates by up to 50% on untargeted perturbations in CIFAR-10 evaluations, with minimal degradation (around 1-2%) to model accuracy on clean inputs.^[57] However, adaptive adversaries, which optimize perturbations to withstand such preprocessing, can evade these defenses, restoring high evasion rates as compression becomes predictable and differentiable during attack generation.^[57] Bayesian frameworks enable post-hoc rejection of uncertain predictions by quantifying epistemic uncertainty, often via variational inference or Monte Carlo dropout to estimate predictive distributions. Recent analyses (Corbin et al., 2023) highlight how modeling uncertainty about adversarial objectives allows detection through divergence from expected posteriors, achieving true positive rates above 80% in controlled settings; yet, elevated uncertainty on adversarial inputs does not inherently confer security, as it may stem from model limitations rather than causal isolation of threats, and false negatives persist against sophisticated attacks calibrated to mimic in-distribution variance.^[58]

Real-World Impacts and Case Studies

Demonstrated Vulnerabilities in Deployments

In autonomous vehicle deployments, physical adversarial perturbations have been demonstrated to compromise perception systems under real-world conditions, though no confirmed safety incidents such as crashes have been publicly attributed to them. A 2018 study showed that printed stickers applied to traffic signs could fool deep learning classifiers used in vehicle detection with success rates up to 100% across various lighting and distances when tested via physical photographs, simulating operational camera inputs similar to those in systems like Tesla's Autopilot. Similarly, billboard-based attacks in simulated driving environments, extended to physical feasibility in follow-up analyses by 2019, altered object detection in models trained on datasets like KITTI, causing misclassification of vehicles or lanes, but practical constraints like precise placement and visibility limited scalability in uncontrolled deployments. These demonstrations highlight sensor vulnerabilities but underscore the absence of exploited wild incidents, partly due to multi-sensor fusion and human oversight in current level 2-3 autonomy systems. In cybersecurity applications, ML-based intrusion detection systems (IDS) deployed in networks have exhibited evasion vulnerabilities to adversarial modifications of malicious payloads. A 2023 evaluation of automatic evasion techniques against seven commercial and open-source ML-based NIDS, including configurations mimicking operational setups, achieved success rates over 90% by morphing network traffic features while preserving attack functionality, bypassing detectors reliant on anomaly scoring. Reports from 2024 further documented morphed malware evading up to 80% of ML classifiers in endpoint protection platforms during controlled red-team exercises on production-like environments, exploiting gradient-based perturbations to shift decision boundaries without alerting signature-based complements. Such exploits have enabled stealthy persistence in real network defenses, though comprehensive logging and behavioral heuristics in layered systems have contained broader impacts.^[59] For medical imaging deployments, adversarial patches applied to X-ray scans have misclassified pathologies in AI-assisted diagnostic tools, raising concerns for clinical workflows despite no reported widespread erroneous diagnoses in patient care. A 2022 study demonstrated that localized perturbations on chest X-rays fooled convolutional neural networks trained for pneumonia detection—models akin to those integrated in radiology PACS systems—with targeted error rates exceeding 90% under white-box access, tested on datasets like CheXpert simulating hospital inputs. These vulnerabilities persist in operational settings where AI outputs inform but do not override radiologist review, limiting incident escalation; however, the ease of generating such patches via optimization methods like PGD illustrates risks in high-stakes, semi-autonomous diagnostics. Empirical gaps remain, as human-in-the-loop validation has precluded confirmed adversarial harms in live deployments.^[60]

Sector-Specific Risks and Consequences

In the financial sector, model extraction attacks represent a primary adversarial risk, enabling competitors or state actors to replicate proprietary machine learning models for fraud detection, risk assessment, or high-frequency trading through repeated API queries. These attacks can result in intellectual property theft, potentially costing firms competitive advantages valued in billions annually across the industry, as proprietary models underpin algorithmic edges in markets where milliseconds determine outcomes.^[61]^[62] While initial extraction methods required few queries and negligible costs—such as under $0.50 for simple models in 2016 demonstrations—contemporary large-scale models demand substantially higher volumes, with defenses like calibrated proof-of-work escalating expenses to levels that deter non-state actors but remain viable for resourced adversaries.^[63]^[64] In defense and military contexts, adversarial machine learning vulnerabilities facilitate reverse-engineering of classifiers deployed in surveillance, target identification, and autonomous weapons systems, where manipulated inputs could mislead detections of threats like drones or missiles. Reinforcement learning agents, common in wargame simulations and tactical planning, are susceptible to policy poisoning, in which adversaries alter training environments or rewards to enforce target policies, derailing optimal strategies and inducing exploitable behaviors.^[65]^[66] For instance, vulnerability-aware poisoning mechanisms can exploit online RL updates, amplifying risks in dynamic scenarios akin to electronic warfare, where even subtle reward manipulations propagate to degrade performance over iterations.^[67] Across sectors, including finance and defense, confirmed real-world adversarial attacks on deployed models from 2021 to 2025 have proven scarce, with most documented ML disruptions stemming from prosaic issues like dataset shifts or overfitting rather than deliberate adversarial inputs. Surveys of incidents reveal that while laboratory demonstrations abound, fielded exploits remain limited, suggesting that adversarial risks, though theoretically severe, are often overhyped relative to empirical occurrence, prioritizing investments in baseline robustness over specialized countermeasures.^[1]^[68] This pattern underscores a causal emphasis on verifiable threats, where economic or safety costs from rare attacks must be weighed against more frequent non-adversarial failures.^[69]

Challenges, Criticisms, and Open Problems

Practical Feasibility and Empirical Gaps

Despite demonstrations of adversarial perturbations in controlled laboratory settings, their translation to physical environments reveals significant limitations due to real-world constraints such as varying lighting, motion blur, and sensor noise. For example, small perturbations like those limited to 8 pixels, effective against image classifiers in static digital tests, often degrade or fail when displayed on screens under dynamic conditions like video playback or ambient light changes, as evidenced in physical attack evaluations spanning 2020 to 2024. These factors introduce causal variabilities that disrupt the precise alignment required for perturbation efficacy, undermining assumptions of seamless transferability from digital to operational contexts. Adversary incentives further constrain practical deployment, as crafting robust attacks demands extensive model knowledge, computational resources, and iterative optimization, frequently yielding only marginal success rates against defended systems in non-idealized scenarios. Critics argue this elevates much adversarial research to an academic exercise, where theoretical vulnerabilities overshadow deployable exploits, particularly given the escalating hardness of defining, solving, and evaluating such problems in increasingly complex models.^[70] Empirical gaps persist in production environments, where systematic red-teaming against composed attacks—such as chained perturbations or multi-stage evasions—remains rare, allowing defenses to appear robust in isolation but falter under realistic adversarial compositions that exploit untested interactions. Limited real-world testing exacerbates this, with attackers facing logistical barriers like access restrictions and environmental unpredictability, highlighting a disconnect between lab-centric threat models and operational resilience.^[70]^[71]

Evaluation and Reproducibility Issues

Evaluation in adversarial machine learning frequently encounters inconsistencies in key metrics, such as Attack Success Rate (ASR), which measures the proportion of successful adversarial perturbations, and robust accuracy, defined as the accuracy under attack (equivalent to 1 minus ASR under white-box conditions). These metrics, while related, are not always reported uniformly, leading to misinterpretations when clean accuracy is conflated with robustness or when thresholds for "success" vary. Furthermore, studies employing different perturbation norms—such as \ell_\infty for bounded perturbations versus \ell_2 for Euclidean distance—hinder cross-study comparability, as robustness claims under one norm do not generalize to others without explicit adaptation. A 2025 analysis of gradient-based attack evaluations underscores how such discrepancies distort progress assessments and undermine trust in reported benchmarks.^[72]^[73] Reproducibility challenges exacerbate these issues, with variations in random seeding, optimization hyperparameters, hardware configurations (e.g., GPU floating-point precision), and even minor code implementations causing substantial result divergence. In adversarial robustness research, attempts to replicate landmark claims have revealed a crisis where reported defenses fail under controlled re-evaluations, often due to unaccounted stochasticity in training and attack generation. Surveys from 2023 and 2024 highlight how these factors contribute to non-reproducible outcomes, with independent validations showing inconsistencies that question the reliability of peer-reviewed findings in the absence of standardized environments like containerized setups.^[74] Criticisms of the field extend to systemic flaws in research incentives, where the emphasis on devising novel attacks garners publications more readily than efforts to falsify robustness assertions through exhaustive benchmarking or long-term validation. Position papers argue that escalating complexity in problem formulations—coupled with lax peer review prioritizing incremental novelty over empirical rigor—has slowed verifiable advances, fostering skepticism about the field's maturity. This dynamic, observed in high-volume conference submissions, prioritizes theoretical perturbations over practical, data-driven scrutiny, potentially inflating perceived threats while underemphasizing defense generalizability.^[70]^[75]

References

[1]
[PDF] Adversarial Machine Learning - NIST Technical Series Publications
Mar 20, 2025 · This NIST Trustworthy and Responsible AI report provides a taxonomy of concepts and defines terminology in the field of adversarial machine ...
[2]
[1312.6199] Intriguing properties of neural networks - arXiv
Dec 21, 2013 · In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of ...
[3]
[1412.6572] Explaining and Harnessing Adversarial Examples - arXiv
Dec 20, 2014 · Explaining and Harnessing Adversarial Examples. Authors:Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy.
[4]
Attacks in Adversarial Machine Learning: A Systematic Survey from ...
Feb 19, 2023 · Abstract:Adversarial machine learning (AML) studies the adversarial phenomenon of machine learning, which may make inconsistent or unexpected ...
[5]
[PDF] A Comprehensive Review of Adversarial Attacks on Machine Learning
Dec 11, 2023 · These attacks involve crafting malicious inputs that can deceive a model into making incorrect predictions.<|separator|>
[6]
from Phenomena to Black-Box Attacks using Adversarial Samples
May 24, 2016 · Abstract page for arXiv paper 1605.07277: Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial ...Missing: date | Show results with:date
[7]
[1608.04644] Towards Evaluating the Robustness of Neural Networks
Aug 16, 2016 · Title:Towards Evaluating the Robustness of Neural Networks. Authors:Nicholas Carlini, David Wagner ... attacks' ability to find adversarial ...
[8]
Towards Deep Learning Models Resistant to Adversarial Attacks
Jun 19, 2017 · We study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view.
[9]
A Systematic Evaluation of Prompt Injection and Jailbreak ... - arXiv
May 7, 2025 · Our experiments evaluated over 1,400 adversarial prompts across four LLMs: GPT-4, Claude 2, Mistral 7B, and Vicuna. We analyze results along ...
[10]
[PDF] A Comprehensive Study of Jailbreak Attack versus Defense for ...
Aug 11, 2024 · Jailbreak attacks use prompts to bypass safety measures in LLMs, producing harmful content. This study analyzes 9 attack and 7 defense ...
[11]
[PDF] arXiv:2402.06363v2 [cs.CR] 25 Sep 2024
Sep 25, 2024 · Prompt injection attacks pose a major challenge for devel- oping secure LLM-integrated applications, as they typically need to process much data ...
[12]
BadFU: Backdoor Federated Learning through Adversarial Machine ...
Aug 21, 2025 · Specifically, we propose BadFU, an attack strategy where a malicious client uses both backdoor and camouflage samples to train the global model ...
[13]
[PDF] FedBAP: Backdoor Defense via Benign Adversarial Perturbation in ...
Jul 26, 2025 · Federated Learning (FL) enables collaborative model training while preserving data privacy, but it is highly vulnerable to backdoor attacks.
[14]
Defending backdoor attacks in federated learning via adversarial ...
This paper proposes ADFL, a novel adversarial distillation-based backdoor defense scheme for federated learning.
[15]
AI 100-2 E2023, Adversarial Machine Learning: A Taxonomy and ...
Jan 4, 2024 · This NIST Trustworthy and Responsible AI report develops a taxonomy of concepts and defines terminology in the field of adversarial machine learning (AML).
[16]
AI 100-2 E2025, Adversarial Machine Learning: A Taxonomy and ...
This NIST Trustworthy and Responsible AI report provides a taxonomy of concepts and defines terminology in the field of adversarial machine learning (AML).
[17]
AdvML 2024 - New Frontiers in Adversarial Machine Learning
Dec 14, 2024 · Join us at AdvML-Frontiers'24 for a comprehensive exploration of adversarial learning at the intersection with cutting-edge multimodal technologies.
[18]
Universal Adversarial Attack on Multimodal Aligned LLMs - arXiv
Jun 4, 2025 · We propose a universal adversarial attack on multimodal Large Language Models (LLMs) that leverages a single optimized image to override ...
[19]
NIST AI 100-2 | Adversarial Machine Learning Taxonomy | CSRC
NIST AI 100-2 targets this issue and offers voluntary guidance relative to identifying, addressing, and managing the risks associated with adversarial machine ...
[20]
[PDF] Adversarial Machine Learning: A Taxonomy and Terminology of ...
Jan 2, 2024 · This NIST Trustworthy and Responsible AI report develops a taxonomy of concepts and defines terminology in the field of adversarial machine ...
[21]
[PDF] Simple Black-box Adversarial Attacks
Under the white-box threat model, the classifier h is provided to the adversary. In this scenario, a powerful attack strategy is to perform gradient descent on ...
[22]
https://adversarial-ml-tutorial.org/adversarial_examples/
[23]
[PDF] Adversarial Machine Learning∗ - People @EECS
Oct 21, 2011 · Cost models of the adversary also led to a theory for query-based near-optimal evasion of classifiers first presented by Lowd and Meek, in which ...
[24]
[PDF] A Survey of Game Theoretic Approaches for Adversarial Machine ...
The influence dimension specifies two types of adversarial attacks, causative and exploratory (also known as probing), illustrated in figure 2. In causative ...Missing: explorative | Show results with:explorative
[25]
A taxonomy and survey of attacks against machine learning
This paper presents a taxonomy and survey of attacks against systems that use machine learning. It organizes the body of knowledge in adversarial machine ...
[26]
Machine Learning Models Have a Supply Chain Problem - arXiv
May 28, 2025 · In this paper, we argue that the current ecosystem for open ML models contains significant supply-chain risks, some of which have been exploited ...3 Model Transparency · 5 Integrity For Ml Models · 6 Dataset Verifiability
[27]
https://openreview.net/forum?id=x4hmIsWu7e
[28]
[PDF] Exploring the Limits of Model-Targeted Indiscriminate Data ...
Figure 10: We visualize some poisoned images generated by the GC attack on the CIFAR-10 dataset. The first row shows the clean samples, the second row shows ...
[29]
[PDF] Detecting and Preventing Data Poisoning Attacks on AI Models - arXiv
Experimental results indicate that data poisoning significantly degrades model performance, reducing classification accuracy by up to 27% in image recognition ...
[30]
[PDF] How To Backdoor Federated Learning
We show that this makes federated learning vulnerable to a model-poisoning attack that is signifi- cantly more powerful than poisoning attacks that target only ...
[31]
Exploring Backdoor Attacks against Personalized Federated Learning
Jan 22, 2025 · Data heterogeneity and backdoor attacks rank among the most significant challenges facing federated learning (FL). For data heterogeneity, ...
[32]
[PDF] Byzantine-Robust Distributed Learning: Towards Optimal Statistical ...
In this paper, we develop distributed optimiza- tion algorithms that are provably robust against. Byzantine failures—arbitrary and potentially ad- versarial ...
[33]
[PDF] Byzantine Fault Tolerance in Distributed Machine Learning - arXiv
Dec 4, 2022 · Byzantine network failures for distributed convex and non-convex learning tasks. ... attacks, single point of failure, data privacy, and existing ...
[34]
[PDF] Black-box Adversarial Attacks with Limited Queries and Information
This paper defines query-limited, partial-information, and label-only settings, where attackers have limited queries, partial information, or only the top ...
[35]
[PDF] Towards More Practical Threat Models in Artificial Intelligence Security
In model stealing, the attacker has black- box access to an ML model and copies its functionality with- out consent of the model's owner [71] and thus harms con ...
[36]
[PDF] From AI Vulnerabilities to AI Security Incident Reporting
• 2 Model Stealing. • 2 Poisoning. • 4 Privacy. • 10 Cybersecurity. 27. Found Incidents. Preventable following. Security Best Practices! (2) Malicious intent.
[37]
RobustBench: a standardized adversarial robustness benchmark
Oct 19, 2020 · Our goal is to establish a standardized benchmark of adversarial robustness, which as accurately as possible reflects the robustness of the considered models.Missing: CNN PGD drop
[38]
RobustBench: Adversarial robustness benchmark
A standardized benchmark for adversarial robustness. The goal of RobustBench is to systematically track the real progress in adversarial robustness.Missing: CNN PGD drop
[39]
Chapter 2 - linear models
Understanding the linear case provides important insights into the theory and practice of adversarial robustness, and also provides connections to more ...Missing: neural | Show results with:neural
[40]
[PDF] Adversarial Robustness of Deep Neural Networks - arXiv
Furthermore, neural networks themselves are often vulnerable to adversarial attacks. For those reasons, there is a high demand for trustworthy and rigorous ...
[41]
[PDF] Adversarial Examples are not Bugs, they are Features
One of the most intriguing properties of adversarial examples is that they transfer across models with different architectures and independently sampled ...<|control11|><|separator|>
[42]
[PDF] Why Do Adversarial Attacks Transfer? Explaining ... - USENIX
Aug 14, 2019 · We give a formal definition of transferability of evasion and poisoning attacks, and an upper bound on a transfer attack's success.
[43]
[PDF] A Survey on Transferability of Adversarial Examples Across Deep ...
May 2, 2024 · Adversarial examples are specially crafted inputs that lead machine learning models to make incorrect predictions. These inputs are impercep- ...
[44]
Robust Deep Reinforcement Learning with Adversarial Attacks - arXiv
This paper proposes adversarial attacks for Reinforcement Learning (RL) and then improves the robustness of Deep Reinforcement Learning algorithms (DRL) to ...
[45]
[PDF] Robust Deep Reinforcement Learning against Adversarial ...
A deep reinforcement learning (DRL) agent observes its states through observa- tions, which may contain natural measurement errors or adversarial noises.
[46]
A study of natural robustness of deep reinforcement learning ...
Analyzing the robustness of DRL algorithms to adversarial attacks is an important prerequisite to enabling the widespread adoption of DRL algorithms. Common ...
[47]
[PDF] Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Attack success rate numbers are shown in Table 4, with each model's MT-Bench scores shown in the brackets. Clearly, almost all safe prefixes lead to better ...<|separator|>
[48]
Adversarial Examples Are Not Bugs, They Are Features - arXiv
May 6, 2019 · We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features derived from patterns in the data ...
[49]
https://ieeexplore.ieee.org/document/9506804
[50]
[PDF] Adversarial Examples Are Not Easily Detected - Nicholas Carlini
We use the L2 attack algorithm of Carlini and Wagner [8] to generate targeted adversarial examples, as it is superior to other published attacks. At a high ...
[51]
White-box and Black-box Attacks for Transfer Learning - ADS
Empirical results show that the adversarial examples are more transferable when fine-tuning is used than they are when the two networks are trained ...
[52]
[PDF] Enhancing the Transferability of Adversarial Examples with Random ...
Compared to the state-of-the-art transferable attacks, our attacks improve the black-box attack success rate by 2.9% against normally trained mod- els, 4.7% ...
[53]
https://proceedings.mlr.press/v97/guo19a/guo19a.pdf
[54]
[PDF] Bridging the Gap Between Adversarial ML Research and Practice
Our analysis clearly indicates that real adversaries do attempt to evade anti-phishing ML systems that use image classification, and do so with some degree of ...
[55]
https://ojs.aaai.org/index.php/AIES/article/download/36698/38836/40773
[56]
Theoretically Principled Trade-off between Robustness and Accuracy
Jan 24, 2019 · We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples.
[57]
DNN-Oriented JPEG Compression Against Adversarial Examples
Mar 14, 2018 · We propose a JPEG-based defensive compression framework, namely "feature distillation", to effectively rectify adversarial examples without impacting ...Missing: effectiveness | Show results with:effectiveness
[58]
Full article: Adversarial Machine Learning: Bayesian Perspectives
We demonstrate how the Bayesian approach allows us to explicitly model our uncertainty about the opponent's beliefs and interests, relaxing unrealistic ...
[59]
Automatic Evasion of Machine Learning-Based Network Intrusion ...
Final results show that the proposed strategy effectively evades seven typical ML-based IDSs and one SOTA DL-based IDS with an average success rate of over < ...
[60]
Adversarial attack vulnerability of medical image analysis systems
In this paper, we study previously unexplored factors affecting adversarial attack vulnerability of deep learning MedIA systems in three medical domains.Missing: incidents | Show results with:incidents
[61]
Adversarial AI threatens our financial services. We need a response.
Jan 22, 2025 · Model theft. AI models are high value intellectual property: crown jewels to protect from theft. However, a technique known as 'model ...
[62]
https://layerxsecurity.com/generative-ai/model-theft/
[63]
[PDF] Stealing Machine Learning Models via Prediction APIs - USENIX
Aug 10, 2016 · On Google's plat- form for example, an extraction attack would cost less than $0. ... model extraction attacks that could subvert model mon-.
[64]
How to Keep a Model Stealing Adversary Busy? - CleverHans Lab
Apr 21, 2022 · A proactive defense using proof-of-work (PoW) puzzles, with difficulty calibrated to query leakage, increases the cost of model extraction by ...
[65]
Adversarial Machine Learning - Joint Air Power Competence Centre
Nowadays, the lack of robustness of these systems can no longer be ignored; many of them have proven to be highly vulnerable to intentional adversarial attacks ...
[66]
[PDF] Policy Teaching in Reinforcement Learning via Environment ...
We study a security threat to reinforcement learning where an attacker poisons the learning environment to force the agent into executing a target policy ...
[67]
Vulnerability-Aware Poisoning Mechanism for Online RL with ... - arXiv
Sep 2, 2020 · We propose a strategic poisoning algorithm called Vulnerability-Aware Adversarial Critic Poison (VA2C-P), which works for most policy-based deep RL agents.
[68]
Adversarial Machine Learning in Industry: A Systematic Literature ...
This literature study reviews studies in the area of AML in the context of industry, measuring and analyzing each study's rigor and relevance scores.
[69]
A Systematic Survey of Model Extraction Attacks and Defenses - arXiv
Aug 20, 2025 · Zhou et al. (2024) propose an inversion-guided defense that detects potential model stealing attacks by analyzing the invertibility of the ...
[70]
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
The field of adversarial ML studies problems that are (1) less clearly defined, (2) harder to solve, and (3) even more challenging to evaluate.
[71]
Feasibility of adversarial attacks against machine learning models
Dec 11, 2024 · This provides a clearer understanding of how to make models more resilient in real-world situations, where attackers face more limitations.
[72]
Evaluating the Evaluators: Trust in Adversarial Robustness Tests
Jul 4, 2025 · We present AttackBench, a benchmark framework developed to assess the effectiveness of gradient-based attacks under standardized and reproducible conditions.
[73]
[PDF] Evaluating the Evaluators: Trust in Adversarial Robustness Tests
Jun 24, 2025 · Together, these inconsistencies introduce variance that can severely distort robustness assessments, hinder reproducibility, and create a false ...
[74]
Leakage and the reproducibility crisis in machine-learning-based ...
Sep 8, 2023 · We surveyed a variety of research that uses ML and found that data leakage affects at least 294 studies across 17 fields, leading to overoptimistic findings.
[75]
Adversarial AI: Coming of age or overhyped?
Sep 1, 2023 · This article explores developments in adversarial artificial intelligence (AAI) and machine learning, examining recent research, practical realities<|separator|>