Fact-checked by Grok 2 weeks ago

Data augmentation

Data augmentation is a set of techniques in that generate high-quality by applying transformations to existing samples, effectively expanding the training to improve model , robustness, and without requiring additional real-world . This approach addresses key challenges such as data scarcity, class imbalance, and , particularly in domains like , , and beyond. The origins of data augmentation trace back to research in the for handwritten digit recognition, with techniques like elastic distortions introduced in 2003 by Simard et al. to augment training data for convolutional s on the MNIST dataset. Its widespread adoption occurred with the rise of , exemplified by the 2012 architecture, which employed on-the-fly augmentations such as random cropping, horizontal flipping, and color perturbations to boost top-5 accuracy on the dataset from 73.8% to 84.7%. These methods simulate real-world variations, enabling models to learn invariant features like object orientation or lighting conditions. Contemporary data augmentation encompasses a broad range of techniques, including single-instance manipulations (e.g., geometric transformations for images), multi-instance mixing (e.g., ), and generative methods (e.g., using GANs). Adaptations for non-visual include replacement in text, perturbations in graphs, and addition in time series. Advanced approaches leverage generative AI, such as diffusion models, for diverse augmentations. Beyond , data augmentation supports semi-supervised, few-shot, and , with empirical evidence showing accuracy gains on image benchmarks like and improvements in tasks on GLUE. Ongoing research includes automated strategies like AutoAugment and RandAugment, which optimize policies via or . Data augmentation remains essential for scalable AI, enhancing reliability amid growing dataset complexity.

Fundamentals

Definition and Purpose

Data augmentation is the process of creating modified versions of existing or generating new to increase the size and diversity of a in , while preserving the semantic meaning and labels of the original samples. This technique applies label-preserving transformations to input , ensuring that augmented samples remain representative of the original and are semantically equivalent to observers. By artificially expanding limited , data augmentation addresses challenges such as data scarcity, particularly in domains where collecting large amounts of is costly or impractical. The primary purposes of data augmentation include mitigating by exposing models to varied representations of the data, handling class imbalance through targeted of minority classes, enhancing performance on underrepresented data points, and simulating real-world variations to improve robustness. For instance, in scenarios with imbalanced datasets, techniques like synthetic minority can generate additional examples for rare classes to balance the training distribution, thereby improving model fairness and accuracy. These purposes are especially critical in , where models trained on augmented data generalize better to unseen test cases, as demonstrated in early applications that reduced error rates by introducing viewpoint variations. Key benefits of data augmentation encompass improved model , reduced reliance on extensive real-world , and enhanced handling of small datasets, leading to more efficient training and higher predictive performance. For example, augmenting images through rotations simulates different viewpoints, allowing models to learn invariant features without additional labeling efforts, which has been shown to decrease top-5 error rates in image classification tasks from 25.2% to 15.3% on the benchmark dataset. Overall, this approach lowers computational costs associated with and enables simpler architectures to achieve state-of-the-art results by increasing training data diversity. Mathematically, data augmentation can be conceptualized as applying a transformation function T to an original dataset D = \{(x_i, y_i)\}, yielding an augmented dataset D' = \{T(x_i, y_i) \mid (x_i, y_i) \in D\}, where T preserves the label y_i and maintains the sample's membership in the same semantic space. This formulation ensures that the augmented data contributes to better optimization of the model's loss function without introducing label noise, thereby supporting empirical risk minimization in supervised learning paradigms.

Historical Development

The roots of data augmentation can be traced to the late 1980s and 1990s in and early research, where limited datasets posed significant challenges for training reliable models. Pioneering work in by and Alexey Chervonenkis, particularly their development of the Vapnik-Chervonenkis (VC) dimension, highlighted the risks of in high-capacity models and emphasized the necessity of large, diverse datasets to achieve good bounds. This theoretical foundation motivated early augmentation strategies to artificially expand training data, addressing data scarcity without collecting new samples. One of the first practical implementations appeared in the LeNet-5 architecture for handwritten digit recognition, where and colleagues applied random distortions and elastic deformations to images, reducing test error by improving model robustness to variations. In the 2000s, data augmentation gained traction for handling imbalanced datasets, with the introduction of Synthetic Minority Over-sampling Technique (SMOTE) by Nitesh Chawla et al. in 2002, which generated synthetic examples by interpolating between minority class instances to balance classes and enhance classifier performance. A pivotal occurred in 2012 during the Large Scale Visual Recognition Challenge (ILSVRC), where Alex Krizhevsky's employed basic geometric transformations such as random cropping, horizontal flipping, and PCA-based color jittering, effectively expanding the training set by a factor of over 2,000 and contributing to a top-5 error rate of 15.3%—a 10.9 improvement over the second-place entry's 26.2%. This success popularized augmentation as a standard practice in , demonstrating its ability to mitigate and enable training of larger networks on limited hardware. The 2010s marked a surge in augmentation's evolution alongside deep learning's rise, with generative models unlocking synthetic data creation. Ian Goodfellow et al.'s 2014 introduction of Generative Adversarial Networks (GANs) revolutionized the field by enabling the generation of realistic synthetic images through adversarial training, which improved model accuracy in data-scarce domains like by up to 10-20% in subsequent applications. Building on this, Ekin D. Cubuk et al.'s AutoAugment in automated the search for optimal augmentation policies using , yielding consistent gains of 1-3% on benchmarks like and without manual tuning. A comprehensive survey by Connor Shorten and Taghi M. Khoshgoftaar in further synthesized these advances, categorizing techniques and underscoring their role in enhancing generalization across vision tasks. By the early 2020s, data augmentation integrated with emerging paradigms for greater scalability and privacy. Diffusion models, exemplified by adaptations of Stable Diffusion released in 2022, facilitated high-fidelity image synthesis conditioned on text prompts, boosting downstream task performance in low-data regimes by generating diverse, semantically consistent augmentations. Concurrently, privacy-preserving variants emerged in federated learning settings, where techniques like XOR Mixup enabled secure data mixing across distributed clients without sharing raw data, improving model utility while complying with regulations like GDPR. Recent surveys up to 2024, such as those by Zaitian Wang et al., highlight ongoing refinements, including multimodal augmentations via large language models. As of 2025, surveys such as the multi-perspective review by Li et al. continue to emphasize applications in diverse domains., solidifying augmentation's foundational role in modern AI.

Techniques in Traditional Machine Learning

Oversampling Strategies

Oversampling strategies in traditional machine learning involve generating synthetic samples for the minority class to address class imbalance in classification tasks, thereby improving model performance on underrepresented classes without discarding majority class data. These methods are particularly useful for tabular datasets where class distributions are skewed, such as in fraud detection or medical diagnosis, by creating new instances that enhance the minority class representation. The seminal Synthetic Minority Over-sampling Technique (SMOTE), introduced by Chawla et al. in 2002, generates synthetic minority class samples by interpolating between a minority instance and its k-nearest neighbors. For a minority sample \mathbf{x} and its nearest neighbor \mathbf{x}_{nn}, a synthetic sample \mathbf{x}_{syn} is created as: \mathbf{x}_{syn} = \mathbf{x} + \lambda \cdot (\mathbf{x}_{nn} - \mathbf{x}), where \lambda \in [0, 1] is a random value, ensuring the new sample lies on the line segment connecting \mathbf{x} and \mathbf{x}_{nn}. This approach avoids simple duplication, which can lead to overfitting, and has been shown to improve classification accuracy on imbalanced datasets. Variants of SMOTE address limitations in the original algorithm by focusing on specific aspects of the data distribution. Borderline-SMOTE, proposed by Han et al. in 2005, prioritizes generating synthetic samples near the between classes, identifying borderline minority instances through their proximity to majority class neighbors. This variant enhances focus on informative regions, reducing noise from safe minority samples far from the boundary. ADASYN, developed by He et al. in 2008, adaptively synthesizes more samples for minority instances that are harder to learn, based on the density of majority class neighbors; it assigns higher synthesis weights to regions with greater learning difficulty. These adaptations make the methods more robust to varying degrees of imbalance. In applications such as credit scoring with tabular financial data, techniques like SMOTE balance datasets where defaults (minority class) are rare, leading to better model generalization. Evaluation often employs metrics like the , the geometric mean of , which balances performance across classes and highlights improvements from oversampling in imbalanced scenarios. While these strategies increase minority class diversity and mitigate bias toward the majority class, they can introduce artificial patterns that risk overfitting, particularly in high-dimensional spaces, and are most effective when combined with undersampling techniques.

Feature Engineering Augmentation

Feature engineering augmentation involves the creation or modification of input features to enhance the representational power of a dataset in traditional machine learning contexts, thereby improving model robustness and generalization without generating entirely new samples. This approach leverages domain knowledge to derive polynomial features, interaction terms, or perturbations that capture underlying patterns more effectively, distinguishing it from sample-level techniques by focusing on the feature space. Such methods are particularly useful in scenarios with limited data variety, where enriching features can simulate additional diversity akin to regularization effects. Key techniques include (PCA)-based jittering, which perturbs features along principal directions to introduce controlled variability while preserving data structure. In this method, data is projected onto eigenvectors derived from the , noise is added to the coefficients, and an inverse projection reconstructs augmented vectors for training. Kernel methods enable non-linear feature expansions by mapping data into higher-dimensional spaces via kernel functions, such as the , allowing linear models to capture complex interactions implicitly. These expansions augment the feature set by embedding transformations that enhance separability, often integrated into support vector machines or kernel ridge regression. Representative examples illustrate practical application: in tasks, is added to continuous features to mitigate , formulated as x' = x + \epsilon, where \epsilon \sim \mathcal{N}(0, \sigma^2), effectively acting as Tikhonov regularization with parameter \lambda = n \sigma^2 (n being the sample size). For time-series forecasting, lag features derived from prior observations, such as one-step or multi-step lags, enrich the input by incorporating temporal dependencies, enabling models like to predict future values more accurately. Evaluation of these augmentations typically employs k-fold cross-validation to measure improvements in performance metrics, such as area under the curve (AUC-ROC) for or for regression, ensuring the added features reduce variance without excessive bias. For instance, jittering has demonstrated accuracy gains of up to 7% on benchmark datasets like when augmenting with 10-20% distilled vectors. Historically, augmentation gained prominence in the through ensemble methods like Random Forests, where random subset selection of features during tree construction introduces diversity equivalent to perturbation-based augmentation, enhancing estimates and overall stability.

Methods in

Geometric Transformations

Geometric transformations constitute a fundamental category of data augmentation techniques in , involving spatial manipulations of to simulate variations in viewpoint, orientation, and position encountered in real-world scenarios. These methods apply rigid or affine changes to the coordinates without altering the underlying photometric properties, thereby preserving the semantic content of objects while expanding the diversity of the training . Common core techniques include , which pivots the image around its by an angle θ (typically ranging from 1° to 45° to avoid label ambiguity in tasks like digit recognition); , which resizes the image by a factor s (often between 0.8 and 1.2) using to maintain quality; , which shifts the image along the x and y axes (e.g., by -4 to +4 ) with to preserve dimensions; and , either horizontally or vertically, to mirror the image and double effective size for symmetric objects. More advanced geometric operations encompass shearing, which slants the image along the x or y axis (e.g., by -20° to +20°) to mimic distortions from camera tilt, and transforms, which simulate viewpoint changes using projective mappings. These are often implemented via matrices for linear operations, where a 2x3 defines the warp; for instance, the is given by \begin{bmatrix} \cos \theta & -\sin \theta & t_x \\ \sin \theta & \cos \theta & t_y \end{bmatrix}, with t_x and t_y as translation components, applied through functions like warpAffine in libraries. Shearing extends this by adding off-diagonal elements to coordinates, while perspective requires a 3x3 homography matrix for non-parallel line preservation. Such transforms maintain object integrity better than photometric alterations, making them suitable for label-preserving tasks. In applications like and semantic segmentation, geometric augmentations enhance model invariance to pose variations; for example, they improve robustness in frameworks such as or by simulating real-world occlusions and angles without changing object identities. Implementation typically involves random application during training epochs using libraries like , which provides core functions for , , , , shearing, and perspective via warpAffine and warpPerspective, or Albumentations, a specialized augmentation toolkit supporting efficient pipelines for these transforms in , detection, and segmentation workflows. Empirical studies demonstrate significant performance gains from these techniques in convolutional neural networks (CNNs). Broader analyses report 5-10% relative error reductions across vision tasks, underscoring their role in mitigating and enhancing .

Color and Texture Modifications

Color and texture modifications encompass photometric augmentations that alter the visual appearance of images to simulate variations in lighting, material properties, and surface characteristics, thereby enhancing model robustness without changing semantic content. These techniques primarily operate on intensities and color channels, preserving the overall of objects while introducing diversity in illumination and texture. Common methods include adjustments to and through linear of values, where is modified by multiplying all intensities by a scalar factor greater than 0 (e.g., values >1 brighten the image, <1 darken it), and is enhanced by scaling the difference between maximum and minimum intensities, often via histogram equalization to redistribute values. Such linear adjustments help models generalize to real-world lighting inconsistencies, as demonstrated in benchmarks where they contribute to improved classification accuracy on datasets like ImageNet. Further refinements involve hue and saturation shifts, typically performed in the HSV color space for intuitive manipulation of color properties: hue rotates the color wheel to alter dominant shades, while saturation scales the purity of colors (factors >1 intensify, 0-1 desaturate toward ). Gamma correction provides a nonlinear , defined as I' = I^{\gamma}, where I is the input normalized to [0,1], and \gamma (often sampled from 0.5 to 2.0) adjusts the mid-tone brightness—values <1 brighten shadows, >1 darken them—to mimic non-uniform lighting effects like those in medical scans. These operations are particularly effective in tasks, as they introduce controlled photometric noise that boosts performance without risking label corruption in scenarios. Texture modifications extend these principles by overlaying patterns or applying local warps to simulate material variations, such as fabric weaves or skin textures. One approach overlays synthetic patterns (e.g., noise textures or artistic styles) onto the image using blending functions like , which preserves underlying semantics while perturbing surface details; this is akin to , where texture from a reference image is transferred to the target, improving in tasks like . Elastic deformations achieve similar effects through local warps, such as thin-plate splines (), which model smooth, non-rigid transformations by interpolating control points on a grid—minimizing bending energy to create realistic distortions like tissue stretching in volumetric data. In medical imaging, TPS-based augmentation has been shown to enhance segmentation accuracy by simulating anatomical deformations. Color space transformations facilitate more perceptually grounded modifications, such as converting from RGB to space, where L encodes , and A/B represent opponent colors (green-red, blue-yellow) in a uniform perceptual —ensuring equal numerical changes correspond to equal visual differences. This uniformity aids in balanced adjustments, like channels to simulate variations in images. Random channel swaps, another simple transform, permute RGB orders (e.g., RGB to ) or isolate channels by zeroing others, which tests model invariance to color encoding and has been used to mitigate biases in lighting-heavy datasets, though it can slightly reduce accuracy if overapplied (e.g., ~3% drop on subsets). In applications like , these modifications are crucial for robustness to illumination variations, such as inconsistent lighting in or scans, where color shifts and gamma adjustments diversify training data to improve detection of pathologies like —achieving 88% accuracy for proliferative diabetic retinopathy detection compared to 82% baseline. Recent works integrate diffusion models for generating realistic color and variations, enhancing performance in diverse scenarios. Unlike spatial alterations, these techniques avoid altering object labels, making them ideal for supervised classification where semantic integrity is paramount. A comparative study on photometric augmentations, including color jittering, reported consistent gains in Top-1 accuracy (e.g., 1.44% from hue-saturation adjustments) across datasets like Caltech-101, underscoring their role in reducing . For texture-specific enhancements, edge-based methods yielded modest but reliable improvements in texture classification tasks.

Noise Injection Techniques

Noise injection techniques involve introducing controlled perturbations to input images during training to simulate real-world distortions, thereby improving the robustness of computer vision models to variations such as sensor or adversarial manipulations. These methods enhance by exposing models to noisy variants, reducing and increasing against unseen perturbations. Common approaches include adding random noise distributions or applying blurring operations, which mimic environmental interferences without altering the core semantic content of the images. One prevalent type is Gaussian noise addition, where each pixel value x is modified as x' = x + \mathcal{N}(0, \sigma^2), with \sigma controlling the noise variance to balance augmentation strength and image fidelity. This technique simulates additive noise, commonly encountered in devices, and has been shown to improve accuracy on datasets like CIFAR-10. , another impulse-based method, randomly sets a of pixels to maximum () or minimum () intensity values, typically affecting 1-5% of pixels to emulate transmission errors or dead sensors. This form of noise injection fosters invariance to sparse corruptions, enhancing model performance in noisy environments. , achieved via with a of size k \times k and standard deviation \sigma, softens image details to replicate out-of-focus captures or motion effects, often applied with \sigma ranging from 0.5 to 2.0 for effective regularization. Adversarial perturbations represent a targeted injection to counter deliberate attacks, with the Fast Gradient Sign Method (FGSM) being a foundational approach. In FGSM, the perturbed input is generated as x' = x + \epsilon \cdot \sign(\nabla_x J(\theta, x, y)), where \nabla_x J is the of with respect to the input, and \epsilon is a small scalar (e.g., 0.007-0.031 for \ell_\infty-norm bounded perturbations) ensuring imperceptibility while maximizing misclassification risk. Introduced in adversarial training frameworks, this method augments datasets with such examples, significantly boosting robust accuracy—for instance, reducing adversarial error on MNIST from 89.4% to 17.9% under ε=0.25 attacks. Cutout and Mixup extend noise injection through masking and interpolation, promoting resilience via partial occlusions or blended samples. Cutout randomly masks square regions (e.g., 16x16 pixels) of the input image to black, simulating occlusions and improving localization robustness, as demonstrated by error reductions of ~0.5% on CIFAR-100 for ResNet-18 with standard augmentations. Mixup creates hybrid examples by linearly interpolating pairs of images and labels: x' = \lambda x_i + (1 - \lambda) x_j, \tilde{y} = \lambda y_i + (1 - \lambda) y_j, where \lambda \sim \Beta(\alpha, \alpha) with \alpha = 0.2-1.0, effectively injecting soft noise that smooths decision boundaries and yields ~0.2-0.5% top-1 error reductions on ImageNet. These techniques differ from color modifications by focusing on structural distortions rather than photometric changes. In applications like autonomous driving, noise injection defends against adversarial attacks on perception systems, such as in adverse weather, by augmenting datasets with perturbations that mimic degradations or malicious inputs. For example, adversarial data augmentation has improved detection robustness in LiDAR-camera models, increasing mean average by 5-10% under simulated conditions. Noise levels are controlled via hyperparameter tuning, such as adjusting \sigma in through grid search or validation on held-out perturbed sets, ensuring perturbations remain realistic without degrading clean performance. Evaluation often relies on robust accuracy metrics, measuring the proportion of correctly classified samples under \epsilon-bounded perturbations (e.g., \ell_\infty-norm \epsilon = 8/255), which quantifies defense efficacy—for instance, achieving 50-60% robust accuracy on against PGD attacks after FGSM training.

Approaches in Natural Language Processing

Lexical Substitution and Paraphrasing

Lexical substitution involves replacing words in a with synonyms or contextually similar terms to generate varied while aiming to preserve the original meaning. This technique commonly utilizes lexical resources like , a large lexical database of English nouns, verbs, adjectives, and adverbs grouped into synsets, to identify and substitute synonyms. For instance, in the "The cat sat on the mat," "sat" could be replaced with "rested" based on synsets. More advanced approaches leverage contextual embeddings from models like , where candidate substitutes are selected by computing between word embeddings, often using a threshold such as 0.7 to ensure semantic closeness. These methods enhance model robustness by introducing lexical diversity without altering core semantics. Paraphrasing extends lexical substitution to sentence-level modifications, producing semantically equivalent rephrasings to expand datasets. Rule-based techniques include syntactic alterations, such as converting to passive, as in transforming "The chef cooked the meal" to "The meal was cooked by the chef," which maintains meaning through predefined grammatical rules. Neural approaches, such as the model, enable controlled generation of paraphrases by pre-training on gap-sentence extraction tasks, allowing fine-tuned variants to produce diverse yet faithful rephrasings. Back-translation serves as another effective paraphrasing method, where is translated to a pivot language (e.g., English to ) and then back to the original language, introducing natural variations like "The quick brown fox jumps over the lazy dog" becoming "The swift brown fox leaps over the indolent dog." This technique leverages monolingual data via models to augment parallel corpora. These methods find applications in , where lexical substitutions and paraphrases increase training sample diversity, improving classifier accuracy on datasets like reviews by up to 1.5% in F1 score. In low-resource languages, such as or certain African dialects, they address data scarcity by generating synthetic examples, enhancing sentiment model performance through transfer from high-resource languages. Semantic preservation is evaluated using metrics like score, which measures n-gram overlap between original and augmented texts, ensuring high scores (e.g., above 0.8) indicate minimal deviation. Challenges in lexical substitution and paraphrasing include avoiding semantic drift, where substitutions inadvertently alter intended meaning, such as replacing "" (financial) with "riverbank" in a monetary context. The Easy Data Augmentation (EDA) framework addresses this by incorporating four operations—synonym replacement (using ), random deletion, insertion, and swap—applied with controlled probabilities to boost text classification performance while mitigating drift through simplicity and empirical validation on tasks like .

Syntactic and Semantic Augmentations

Syntactic augmentations in involve modifying the grammatical structure of text while preserving its core meaning to enhance model robustness against syntactic variations. One prominent approach uses dependency tree morphing, where operations such as cropping (removing subtrees) and rotating (repositioning fragments) are applied to the parsed dependency tree of a to generate diverse syntactic forms. For instance, swapping subjects and objects in the tree—while maintaining grammaticality through constraints like preserving head dependencies—can produce valid rephrasings that expose models to reordered structures without altering semantics. Round-trip , which involves text to a syntactic tree and regenerating it through rule-based or model-driven reconstruction, further introduces subtle structural variations, such as alternative phrase orderings. These methods are particularly useful in low-resource settings, where they have been shown to improve accuracy by up to 22 percentage points on benchmarks like the Universal Dependencies dataset. Semantic augmentations focus on alterations that maintain or subtly shift meaning to capture contextual nuances, aiding tasks sensitive to inference and implication. Embedding-based perturbations add controlled noise to sentence or word embeddings, followed by decoding to new text that retains semantic proximity; for example, Gaussian noise applied to BERT embeddings can yield paraphrases with cosine similarity above 0.9 to the original. Counterfactual generation creates "what-if" scenarios by minimal edits, such as inserting "not" to negate verbs or adjectives, flipping label implications while keeping surface structure intact—this has been applied to question answering datasets to boost out-of-distribution performance by up to 7 percentage points in exact match score. These techniques emphasize meaning preservation, distinguishing them from surface-level changes by targeting deeper representational shifts. Key techniques include augmentation, where pairs from datasets like SNLI are leveraged to generate hypotheses that logically follow from premises, expanding training data for inference tasks without introducing contradictions. Conditional generation with models like enables targeted augmentations by prompting the model to produce text conditioned on specific attributes, such as rephrasing while enforcing entailment relations, achieving gains of 1-4% in downstream classification accuracy. In applications to and , these augmentations improve generalization; for example, counterfactual data has enhanced QA models' handling of , with reported improvements of 1-2 percentage points in exact match on challenge sets. Evaluation typically measures impact via downstream task metrics, such as exact match in or entailment accuracy in NLU, ensuring augmented data contributes to robust performance without degrading fidelity. Advanced developments, such as work on semantic through contrastive learning, treat dropout-induced variants of the same sentence as positive pairs to train embeddings that capture invariance to perturbations, outperforming prior methods by 2-5% on semantic textual similarity tasks like STS-B. This approach underscores the role of contrastive objectives in generating augmentations that align closely with human notions of semantic identity.

Augmentation for Time-Series and Signals

Temporal Manipulations

Temporal manipulations in data augmentation involve altering the time axis of sequential data, such as audio signals or readings, to introduce variations that mimic real-world dynamic changes while preserving the inherent order and dependencies in the data. These techniques are particularly useful for time-series data where temporal structure is critical, contrasting with methods like random that disrupt sequential relationships. Key techniques include time warping, which stretches or compresses segments of the using (DTW) to align and distort temporal alignments, thereby simulating variations in event timing. slicing and sliding extract or shift sub-sequences from the original signal to generate new samples, effectively creating diverse temporal excerpts without altering the underlying pattern. Magnitude warping applies smooth amplitude scaling over time via , modulating signal intensity across temporal regions to reflect natural fluctuations. For audio data, speed and are prominent temporal alterations; speed adjustment resamples the signal at a modified rate r (e.g., r = 0.9 to slow it down), while shifts the as f' = f \times r, preserving perceptual qualities in tasks. In electrocardiogram (ECG) signals, adding time shifts delays signal onsets to emulate , and reversals flip the sequence to model atypical rhythms, enhancing model robustness to phase differences. These methods find applications in , where speed and pitch perturbations improve acoustic model generalization, and in for data, where they help identify irregular patterns by simulating temporal anomalies. Unlike non-sequential augmentations, temporal manipulations maintain dependencies, leading to better performance in sequential models. from wearable data augmentation demonstrates that combining time warping with other temporal methods reduces classification error by approximately 9% in Parkinson's disease monitoring tasks.

Domain-Specific Signal Enhancements

In domain-specific signal enhancements, data augmentation techniques are tailored to the unique characteristics of biological and signals, such as their physiological constraints and environmental sensitivities, to improve model robustness in specialized applications. For biological signals like (EEG) and (ECG), augmentation often incorporates realistic artifacts to mimic real-world recording conditions, while for signals like vibrations from rotating machinery, methods focus on frequency-domain manipulations to simulate faults and operational variations. These approaches build on general temporal methods by emphasizing , such as physiological plausibility or mechanical physics, to generate diverse yet credible . Biological signal augmentation commonly involves adding physiological noise to EEG and ECG data to enhance model generalization against real-world variability. For instance, motion artifacts are simulated by overlaying sinusoidal waves with low frequencies (e.g., 0.05–0.5 Hz) to replicate baseline wandering caused by patient movement, which helps in denoising and classification tasks. Oversampling heartbeats in ECG datasets addresses class imbalance for rare arrhythmias by generating synthetic cycles using generative models like variational autoencoders (VAEs) or generative adversarial networks (GANs), achieving up to 37% improvement in arrhythmia detection accuracy on the MIT-BIH dataset. Similarly, for EEG, controlled noise addition via surrogate methods preserves signal statistics while introducing variability. Mechanical signal augmentation targets vibration data from components like bearings, employing to create diverse fault scenarios under limited real data. Techniques such as (STFT)-based augmentation apply FFT shifts to alter frequency components, simulating speed variations or load changes in rotating equipment. Fault simulation in bearings involves generating 2D time-frequency images from raw and augmenting them to represent inner/outer race defects, improving precision in predictive models. Advanced techniques like signal mixing and further refine these enhancements. Mixing two ECG traces via alpha-blending (e.g., with α=0.7) combines scalogram and binary representations to extract robust features, yielding 99.62% accuracy in arrhythmia classification using DenseNet on PhysioNet data. through style transfer aligns distributions across subjects or devices; for , sparse representation classifiers transfer features while preserving physiological styles, reducing cross-domain error by 15–20%. In mechanical contexts, fault frequency band segmentation adapts vibrations from lab to industrial settings, simulating domain shifts for bearing prognostics. These methods find applications in wearables for health monitoring, where augmented ECG/EEG data enables real-time or detection, and in for mechanical systems, using vibration augmentation to forecast bearing failures and minimize downtime. For example, augmentation of (IMU) data from wearables has improved accuracy in health monitoring by 5–10% through physics-based simulations of motion variations. Evaluation often relies on (SNR) metrics post-augmentation, where targeted addition maintains SNR above 15–20 to ensure augmented signals retain diagnostic fidelity without excessive .

Advanced and Generative Techniques

Model-Based Generation

Model-based generation in data augmentation leverages deep generative models to synthesize novel data samples that mimic the underlying distribution of the original dataset, thereby expanding training corpora without relying on manual transformations. These approaches, particularly and , enable the creation of realistic synthetic instances across modalities such as images, text, and signals, enhancing model robustness in scenarios with limited data. GANs operate through an adversarial training process involving a G that produces from random noise z, and a discriminator D that distinguishes real data x from generated samples G(z). The training objective is formulated as a game: \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] This setup, introduced in , encourages the generator to produce data indistinguishable from real samples, fostering high-fidelity augmentation. In contrast, VAEs employ an encoder-decoder architecture where the encoder maps input data to a distribution, and the reconstructs samples from latent variables. The loss function combines reconstruction error with a divergence term to regularize the latent space, ensuring it approximates a prior distribution like a standard Gaussian. Variants such as β-VAE scale the KL term by a hyperparameter β > 1 to promote disentangled representations, facilitating controlled generation for augmentation tasks. Applications of these models span multiple domains. In image synthesis, deep convolutional GANs (DCGANs) integrate convolutional layers to generate augmented images, achieving stable training and improved visual quality on datasets like CIFAR-10. For text, frameworks like TextGAN adapt adversarial training to sequence generation using LSTM-based generators and discriminators, producing diverse paraphrases or sentences to augment datasets. In , GANs and VAEs synthesize time-series data, such as audio waveforms, to bolster training in resource-constrained environments. Conditional variants, like conditional GANs (cGANs), incorporate labels or attributes into the input to generate class-specific augmentations, enabling targeted data expansion for . Advancements have refined these models for superior augmentation. , introduced in 2018, employs a style-based that injects adaptive instance at multiple scales, yielding high-fidelity images with fine-grained control over attributes like facial features, ideal for augmenting visual datasets. As an alternative to GANs, denoising diffusion probabilistic models (DDPMs), proposed in 2020, iteratively denoise to generate samples, offering stable training and state-of-the-art realism in image augmentation without adversarial components. Subsequent developments, such as latent diffusion models introduced in 2022, further enhance efficiency by operating in , enabling scalable generation of diverse for augmentation tasks as of 2025. Despite their efficacy, model-based methods face challenges, including mode collapse in GANs, where the produces limited varieties of samples, failing to capture the full data diversity. This risk is mitigated through techniques like feature matching but underscores the need for careful hyperparameter tuning. Evaluation often relies on the (FID) score, which measures distributional similarity between real and generated samples using network features, with lower values indicating more realistic augmentations.

Policy Search and Optimization Methods

Policy search and optimization methods in data augmentation involve automated techniques that leverage search algorithms or to discover effective augmentation policies, which are typically defined as sequences of operations applied with specific probabilities or magnitudes to input . These methods aim to maximize a performance metric, such as validation accuracy on a target dataset, by exploring a predefined search space of augmentation operations like rotations, color adjustments, or cuts. Unlike manual policy design, this approach systematically identifies combinations that enhance model without extensive human intervention. A seminal method in this domain is AutoAugment, introduced in 2018, which employs to search for optimal augmentation policies. In AutoAugment, a controller samples sub-policies—each consisting of two consecutive augmentation operations with associated probabilities—from a discrete search space of 16 operations and five magnitudes per operation, applied to mini-batches of data. The controller is trained using the REINFORCE algorithm, where the reward is the validation accuracy improvement of a child model trained on the augmented data, with a proxy task on a smaller dataset like to reduce computational cost before transferring to larger datasets. This results in a learned policy that, when applied to models like ResNet or NASNet on , yields top-1 accuracy gains of up to 1.48% compared to standard augmentations. Building on AutoAugment, RandAugment (2020) simplifies the search process by reducing the parameter space, eliminating the need for and proxy tasks, thereby lowering computational requirements by orders of magnitude. Instead of learning probabilities and magnitudes, RandAugment samples a fixed number of operations (denoted by magnitude M, typically 10-15) from the same 16-operation space, applying each with a uniform random magnitude between 0 and 9 without probabilities. This distortion-based approach is trained directly on the target task, achieving comparable or superior performance to AutoAugment; for instance, it improves top-1 accuracy by 1.3-2.0% on for various architectures, while requiring only 0.1% of AutoAugment's search compute. Beyond RL-based methods, alternative optimization techniques have been developed for policy search. Bayesian optimization approaches, such as , model the objective function (e.g., model accuracy) as a to efficiently explore continuous or discrete augmentation spaces, selecting promising policies via an acquisition function that balances exploration and exploitation. This method automates policy discovery for tasks like image classification, outperforming in fewer evaluations. Genetic algorithms evolve augmentation policies through population-based optimization, where candidate policies (chromosomes) are mutated and crossed over, with evaluated by downstream model performance; for example, tournament selection genetic algorithms have been used to adapt AutoAugment-like searches for specialized domains like sim-to-real transfer. Additionally, (), an on-policy algorithm, has been applied to learn augmentation policies in settings, where it optimizes stochastic policies to maximize generalization rewards, as demonstrated in environments requiring robust data perturbations for policy stability. These methods prove particularly efficient for training large-scale models like Vision Transformers (ViTs) on datasets such as , where strong augmentation policies can yield 2-3% top-1 accuracy improvements over baseline training, enhancing robustness to distribution shifts without additional data. For ViTs, which lack convolutional inductive biases, optimized policies like those from RandAugment or AutoAugment variants are crucial for achieving competitive performance from scratch. However, policy search methods face limitations, including high computational costs for exhaustive searches—AutoAugment requires thousands of GPU hours—and challenges in transferability, as policies learned on one dataset (e.g., ) may underperform on others due to domain-specific optima.

Applications and Challenges

Cross-Domain Applications

In healthcare, data augmentation plays a pivotal role in addressing the scarcity of training data for s, particularly through the generation of synthetic MRI and scans that mimic real pathological features. Techniques such as generative adversarial networks (GANs) have been employed to create augmented datasets, enabling models to improve diagnostic accuracy for conditions like rare hematological disorders where imaging is crucial. For instance, DCGAN-based augmentation has demonstrated effectiveness in synthesizing MRI images for classification, enhancing model robustness without compromising patient privacy. Additionally, integrated with data augmentation has emerged as a key trend since 2023, allowing collaborative model training across institutions while preserving data privacy through mechanisms, as seen in frameworks like FMDADP-MA that augment medical datasets for edge-based assistance. In autonomous systems, data augmentation via simulated data is essential for self-driving vehicles to handle edge cases that are rare or unsafe to capture in real-world scenarios. Methods like SurfelGAN synthesize realistic and camera data by generating novel trajectories and environmental variations, bridging the gap between simulated and real inputs to improve models. Similarly, augmented autonomous simulation (AADS) combines real-world imagery with data-driven generation, enabling scalable for obstacle detection and path planning in diverse conditions. These approaches have been validated in large-scale datasets, showing significant gains in model generalization for safety-critical tasks. The sector leverages augmentation to bolster detection systems, creating diverse transaction datasets that comply with stringent regulations like GDPR. Augmentation techniques generate realistic fraudulent patterns by perturbing real transaction features while ensuring statistical fidelity, which helps mitigate class imbalance in imbalanced datasets. A 2024 report by the Financial Conduct Authority highlights how synthetic augmentation enhances model performance in detecting anomalous transactions, with applications in anti-money laundering that avoid direct use of sensitive . Systematic reviews confirm that such methods, often using GANs or variational autoencoders, improve detection accuracy by up to 15-20% in controlled benchmarks while adhering to standards. Beyond these domains, data augmentation facilitates sim-to-real transfer in by augmenting simulation environments with GAN-generated variations to better approximate real-world dynamics. For example, instance-level augmentation pipelines have been shown to enhance vision-based policies, reducing the domain gap in tasks like . In climate modeling, synthetic weather pattern synthesis through data augmentation supports predictive s; frameworks like GANterpolate interpolate and generate augmented datasets for intensity estimation, improving forecast reliability amid data sparsity. A systematic approach using techniques such as random erasing and noise addition has demonstrated efficacy in augmenting reanalysis data for tracking. Case studies from 2024 underscore the impact of data augmentation in for , where enhances training efficiency. Techniques like comment augmentation generate explanatory annotations for code snippets, filtering and enriching datasets to boost LLM performance on programming tasks; evaluations on benchmarks such as HumanEval show improvements in pass@1 rates by 5-10%. Surveys of LLM-based augmentation further illustrate its role in creating diverse code corpora, addressing data scarcity in specialized domains like .

Limitations and Future Directions

Despite its benefits, data augmentation faces significant limitations, particularly in addressing domain shift, where generated samples may fail to capture real-world data distributions, leading to degraded model performance during deployment. Computational overhead remains a key challenge, as policy optimization methods like AutoAugment require extensive resources for search and validation, often demanding thousands of GPU hours. Additionally, can amplify biases present in the original dataset, propagating and exacerbating unfair representations in downstream models. Ethical concerns further complicate data augmentation practices, especially with generative approaches. Privacy risks arise from memorization in models like GANs, where training data can be reconstructed from generated outputs, enabling membership inference attacks on sensitive information. Fairness issues are pronounced in imbalanced societal datasets, where augmentation may reinforce disparities across demographic groups unless explicitly mitigated. Looking ahead, future directions emphasize augmentation techniques that integrate text and data to enhance cross-modal generalization. Integration with promises to leverage unlabeled data more effectively, as seen in methods like VIME for tabular augmentation. Sustainable computing is gaining traction through efficient policies developed post-2023, such as adaptive search strategies that minimize resource use while maintaining performance. Research gaps persist in standardizing evaluation protocols, with current metrics lacking uniformity across modalities. Augmentation for and data remains underdeveloped, hindering applications in spatial and temporal modeling. Projections for highlight AI-driven auto-discovery systems, where agents autonomously generate and optimize augmentation strategies to accelerate innovation. Key metrics for assessing augmentation quality include diversity scores, such as label shift estimation, which quantify and sample variety.

References

  1. [1]
    [PDF] A Comprehensive Survey on Data Augmentation - arXiv
    Oct 15, 2025 · Abstract—Data augmentation is a series of techniques that generate high-quality artificial data by manipulating existing data samples.
  2. [2]
    Data augmentation: A comprehensive survey of modern approaches
    This paper presents an extensive and thorough review of data augmentation methods applicable in computer vision domains.
  3. [3]
    A survey on Image Data Augmentation for Deep Learning
    Jul 6, 2019 · This survey will present existing methods for Data Augmentation, promising developments, and meta-level decisions for implementing Data Augmentation.
  4. [4]
    SMOTE: Synthetic Minority Over-sampling Technique
    Jun 1, 2002 · SMOTE is a method for imbalanced datasets that over-samples the minority class by creating synthetic examples, and under-samples the majority ...
  5. [5]
    None
    ### Summary of Data Augmentation Techniques in AlexNet and Their Impact on Performance in the ImageNet Challenge
  6. [6]
  7. [7]
    XOR Mixup: Privacy-Preserving Data Augmentation for One-Shot ...
    Jun 9, 2020 · We develop a privacy-preserving XOR based mixup data augmentation technique, coined XorMixup, and thereby propose a novel one-shot FL framework, termed ...
  8. [8]
    Data oversampling and imbalanced datasets: an investigation of ...
    Jun 17, 2024 · In this study, we compare different oversampling techniques like synthetic minority oversampling technique (SMOTE), support vector machine SMOTE (SVM-SMOTE), ...
  9. [9]
    A New Over-Sampling Method in Imbalanced Data Sets Learning
    This paper presents two new minority over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in which only the minority examples near the borderline ...Missing: original | Show results with:original
  10. [10]
    Adaptive Synthetic Sampling Approach for Imbalanced Learning
    PDF | This paper presents a novel adaptive synthetic (ADASYN) sampling approach for learning from imbalanced data sets. The essential idea of ADASYN is.
  11. [11]
    SMOTE algorithm optimization and application in corporate credit ...
    Jul 2, 2025 · This study focuses on optimizing the Synthetic Minority Over-Sampling Technique (SMOTE) algorithm for corporate credit risk prediction.
  12. [12]
    An oversampling method for imbalanced data based on spatial ...
    Oct 7, 2022 · ... G-mean, and the AUC. In the binary classification problem, we ... To combat multi-class imbalanced problems by means of over-sampling and boosting ...
  13. [13]
    What is a feature engineering? | IBM
    Feature engineering is the process of transforming raw data into relevant information for use by machine learning models.
  14. [14]
    [PDF] The Role of Feature Engineering in Machine Learning - IRE Journals
    Traditional feature engineering techniques include domain-specific feature selection, polynomial transformations, encoding categorical variables, and feature ...
  15. [15]
    [PDF] The good, the bad and the ugly sides of data augmentation
    adding random Gaussian noise to data points is equivalent to Tikhonov regularization (Bishop,. 1995) and vicinal risk minimization (Zhang et al., 2017 ...
  16. [16]
    Training Data Augmentation with Data Distilled by Principal ... - MDPI
    Jan 8, 2024 · This work develops a new method for vector data augmentation. The proposed method applies principal component analysis (PCA), determines the eigenvectors of a ...
  17. [17]
    [PDF] Kernel methods in machine learning - arXiv
    Kernel methods for unsupervised learning. This section discusses var- ious methods of data analysis by modeling the distribution of data in feature space.Missing: augmentation | Show results with:augmentation
  18. [18]
    Lagged features for time series forecasting - Scikit-learn
    This example demonstrates how Polars-engineered lagged features can be used for time series forecasting with HistGradientBoostingRegressor on the Bike ...
  19. [19]
    OpenCV: Geometric Transformations of Images
    ### Summary of OpenCV Geometric Transformations
  20. [20]
    Documentation
    ### Summary of Albumentations Support for Geometric Transformations
  21. [21]
    Improving Deep Learning using Generic Data Augmentation - arXiv
    Mordad 29, 1396 AP · This study benchmarks various popular data augmentation schemes to allow researchers to make informed decisions as to which training methods are most ...
  22. [22]
    None
    Nothing is retrieved...<|separator|>
  23. [23]
  24. [24]
    Data Augmentation in Training CNNs: Injecting Noise to Images - arXiv
    Jul 12, 2023 · This study analyzes the effects of adding or applying different noise models of varying magnitudes to Convolutional Neural Network (CNN) architectures.Missing: autonomous driving
  25. [25]
    [1412.6572] Explaining and Harnessing Adversarial Examples - arXiv
    Dec 20, 2014 · Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.Missing: augmentation | Show results with:augmentation
  26. [26]
    Improved Regularization of Convolutional Neural Networks with ...
    Aug 15, 2017 · Cutout is a regularization technique that randomly masks out square regions of input during training to improve the robustness of CNNs.
  27. [27]
    [1710.09412] mixup: Beyond Empirical Risk Minimization - arXiv
    Oct 25, 2017 · We propose mixup, a simple learning principle to alleviate these issues. In essence, mixup trains a neural network on convex combinations of pairs of examples ...
  28. [28]
    [PDF] Adversarial Differentiable Data Augmentation for Autonomous ...
    The work presented in this paper builds on existing needs to enhance the robustness of vision systems, and draws in- spiration from the data augmentation and ...
  29. [29]
    Data Augmentation via Dependency Tree Morphing for Low ...
    Data augmentation uses 'crop' (removing dependency links) and 'rotate' (moving tree fragments) techniques to augment training sets for low-resource languages.Missing: syntactic | Show results with:syntactic
  30. [30]
    Data augmentation approaches in natural language processing
    Masked language models (MLMs) such as BERT and RoBERTa can predict masked words in text based on context, which can be used for text data augmentation (as shown ...
  31. [31]
    [1909.12434] Learning the Difference that Makes a Difference with ...
    Sep 26, 2019 · Learning the Difference that Makes a Difference with Counterfactually-Augmented Data. Authors:Divyansh Kaushik, Eduard Hovy, Zachary C. Lipton.Missing: augmentation | Show results with:augmentation
  32. [32]
    The Stanford Natural Language Inference (SNLI) Corpus
    The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced ...Missing: augmentation | Show results with:augmentation
  33. [33]
    [PDF] Exploring the Limits of Transfer Learning with a Unified Text-to-Text ...
    We refer to our model and framework as the “Text-to-Text Transfer Transformer”. (T5). 2.1. Model. Early results on transfer learning for NLP leveraged recurrent ...
  34. [34]
    SimCSE: Simple Contrastive Learning of Sentence Embeddings
    This paper presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings.
  35. [35]
    Time Series Data Augmentation for Deep Learning: A Survey - arXiv
    Feb 27, 2020 · In this paper, we systematically review different data augmentation methods for time series. We propose a taxonomy for the reviewed methods.
  36. [36]
    [PDF] Time Series Data Augmentation for Deep Learning: A Survey - IJCAI
    The experiments in [Um et al., 2017] show that the combination of three basic time- domain methods (permutation, rotation, and time warping) is better than that ...
  37. [37]
    A Novel Data Augmentation Technique for Time Series Classification
    Mar 1, 2021 · This paper proposes a novel data augmentation method for time series based on Dynamic Time Warping.
  38. [38]
    Data Augmentation techniques in time series domain: a survey and ...
    Mar 24, 2023 · This work systematically reviews the current state of the art in the area to provide an overview of all available algorithms and proposes a taxonomy of the ...
  39. [39]
    [2004.08780] Time Series Data Augmentation for Neural Networks ...
    Apr 19, 2020 · We propose a novel time series data augmentation called guided warping. While many data augmentation methods are based on random transformations.
  40. [40]
    A Systematic Survey of Data Augmentation of ECG Signals for AI ...
    Time domain transformations change the ECG along the time axis, i.e., the data points on the ECG are moved to different time steps than the original sequence.3. Ecg Applications And... · 3.1. Typical Ecg... · 5.2. Learning Based-Models
  41. [41]
    Data Augmentation of Wearable Sensor Data for Parkinson's ... - arXiv
    Jun 2, 2017 · In this paper, various data augmentation methods for wearable sensor data are proposed. The proposed methods and CNNs are applied to the classification of the ...
  42. [42]
    A Systematic Survey of Data Augmentation of ECG Signals for AI ...
    May 31, 2023 · This study provided a better understanding of the potential of ECG augmentation in enhancing the performance of AI-based ECG applications.
  43. [43]
    a time-frequency domain data augmentation for enhancing fault ...
    Oct 1, 2025 · STFT–DA: a time-frequency domain data augmentation for enhancing fault diagnosis in rotating equipment with limited data.
  44. [44]
    Rolling bearing fault diagnosis based on 2D time-frequency images ...
    Jan 6, 2023 · A fault diagnosis method based on two-dimensional time-frequency images and data augmentation is proposed.
  45. [45]
    BlendNet: a blending-based convolutional neural network ... - Frontiers
    Aug 21, 2025 · Methods: This work proposes “BlendNet,” a DL architecture that effectively extracts the features of an ECG signal using a blending approach ...
  46. [46]
    A Domain Adaptation Sparse Representation Classifier for Cross ...
    In this study, a new domain adaptation sparse representation classifier (DASRC) is proposed to address the cross-domain EEG-based emotion classification.Missing: mixing blending
  47. [47]
    Fault frequency band segmentation and domain adaptation with ...
    This paper proposes a novel method combining fault frequency band segmentation domain adaptation (FBSDA) with fault-added and uncertainty-aware signal ...
  48. [48]
    Leveraging Machine Learning for Personalized Wearable ... - NIH
    Feb 13, 2024 · This review investigates the convergence of artificial intelligence (AI) and personalized health monitoring through wearable devices.
  49. [49]
    Data augmentation in predictive maintenance applicable to ...
    Dec 3, 2024 · A literature review is conducted to identify a solution approach for a suitable data augmentation strategy that can be applied to our specific use case of ...
  50. [50]
    Physically Plausible Data Augmentations for Wearable IMU-based ...
    Aug 18, 2025 · In this study, we introduce and systematically characterize Physically Plausible Data Augmentation (PPDA) enabled by physics simulation. PPDA ...
  51. [51]
    [1406.2661] Generative Adversarial Networks - arXiv
    Jun 10, 2014 · We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models.
  52. [52]
    [1312.6114] Auto-Encoding Variational Bayes - arXiv
    We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even ...
  53. [53]
    Unsupervised Representation Learning with Deep Convolutional ...
    Nov 19, 2015 · This paper introduces DCGANs, a type of CNN, for unsupervised learning, aiming to bridge the gap with supervised learning. It learns a ...
  54. [54]
    [1706.03850] Adversarial Feature Matching for Text Generation - arXiv
    Jun 12, 2017 · We propose a framework for generating realistic text via adversarial training. We employ a long short-term memory network as generator, and a convolutional ...
  55. [55]
    [1411.1784] Conditional Generative Adversarial Nets - arXiv
    In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data.
  56. [56]
    [1812.04948] A Style-Based Generator Architecture for ... - arXiv
    Dec 12, 2018 · Abstract:We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature.
  57. [57]
    [2006.11239] Denoising Diffusion Probabilistic Models - arXiv
    This paper presents high quality image synthesis using diffusion probabilistic models, trained with a novel connection to denoising score matching. It achieves ...
  58. [58]
    GANs Trained by a Two Time-Scale Update Rule Converge ... - arXiv
    Jun 26, 2017 · For the evaluation of the performance of GANs at image generation, we introduce the "Fréchet Inception Distance" (FID) which captures the ...
  59. [59]
    AutoAugment: Learning Augmentation Policies from Data - arXiv
    May 24, 2018 · In this paper, we describe a simple procedure called AutoAugment to automatically search for improved data augmentation policies.
  60. [60]
    Practical automated data augmentation with a reduced search space
    Sep 30, 2019 · RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task.
  61. [61]
    Learning Optimal Data Augmentation Policies via Bayesian ... - arXiv
    May 6, 2019 · We propose a method named BO-Aug for automating the process by finding the optimal DA policies using the Bayesian optimization approach. Our ...
  62. [62]
    How to train your ViT? Data, Augmentation, and Regularization in ...
    Jun 18, 2021 · ViT training relies on data augmentation and regularization ("AugReg") due to weaker inductive bias, especially with smaller datasets. ...
  63. [63]
    Data Augmentation and Synthetic Data Generation in Rare Disease ...
    This explains why data augmentation is prevalent in diseases where the use of images for diagnosis and prognosis is essential, such as hematological diseases ...
  64. [64]
    Data Augmentation for Rare Diseases using DCGAN in MRI
    May 9, 2025 · This study presents a comprehensive investigation of deep learning techniques for synthesizing medical images across different modalities, with ...
  65. [65]
    Enhancing Medical Assistance Through Secure Federated Edge ...
    Oct 21, 2025 · We introduce Federated Medical Data Augmentation with Differential Privacy for Medical Assistance (FMDADP-MA), addressing the challenge of ...
  66. [66]
    [PDF] SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous ...
    a) The goal of this work is the generation of camera images for autonomous driving simulation. When provided with a novel trajectory of the self-driving vehicle ...<|separator|>
  67. [67]
    AADS: Augmented autonomous driving simulation using data-driven ...
    We present our augmented autonomous driving simulation (AADS). Our formulation augmented real-world pictures with a simulated traffic flow to create ...
  68. [68]
    [PDF] Report: Using Synthetic Data in Financial Services
    Mar 22, 2024 · Augmenting the real-world fraud data with synthetic fraudulent transactions may be used to improve detection performance. 4.5. Robust model ...
  69. [69]
    A Systematic Review of Synthetic Data Generation for Finance - arXiv
    Oct 30, 2025 · Synthetic data generation has emerged as a promising approach to address the challenges of using sensitive financial data in machine learning ...
  70. [70]
    GAN-Based Instance-Level Data Augmentation for Sim-to-Real ...
    Jan 3, 2025 · Our method provides a scalable and flexible data augmentation tool for leveraging large synthetic datasets to enhance vision-based robotic navigation tasks.
  71. [71]
    Improving Climate Modeling through Synthetic Data Generation
    Aug 8, 2025 · This study proposes GANterpolate, a novel hybrid approach that integrates Generative Adversarial Networks (GANs) with interpolation techniques ...<|separator|>
  72. [72]
    A Systematic Framework for Data Augmentation for Tropical Cyclone ...
    Sep 25, 2024 · Our findings suggest that all augmentation techniques are effective for the estimation of tropical cyclones intensity, including random erasing.
  73. [73]
    Enhancing Code LLMs with Comment Augmentation - ACL Anthology
    We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data ...
  74. [74]
  75. [75]
    Understanding and Mitigating the Bias Inheritance in LLM-based ...
    Feb 6, 2025 · Bias inheritance is when LLMs propagate biases from training data when generating synthetic data, impacting model fairness and robustness.
  76. [76]
    Improving Recommendation Fairness via Data Augmentation - arXiv
    Feb 13, 2023 · We augment imbalanced training data towards balanced data distribution to improve fairness.