Fact-checked by Grok 2 weeks ago

Video super-resolution

Video super-resolution (VSR) is a computational technique in computer vision that reconstructs high-resolution (HR) video frames from low-resolution (LR) counterparts by exploiting temporal correlations and alignments across multiple frames, thereby achieving superior detail recovery compared to single-image super-resolution methods. Unlike static image upscaling, VSR addresses challenges inherent to dynamic content, such as motion-induced blurring and inter-frame inconsistencies, through explicit or implicit modeling of temporal dependencies.^[1] Early approaches relied on classical signal processing techniques like multi-frame interpolation and optical flow estimation, but these often suffered from alignment errors and limited fidelity in complex scenes.^[2] The field has seen transformative advances since the mid-2010s with the advent of deep learning, particularly convolutional neural networks (CNNs) and recurrent architectures that propagate information across frames to mitigate artifacts like flickering and enhance perceptual quality.^[1] Notable methods include enhanced deformable video restoration (EDVR) for robust alignment and feature fusion, and recurrent models like BasicVSR, which leverage long-term dependencies for efficient 4x upscaling with state-of-the-art peak signal-to-noise ratio (PSNR) gains on benchmarks such as VID4.^[1] Recent innovations incorporate transformer-based attention for global context capture and diffusion models to generate realistic textures, addressing over-smoothing in prior GAN-based techniques and enabling real-world applications in 4K enhancement despite computational demands.^[3] Despite these achievements, persistent challenges include handling diverse degradations (e.g., compression artifacts, noise) without paired HR-LR training data and achieving real-time inference on resource-constrained devices, highlighting the need for causal, degradation-adaptive priors over synthetic assumptions.^[4] Applications span video conferencing upscaling, surveillance enhancement, and medical imaging, where empirical evaluations underscore VSR's causal role in surpassing Nyquist limits via multi-frame fusion.^[5]

History

Origins in Image Super-Resolution

Video super-resolution emerged from foundational image super-resolution techniques, which addressed the challenge of reconstructing high-resolution images from low-resolution inputs degraded by downsampling, blur, and noise. These methods were rooted in sampling theory, where aliasing arises when the sampling rate falls below the Nyquist frequency, leading to loss of high-frequency details. Early image super-resolution sought to mitigate this by exploiting redundancy across multiple low-resolution observations, assuming sub-pixel shifts and shift-invariance to recover aliased components through fusion.^[6]^[7] Initial adaptations for video borrowed baseline interpolation techniques from single-image super-resolution, such as bicubic interpolation, which estimates missing pixels via cubic polynomial fitting over a 4x4 neighborhood, and Lanczos resampling, a sinc-based method that preserves sharper edges by convolving with a truncated sinc kernel. These served as simple upsampling baselines but often introduced smoothing artifacts, failing to recover true high-frequency content due to their reliance on local smoothness assumptions rather than global scene structure. Multi-frame image super-resolution extended this by iteratively projecting observations onto a higher grid, with the Papoulis-Gerchberg algorithm (introduced in 1977 for bandlimited signal extrapolation) adapted for images to enforce consistency with low-resolution constraints while extrapolating frequencies.^[8]^[9] A pivotal advancement came in 1991 with the iterative back-projection method by Irani and Peleg, which modeled low-resolution images as warped, blurred, and decimated versions of a high-resolution latent image, iteratively refining estimates by back-projecting errors to enforce fidelity across frames with sub-pixel misalignments. This approach assumed translational motion and no occlusions, enabling super-resolution factors of 2-4x in controlled settings but revealing limitations when applied to video sequences, where rigid shift-invariance overlooked complex object motion and temporal correlations. Empirical tests on dynamic scenes showed persistent blurring and ghosting artifacts, as the methods underutilized inter-frame redundancy beyond static fusion, prompting later video-specific extensions.^[10]^[11]

Early Video-Specific Techniques (Pre-2010)

Early video super-resolution techniques, developed primarily in the 1990s and 2000s, leveraged the temporal redundancy across multiple low-resolution frames by incorporating explicit motion estimation to align frames before fusion, distinguishing them from single-image methods that ignored inter-frame information.^[12] These approaches typically employed block-matching algorithms for efficient sub-pixel motion vector estimation, which divided frames into blocks and searched for correspondences to compute displacements, enabling frame warping for alignment.^[13] Alternatively, dense optical flow methods, solving for pixel-wise motion fields via brightness constancy assumptions and smoothness constraints, provided more precise alignments but at higher computational cost, often integrated into iterative refinement processes.^[12] Fusion after compensation commonly used weighted averaging, where aligned frames contributed to the high-resolution estimate proportional to their estimated reliability, such as inverse variance of alignment errors or pixel distances from motion discontinuities.^[14] More sophisticated formulations applied maximum a posteriori (MAP) estimation, modeling the high-resolution frame as a latent variable optimized under likelihood terms from observed low-resolution inputs (accounting for blur, decimation, and noise) and priors like Huber-Markov random fields to penalize discontinuities robustly while preserving edges.^[15] Frequency-domain strategies, such as discrete Fourier transform (DFT)-based phase correlation for sub-pixel shift correction, addressed aliasing in aligned frames by estimating global translations before spatial fusion.^[13] Empirical evaluations on controlled degradations, such as bicubic downsampling by factors of 2-4 with added Gaussian noise on standard sequences like Foreman or Akiyo, demonstrated peak signal-to-noise ratio (PSNR) gains of 1-2 dB over single-frame interpolation baselines, attributed to sub-pixel information aggregation from motion-exploited redundancy.^[15] However, these methods proved sensitive to motion estimation inaccuracies, with block-matching yielding artifacts in textured regions due to aperture problems and optical flow failing on large displacements or occlusions, often resulting in blurring or ghosting in dynamic scenes.^[12] Compute efficiency allowed real-time processing on 2000s hardware for modest upscaling, but scalability was limited by error propagation in complex motions, prompting later refinements in prior selection like adaptive Huber thresholds.^[16]

Deep Learning Era (2010s Onward)

The advent of deep learning in video super-resolution (VSR) marked a paradigm shift from handcrafted priors to data-driven models, leveraging large-scale datasets and convolutional neural networks (CNNs) trained end-to-end. This era began in earnest around 2016, building on successes in single-image super-resolution such as SRCNN, with initial adaptations exploiting temporal information across frames to surpass traditional methods in reconstruction quality and efficiency. A pivotal early work was the Efficient Sub-Pixel CNN (ESPCN), which achieved real-time upscaling of 1080p videos on consumer GPUs by learning sub-pixel convolutions and incorporating adjacent frames for temporal gradient exploitation, outperforming bicubic interpolation and sparse-coding approaches on standard sequences.^[17] Subsequent advancements from 2017 to 2020 emphasized architectures capturing spatio-temporal correlations, including recurrent neural networks (RNNs) for modeling long-range dependencies across sequences and 3D CNNs with kernels extending over time to aggregate motion-consistent features. These gains were causally linked to expanded training datasets like Vimeo-90K, a 2017 collection of 89,800 high-quality clips that facilitated supervised learning on diverse motions and degradations, enabling models to generalize beyond isolated frames. However, empirical evaluations revealed limitations, as many early networks overfit to synthetic bicubic downsampling prevalent in such datasets, yielding artifacts and reduced fidelity on real-world videos with unknown degradations like camera shake or compression noise.^[18]^[19]^[20] By the early 2020s, VSR incorporated multimodal inputs and generative paradigms, with 2024 introducing event-enhanced methods like EvTexture, which fuses asynchronous event camera data—rich in high-frequency texture edges—with RGB frames to mitigate blurring in dynamic scenes, demonstrating superior detail recovery on benchmarks. Diffusion models emerged concurrently for handling severe blurry or low-quality inputs, probabilistically denoising latent representations across frames to produce temporally coherent outputs, though at higher computational cost than deterministic CNNs. These developments, enabled by scalable hardware like modern GPUs, transitioned VSR toward hybrid generative frameworks prioritizing perceptual realism over pixel-wise metrics alone.^[21]

Fundamentals

Mathematical Formulation

The degradation process in single-image super-resolution is commonly modeled as y = D(Hx) + n, where y \in \mathbb{R}^{M} is the observed low-resolution image, x \in \mathbb{R}^{N} (with N > M) is the unknown high-resolution image, H is the blurring operator (e.g., convolution with a kernel such as Gaussian), D is the downsampling operator (e.g., bicubic or average pooling by factor s), and n is additive noise.^[22] This ill-posed inverse problem lacks a unique closed-form solution due to the loss of high-frequency information during downsampling, necessitating regularization. The estimation of x is thus formulated as the maximum a posteriori (MAP) solution: \hat{x} = \arg\min_{x} \| y - D H x \|_{2}^{2} + \lambda R(x), where R(x) encodes image priors (e.g., gradient sparsity R(x) = \| \nabla x \|_{1} for total variation) and \lambda > 0 balances data fidelity and regularization.^[22] Pre-deep-learning solutions rely on iterative optimization methods, such as proximal gradient descent, to approximate this minimum, as direct matrix inversion is computationally infeasible for large N.^[14] Video super-resolution extends this to a temporal sequence of low-resolution frames \{ y_{t} \}_{t=1}^{T}, aiming to recover high-resolution frames \{ x_{t} \} while ensuring temporal consistency. The forward degradation model incorporates inter-frame motion: y_{t} = D(H(W_{t \to ref} x_{ref})) + n_{t}, where W_{t \to ref} is a warping operator (e.g., affine transformation or optical flow-based) aligning frame t to a reference frame via estimated motion parameters, x_{ref} is the reference high-resolution frame, and other terms follow the single-image case.^[22] ^[14] In a Bayesian framework, this yields a joint MAP estimation over the video volume, motion fields \{ w_{t} \}, blur kernel K, and noise levels \{ \theta_{t} \}: \hat{x}, \{ \hat{w}_{t} \}, \hat{K}, \{ \hat{\theta}_{t} \} = \arg\max p(x, \{ w_{t} \}, K, \{ \theta_{t} \} | \{ y_{t} \}), with likelihood p(y_{t} | x, K, w_{t}, \theta_{t}) \propto \exp\{ -\theta_{t} \| y_{t} - D K F_{w_{t}} x \|_{2}^{2} \} (where F_{w_{t}} is the warping matrix) and priors on smoothness of x, w_{t}, and K.^[22] Equivalently, in optimization form: \hat{x} = \arg\min_{x} \sum_{t=1}^{T} \| y_{t} - D H W_{t} x \|_{2}^{2} + \lambda_{1} R_{spatial}(x) + \lambda_{2} R_{temporal}(\{ x_{t} \}), where R_{temporal} enforces consistency (e.g., via optical flow residuals) and motion compensation via W_{t} (estimated separately, e.g., using phase correlation to resolve aliasing in frequency domain by cross-power spectrum peaks).^[14] Closed-form solutions remain unavailable due to the high dimensionality and coupling of spatial-temporal variables, leading to alternating optimization: estimate motion w_{t} (e.g., via block-matching with sum-of-absolute-differences), warp and fuse low-resolution frames, then iteratively back-project to refine x as in x^{(n+1)} = x^{(n)} + \alpha \sum_{t} \uparrow (y_{t} - D H W_{t} x^{(n)}) * b, where \uparrow is upsampling, b a back-projection kernel, and \alpha a step size.^[14] This motion-compensated iterative back-projection converges empirically but requires careful initialization and regularization to avoid artifacts from motion estimation errors.^[14] Frequency-domain analysis aids motion estimation by computing phase correlations \frac{\mathcal{F}(y_{t}) \odot \mathcal{F}(y_{ref})^{*} }{ | \mathcal{F}(y_{t}) \odot \mathcal{F}(y_{ref})^{*} | } (where \mathcal{F} is Fourier transform and \odot element-wise product), yielding delta peaks at shift vectors to mitigate aliasing-induced ambiguities.^[14]

Distinctions from Single-Image Super-Resolution

Video super-resolution (VSR) fundamentally differs from single-image super-resolution (SISR) by incorporating temporal information from multiple consecutive frames, enabling the exploitation of inter-frame redundancy to enhance reconstruction quality.^[23] While SISR processes each low-resolution image in isolation, relying exclusively on intra-frame spatial correlations, VSR aggregates complementary details from adjacent frames—typically spanning times t-1 to t+1—thereby alleviating the underdetermined nature of the super-resolution inverse problem through sub-pixel shifts induced by motion.^[24] This multi-frame strategy permits mutual disambiguation of occlusions and noise, yielding empirically superior fidelity metrics compared to frame-independent approaches.^[25] Quantitative evaluations on standard benchmarks underscore these advantages: VSR-adapted models achieve PSNR gains of at least 1.26 dB over baseline SISR methods on the Vid4 dataset for 4× upscaling, reflecting the causal benefit of temporal fusion in reducing reconstruction ambiguity.^[25] Similar improvements, exceeding 1 dB in PSNR, are observed across other sequences where motion provides diverse viewpoints absent in single-frame scenarios.^[23] However, this reliance on sequential data introduces alignment sensitivities, as sub-frame motions necessitate precise registration to avoid artifacts like ghosting, a concern irrelevant to SISR's static processing.^[26] Degradation modeling further delineates the paradigms: SISR typically presumes independent and identically distributed (i.i.d.) noise and blur across pixels within a frame, simplifying the restoration pipeline.^[23] In VSR, however, degradations exhibit spatio-temporal non-stationarity—e.g., frame-specific motion blur or compression inconsistencies arising from varying object velocities and camera dynamics—demanding explicit handling of these correlations to prevent temporal flickering.^[26] Thus, while VSR's temporal leverage causally amplifies detail recovery potential, it imposes a dual burden of motion exploitation and mitigation not encountered in SISR.^[24]

Core Challenges: Motion, Degradation, and Temporal Consistency

Video super-resolution (VSR) encounters significant difficulties from inter-frame motion, particularly when displacements are large or non-rigid, as these misalignments during multi-frame fusion produce ghosting artifacts—overlapping or blurred replicas of moving objects that degrade output quality.^[27]^[28] Such effects arise because optical flow estimation fails under rapid or complex deformations, like those in natural scenes with deformable objects, leading to erroneous pixel aggregation across frames.^[29] Real-world degradations further complicate VSR, encompassing spatially variant blur, noise, and compression artifacts from codecs such as H.265/HEVC, which introduce blocking, ringing, and aliasing that interact adversely with motion to amplify high-frequency loss.^[30]^[31] These degradations are often unknown and non-uniform across frames, rendering kernel estimation unreliable and exacerbating aliasing in upscaled outputs, as low-resolution inputs inherently discard details beyond the sensor's Nyquist frequency.^[32] Temporal consistency poses a core barrier, where frame-independent super-resolution yields flickering and warping discontinuities, while inadequate alignment in multi-frame approaches propagates inconsistencies over time, manifesting as jitter in static regions or unnatural oscillations. Benchmarks reveal that unaligned or per-frame methods increase temporal artifacts by 10-20% in metrics like warping error compared to motion-compensated baselines on datasets with dynamic content.^[33] Fundamentally, these challenges stem from information-theoretic constraints: low-resolution video sequences provide insufficient bandwidth to recover lost high-frequency components, with motion introducing occlusions and non-stationarities that render the inverse mapping underdetermined, limiting reconstruction fidelity irrespective of algorithmic priors.^[34]^[35]

Methods

Traditional Methods

Traditional methods for video super-resolution, developed primarily in the 1980s through early 2000s, exploit temporal correlations across multiple low-resolution frames using classical signal processing without reliance on learned models. These approaches typically involve motion estimation to align frames, followed by fusion and interpolation to reconstruct higher-resolution output, emphasizing explicit modeling of degradation processes like blurring, decimation, and inter-frame shifts.^[36]^[37] Frequency-domain techniques model frame shifts via phase differences in the discrete Fourier transform (DFT), where sub-pixel translations correspond to linear phase ramps, enabling precise alignment without spatial interpolation errors. The Papoulis-Gerchberg algorithm extends this by iteratively projecting bandlimited signals onto constraints from observed low-frequency components and extrapolated high frequencies, leveraging the frequency-domain representation to recover aliased details from shifted frames. These methods achieve low computational complexity, often O(N log N) per frame via fast Fourier transforms, making them viable for real-time applications on limited hardware, but they presuppose periodic motion and stationary statistics, performing poorly under non-rigid deformations or aperture effects.^[36]^[38] Spatial-domain approaches begin with motion estimation, commonly using block-matching to compute displacement vectors between frames, followed by warping to a reference frame and fusion via techniques like weighted averaging or Kalman filtering for temporal smoothing. Interpolation, such as bilinear or edge-adaptive variants, then upsamples the fused estimate, preserving basic structures in mildly degraded sequences (e.g., 2× downsampling). Empirical results demonstrate viability for small upscaling factors, with fusion reducing variance in aligned regions, yet efficacy diminishes beyond 4× due to motion estimation inaccuracies amplifying artifacts like blurring or ghosting in occluded areas. Limitations include sensitivity to noise, which propagates through alignment, and reliance on hand-tuned parameters for matching thresholds and filter gains, though their deterministic nature allows causal processing without training data.^[37]^[39]

Deep Learning Methods

Deep learning methods for video super-resolution represent a paradigm shift from handcrafted priors to data-driven architectures that jointly model spatial details and temporal correlations across frames. These approaches typically employ end-to-end trainable neural networks, such as convolutional neural networks (CNNs) extended with recurrent or attention mechanisms, to upsample low-resolution (LR) video sequences to higher resolutions while mitigating artifacts from motion and degradation. Training relies on paired LR-HR video datasets, where LR inputs are often synthetically generated via bicubic downsampling of HR videos like those from Vimeo-90K or REDS, enabling supervised learning but introducing domain gaps with real-world inputs.^[40] Loss functions in these models blend pixel-level reconstruction errors, such as L1 or mean squared error (MSE), with perceptual components extracted from intermediate features of pre-trained classifiers like VGG-19, prioritizing visual plausibility over strict fidelity metrics. This combination has driven empirical gains on benchmarks, with architectures evolving from early frame-recurrent CNNs to multi-frame fusion networks that align and aggregate information across temporal windows of 3-7 frames. For instance, methods incorporating explicit motion estimation via optical flow or implicit learning through deformable convolutions have demonstrated superior handling of inter-frame inconsistencies, though computational demands scale with video length and resolution factors (e.g., 4x upsampling). Recent variants emphasize efficiency, such as lightweight models optimized via neural architecture search for real-time deployment, achieving runtime under 50 ms per frame on GPUs while maintaining competitive reconstruction on compressed streams.^[40]^[41]^[42] Despite these advances, deep learning methods exhibit limitations in generalizability, often overfitting to synthetic degradations like bicubic blurring and noise, which fail to capture complex real-world processes including compression artifacts from codecs like H.264 or sensor-specific blur. Benchmarks on datasets with authentic LR videos, such as RealVSR, reveal performance drops of 1-3 dB in PSNR compared to synthetic counterparts, underscoring the need for degradation-adaptive training or unsupervised paradigms to bridge the simulation-to-reality gap. Surveys of over 30 architectures highlight that while holistic models integrating alignment and restoration outperform modular pipelines on controlled data, they underperform on unseen degradations without domain-specific fine-tuning, prioritizing benchmark leaderboard dominance over robust causal modeling of video formation.^[40]

Motion Compensation-Based Alignment

Motion compensation-based alignment in deep learning video super-resolution explicitly estimates optical flow fields to warp neighboring low-resolution frames onto a reference frame, thereby compensating for inter-frame displacements before feature fusion and upsampling. This approach leverages the temporal redundancy across frames by aligning them spatially, which is crucial for exploiting sub-pixel shifts that single-image methods cannot capture. Early integrations, such as in the Detail-Revealing Deep Video Super-resolution framework, demonstrated that accurate motion compensation via learned flow estimation significantly improves reconstruction quality over naive averaging, with ablation studies showing gains of up to 1.5 dB in PSNR on benchmark sequences.^[43]^[44] Subsequent methods refined this by end-to-end training of flow estimation networks inspired by optical flow pioneers like FlowNet, which was adapted for VSR to predict dense motion vectors directly from low-resolution inputs. For instance, the End-to-End Learning of Video Super-Resolution with Motion Compensation model (2017) uses a CNN-based flow estimator to generate warped frames, enabling joint optimization of alignment and super-resolution losses, which mitigates misalignment artifacts in moderate motion scenarios. Task-oriented variants, such as TOFlow (2017), further enhance this by learning self-supervised, application-specific flows tailored to super-resolution, outperforming general-purpose FlowNet by adapting motion representations to reconstruction objectives and achieving higher fidelity in dynamic scenes.^[45]^[18] These techniques excel in handling predictable, moderate motions where flow accuracy directly causally enhances alignment precision, as evidenced by reduced endpoint errors (EPE) correlating with PSNR improvements in controlled evaluations. However, limitations arise from flow estimation failures in occlusions, rapid changes, or low-texture regions, where erroneous warps propagate blurring or ghosting, empirically causing 2-3 dB PSNR degradation relative to oracle flow baselines in ablation tests on synthetic datasets. Later advancements, like high-resolution optical flow estimation (2020), address some inaccuracies by predicting flows at target scales post-alignment, though they retain explicit warping steps vulnerable to such propagation. Overall, while effective for structured motion, these methods underscore the causal bottleneck of flow reliability, prompting shifts toward implicit alignments in subsequent paradigms.^[46]^[47]

Deformable and Spatial Alignment Techniques

Deformable alignment techniques in video super-resolution utilize deformable convolutional networks to enable learnable offsets that facilitate adaptive warping of neighboring frames, addressing limitations of rigid motion estimation in handling non-rigid deformations and occlusions. These methods predict spatial offsets for each pixel or feature, allowing convolution kernels to sample from irregular locations rather than fixed grids, thereby capturing complex temporal correspondences. Introduced prominently in the Enhanced Deformable Video Restoration (EDVR) framework in 2019, this approach integrates pyramid cascading and deformable alignment modules to fuse multi-frame information effectively across various restoration tasks, including super-resolution.^[48] EDVR demonstrated superior performance on benchmarks like the NTIRE 2019 video super-resolution challenge, achieving higher PSNR values (e.g., 38.15 dB on Vid4 dataset at ×4 scale) compared to prior optical flow-based alignments by better preserving details in dynamic scenes.^[48] The efficacy of deformable convolutions stems from their decomposition into explicit spatial warping followed by standard convolution, which enhances alignment flexibility without relying on explicit motion vectors, proving particularly adaptive to geometric variations in video sequences. Subsequent analyses confirmed that these offsets implicitly model both alignment and feature modulation, outperforming fixed-grid convolutions on datasets exhibiting irregular motions, such as SPMCS or Vimeo-90K.^[49] ^[50] However, the technique introduces challenges, including training instability from offset overflow and over-parameterization due to dedicated offset prediction networks, which can increase model parameters by up to 20-30% and inference compute by factors of 1.5-2x relative to rigid alignment baselines, necessitating lightweight variants like deformable convolution alignment networks (DCAN) for practical deployment.^[49] ^[51] Spatial alignment techniques complement deformable methods by incorporating learnable spatial transformations, often via modules akin to spatial transformer networks, to enforce global or local geometric corrections prior to feature fusion. These enable explicit parameterization of affine or thin-plate spline warps, providing robustness to scale and rotation variances in video frames, though they are less prevalent in pure VSR pipelines compared to deformable convolutions due to higher rigidity in transformation assumptions. In hybrid setups, such as flow-guided deformable alignments, spatial components refine offsets for sub-pixel accuracy, yielding state-of-the-art results on REDS dataset (e.g., 32.24 dB PSNR at ×4), but at the cost of added preprocessing overhead.^[52] Empirical evaluations highlight their adaptability to non-rigid motion over traditional estimators, though they remain computationally intensive for real-time applications without optimization.^[52]

3D and Recurrent Architectures

3D convolutional neural networks (3D CNNs) extend traditional 2D convolutions by incorporating a temporal dimension, applying 3D kernels to stacked low-resolution frames forming spatio-temporal volumes. This enables implicit modeling of motion through shared weights across space and time, capturing inter-frame correlations without explicit alignment. For instance, the 3DSRnet architecture, introduced in 2018, processes video volumes directly via 3D convolutions, bypassing motion compensation preprocessing and achieving competitive performance on benchmark datasets by leveraging temporal redundancy.^[53] However, 3D CNNs demand substantial memory for larger temporal kernels or volumes, as the parameter count scales cubically with kernel size, often limiting practical window sizes to 3-5 frames and trading scalability for fixed receptive fields in time.^[53] Recurrent neural networks (RNNs), including variants like long short-term memory (LSTM) units, model video sequences by propagating hidden states frame-by-frame, inherently handling variable-length inputs and long-range temporal dependencies through sequential processing. In video super-resolution, RNNs encode evolving scene dynamics in recurrent states, with LSTMs using input, forget, and output gates to mitigate vanishing gradient issues during backpropagation through time. The Recurrent Residual Network (RRN), proposed in 2020, integrates residual connections within RNN blocks to stabilize training and enhance temporal consistency, demonstrating efficiency on datasets like VID4 by reducing artifacts in dynamic scenes.^[54] RNN-based methods excel in memory efficiency for streaming inference compared to volume-based 3D CNNs, as they avoid storing full volumes, but incur linear time complexity per frame and risk error accumulation or instabilities over extended sequences, particularly in low-motion scenarios.^[55] Despite these strengths, both paradigms face inherent trade-offs: 3D CNNs provide parallelizable computation but escalate GPU memory with temporal depth, constraining deployment on resource-limited devices, while RNNs offer sequential adaptability yet suffer from slower training due to unrolled dependencies and suboptimal parallelization. Empirical evaluations, such as those in RRN, report PSNR gains of 0.2-0.5 dB over baselines on upscaled videos, underscoring their utility for motion-rich content, though performance degrades on sequences exceeding 10-20 frames without advanced stabilization techniques.^[54] Overall, these architectures prioritize direct spatio-temporal fusion over alignment-heavy alternatives, emphasizing causal temporal modeling at the expense of scalability for ultra-long videos.^[54]

Emerging Paradigms: Diffusion and Generative Models

Diffusion models represent a probabilistic shift in video super-resolution (VSR), departing from deterministic convolutional architectures by iteratively denoising Gaussian noise conditioned on low-resolution inputs to generate high-fidelity frames while enforcing temporal consistency.^[56] Introduced for VSR around 2023, these models leverage learned generative priors to hallucinate plausible details in regions lacking high-frequency information, outperforming prior methods on complex degradations such as motion blur and compression artifacts in benchmarks like REDS and Vimeo-90K.^[57] For instance, adaptations of image diffusion models to video sequences incorporate spatial modulation and temporal alignment modules, enabling pixel-wise guidance from low-resolution frames to preserve inter-frame coherence without explicit motion estimation.^[56] Key advancements include frame-sequential diffusion frameworks that minimize retraining by repurposing pre-trained image diffusion models, achieving up to 2 dB PSNR gains on real-world videos with unknown degradations compared to GAN-based baselines.^[58] In blurry VSR scenarios, event-enhanced diffusion variants fuse asynchronous event data from neuromorphic sensors with RGB frames to disambiguate motion-induced blur, yielding sharper textures in dynamic scenes as demonstrated on synthetic datasets with ×4 upscaling.^[59] These approaches excel in handling non-ideal degradations by modeling the reverse diffusion process with video-specific noise schedules, though they require careful tuning to avoid over-smoothing in static regions.^[57] Generative priors in diffusion enable extrapolation beyond training distributions, such as synthesizing fine-grained details in occluded or low-texture areas, but at the cost of computational inefficiency; inference typically involves 50-100 denoising steps per frame, rendering it 100 times slower than recurrent CNN methods on standard GPUs.^[60] To mitigate this, one-step latent diffusion variants accelerate processing by distilling multi-step models into single-pass generators, improving real-time viability while maintaining perceptual quality on datasets like SPyNet-eval.^[61] By 2025, vision-language model (VLM)-guided diffusion incorporated degradation priors learned from textual descriptions of real-world corruptions, enhancing robustness to unseen blurs and noise in video SR without paired training data, as validated on custom real-world benchmarks with LPIPS scores improved by 15-20% over unguided diffusion.^[4] This paradigm prioritizes perceptual realism over pixel-wise accuracy, though empirical evaluations reveal persistent challenges in maintaining long-range temporal consistency across extended sequences exceeding 100 frames.^[4]

Evaluation

Performance Metrics

Peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) serve as primary full-reference fidelity metrics in video super-resolution (VSR), measuring pixel-wise reconstruction accuracy and structural preservation by averaging frame-level scores across sequences.^[62] These metrics quantify error relative to ground-truth high-resolution videos, with higher PSNR values indicating lower mean squared error and SSIM emphasizing luminance, contrast, and structure correlations.^[63] However, PSNR and SSIM often exhibit poor alignment with human visual perception, as demonstrated in super-resolution tasks where algorithms yielding higher scores produce artifacts visually inferior to lower-scoring alternatives, with empirical Spearman correlations to mean opinion scores (MOS) typically below 0.7.^[64] Perceptual metrics like Video Multimethod Assessment Fusion (VMAF) address these shortcomings by integrating multiple features to better predict subjective quality, showing stronger MOS correlations in video contexts.^[65] No-reference metrics, such as Natural Image Quality Evaluator (NIQE), enable blind assessment without ground truth, relying on natural scene statistics deviations, which proves essential for real-world VSR deployments lacking pristine references.^[32] Video-specific extensions, including Spatial-Temporal Reduced Reference Entropic Differencing (ST-RRED), target temporal inconsistencies like flicker by analyzing entropic differences across frames, offering reduced-reference evaluation for dynamic quality degradation.^[66] Recent evaluations increasingly incorporate efficiency alongside quality, reporting floating-point operations (FLOPs), model parameters, and runtime to assess practical viability, as mandated in the 2025 ICME VSR challenge despite its primary focus on reconstruction.^[67] This shift underscores that fidelity metrics alone inadequately capture causal factors in human perception, such as temporal coherence and computational feasibility, favoring hybrid approaches prioritizing MOS-aligned, no-reference perceptual scores for robust real-world benchmarking.^[68]

Metric Type	Examples	Reference Requirement	Key Limitation
Fidelity	PSNR, SSIM	Full	Weak MOS correlation (r < 0.7)^[64]
Perceptual	VMAF, NIQE	Full/No	Better human alignment but computationally intensive^[65]
Temporal/Video-Specific	ST-RRED	Reduced	Focuses on flicker; less emphasis on spatial detail^[66]
Efficiency	FLOPs, Runtime	None	Complements quality; hardware-dependent^[67]

Key Datasets

The Vimeo-90K dataset, introduced in 2017, serves as a foundational synthetic corpus for video super-resolution training, consisting of 89,800 high-quality clips sourced from Vimeo and typically processed into 91,701 septuplet sequences (seven consecutive frames at 448×256 resolution) via bicubic downsampling to generate low-resolution inputs.^[18] This approach simulates degradation primarily through spatial downsampling, enabling large-scale supervised learning but introducing a domain gap relative to real-world artifacts like compression or sensor noise.^[18] The VID4 benchmark, a compact evaluation set comprising four sequences—Calendar, City, Foliage, and Walk—provides standardized testing for super-resolution algorithms, with ground-truth high-resolution frames derived from original footage downsampled for input.^[69] Its limited scope emphasizes temporal consistency across diverse motion patterns, though reliance on synthetic degradation limits its representation of authentic video pipelines.^[70] Real-world datasets address synthetic biases by capturing paired low- and high-resolution sequences under genuine conditions; for instance, the RealVSR dataset from 2021 includes videos acquired via multi-camera setups to replicate complex degradations such as varying blur and noise, revealing that models trained on synthetic data like Vimeo-90K fail to generalize, often yielding perceptibly inferior outputs on uncompressed real footage.^[71] Similarly, Real-RawVSR (2022) focuses on raw sensor data, providing a benchmark for unprocessed videos to evaluate super-resolution prior to color space transformations, highlighting persistent challenges in handling camera-specific noise and motion.^[72] More recent efforts prioritize high-resolution and specialized content: RealisVideo-4K, released in 2025, offers 1,000 detail-rich 4K video-text pairs curated for realistic evaluation, incorporating authentic degradations to bridge gaps in prior benchmarks.^[3] The MSU Detail Restoration dataset targets restoration-critical elements like faces, text, QR codes, and license plates, using sequences with intricate fine details to assess empirical fidelity beyond aggregate metrics.^[33] These corpora underscore the necessity of diverse degradations—encompassing motion, compression, and noise—over scale-focused synthetic downsampling, as evidenced by substantial PSNR declines (often several dB) when synthetic-trained models encounter real-world inputs mismatched in causal degradation chains.^[71]

Benchmarks and Comparative Challenges

The NTIRE challenges on video super-resolution, held from 2019 to 2021 as part of the CVPR workshops, evaluated methods primarily on synthetic low-resolution videos derived from high-resolution sequences via bicubic downsampling, focusing on tracks for ×4 upscaling with metrics like PSNR and SSIM.^[73] Deep learning approaches, such as recurrent and deformable convolution networks, dominated the leaderboards, achieving PSNR gains of up to 1-2 dB over traditional interpolation baselines, highlighting the causal role of temporal alignment in exploiting inter-frame redundancy.^[74] The AIM series, spanning 2019 to 2025, emphasized extreme super-resolution and robust offline processing, with the 2025 edition targeting 4× upscaling of degraded 270p videos to 1080p, incorporating real-world compression artifacts and motion blur in tracks for animated and real-world content.^[75] ^[76] Deep learning winners in the 2025 AIM tracks, including transformer-based models, outperformed priors by prioritizing artifact suppression alongside fidelity, yet revealed persistent gaps in handling severe degradations beyond benchmark assumptions.^[77] ICME 2025 introduced a domain-specific challenge for video conferencing, where low-resolution inputs were H.265-encoded at fixed quantization parameters, requiring ×2-×4 upscaling under runtime constraints simulating real-time transmission.^[78] Top submissions integrated codec-aware alignment, yielding PSNR improvements of 0.5-1.5 dB over uncorrected baselines, underscoring the need for joint super-resolution and compression modeling.^[5] The ongoing MSU Super-Resolution for Video Compression Benchmark continuously assesses methods across H.264, H.265, AV1, and other codecs, using over 260 test videos to measure PSNR, runtime, and subjective quality on compressed inputs.^[79] From 2023 to 2025, efficiency emerged as a priority, with lightweight models like those emphasizing fewer FLOPs achieving competitive PSNR (e.g., 30-32 dB on Vid4 sequences) at runtimes under 50 ms per frame on standard GPUs, though traditional methods persist in low-resource scenarios. Comparative analyses across these benchmarks reveal deep learning's evolution from accuracy-focused architectures (pre-2023) to balanced trade-offs, but expose systemic challenges: narrow evaluation scopes often overlook edge cases such as 360-degree videos or variable frame rates, leading to overfitting on synthetic degradations rather than causal generalization to uncompressed or domain-shifted real-world footage. For instance, while NTIRE and AIM prioritize clean motion, MSU's codec-inclusive tests highlight runtime-PSNR Pareto fronts where top deep models (e.g., 35+ dB PSNR) incur 10-20x higher latency than bicubic alternatives, constraining deployment.^[33] These gaps drive ongoing refinements, yet underscore the need for broader, causal benchmarks incorporating diverse artifacts to mitigate benchmark-specific biases.

Applications

Media and Broadcasting

Video super-resolution techniques enable broadcasters to upscale legacy standard-definition content to ultra-high-definition (UHD) or 4K resolutions, facilitating the modernization of archival footage for contemporary distribution without necessitating original remastering.^[80] This process is particularly valuable for television networks handling compressed video streams, where super-resolution restores perceptual details lost during encoding with codecs like H.264 or H.265.^[79] At the 2025 NAB Show's Broadcast Engineering and Information Technology Conference (BEITC) on March 21, TSENet was introduced as a specialized method for enhancing low-resolution, compressed video frames in broadcast television, leveraging an enhanced equivalent transform to improve frame quality prior to upscaling.^[42] Integration with benchmarks such as the MSU Super-Resolution for Video Compression Benchmark allows evaluation of these models on diverse compressed datasets, assessing their efficacy in detail restoration across multiple codec standards.^[81] Quantitative achievements include notable gains in Video Multimethod Assessment Fusion (VMAF) scores, a perceptual quality metric correlating with human judgments, often reporting 10-15% improvements over traditional interpolation for legacy material in controlled tests.^[82] However, these enhancements prioritize algorithmic sharpness, which can introduce over-sharpening artifacts that alter fine textures and edges, potentially distorting the authentic visual characteristics of original productions—such as film grain or intentional softness—and raising concerns about fidelity to source intent in archival contexts.^[83]^[84] Broadcasters thus weigh these trade-offs in offline batch processing workflows, where super-resolution supports high-volume remastering of historical content for streaming and rebroadcast, but demands validation against original aesthetics to preserve narrative and artistic integrity over mere resolution escalation.^[85]

Real-Time Processing and Conferencing

Real-time video super-resolution (VSR) for conferencing prioritizes ultra-low latency to maintain natural interaction, typically targeting inference times under 30 ms per frame at 30 fps to minimize end-to-end delays in bidirectional communication.^[86] This constraint necessitates causal models that process frames sequentially without future dependencies, unlike offline VSR, ensuring deployment in live streams where buffering would disrupt perceived responsiveness.^[78] The 2025 IEEE International Conference on Multimedia and Expo (ICME) Grand Challenge exemplified these demands by evaluating VSR on H.265-encoded low-resolution inputs at fixed quantization parameters (QPs), simulating bandwidth-limited conferencing scenarios with upscaling factors such as 2x or 4x.^[5] Participants competed across tracks for general-purpose videos, talking-head content, and screen sharing, using subjective metrics alongside objective ones like PSNR to assess perceptual improvements in compressed artifacts.^[78] Top entries, such as Collabora's solution in the screen content track, achieved enhanced detail recovery while adhering to runtime limits on standard hardware, demonstrating feasibility for edge deployment in platforms like Microsoft Teams.^[87] Efficient architectures, including those with decoupled guidance mechanisms, further address latency by separating spatial-temporal feature extraction from auxiliary inputs, enabling real-time rendering of super-resolved frames without excessive computation.^[88] However, under severe bandwidth constraints—common in mobile or low-bitrate conferencing—VSR quality degrades due to unmodeled compression noise, where models trained on synthetic degradations underperform on real H.265 artifacts, often yielding blurring or temporal inconsistencies.^[78] This highlights a causal gap: while offline benchmarks excel, real-time causal processing sacrifices fidelity for speed, with empirical tests showing PSNR drops of 1-2 dB in high-QP regimes compared to non-causal baselines.^[67]

Specialized Domains: Gaming and Surveillance

In gaming applications, video super-resolution enables real-time upscaling of rendered frames to maintain high frame rates while enhancing visual fidelity, particularly by allowing lower native resolutions for performance gains followed by AI-driven reconstruction. NVIDIA's RTX Video Super Resolution, introduced in 2023 for RTX 40 Series GPUs, upscales compressed video streams from resolutions like 360p to 1440p up to 4K in browsers and streaming, reducing artifacts and supporting real-time AV1 encoding for efficiency. A 2025 CVPR paper proposes an efficient network leveraging decoupled G-buffer guidance and spatial-temporal features from low-resolution inputs to achieve real-time video super-resolution tailored for rendering pipelines, minimizing latency critical in interactive gaming environments.^[89]^[88] These techniques benefit gaming by preserving details in dynamic scenes, such as textures and motion, without excessive computational overhead on high-end hardware. However, challenges persist in highly dynamic content, where motion blur from rapid camera or object movement can degrade alignment and reconstruction accuracy, leading to artifacts despite temporal fusion efforts.^[90] In surveillance, video super-resolution focuses on restoring fine details like facial features, license plates, and text in low-quality footage, aiding forensic analysis and identification. The MSU Video Super-Resolution Benchmark evaluates methods on complex content including faces and text, prioritizing subjective detail restoration over generic metrics to simulate real-world security needs. Algorithms such as edge-preserving maximum a posteriori estimation target regions of interest (ROI) in surveillance videos, improving readability of small or distant elements crucial for evidence extraction.^[33]^[91] Event-based approaches enhance low-light performance by integrating data from event cameras, which capture brightness changes with high dynamic range and microsecond latency, avoiding traditional frame-based motion blur. The MamEVSR framework, presented at CVPR 2025, employs state space models like Mamba for bidirectional feature alignment in event-driven video super-resolution, enabling robust reconstruction in dim or high-speed scenarios common to surveillance. These adaptations strengthen forensic utility by clarifying obscured details, though they falter in extremely dynamic scenes with unpredictable motion, where inter-frame inconsistencies amplify blur propagation despite event guidance.^[92]^[90]

Limitations and Criticisms

Empirical Shortcomings in Real-World Scenarios

Video super-resolution models trained on synthetic degradations, typically involving bicubic downsampling followed by additive noise, exhibit substantial performance gaps when applied to authentic videos featuring complex, pipeline-specific degradations such as sensor blur, codec compression, and transmission artifacts.^[93] These mismatches arise because real-world low-resolution inputs result from causal degradation chains—optical capture, motion-induced defocus, quantization in encoding (e.g., H.264/AVC block artifacts), and variable bitrate resizing—that deviate from the simplified assumptions in lab-generated data.^[94] Empirical evaluations on real-captured benchmarks like MVSR4× demonstrate PSNR values of approximately 23-24 dB for ×4 upscaling, compared to over 35 dB on synthetic datasets such as Vimeo-90K, reflecting drops of 10-15 dB due to unmodeled degradation distributions.^[93] Compression and motion mismatches further amplify these shortcomings, as inter-frame alignment techniques reliant on optical flow (e.g., pre-trained SpyNet) produce inaccurate offsets in low-quality real inputs distorted by ringing or blocking from compression.^[93] In scenarios with rapid motion, such errors lead to temporal inconsistencies and artifact propagation in recurrent architectures, where distortions accumulate across frames rather than being suppressed as in controlled bicubic settings.^[94] For instance, deformable convolutions offer some robustness but suffer from training instability under variable real motions, resulting in blurred reconstructions where fine details are conflated with noise.^[93] The inherent ill-posedness of super-resolution is exacerbated in real-world videos by unknown degradation parameters, rendering multiple high-resolution solutions compatible with the same low-resolution input and complicating causal inversion.^[1] Occlusions, where foreground objects temporarily obscure background regions across frames, disrupt alignment and fusion, often yielding ghosting or incomplete recoveries since occluded pixels lack corresponding high-fidelity references.^[1] Similarly, large disparities from fast camera pans or depth variations overwhelm standard motion compensation, leading to warping failures and structural distortions not mitigated by models assuming sub-pixel shifts.^[95] These issues highlight how real-world inputs amplify the underconstrained nature of the problem, prioritizing degradation modeling over purely data-driven synthesis.^[94]

Computational and Resource Constraints

Deep learning approaches to video super-resolution demand substantially greater computational resources than classical methods such as bicubic interpolation, which process frames with minimal operations on standard CPUs.^[96] In contrast, prominent deep learning models like EDVR or BasicVSR exhibit complexities ranging from tens to hundreds of gigaFLOPs (GFLOPs) per frame for 4× upscaling, representing 100–1000× increases over traditional techniques due to extensive convolutional and recurrent operations across spatial and temporal dimensions.^[1] This escalation arises from the need to model inter-frame dependencies and high-dimensional feature extraction, rendering unoptimized models impractical for high-frame-rate videos without dedicated hardware.^[97] Achieving real-time performance—typically under 33 ms per frame for 30 fps video—remains elusive for most deep learning video super-resolution systems without aggressive optimizations like pruning or distillation, as baseline architectures prioritize reconstruction quality over efficiency.^[98] For example, even lightweight variants in 2025 benchmarks report inference runtimes exceeding real-time thresholds on consumer GPUs for HD inputs, with full 4K processing often demanding seconds per frame. Such constraints highlight the scalability limitations of these models, where parameter counts frequently surpass millions, amplifying memory footprints and power consumption.^[99] Deployment is further hampered by heavy dependence on GPU or NPU acceleration, as CPU-only execution yields latencies incompatible with edge scenarios like mobile devices or embedded systems.^[100] This hardware specificity restricts applicability in bandwidth-limited or low-power environments, where traditional methods suffice despite inferior quality, underscoring deep learning's trade-off between fidelity gains and practical viability.^[101] Efforts to mitigate these barriers through reparameterization or frequency-domain processing in recent works still necessitate specialized accelerators for sub-second latencies, perpetuating exclusion from resource-constrained pipelines.

Overreliance on Synthetic Data and Benchmark Bias

Many video super-resolution (VSR) models are trained and evaluated primarily on synthetic datasets generated through bicubic downsampling of high-resolution videos, which simulates degradations as simple, uniform spatial reductions without incorporating real-world complexities like anisotropic blur kernels, additive noise, or codec-induced artifacts. This approach assumes degradations are largely invertible and domain-invariant, leading to a domain gap where models overfit to these contrived conditions but fail to generalize to authentic low-resolution videos captured by diverse cameras and environments. For example, the RealVSR benchmark dataset, introduced in 2021, pairs real low- and high-resolution videos captured via multi-camera systems on devices like the iPhone 11 Pro Max, revealing that synthetic-trained models exhibit markedly inferior visual fidelity on these sequences compared to those fine-tuned on real degradations.^[71]^[102] Benchmarks perpetuating this synthetic focus incentivize methodological advances that excel on controlled metrics such as PSNR or SSIM under idealized assumptions, yet entrench overfitting by rewarding solutions blind to causal degradation chains in practice—e.g., the interplay of optical blur from motion or defocus followed by sensor noise and compression. The MSU Video Super-Resolution Benchmark underscores these biases through subjective evaluations on a dataset emphasizing intricate details like faces, text, and QR codes, where deep learning methods optimized for synthetic inputs often lag in perceptual quality restoration relative to their reported benchmark gains.^[33] Similarly, the RealisVideo-4K dataset, comprising 1,000 detail-rich 4K video-text pairs for realistic VSR, exposes how synthetic reliance hampers handling of high-fidelity real degradations, with models showing diminished detail enhancement in uncompressed, camera-captured scenarios.^[3] Critics argue that while such benchmarks accelerate iterative improvements in algorithmic architectures, they foster an evaluation ecosystem skewed toward narrow, measurable successes, potentially undervaluing simpler traditional methods like Lanczos interpolation that demonstrate comparative robustness in no-reference assessments of real-world outputs, where deep models introduce hallucinatory artifacts absent in ground-truth alignments. This overfitting dynamic is evident in studies expanding synthetic degradations to mimic real ones (e.g., combining blur, noise, downsampling, and pixel binning), which still fail to fully bridge the gap, as performance drops persist when deploying to unseen real videos. Empirical evidence from degradation-adaptive frameworks further confirms that synthetic benchmarks inflate perceived efficacy, with real-world PSNR improvements averaging only 0.18 dB over state-of-the-art baselines when accounting for unknown degradations.^[103]^[31]

Future Directions

Efficiency Improvements for Deployment

Efficiency improvements in video super-resolution (VSR) focus on reducing computational demands to enable deployment on resource-constrained devices such as mobiles and edge hardware, where inference times must typically fall below 33 ms for real-time applications like video streaming.^[104] Techniques such as model pruning and knowledge distillation compress VSR networks by removing redundant parameters and transferring knowledge from larger teacher models to smaller student ones, achieving up to 50% parameter reduction while maintaining PSNR within 0.2 dB of baselines on datasets like REDS.^[105] These methods address the high FLOPs of recurrent or transformer-based VSR architectures, which often exceed 10^9 operations per frame, by iteratively pruning low-importance weights and distilling temporal alignment features.^[106] A notable advancement in 2025 involves decoupled guidance mechanisms, as proposed in RDG, an asymmetric U-Net architecture that separates G-buffer guidance for spatial and temporal features in real-time rendering scenarios.^[88] This decoupling allows dynamic feature modulation in the encoder, reducing inter-frame dependencies and enabling 4x upscaling at over 60 FPS on consumer GPUs, with runtime reductions of 40% compared to coupled baselines like EDVR.^[107] Empirical evaluations on synthetic and real-world videos demonstrate that such approaches bridge compute gaps for mobile deployment, targeting sub-10 ms per-frame latency through hardware-aware optimizations like INT8 quantization integrated with pruning.^[108] Quantization and hardware-specific adaptations further enhance deployability; for instance, post-training quantization halves model sizes without retraining, preserving temporal consistency in VSR pipelines tested on Snapdragon processors.^[109] Challenges persist in balancing these reductions with fidelity, as aggressive pruning can introduce artifacts in motion-heavy sequences, necessitating hybrid strategies that combine distillation with lightweight recurrent modules for sub-10 ms goals on mid-range mobiles.^[110] Overall, these 2025 developments prioritize empirical runtime metrics over peak quality, facilitating practical VSR in bandwidth-limited environments.^[111]

Handling Complex Real-World Degradations

Blind video super-resolution addresses unknown degradations in real-world videos, such as combined motion blur, sensor noise, compression artifacts, and misalignment, which deviate from idealized synthetic models like bicubic downsampling.^[103] These methods estimate or adapt to degradation processes without explicit priors, enabling robust reconstruction by modeling causal factors like imaging pipeline variations and transmission losses.^[31] Traditional approaches falter here due to domain gaps, yielding artifacts in practical deployments.^[112] To bridge synthetic-real disparities, expanded degradation pipelines simulate real-world complexities by chaining blur kernels, Gaussian noise, downsampling, pixel binning, and codec compression (e.g., H.264/H.265 at varying bitrates), training models on datasets like SRWD for improved generalization.^[103] Degradation-adaptive frameworks, such as DVASR, employ bidirectional recurrent propagation to infer and mitigate heterogeneous frame degradations, achieving PSNR gains of 1-2 dB over baselines on real videos without degradation-specific tuning.^[31] These techniques prioritize causal modeling of degradation trajectories over post-hoc alignment. Vision-language models (VLMs) provide implicit priors for blind restoration by associating textual degradation descriptors with visual patterns, resolving ambiguities in local structures from unknown corruptions; a 2025 study applied this to super-resolution, outperforming kernel-estimation methods by leveraging pre-trained VLMs like CLIP for prior distillation.^[4] For motion-induced blur—a prevalent real-world issue—event cameras capture asynchronous intensity changes at microsecond latencies, supplying sparse, high-temporal data for deblurring. Ev-DeblurVSR integrates event streams via reciprocal feature modules, enhancing blurry video super-resolution by 0.5-1.5 dB PSNR on benchmarks with real motion, exploiting events' immunity to global shutter blur.^[59] Hybrid architectures combine diffusion-based generation with explicit motion compensation to target 4K+ resolutions in real videos, using professionally captured datasets of 1,000 clips to train against domain shifts; these yield sharper details in dynamic scenes compared to CNN-only hybrids, though event augmentation further reduces artifacts in high-frame-rate inputs.^[3] Such advances emphasize verifiable metrics like LPIPS for perceptual fidelity, confirming reduced hallucination in unseen degradations.^[4]

Integration with Multimodal and Generative AI

Recent advancements in generative AI have enabled video super-resolution (VSR) models to incorporate multimodal inputs, such as text prompts, to guide the synthesis of high-resolution outputs while preserving temporal consistency. For instance, frameworks like Upscale-A-Video employ text-guided latent diffusion to upscale videos, allowing users to specify details like "enhance facial textures with natural lighting" to direct the restoration process beyond purely data-driven interpolation.^[113] Similarly, UniMMVSR introduces a unified latent diffusion approach for multi-modal VSR, integrating audio, text, or depth cues to improve fidelity in cascaded upscaling tasks, demonstrating up to 2.5 dB PSNR gains on benchmarks like Vimeo-90K. These integrations expand VSR's applicability in content creation, where generative models can infer plausible details from low-resolution inputs conditioned on external modalities. In immersive video applications, generative AI facilitates super-resolution for panoramic or 360-degree content, addressing distortions in wide-field-of-view footage. A 2025 IJCAI survey highlights how diffusion-based generative models enhance super-resolution in immersive videos by synthesizing high-frequency details, such as improved edge sharpness in VR environments, with reported reductions in perceptual artifacts by 15-20% on datasets like 360-VidSR.^[114] This synergy leverages large language models for semantic guidance, enabling zero-shot adaptation to novel degradations, as seen in VideoGigaGAN, which combines GANs with diffusion priors for detail-rich upscaling while minimizing temporal flickering.^[28] However, these generative integrations introduce risks of hallucinations, where models fabricate non-existent structures, such as spurious textures or inconsistent motion, deviating from the original causal content. Empirical evaluations reveal that text-guided VSR can amplify such artifacts, with hallucination rates exceeding 10% in blind tests on real-world videos, as measured by multimodal LLMs assessing semantic fidelity.^[115] Mitigating this requires prioritizing causal realism—ensuring outputs align with verifiable low-resolution evidence—over creative extrapolation, often through hybrid constraints like reference-based diffusion or adversarial training grounded in ground-truth data, rather than relying solely on perceptual metrics that overlook factual distortions.^[116] Future progress demands rigorous validation against diverse real-world degradations to substantiate these multimodal enhancements beyond synthetic benchmarks.

References

[1]
[2506.03216] A Survey of Deep Learning Video Super-Resolution
Jun 3, 2025 · In this paper, we present an overarching overview of deep learning-based video super-resolution models, investigating each component and discussing its ...
[2]
https://ieeexplore.ieee.org/document/10164823
[3]
Detail-enhanced Diffusion for Real-World 4K Video Super-Resolution
Jul 25, 2025 · ... RealisVideo-4K, the first public 4K VSR benchmark containing 1,000 high-definition video-text pairs. Leveraging the advanced spatio-temporal ...
[4]
Real-world super-resolution with VLM-based degradation prior ...
Aug 6, 2025 · In recent years, advances in deep learning have significantly impacted the field of super-resolution. However, both non-blind ...
[5]
ICME 2025 Grand Challenge on Video Super-Resolution ... - Microsoft
The Video Super-Resolution (VSR) task extends SR to the temporal domain, aiming to reconstruct a high-resolution video from a low-resolution one.
[6]
Seven decades of image super-resolution: achievements ...
Jul 18, 2024 · Firstly, development of methods, algorithms, and techniques for resolution enhancement. ... Video Super-resolution driver for their GeForce RTX 40 ...
[7]
https://ieeexplore.ieee.org/document/7432730
[8]
A Comparative Study Of Super-Resolution Interpolation Techniques
Super-resolution interpolation is a popular technique, which is used to increase the image's resolution beyond its original size. However, there are several ...
[9]
https://digital-library.theiet.org/doi/10.1049/iet-ipr.2010.0513
[10]
[PDF] Improving Resolution by Image Registration
1991. Improving Resolution by Image Registration. MICHAL IRANI AND SHMUEL PELEG*. Department of Computer Science, The Hebrew University of Jerusalem, 91904 ...
[11]
[PDF] Advances and challenges in super-resolution
Early works on Super-Resolution showed that the aliasing effects in the high-resolution fused image can be reduced (or even completely removed), if a relative ...
[12]
[PDF] A Bayesian Approach to Adaptive Video Super Resolution
In this paper, we propose a Bayesian framework for adaptive video super resolution that incorporates high-res image reconstruction, optical flow, noise level ...
[13]
(PDF) Motion Estimation Techniques in Super-Resolution Image ...
In this paper, we evaluate two different nonparametric motion estimation techniques, one based on block-matching [2] and the other based on the optical flow ...
[14]
[PDF] Video Super-Resolution by Motion Compensated Iterative Back ...
The first and the most important step is image registration or motion estimation which takes a set of low-resolution frames as input and produces motion in-.<|separator|>
[15]
https://ieeexplore.ieee.org/document/1246836
[16]
Choice of threshold of the Huber-Markov prior in MAP-based video ...
PDF | The MAP (maximum a posteriori) -based resolution enhancement technique with Huber-Markov random field (HMRF) as the image prior has been proposed.Missing: super- | Show results with:super-
[17]
Real-Time Single Image and Video Super-Resolution Using ... - arXiv
Sep 16, 2016 · In this paper, we present the first convolutional neural network (CNN) capable of real-time SR of 1080p videos on a single K2 GPU.Missing: VESPCN | Show results with:VESPCN
[18]
[1711.09078] Video Enhancement with Task-Oriented Flow - arXiv
Nov 24, 2017 · For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional ...
[19]
[PDF] Real-Time Video Super-Resolution With Spatio-Temporal Networks ...
In this paper, we introduce spatio-temporal sub- pixel convolution networks that effectively exploit temporal redundancies and improve reconstruction accuracy ...
[20]
[PDF] Real-world Video Super-resolution: A Benchmark Dataset and A ...
Video super-resolution results on a real-world video. (captured by the iPhone 11 Pro Max) by EDVR [27] trained on the synthetic Vimeo-90k datatset [31] and our ...
[21]
Event-driven Texture Enhancement for Video Super-Resolution - arXiv
Our method, called EvTexture, leverages high-frequency details of events to better recover texture regions in VSR.
[22]
[PDF] On Bayesian Adaptive Video Super Resolution - People | MIT CSAIL
The graphical model of video super resolution. The circular nodes are variables (vectors), whereas the rectangular nodes are matrices (matrix multiplication) ...
[23]
A Survey of Deep Learning Video Super-Resolution - arXiv
3D video super-resolution aims to enhance the resolution of videos captured with depth information and/or multi-view cameras. By leveraging the additional depth ...
[24]
Small Clips, Big Gains: Learning Long-Range Refocused Temporal ...
May 4, 2025 · ... Temporal Information for Video Super-Resolution. ... single image super-resolution by additionally leveraging temporal information.
[25]
Adapting Single-Image Super-Resolution Models to Video ... - NIH
May 24, 2023 · On the SPMC benchmark, the VSR-adapted models achieve at least 1.16 dB and 0.036 gain over original SISR models in terms of PSNR and SSIM, ...
[26]
UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with ...
Aug 2, 2025 · Unlike single-image super-resolution (SISR), VSR presents additional challenges due to the necessity of maintaining temporal consistency ...
[27]
Ghosting Effect Removal for Multi-Frame Super-Resolution on CCTV ...
May 2, 2022 · Our results demonstrate that the system can completely remove ghosting effects caused by moving objects while performing MISR on CCTV footage.
[28]
[PDF] VideoGigaGAN: Towards Detail-rich Video Super-Resolution
Recently, diffusion models have shown diverse and high-quality results in video generation tasks [3, 4, 16, 17, 22]. Imagen Video [21] proposes pixel diffusion ...
[29]
[PDF] Motion-Guided Latent Diffusion for Temporally Consistent Real ...
Abstract. Real-world low-resolution (LR) videos have diverse and com- plex degradations, imposing great challenges on video super-resolution.<|separator|>
[30]
DAM-VSR: Disentanglement of Appearance and Motion for Video ...
Jul 27, 2025 · Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations.
[31]
Real-World Video Super-Resolution with a Degradation-Adaptive ...
The goal of video super-resolution is to reconstruct high-resolution video frames from the low-resolution ones. Most existing VSR methods are trained with ...
[32]
Detail-enhanced Diffusion for Real-World 4K Video Super-Resolution
Jul 25, 2025 · Video super-resolution (VSR) aims to reconstruct high-resolution (HR) video sequences from degraded low-resolution (LR) inputs, with a focus on ...
[33]
MSU Video Super-Resolution Benchmark: Detail Restoration
The table below shows a comparison of Video Super Resolution methods by subjective score and a few objective metrics. Default ranking is by subjective score.Missing: inconsistency flickering
[34]
[2211.13541] A mathematical theory of resolution limits for super ...
Nov 24, 2022 · We introduce the computational resolution limit for respectively the number detection and location recovery in the one-dimensional super-resolution problem.Missing: bandwidth | Show results with:bandwidth
[35]
Limitation of Super Resolution Image Reconstruction for Video
Aug 9, 2025 · One of the most interesting points of SRR for video is the limitation of the resolution. Although HDTV sets with SRR is in the market recently, ...
[36]
[PDF] Super-resolution image reconstruction: A technical overview
Frequency Domain Approach. The frequency domain approach makes explicit use of the aliasing that exists in each LR image to reconstruct an. HR image. Tsai and ...
[37]
[PDF] Deep Video Super-Resolution Network Using Dynamic Upsampling ...
All deep learning based VSR methods follow similar steps and are composed of two steps: a motion estimation and compensation proce- dure followed by an ...
[38]
[PDF] Superresolution using Papoulis-Gerchberg Algorithm
Most of the Superresolution algorithms deal with this problem using two steps – Motion estimation between different images and Projection of LR pixel values ...
[39]
https://infoscience.epfl.ch/record/142193/files/chapterSR_VandewalleSV_submitted_v2.pdf
[40]
[2007.12928] Video Super Resolution Based on Deep Learning - arXiv
Jul 25, 2020 · In this survey, we comprehensively investigate 33 state-of-the-art video super-resolution (VSR) methods based on deep learning.
[41]
https://ieeexplore.ieee.org/document/9522982
[42]
TSENet Video Super Resolution for Broadcast Television - NAB PILOT
In stockMar 21, 2025 · In this paper, we introduce a method to upscale low-resolution, compressed video frames based on an enhanced version of Equivalent ...
[43]
[1704.02738] Detail-revealing Deep Video Super-resolution - arXiv
Apr 10, 2017 · ... Deep Video Super-resolution. ... In this paper, we show that proper frame alignment and motion compensation is crucial for achieving high quality ...
[44]
[PDF] Detail-Revealing Deep Video Super-Resolution - CVF Open Access
As one of the fundamental problems in image processing and computer vision, video or multi-frame super-resolution. (SR) aims at recovering high-resolution (HR) ...Missing: VRNN | Show results with:VRNN
[45]
[PDF] End-to-End Learning of Video Super-Resolution with Motion ...
In this work, we aim to perform video super-resolution with a CNN-only approach. The pioneering FlowNet of Dosovitskiy et al. [8] showed that motion ...
[46]
Deep Video Super-Resolution using HR Optical Flow Estimation
Jan 6, 2020 · Abstract page for arXiv paper 2001.02129: Deep Video Super-Resolution ... Then, motion compensation is performed using HR optical flows to encode ...
[47]
Learning for Video Super-Resolution through HR Optical Flow ...
Sep 23, 2018 · Title:Learning for Video Super-Resolution through HR Optical ... Then, motion compensation is performed according to the HR optical flows.<|separator|>
[48]
[PDF] Video Restoration with Enhanced Deformable Convolutional Networks
We propose a unified framework, called. EDVR, which is extensible to various video restoration tasks, including super-resolution and deblurring. The cores of ...
[49]
[PDF] Understanding Deformable Alignment in Video Super-Resolution
Deformable convolution, originally proposed for the adapta- tion to geometric variations of objects, has recently shown compelling performance in aligning ...
[50]
[2004.02803] Deformable 3D Convolution for Video Super-Resolution
Apr 6, 2020 · In this paper, we propose a deformable 3D convolution network (D3Dnet) to incorporate spatio-temporal information from both spatial and temporal dimensions for ...
[51]
Video Super-Resolution Method Using Deformable Convolution ...
Nov 3, 2022 · We propose a deformable convolution-based alignment network (DCAN) to generate scaled high-resolution sequences with quadruple the size of the low-resolution ...
[52]
FDAN: Flow-guided Deformable Alignment Network for Video Super ...
May 12, 2021 · We propose an end-to-end deep network called Flow-guided Deformable Alignment Network (FDAN), which reaches the state-of-the-art performance on two benchmark ...<|separator|>
[53]
Video Super-resolution using 3D Convolutional Neural Networks
Dec 21, 2018 · We propose an effective 3D-CNN for video super-resolution, called the 3DSRnet that does not require motion alignment as preprocessing.
[54]
Revisiting Temporal Modeling for Video Super-resolution - arXiv
Aug 13, 2020 · Extensive experiments show that the proposed RRN is highly computational efficiency and produces temporal consistent VSR results with finer ...
[55]
[PDF] Stable Long-Term Recurrent Video Super-Resolution
In this work, we expose instabilities of existing recur- rent VSR networks on long sequences with low motion. We demonstrate it on a new long sequence dataset ...
[56]
Learning Spatial Adaptation and Temporal Coherence in Diffusion ...
Mar 25, 2024 · Abstract page for arXiv paper 2403.17000: Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution.
[57]
[PDF] Learning Spatial Adaptation and Temporal Coherence in Diffusion ...
ence in diffusion models for video super-resolution. The solution also leads to the elegant views of how to leverage pixel-wise information from LR videos ...
[58]
FSDM: An efficient video super-resolution method based on Frames ...
FSDM adapts image diffusion models for video super-resolution with minimal training. •. Our approach redesigns the diffusion process without complex temporal ...
[59]
[2504.13042] Event-Enhanced Blurry Video Super-Resolution - arXiv
Apr 17, 2025 · In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry ...Missing: 2024 | Show results with:2024
[60]
yjsunnn/DLoRAL: [NeurIPS'25] One-Step Diffusion for Detail-Rich ...
Awesome Diffusion Models for Video Super-Resolution Repo. A curated list of resources for Video Super-Resolution (VSR) using Diffusion Models. TODO.
[61]
OS-DiffVSR: Towards One-step Latent Diffusion Model for High ...
Sep 25, 2025 · Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution. Conference Paper. Jun 2024. Zhikai Chen ...
[62]
Video Quality Assessment: A Comprehensive Survey - arXiv
Dec 11, 2024 · We present a comprehensive survey of recent progress in the development of VQA algorithms and the benchmarking studies and databases that make them possible.
[63]
Video quality metrics toolkit: An open source software to assess ...
The vmaf artifact integrates a set of quality metrics like MSSIM, VMAF, VIF, and CAMBI. The scikit-video artifact provides metrics like BRISQUE, NIQE, VIIDEO, ...Missing: super- | Show results with:super-
[64]
PSNR and SSIM: application areas and criticism - Video Processing
Our examples of PSNR negative correlation with visual quality for Super-Resolution task. We collected some evidence for PSNR and SSIM inapplicability for SR ...
[65]
VMAF vs. PSNR vs. SSIM: Understanding Video Quality Metrics
Sep 19, 2025 · Compare VMAF, PSNR, and SSIM to understand how they measure video quality and why VMAF is the best for streaming and encoding decisions.Missing: NIQE ST- RRED
[66]
On the Robust Performance of the ST-RRED Video Quality Predictor
ST-RRED is a video quality model that was developed in LIVE during 2013 as an example of a high performance reduced-reference picture quality model.Missing: resolution NIQE
[67]
ICME 2025 Grand Challenge on Video Super-Resolution ... - Microsoft
This ICME grand challenge focuses on video super-resolution for video conferencing, where the low-resolution video is encoded using the H.265 codec with fixed ...Missing: key achievements
[68]
ICME 2025 Grand Challenge on Video Super-Resolution for ... - arXiv
Jun 13, 2025 · Super-Resolution (SR) is a critical task in computer vision, focusing on reconstructing high-resolution (HR) images from low-resolution (LR) ...
[69]
[PDF] COMISR: Compression-Informed Video Super-Resolution
Video super-resolution results (4×, RGB-channels) on compressed Vid4 and REDS datasets. Here we show the results using the most widely adopted compression rate ...
[70]
An improved regularization method for video super-resolution using ...
Sep 30, 2025 · In this study, we introduce a new technique for video super-resolution that incorporates an innovative denoiser within the ADMM algorithm. Our ...2 Image And Video... · 4 Our Method · 5 The Denoising Algorithm
[71]
[PDF] Real-World Video Super-Resolution: A Benchmark Dataset and a ...
In contrast, all. VSR models trained on our RealVSR dataset achieve much better performance in terms of PSNR/SSIM. We also com- pare the results with two ...
[72]
[PDF] Real-World Raw Video Super-Resolution with a Benchmark Dataset
Considering the superiority of raw image SR over sRGB image SR, we construct a real-world raw video SR. (Real-RawVSR) dataset and propose a corresponding SR ...<|separator|>
[73]
NTIRE 2021 Challenge on Video Super-Resolution - arXiv
Apr 30, 2021 · This paper reviews the NTIRE 2021 Challenge on Video Super-Resolution. We present evaluation results from two competition tracks as well as the proposed ...Missing: 2019-2021 | Show results with:2019-2021
[74]
[PDF] NTIRE 2021 Challenge on Video Super-Resolution
In NTIRE 2021 Challenge on Video Super-Resolution, participants are required to develop state-of-the-art VSR methods on two distinctive tracks. The goal of ...
[75]
AIM 2025 Robust Offline Video Super-Resolution - Codabench
This challenge focuses on developing algorithms that convert heavily degraded 270p videos into high-quality 1080p sequences (4× upscaling) while suppressing ...<|separator|>
[76]
AIM 2019 Challenge on Video Extreme Super-Resolution
Abstract: This paper reviews the extreme video super-resolution challenge from the AIM 2019 workshop, with emphasis on submitted solutions and results.
[77]
AIM 2025 - Advances in Image Manipulation workshop - Radu Timofte
Challenge Report: "AIM 2025 Challenge on Robust Offline Video Super-Resolution: Dataset, Methods and Results" Nikolai Karetin (Lomonosov Moscow State ...
[78]
ICME 2025 Grand Challenge on Video Super-Resolution for ... - arXiv
Jun 13, 2025 · This challenge addresses VSR for conferencing, where LR videos are encoded with H.265 at fixed QPs. The goal is to upscale videos by a specific factor.Missing: metrics 2024 efficiency FLOPs runtime
[79]
MSU Super-Resolution for Video Compression Benchmark
We present a comprehensive SR benchmark to test the ability of SR models to upscale and restore videos compressed by video codecs of different standards.Charts · Visualization · Discover Sr Methods For...Missing: ongoing | Show results with:ongoing
[80]
Video Super Resolution for Broadcast Television - 2025 NAB Show
The advent of Video Super Resolution (VSR) has catalyzed a shift within the media industry, enabling the broadcast of UHD / 4K content without the costs ...
[81]
SR+Codec: a Benchmark of Super-Resolution for Video ... - arXiv
Dec 4, 2024 · We developed a super-resolution benchmark to analyze SR's capacity to upscale compressed videos. Our dataset employed video codecs based on five widely-used ...
[82]
Toward a Better Quality Metric for the Video Community
Dec 7, 2020 · VMAF is a video quality metric using human visual system modeling and machine learning. It predicts consistently across resolutions and shots, ...
[83]
Full-screen videos and games look oversharpened with shimmering ...
Mar 19, 2024 · Whenever I played a high-resolution video, it would appear oversharpened, as if someone went overboard with a sharpening filter. You know ...
[84]
Resolution is Not Sharpness! Which Is More Important?
Mar 30, 2018 · Resolution and sharpness are not a 1:1. Instead of thinking of higher resolution images as “sharper,” try thinking of them as smoother.
[85]
Create super resolution for legacy media content at scale with ... - AWS
Jan 30, 2024 · In this blog post, we propose an end-to-end solution that uses Real-ESRGAN and SwinIR coupled with AWS services to orchestrate a workflow that could take low- ...
[86]
Real-Time Video Super-Resolution - CodaLab - Competition
The aim is to achieve real-time (30 ms / frame) super-resolved output videos with the highest fidelity (PSNR) to the ground truth. The participants will not ...
[87]
Collabora takes first place at ICME 2025 Grand Challenge
Jun 19, 2025 · Collabora has taken first place in Track 3 of the ICME 2025 Grand Challenge on Video Super-Resolution for Video Conferencing!
[88]
[PDF] Efficient Video Super-Resolution for Real-time Rendering with ...
In this paper, we propose an effective real-time video super-resolution network to distinctively explore G-buffers ... Effectiveness of the decoupled G-buffer ...
[89]
rtx-video-super-resolution-ai-obs-broadcast - NVIDIA
For livestreamers, RTX 40 Series GPUs can offer support for real-time AV1 hardware encoding, providing a 40% efficiency boost compared to H.264 and delivering ...
[90]
Event-Enhanced Blurry Video Super-Resolution - arXiv
Apr 17, 2025 · Video super-resolution (VSR) aims to recover a high-resolution (HR) video from its low-resolution (LR) counterpart.
[91]
A super-resolution reconstruction algorithm for surveillance images
This paper proposes an edge-preserving MAP estimation based super-resolution algorithm by using the weighted Markov random field regularization for a ROI from ...
[92]
[PDF] Event-based Video Super-Resolution via State Space Models
Exploiting temporal correlations is crucial for video super-resolution (VSR). Recent approaches enhance this by incorporating event cameras.Missing: diffusion | Show results with:diffusion<|control11|><|separator|>
[93]
[PDF] Benchmark Dataset and Effective Inter-Frame Alignment for Real ...
Video super-resolution (VSR) aiming to reconstruct a high-resolution (HR) video from its low-resolution (LR) counterpart has made tremendous progress in ...Missing: 2023 | Show results with:2023
[94]
[PDF] Investigating Tradeoffs in Real-World Video Super-Resolution
Real-world VSR tradeoffs include detail synthesis vs. artifact suppression, and speed vs. batch-length due to complex degradations and increased computational ...Missing: empirical shortcomings
[95]
Tri-visualization feature extraction for light field angular super ...
Sep 23, 2025 · This gap underscores a fundamental limitation of depth-free approaches, which often face difficulties under severe occlusion and large disparity ...
[96]
Real-Time Super-Resolution System of 4K-Video Based on Deep ...
Video super-resolution (VSR) technology excels in reconstructing low-quality video, avoiding unpleasant blur effect caused by interpolation-based algorithms ...
[97]
A Review on Deep Learning based Algorithms for Video Super ...
In this survey, we comprehensively investigate 28 state-of-the-art video super-resolution methods based on deep learning. It is well known that the leverage of ...
[98]
Real-time video super-resolution using lightweight depthwise ...
These models have outstanding performance with high computational complexity, making them difficult to apply to real-world scenarios. ... Video Super Resolution ...
[99]
[PDF] Fast Spatio-Temporal Residual Network for Video Super-Resolution
Recently, deep learning based video super-resolution. (SR) methods have ... ciency of computational complexity. 3. Fast spatio-temporal residual network.
[100]
Power Efficient Video Super-Resolution on Mobile NPUs with Deep ...
Video super-resolution is one of the most popular ... runtime and power efficiency of all models was ... Power Efficient Video Super-Resolution on Mobile NPUs with ...<|separator|>
[101]
ELVSR: An efficient and lightweight convolutional neural network for ...
Oct 16, 2025 · This approach aims to provide an optimal trade-off, resulting in an efficient real-time FHD to UHD video super-resolution FPGA accelerator with ...
[102]
https://ieeexplore.ieee.org/document/9710765
[103]
Expanding Synthetic Real-World Degradations for Blind Video ...
May 4, 2023 · The proposed synthetic real-world degradations (SRWD) include a combination of the blur, noise, downsampling, pixel binning, and image and video ...
[104]
Efficient Real-Time On-Mobile Video Super-Resolution with ...
Real-time Video Super-Resolution (VSR) on mobile devices requires balancing sub-33ms latency with quality preservation – a dual challenge that existing NAS ...
[105]
End-to-end model compression via pruning and knowledge ...
Apr 29, 2025 · Pruning and knowledge distillation (KD) are two commonly employed methods for compressing SR models. However, previous approaches have typically ...<|separator|>
[106]
https://www.researchgate.net/publication/373134909_DIPNet_Efficiency_Distillation_and_Iterative_Pruning_for_Image_Super-Resolution
[107]
Efficient Video Super-Resolution for Real-time Rendering with ...
Abstract: Latency is a key driver for real-time rendering applications, making super-resolution techniques increasingly popular to accelerate rendering ...<|separator|>
[108]
Achieving on-Mobile Real-Time Super-Resolution with Neural ...
Aug 18, 2021 · This paper proposes a framework using neural and pruning search to achieve real-time 720p super-resolution on mobile, with tens of milliseconds ...Missing: sub- 10ms
[109]
[PDF] Real-Time Video Super-Resolution on Smartphones With Deep ...
The challenge aimed to develop deep learning video super-resolution solutions for real-time performance on mobile GPUs, using the REDS dataset and 4x upscaling ...Missing: 10ms | Show results with:10ms
[110]
MediaTek-NeuroPilot/mai22-real-time-video-sr - GitHub
This repository provides the implementation of the baseline model, Mobile RRN, for the Real-Time Video Super-Resolution Challenge in Mobile AI (MAI) Workshop @ ...Missing: sub- 10ms
[111]
https://www.youtube.com/watch?v=9LgKq4Mvjaw
[112]
[PDF] Expanding Synthetic Real-World Degradations for Blind Video ...
In this paper, we have investigated a realistic degradation model for video super-resolution methods by introducing a synthetic dataset of real-world ...Missing: equation | Show results with:equation
[113]
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real ...
Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key ...
[114]
[PDF] Generative AI for Immersive Video: Recent Advances and Future ...
Super-Resolution (SR) for Display. GenAI models have made significant strides in applying super-resolution (SR) techniques to immersive video, im- proving ...
[115]
Towards Mitigating Hallucinations in Generative Image Super ... - arXiv
Jul 18, 2025 · Instead, we take advantage of a multimodal large language model (MLLM) by constructing a prompt that assesses hallucinatory visual elements and ...
[116]
Improving Generative Adversarial Networks for Video Super ... - arXiv
Jun 24, 2024 · In this research, we explore different ways to improve generative adversarial networks for video super-resolution tasks from a base single image ...