Fact-checked by Grok 2 weeks ago

Motion estimation

Motion estimation is the process of estimating the motion that occurs between a reference frame and the current frame in a video sequence, typically by determining motion vectors that describe pixel displacements or transformations from one 2D image to another.^[1] This technique exploits temporal correlations in image sequences to model apparent motion caused by object movement or camera motion, serving as a fundamental component in fields such as video compression and computer vision.^[1] In video processing, it enables efficient encoding by reducing redundancy, while in computer vision, it facilitates tasks like object tracking and scene understanding by measuring displacements from image sequences with high density and low cost compared to physical sensors.^[2]^[3] Key methods for motion estimation include block-matching algorithms, which divide frames into fixed-size blocks (e.g., 16×16 pixels) and search for the best-matching block in a reference frame to compute motion vectors, offering a practical balance of accuracy and computational efficiency in video coding standards.^[1] These are prominently featured in standards from H.261 to H.264/AVC and HEVC, where they support motion compensation to achieve significant compression gains by predicting frame content from prior ones.^[3] In contrast, optical flow techniques in computer vision estimate dense or sparse motion fields by analyzing brightness constancy or gradient changes across pixels, with notable approaches like the Lucas-Kanade method for sparse flows and Horn-Schunck for dense regularization-constrained estimation.^[2] Other variants, such as parametric models for rigid motions or feature-based matching using descriptors like SIFT, address complex scenarios including nonrigid deformations.^[1]^[2] Despite its utility, motion estimation faces challenges as an ill-posed problem due to the projection from 3D scenes to 2D images, often requiring assumptions like smoothness or regularization to resolve ambiguities.^[1] Applications extend beyond compression to biomedical imaging for tracking cellular motion, robotics for navigation, and structural monitoring for vibration analysis, where accuracies down to 0.01 pixels can be achieved with advanced interpolation or kernel-based methods.^[2] Ongoing advancements, including nature-inspired algorithms and deep learning integrations, continue to enhance robustness and real-time performance across these domains.^[1]

Fundamentals

Definition and Overview

Motion estimation is a fundamental task in computer vision and signal processing that involves computing motion vectors or displacement fields to describe how pixels or features in an image sequence move from one frame to the next.^[4] These vectors represent the apparent motion of objects projected onto the two-dimensional image plane, capturing the transformation between consecutive frames in a video.^[5] The process takes as input a sequence of images and produces as output a motion field, which quantifies displacements at the pixel or block level, thereby enabling the analysis of dynamic scenes.^[6] At its core, motion estimation relies on two key principles: brightness constancy, which assumes that the intensity of a pixel remains unchanged as it moves across frames, and spatial coherence, which posits that nearby pixels belonging to the same surface exhibit similar motion patterns.^[7] These assumptions facilitate the estimation of motion by solving the pixel correspondence problem, where the goal is to match points between frames based on their visual properties.^[8] In practice, this reduces temporal redundancy in video data, allowing for efficient representation by predicting subsequent frames from previous ones rather than encoding each frame independently.^[9] A basic example of motion estimation involves estimating pure translation in a simple two-dimensional rigid motion scenario, where an entire object shifts uniformly across frames without rotation or scaling, yielding a constant motion vector for all pixels.^[7] Optical flow provides a dense representation of this motion field, assigning a vector to every pixel.^[10] Motion estimation is crucial for temporal analysis in dynamic environments and serves as a foundational technique for numerous computer vision applications, including video compression and object tracking.^[6]^[9]

Historical Development

The origins of motion estimation trace back to the mid-20th century in photogrammetry and pioneering computer vision research, where aligning successive images was essential for reconstructing three-dimensional scenes from two-dimensional projections. In the 1950s and 1960s, early efforts focused on image registration techniques to handle geometric transformations between views, with applications in aerial surveying and remote sensing. Early computational models, such as Hassenstein and Reichardt's 1956 correlation-based detector for motion in insect vision, provided foundational ideas for later algorithms.^[11] The 1980s marked a pivotal era with the formalization of differential methods for dense motion analysis, driven by the need to model continuous image changes over time. In 1981, Berthold K. P. Horn and Brian G. Schunck published "Determining Optical Flow," proposing a variational framework that minimized an energy functional to compute smooth velocity fields across entire images, influencing subsequent global optimization approaches in computer vision. Concurrently, Bruce D. Lucas and Takeo Kanade introduced a local iterative technique in their 1981 paper "An Iterative Image Registration Technique with an Application to Stereo Vision," which estimated motion by solving least-squares problems over small windows, laying the groundwork for efficient feature-based tracking. These works established the brightness constancy assumption as a core principle, assuming pixel intensities remain constant under motion, which briefly referenced early video processing applications but spurred broader adoption in robotics and animation. By the 1990s, motion estimation transitioned into practical engineering domains, particularly video compression, where discrete methods enabled real-time processing. Block-matching algorithms, which divide frames into blocks and search for best-matching displacements, were integrated into international standards like MPEG-1 (1993) and MPEG-2 (1995), achieving compression efficiencies by exploiting temporal redundancies in broadcast television and digital storage. This period also saw refinements in tracking, such as the Kanade-Lucas-Tomasi (KLT) feature tracker introduced in Carlo Tomasi and Takeo Kanade's 1991 technical report "Detection and Tracking of Point Features," which extended local methods to affine motion models for robust object following in sequences. The 2000s brought initial fusions of motion estimation with machine learning, enhancing adaptability to complex scenes beyond rigid assumptions. Kernel-based trackers and probabilistic models, building on KLT, incorporated learning to predict feature trajectories, as seen in extensions like the online boosting classifiers for visual tracking around 2004. These laid preparatory groundwork for data-driven paradigms. In the 2010s and 2020s, deep learning revolutionized motion estimation, shifting from handcrafted features to end-to-end neural architectures trained on large datasets. Philipp Fischer et al.'s 2015 paper "FlowNet: Learning Optical Flow with Convolutional Networks" introduced the first CNN-based optical flow estimator, achieving near-real-time performance on benchmarks like Sintel by directly regressing motion fields from image pairs. This sparked a surge in supervised methods, with refinements like PWC-Net (2018) improving accuracy via pyramid warping. By the early 2020s, transformer architectures addressed long-range dependencies, as in the 2021 GMA (Global Motion Aggregation) module integrated into RAFT, enabling state-of-the-art unsupervised flow on datasets up to 2025, with emerging diffusion frameworks and biologically-inspired ML models enhancing accuracy in complex scenes.^[12] Influential standards like H.264/AVC (2003) and its successors incorporated advanced block-matching techniques, such as variable block sizes, boosting video quality in streaming up to HEVC (2013).^[13]

Mathematical Foundations

Motion Models

Motion models in motion estimation provide parametric approximations of the underlying scene dynamics, enabling efficient computation by reducing the degrees of freedom compared to dense pixel-wise estimates. These models assume that motion can be captured by transformations with a small number of parameters, suitable for rigid or semi-rigid objects in video sequences or image pairs. Seminal works in computer vision have formalized these representations to balance representational power with estimation tractability. The simplest motion model is the translational model, which assumes uniform global shift across the image and is parameterized by a 2D vector (t_x, t_y) representing displacement in the x and y directions. This model is effective for scenarios with pure camera panning or object translation without rotation or scaling, as seen in early optical flow techniques. Rotational motion, in contrast, models rigid body rotation around a center, parameterized by an angular velocity \theta (or multiple angles for 3D), preserving distances but altering orientations; it is often combined with translation for basic rigid transformations. These fundamental models form the basis for more complex approximations in structure-from-motion pipelines.^[14] For scenes involving scaling, shear, or perspective effects, the affine model extends translation and rotation with additional parameters, using a 6-degree-of-freedom transformation:

\begin{pmatrix} x' \\ y' \end{pmatrix} = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} + \begin{pmatrix} t_x \\ t_y \end{pmatrix},

where the matrix elements a, b, c, d capture uniform scaling (via a and d), shear (via b and c), and rotation. This model assumes parallel projection and is widely used for local motion estimation in non-planar scenes, as introduced in extensions of the Lucas-Kanade framework for tracking.^[15] Perspective distortions in planar scenes or pure camera motion are handled by projective models, specifically the 8-parameter homography H, a 3x3 matrix up to scale, that maps points via:

\begin{pmatrix} x' \\ y' \\ w' \end{pmatrix} = H \begin{pmatrix} x \\ y \\ 1 \end{pmatrix},

with normalized coordinates (x'/w', y'/w'). Homographies model arbitrary projective transformations, including translation, rotation, scaling, and perspective skew, and are central to two-view geometry for reconstructing planar structures.^[16] Non-rigid motions, such as deformations in elastic objects or articulated bodies, require deformable models like thin-plate splines (TPS), which interpolate local displacements using a radial basis function to minimize bending energy while fitting control points. TPS extends affine models locally, enabling smooth non-linear warps without assuming global rigidity, and has been applied in direct non-rigid registration methods.^[17] The choice of motion model involves a trade-off between complexity (number of parameters) and fitting accuracy: translational or rotational models suffice for rigid, distant objects to avoid overfitting noise, while affine or projective models improve accuracy for closer or planar scenes at higher computational cost, and deformable models like TPS are selected for elastic deformations despite increased estimation challenges. Unlike non-parametric optical flow, which computes dense fields without assumptions, parametric models prioritize low-dimensional fits for robustness in sparse data regimes.^[18]

Optical Flow Constraint

The optical flow constraint arises from the fundamental assumption of brightness constancy, which posits that the intensity of a point in an image sequence remains unchanged as it moves between frames. This assumption, central to differential methods for motion estimation, states that for an image intensity function I(x, y, t), the value at a displaced position satisfies I(x + dx, y + dy, t + dt) = I(x, y, t), where dx = u \, dt and dy = v \, dt, with (u, v) representing the velocity components of the optical flow.^[19] To derive the constraint equation, a first-order Taylor series expansion is applied around the point (x, y, t), assuming small motions such that higher-order terms are negligible:

\begin{aligned} I(x + u \, dt, y + v \, dt, t + dt) &\approx I(x, y, t) + \frac{\partial I}{\partial x} u \, dt + \frac{\partial I}{\partial y} v \, dt + \frac{\partial I}{\partial t} dt \\ &= I(x, y, t). \end{aligned}

Dividing through by dt and rearranging yields the optical flow constraint equation:

\frac{\partial I}{\partial x} u + \frac{\partial I}{\partial y} v + \frac{\partial I}{\partial t} = 0,

where \frac{\partial I}{\partial x} and \frac{\partial I}{\partial y} are the spatial gradients, and \frac{\partial I}{\partial t} is the temporal gradient. This equation links the observed changes in image intensity to the underlying motion, forming the basis for many intensity-based optical flow algorithms.^[19] A key challenge with this constraint is the aperture problem, which stems from its underconstrained nature: for each pixel, there is only one equation but two unknowns (u and v). This results in an infinite number of possible flow vectors that satisfy the equation, lying along a line perpendicular to the local image gradient. For instance, along a straight edge with uniform intensity parallel to the edge, only the normal component of the flow (perpendicular to the edge) can be reliably estimated from local intensity changes, while the tangential component remains ambiguous without additional contextual information from neighboring pixels.^[19] The derivation relies on the small motion assumption inherent in the linear Taylor approximation, which holds well for sub-pixel displacements but breaks down for larger movements. Spatial derivatives are typically computed using convolution kernels such as Sobel operators to approximate \frac{\partial I}{\partial x} and \frac{\partial I}{\partial y}, while the temporal derivative \frac{\partial I}{\partial t} is often obtained via finite differences between consecutive frames. These approximations introduce sensitivity to noise, necessitating preprocessing like Gaussian smoothing. To address the underconstrained nature of the single-frame constraint, extensions incorporate multi-frame information, such as temporal coherence models that integrate data across several frames to provide additional equations and improve solvability. For example, sequential estimation using Kalman filters can propagate flow estimates over time, reducing ambiguities by leveraging the diversity of temporal gradients across frames.^[20] Despite its foundational role, the optical flow constraint has limitations, failing under large displacements where the small motion assumption does not hold, changes in illumination that violate brightness constancy, or occlusions that introduce discontinuities in the flow field. These issues highlight the need for robust extensions in practical applications, though the core equation remains a cornerstone for instantaneous motion analysis.

Algorithms

Intensity-Based Methods

Intensity-based methods estimate dense motion fields by directly utilizing pixel intensities from consecutive image frames, relying on the assumption of brightness constancy to solve the optical flow constraint equation. These approaches compute motion vectors for every pixel, producing a complete flow field suitable for applications requiring detailed velocity information, unlike sparse methods that focus on select features. The aperture problem, inherent to local intensity changes, is addressed through either global smoothness assumptions or local averaging within windows.^[21] Global methods, such as the Horn-Schunck algorithm, formulate motion estimation as a variational optimization problem that minimizes a global energy functional combining data fidelity and smoothness terms. The energy is defined as

E = \int \left( (I_x u + I_y v + I_t)^2 + \alpha (|\nabla u|^2 + |\nabla v|^2) \right) \, dx \, dy,

where I_x, I_y, I_t are the spatial and temporal intensity derivatives, u and v are the horizontal and vertical flow components, and \alpha controls the smoothness penalty. This functional is solved using the Euler-Lagrange equations, yielding a system of partial differential equations that enforce neighboring flow consistency to resolve ambiguities from the aperture problem. The resulting dense flow is smooth but can oversmooth discontinuities in motion boundaries.^[21] Local methods, exemplified by the Lucas-Kanade algorithm, estimate motion within small spatial windows by assuming constant flow across the region and solving a least-squares problem. For a window around pixel (x, y), the flow (u, v) is computed as

\begin{pmatrix} u \\ v \end{pmatrix} = (A^T A)^{-1} A^T b,

where A is the matrix of spatial gradients I_x and I_y for pixels in the window, and b contains the temporal derivatives -I_t. This approach provides robustness to noise through averaging but requires careful window size selection: smaller windows capture fine details and handle occlusions better, while larger ones improve stability for low-texture areas at the cost of blurring motion edges. The method assumes small displacements, limiting its applicability to sub-pixel motions without extensions.^[15] Variational frameworks extend these ideas by incorporating advanced regularizers, such as total variation (TV) terms, to enhance robustness against noise and outliers while preserving flow discontinuities. The TV-L1 model replaces the quadratic data term with a robust L1 norm and uses TV regularization on the flow magnitude, formulated as minimizing \int |I_x u + I_y v + I_t| + \lambda |\nabla \mathbf{w}| \, dx \, dy, where \mathbf{w} = (u, v) and \lambda balances fidelity and regularity. This duality-based optimization yields piecewise-smooth flows resilient to illumination variations and sparse errors, outperforming quadratic penalties in textured scenes. Computationally, global methods like Horn-Schunck employ iterative solvers such as Gauss-Seidel relaxation to approximate the Euler-Lagrange solutions, converging in tens of iterations for typical image sizes but scaling poorly with resolution. Local methods like Lucas-Kanade are faster, requiring matrix inversions per window, and often use multi-resolution pyramids to handle larger motions by estimating coarse flows first and refining at finer levels. These techniques enable real-time processing on modern hardware for video sequences up to HD resolution.^[21]^[15] Intensity-based methods produce dense flow fields valuable for applications like synthetic aperture radar (SAR) imaging, where they estimate glacier surface motion from intensity images with sub-pixel accuracy under varying speckle noise. However, they remain sensitive to illumination changes, which violate the brightness constancy assumption and introduce errors in the data term, necessitating robust variants for outdoor scenes. In contrast to feature-based approaches, their use of all pixels yields comprehensive coverage but at higher computational cost.^[22]^[23]

Feature-Based Methods

Feature-based methods in motion estimation focus on identifying and tracking distinct keypoints or sparse regions within images to estimate motion vectors, prioritizing computational efficiency and robustness to occlusions over the dense pixel-wise analysis of intensity-based approaches. These techniques typically divide the image into blocks or select salient features based on local image properties, then match them across frames using similarity metrics. By operating on a limited set of features—often numbering in the hundreds rather than thousands of pixels—these methods achieve lower complexity while effectively capturing dominant motions in scenes with structured elements like edges or corners.^[24] One foundational approach is block matching, where the image is partitioned into fixed-size blocks, typically 16x16 pixels, and each block in the current frame is compared to candidate blocks in a search window of the reference frame to find the best match. The similarity is commonly measured by the sum of absolute differences (SAD), defined as \sum |I(x) - I'(x + d)| over block pixels, where I and I' are the reference and current frame intensities, and d is the displacement vector minimizing SAD. Exhaustive search evaluates all positions in the search window for the minimum SAD, providing high accuracy but at a computational cost of O(B \cdot W^2) per block, where B is the number of blocks and W the search window size; this full-search block matching forms the basis of motion compensation in standards like MPEG-1. To accelerate this, fast search patterns reduce evaluations: the three-step search starts with a coarse grid of nine points spaced by half the window size, refines to a smaller step, and ends with an unrestricted search around the minimum, typically requiring about 25 checks versus 225 for exhaustive search on a ±7 pixel window. Similarly, the diamond search employs a large diamond pattern (five points) iteratively until convergence, followed by a small diamond refinement, achieving up to 22% fewer computations than three-step search while maintaining comparable accuracy in MPEG video coding.^[25] Feature tracking methods, such as the Kanade-Lucas-Tomasi (KLT) tracker, select and follow individual keypoints across frames using local optimization. Feature selection relies on the structure tensor, a 2x2 symmetric matrix G = \sum_w \begin{pmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{pmatrix}, where I_x, I_y are spatial gradients weighted by a window w; good features are those with both eigenvalues of G above a threshold, ensuring corner-like stability under small translations. The tracker then estimates affine motion per feature by solving the least-squares system from the Lucas-Kanade optical flow constraint I_x u + I_y v + I_t = 0, where (u, v) are flow components and I_t the temporal derivative, iteratively warping the template to minimize residual errors. For subpixel accuracy in sparse optical flow, the inverse compositional variant optimizes the warp parameters by composing updates inversely on the template, avoiding Jacobian recomputation per iteration and enabling efficient affine or perspective models with convergence in 5-10 steps.^[15] Descriptor-based methods enhance matching robustness by extracting invariant feature descriptors at keypoints, followed by correspondence estimation. The Scale-Invariant Feature Transform (SIFT) detects keypoints at scale-space extrema and describes them with 128-dimensional histograms of oriented gradients, invariant to scale and rotation; matches are found via nearest-neighbor search in descriptor space, often using Lowe's ratio test to discard ambiguous pairs (distance to closest over second-closest < 0.8). For faster alternatives, Oriented FAST and Rotated BRIEF (ORB) combines FAST corner detection with binary BRIEF descriptors rotated to keypoint orientation, enabling Hamming distance matching at speeds 100 times faster than SIFT with similar accuracy on rotated images. To fit a motion model from these correspondences amid outliers (e.g., 50% from repetitive textures), RANSAC randomly samples minimal sets (e.g., two points for translation), fits the model, counts inliers within a threshold, and selects the largest consensus set, iteratively over thousands of trials for robust estimation.^[26]^[27]^[28] These methods excel in handling large displacements by focusing on distinctive features less prone to aperture problems, unlike dense intensity-based techniques that assume global smoothness. Their complexity scales linearly as O(N) with the number of features N, making them suitable for real-time applications like video stabilization, where N \approx 200-500 suffices for accurate ego-motion recovery.^[25]

Deep Learning Methods

Deep learning methods have revolutionized motion estimation by enabling end-to-end learning of dense optical flow from image pairs, surpassing traditional hand-crafted approaches in generalization and accuracy on diverse datasets.^[29] These methods typically leverage convolutional neural networks (CNNs) or transformers to predict pixel-wise motion vectors, often trained with losses inspired by the optical flow constraint equation to minimize photometric errors between frames.^[30] A seminal CNN-based approach is FlowNet, introduced in 2015, which employs a two-stream architecture to extract features from consecutive frames and computes a correlation volume across multi-scale feature maps for end-to-end flow prediction.^[29] This correlation layer enables the network to match corresponding pixels efficiently without explicit search, achieving real-time performance on GPUs. FlowNet is supervised-trained on synthetic datasets such as Flying Chairs, which simulate realistic motion with rendered chair sequences to generate ground-truth flow labels.^[29] Subsequent improvements, like FlowNet 2.0, stacked multiple networks for refined estimates, reducing end-point error (EPE) on benchmarks like Sintel by integrating coarse-to-fine processing.^[31] Unsupervised variants address the need for labeled data by relying on photometric and smoothness losses, with RAFT (Recurrent All-Pairs Field Transforms) from 2020 exemplifying iterative refinement through a gated recurrent unit (GRU) that updates flow estimates over multiple iterations.^[30] RAFT constructs all-pairs correlation volumes at multiple pyramid levels to handle large displacements and incorporates occlusion handling via forward-backward consistency checks, where inconsistent flows between forward and backward predictions are masked.^[30] This approach achieves state-of-the-art EPE on the KITTI dataset (around 3.5 pixels average) and Sintel (under 2 pixels on clean passes), demonstrating robust generalization without ground-truth supervision in its core refinement loop.^[30] Transformer-based models extend these capabilities by capturing long-range dependencies, as seen in GMA (Global Motion Aggregation) from 2021, which integrates a transformer module to aggregate global motion cues from the first frame into local flow predictions.^[32] GMA's attention mechanism reasons over occluded regions by attending to visible pixels with similar motions, improving EPE in challenging areas like the Sintel final pass by up to 20% over RAFT baselines.^[32] This enables better handling of complex scenes with occlusions and non-rigid motions. Hybrid approaches combine deep learning with classical elements, such as RAFT's pyramid processing, where multi-resolution correlation volumes mimic traditional coarse-to-fine strategies to capture both small and large motions efficiently.^[30] By 2025, trends emphasize lightweight models for edge devices, like EdgeFlowNet, a CNN tailored for tiny mobile robots that delivers 100 FPS dense flow at 1W power on hardware like the Google Coral Edge TPU, with EPE competitive on Sintel (6.53 pixels).^[33] Integration with diffusion models for generative motion estimation has emerged, as in GENMO, which combines regression with diffusion processes to produce diverse yet accurate human motion flows from sparse inputs, advancing applications in animation and robotics.^[34] These developments support real-time video processing in resource-constrained environments.

Advanced Techniques

Affine Motion Estimation

Affine motion estimation involves computing the six parameters of an affine transformation—two for translation, one for rotation, one for scaling, and two for shear—from sparse point correspondences between images, enabling the modeling of more complex deformations than pure translation. This approach assumes a linear relationship between corresponding points (x_i, y_i) and (x_i', y_i'), formulated as:

\begin{align*} x_i' &= a x_i + b y_i + t_x, \\ y_i' &= c x_i + d y_i + t_y, \end{align*}

where a, b, c, d capture rotation, scaling, and shear, while t_x, t_y represent translation.^[35] Parameter estimation typically employs least-squares optimization to minimize the residual error over N correspondences:

\min_{a,b,c,d,t_x,t_y} \sum_{i=1}^N \left[ (x_i' - (a x_i + b y_i + t_x))^2 + (y_i' - (c x_i + d y_i + t_y))^2 \right].

This problem is linear in the parameters and can be solved directly using the normal equations or the pseudoinverse of the design matrix. To improve numerical stability, the points can be centered by subtracting their centroids, which decouples translation and allows solving for the linear part first.^[36] For multiple correspondences forming an overdetermined system A \mathbf{p} = \mathbf{b}, where \mathbf{p} = [a, b, c, d, t_x, t_y]^T, the solution is the least-squares minimizer \mathbf{p} = (A^T A)^{-1} A^T \mathbf{b}. Robust variants address outliers in correspondences using RANSAC, which iteratively samples minimal sets (three non-collinear points for affine) to hypothesize transformations, evaluating consensus via inlier counts before refining with least-squares on the largest set. The direct linear transformation (DLT) provides an efficient linear solver for the affine system, constructing a constraint matrix from correspondences and applying SVD to solve the homogeneous form, though it requires normalization to avoid numerical instability.^[37] Degenerate configurations, such as collinear points, lead to rank-deficient systems where the affine matrix cannot be uniquely determined, as multiple transformations map the line identically; detection involves checking the condition number of the SVD or requiring at least three non-collinear points.^[38] In tracking applications, affine-invariant features like ASIFT enhance robustness to viewpoint changes by simulating all possible affine distortions during feature extraction, allowing reliable matching under rotation, scaling, and shear for sustained object tracking across frames.^[39] Evaluation often quantifies accuracy via parameter deviation metrics, such as the Euclidean distance between estimated and ground-truth parameter vectors, or endpoint error on held-out points, with applications in image registration demonstrating sub-pixel precision on synthetic datasets.^[40]

Multi-Resolution Approaches

Multi-resolution approaches in motion estimation employ hierarchical image representations to address challenges posed by large displacements between frames, enabling more robust and efficient computation. These methods decompose the input images into multiple scales, typically using Gaussian or Laplacian pyramids, where coarser levels capture global motion patterns while finer levels refine local details. The Gaussian pyramid is constructed by successively low-pass filtering and subsampling the image, creating a series of reduced-resolution versions that approximate the original at decreasing spatial frequencies. In contrast, the Laplacian pyramid encodes band-pass filtered details at each scale, facilitating the propagation of high-frequency information during refinement. This coarse-to-fine strategy, popularized in optical flow estimation, warps intermediate pyramid levels to align images and propagate flow estimates upward, significantly improving convergence for scenarios with substantial motion.^[41] In implementation, the process begins at the coarsest pyramid level, where motion is estimated under reduced resolution to capture broad displacements, often assuming affine or translational models for global consistency. This initial estimate is then upsampled to the next finer level and used to initialize a local refinement step, such as iterative optimization within small windows. The upsampling typically scales the coarser flow vector \mathbf{u}_{l+1} by a factor of 2 (matching the pyramid's subsampling rate) and adds a correction term \Delta \mathbf{u}_l computed at the current level l, formalized as:

\mathbf{u}_l = 2 \mathbf{u}_{l+1} + \Delta \mathbf{u}_l

This propagation stabilizes the search by constraining the refinement to small residuals, integrating seamlessly with intensity-based methods like Lucas-Kanade optical flow or block matching algorithms.^[42]^[43] The primary benefits of multi-resolution approaches include a drastic reduction in the search space at coarser scales, which mitigates the computational burden of exhaustive matching and enhances efficiency through subsampling—often achieving speedups of 4-8 times per level compared to single-scale methods. Additionally, by estimating large motions globally first, these techniques avoid entrapment in local minima, leading to more accurate flows in complex scenes with occlusions or rapid changes. For instance, in video sequences with fast camera panning, the hierarchical refinement preserves structural coherence that single-resolution estimators frequently lose.^[41]^[44] Variants extend the classic pyramid framework, such as overcomplete representations that maintain overlapping scales for smoother transitions, or learned multi-scale hierarchies in deep learning models like PWC-Net, which incorporate pyramid processing with warping and cost volumes to achieve state-of-the-art accuracy on benchmarks like MPI Sintel, with end-point errors around 10% lower than some prior CNN methods. These adaptations preserve the core coarse-to-fine paradigm while adapting to modern neural architectures.^[45]^[44] Despite these advantages, multi-resolution approaches can introduce artifacts from quantization errors at coarse scales, where subsampling blurs fine details and may propagate inaccuracies, particularly in handling fast camera motion exceeding the pyramid's displacement limits—resulting in aliased flows or failure to capture sub-pixel precision in high-speed scenarios. Such limitations underscore the need for careful pyramid depth selection, typically 3-5 levels, to balance accuracy and robustness.^[41]^[42]

Applications

Video Coding

Motion estimation plays a pivotal role in video coding by exploiting temporal redundancy between consecutive frames, enabling efficient compression through predictive modeling of pixel displacements. In hybrid video codecs, it forms the basis for inter-frame prediction, where motion vectors describe block movements to reconstruct current frames from reference frames, significantly reducing the bitrate required for transmission or storage. This process is central to standards developed by the Joint Video Experts Team (JVET) and predecessors, achieving substantial gains in compression efficiency.^[46]^[47] Block-based prediction is the cornerstone of motion estimation in modern video coding standards such as H.264/AVC and HEVC (H.265). Frames are partitioned into macroblocks or coding units—typically 16×16 pixels in H.264/AVC and variable sizes up to 64×64 in HEVC—and a motion vector is estimated for each to minimize the residual error when predicting from a reference frame. To enhance accuracy, these standards support sub-pixel precision, particularly quarter-pixel accuracy achieved via bilinear or Wiener interpolation filters on reference samples. This fractional resolution allows finer motion compensation, improving prediction quality especially for smooth or non-integer movements.^[46]^[47]^[48] Search strategies for motion vectors balance computational complexity and accuracy, with full search exhaustively evaluating all candidate positions within a search window but at high cost, while fast algorithms like TZSearch in HEVC's reference software reduce evaluations through zonal patterns and early termination. These methods integrate rate-distortion optimization (RDO) to select the best vector and mode, minimizing the Lagrangian cost function J = D + \lambda R, where D represents distortion (e.g., sum of absolute differences), R is the bitrate for encoding the vector and residual, and \lambda is a Lagrange multiplier trading off distortion against rate. RDO ensures decisions align with overall compression goals, often yielding 10-20% bitrate savings in inter prediction.^[47]^[49] Prediction modes further refine motion estimation: unidirectional modes use forward or backward prediction from one reference (P-frames in H.264/AVC), while bidirectional modes in B-frames combine two references for better efficiency in scenes with occlusions or reversals. Both standards support multiple reference frames—up to 16 in HEVC—allowing selection from a list to capture longer-term dependencies, which can improve compression by 5-15% in low-motion sequences.^[46]^[47]^[50] The evolution of motion estimation traces from MPEG-2 (H.262) in the 1990s, which introduced block-based compensation with half-pixel accuracy, to H.264/AVC (2003) enhancing it with quarter-pixel and variable block shapes for about 50% bitrate reduction over MPEG-2. HEVC (2013) extended this with larger blocks and TZSearch, doubling efficiency over H.264/AVC. VVC (H.266, 2020) incorporates affine motion models for intra-block variations, treating rotation and scaling alongside translation via 4- or 6-parameter transforms on 4×4 sub-blocks, yielding up to 50% further bitrate savings for complex motions like camera pans. These advancements have enabled 4K/8K streaming at viable bitrates.^[51]^[46]^[47] Despite these gains, challenges persist, notably the bitrate overhead from transmitting motion vectors, which can consume 10-20% of the bitstream in high-motion or fine-grained block scenarios, necessitating advanced entropy coding like CABAC to mitigate. In streaming services, this overhead impacts adaptive bitrate delivery, prompting optimizations like vector merging in VVC to reduce signaling for similar neighboring blocks.^[52]^[53]^[54]

3D Reconstruction

Motion estimation plays a crucial role in 3D reconstruction by enabling the recovery of camera poses and scene structure from sequences of 2D images, primarily through the structure-from-motion (SfM) pipeline. This process begins with feature matching across multiple views to establish correspondences between images, leveraging techniques from feature-based methods to identify and track keypoints such as corners or blobs. From these correspondences, the fundamental matrix F is estimated, which encodes the epipolar geometry relating points in two images; assuming known camera intrinsics K, the essential matrix E is then derived via E = K^T F K[^[55]], capturing the relative rotation and translation up to scale between views. Triangulation follows, projecting the matched features back into 3D space using the decomposed camera poses from E to initialize a sparse point cloud representing the scene structure. To refine the initial estimates, bundle adjustment performs a joint optimization over all camera poses P and 3D points X_i, minimizing the reprojection error defined as \sum_i \| x_i - \pi(P, X_i) \|^2, where \pi is the projection function and x_i are the observed image points. This non-linear least-squares problem, often solved using Levenberg-Marquardt, ensures global consistency across the entire sequence, significantly improving accuracy in the presence of noise or outliers. For dense reconstruction beyond the sparse SfM output, multi-view stereo extends the model by computing depth for most pixels, employing methods like patch matching to evaluate photo-consistency across views or optical flow to propagate correspondences; a seminal approach uses adaptive patch-based evaluation to generate quasi-dense surface models visible in the input images.^[56]^[57] In scenarios with planar scenes, such as facades or documents, affine approximations simplify motion estimation by directly decomposing the homography matrix—induced by the plane—into rotation and translation components, avoiding full 3D recovery when depth variation is minimal. This is particularly useful for initial pose estimation in restricted environments. Practical implementations, like the COLMAP software, integrate these SfM steps into an end-to-end pipeline for robust 3D reconstruction from unordered image collections, supporting both sparse and dense outputs. In photogrammetry, drone-based SfM has been applied to cultural heritage sites in the 2020s, such as generating detailed 3D models of historical monuments from aerial imagery to aid preservation and virtual tourism.^[58]^[59]

Robotics and Surveillance

In robotics, motion estimation plays a critical role in visual odometry (VO), which estimates the ego-motion of a robot using sequential camera images to enable navigation in dynamic environments. Direct methods, such as those aligning pixel intensities between frames, and feature-based approaches like Iterative Closest Point (ICP) for point cloud registration, are commonly employed to compute relative poses with high accuracy in real-time settings.^[60]^[61] A prominent example is ORB-SLAM3, an open-source visual-inertial SLAM system that integrates IMU data for robust ego-motion estimation, achieving low drift through tightly coupled optimization of visual and inertial measurements.^[62] Loop closure detection further enhances VO by identifying revisited locations, allowing global pose corrections via bundle adjustment to mitigate cumulative errors in long trajectories.^[63] In surveillance applications, motion estimation facilitates multi-object tracking by estimating trajectories from video feeds, often fusing motion vectors—derived from optical flow or block matching—with predictive models to maintain track continuity. The Kalman filter is widely used for this fusion, predicting object states based on constant-velocity assumptions and updating with detected motion vectors to handle multiple targets in cluttered scenes.^[64]^[65] Occlusions, a common challenge in surveillance, are addressed through re-identification techniques that match object appearances across frames using deep features, enabling track recovery post-obstruction without relying solely on motion continuity.^[66] Real-time constraints in these domains demand sub-millisecond motion estimation to support immediate decision-making, achieved through GPU-accelerated deep learning models optimized for embedded hardware. Lightweight convolutional networks, such as quantized variants of YOLO or MobileNet, process optical flow or feature correspondences at high frame rates on platforms like NVIDIA Jetson, enabling robust estimation under resource limitations.^[67]^[68] Deep learning enhances robustness to lighting variations and partial occlusions, briefly leveraging multi-resolution processing for large-scale scenes when necessary. Applications in autonomous vehicles utilize VO for precise localization, with benchmarks on the KITTI dataset evaluating translational and rotational errors, where state-of-the-art systems achieve average drifts below 1% over urban sequences of several kilometers.^[69] In security cameras, abnormal optical flow patterns signal anomalies like intrusions, with detection models reconstructing expected flows to flag deviations in real-time monitoring.^[70] Key metrics include tracking success rates exceeding 80% in multi-object scenarios and odometry drift limited to 0.5-2% in extended runs, as demonstrated in DARPA Urban Challenge (2007) and SubT Challenge (2019-2021) evaluations of robotic navigation under unstructured conditions.^[71]^[72]

Challenges

Limitations in Real-World Scenarios

Motion estimation algorithms encounter significant challenges in real-world scenarios due to occlusions and disocclusions, where parts of objects are temporarily hidden or newly revealed during motion. In such cases, motion vectors become unreliable at object boundaries because matching correspondences fail, leading to erroneous flow estimates in affected regions. A common detection method involves forward-backward error analysis, which identifies occlusions by comparing the forward-warped positions from one frame to the next with backward-warped positions, flagging inconsistencies as occluded areas. Disocclusions, often arising from dynamic object movements, exacerbate this issue by introducing novel visible regions without prior matching points, further degrading estimation accuracy in video sequences. Illumination variations pose another critical limitation by violating the fundamental brightness constancy assumption underlying many optical flow methods, which posits that pixel intensity remains unchanged under motion. Changes in lighting, such as shadows or global illumination shifts, introduce mismatches in intensity-based similarity measures, resulting in inaccurate displacement estimates. To mitigate this, normalized cross-correlation is employed as a robust similarity metric, which normalizes patch intensities to reduce sensitivity to linear illumination changes, though it does not fully resolve non-linear variations. Large or non-rigid motions amplify the aperture problem, where local gradient-based methods can only reliably estimate motion components perpendicular to edges, leading to ambiguous flow directions parallel to contours and higher failure rates overall. In non-rigid scenarios, such as deformable objects, this ambiguity propagates, causing widespread estimation errors. For instance, on the Middlebury optical flow dataset, algorithms exhibit endpoint error rates exceeding 20% in sequences featuring large displacements and outdoor scenes with complex, non-rigid elements like waving flags or moving crowds. Noise and aliasing further impair motion estimation by corrupting gradient computations essential to differential techniques, introducing spurious local minima in the optimization process and biasing flow vectors. Image noise amplifies uncertainties in spatial and temporal derivatives, while aliasing from undersampling high-frequency motions distorts gradient accuracy, particularly in low-texture areas. In robotics applications, these effects are quantified using average trajectory error (ATE), where noisy estimates can increase ATE by factors of 2-5 compared to ideal conditions, as seen in visual odometry benchmarks on datasets like KITTI. Computational bottlenecks limit the practicality of motion estimation in resource-constrained environments, such as mobile devices, where dense algorithms demand high processing power for real-time performance. Without hardware acceleration like GPUs, many methods are capped at around 30 frames per second (FPS) for standard resolutions, falling short for high-definition video or interactive applications, due to the quadratic complexity in search space and iterative optimizations.

Emerging Solutions

Emerging solutions in motion estimation are addressing longstanding challenges in accuracy, robustness, and efficiency by integrating multimodal data, advanced learning paradigms, and uncertainty-aware frameworks, particularly as of 2025. These innovations build on deep learning methods to enhance performance in dynamic environments, such as robotics and augmented reality applications. For instance, recent benchmarks in 2025 robotics, such as the HuPerFlow benchmark, demonstrate improved real-time estimation through hybrid approaches that mitigate sensor-specific limitations like visual occlusions or inertial drift.^[73] Hybrid sensor fusion techniques combine visual data with complementary modalities like LiDAR and inertial measurement units (IMUs) to achieve more reliable motion estimation. In visual-inertial odometry (VIO) systems for AR glasses, such as those used in mobile augmented reality, camera feeds are fused with IMU accelerations and angular velocities to track head movements with sub-millimeter precision even under rapid rotations or low-light conditions.^[74] Graph-based optimization methods, which model sensor measurements as nodes and constraints in a factor graph, further refine these estimates by jointly optimizing poses across LiDAR point clouds, visual features, and IMU biases, reducing drift errors by up to 40% in urban navigation scenarios.^[75] Similarly, Kalman filter variants, including extended and unscented forms, enable real-time fusion of GNSS, LiDAR, vision, and IMU data for robust vehicle pose estimation, maintaining accuracy within 0.5 meters in GPS-denied environments.^[76] These approaches, exemplified in systems like Super Odometry, demonstrate how multi-sensor integration enhances motion estimation for applications requiring seamless AR overlays.^[77] Self-supervised learning has gained traction for motion estimation by leveraging unlabeled data through contrastive losses, reducing reliance on costly annotations. In this paradigm, models learn representations by contrasting positive motion pairs (e.g., temporally adjacent frames) against negative ones, enabling end-to-end training for optical flow prediction with minimal supervision.^[78] Advances in 2025 particularly highlight event-based cameras, which use neuromorphic sensors to capture asynchronous brightness changes for high-speed motion, achieving latencies under 1 ms in dynamic scenes like drone flight. For example, EV-FlowNet employs self-supervised photometric consistency losses on event streams to estimate dense optical flow, competitive with supervised frame-based methods in high-dynamic-range conditions without labeled data.^[79] Recent extensions, such as unsupervised joint learning frameworks for event cameras and ESMD for simultaneous motion and noise estimation, further integrate intensity reconstruction with flow estimation using contrastive objectives, facilitating deployment in resource-constrained neuromorphic hardware for real-time tracking.^[80]^[81] Uncertainty estimation in deep learning models for motion estimation is advancing through Bayesian techniques, providing probabilistic outputs crucial for safe robotics operations. Bayesian normalizing flows model the posterior distribution over motion parameters, allowing efficient sampling of possible trajectories to quantify aleatoric and epistemic uncertainties in visual odometry.^[82] Monte Carlo dropout, a practical approximation to Bayesian inference, applies dropout at inference time to generate multiple predictions, whose variance yields uncertainty measures; this has been applied to optical flow networks like FlowNet to flag unreliable estimates in occluded regions, enhancing reliability in robotics applications such as navigation in cluttered environments.^[83] In robotics contexts, frameworks combining evidential deep learning with Monte Carlo methods estimate motion uncertainties for tasks like SLAM, enabling adaptive planning that avoids high-variance paths.^[84] These methods, as surveyed in recent works, emphasize scalable Bayesian approximations to ensure trustworthy motion predictions without excessive computational overhead.^[85] Scalable architectures for edge AI are optimizing motion estimation models through quantization, enabling deployment on low-power devices like mobile robots. Quantization reduces model precision from 32-bit floats to 8-bit or lower integers, compressing parameters while preserving accuracy for real-time inference. For instance, quantization techniques applied to deep learning models for optical flow, such as variants of FlowNet, can achieve substantial parameter reductions while maintaining accuracy on benchmarks like KITTI, facilitating 30+ FPS operation on edge hardware with under 1W power draw.^[86] Fusion-FlowNet exemplifies this by integrating sensor fusion with quantized spiking neural networks, significantly reducing energy consumption for optical flow in applications like autonomous driving while preserving accuracy.^[87] Such optimizations, including quality-scalable multipliers, ensure motion estimation remains viable for battery-constrained applications without retraining from scratch.^[88] Research frontiers in motion estimation explore quantum-inspired optimization for global methods and ethical considerations in surveillance applications. Quantum annealing-inspired algorithms solve combinatorial optimization problems in motion segmentation, partitioning video frames into coherent motion clusters more efficiently than classical solvers, with speedups of 10-100x on NP-hard instances using adiabatic quantum computing frameworks.^[89] These techniques extend to global motion estimation by minimizing energy functions over large search spaces, promising breakthroughs in multi-object tracking. On the ethical front, bias in AI motion tracking for surveillance—such as racial or gender disparities in pedestrian detection—raises concerns about perpetuating inequalities, with studies as of 2025 recommending diverse datasets and fairness audits to mitigate discriminatory outcomes in public monitoring systems.^[90] Frameworks for ethical AI emphasize transparency in tracking algorithms to balance security benefits with privacy rights, addressing biases that amplify societal inequities.^[91]

References

[1]
Motion Estimation - an overview | ScienceDirect Topics
Motion estimation (ME) is defined as the process of estimating the motion that occurs between a reference frame and the current frame in a video sequence, ...
[2]
[PDF] a review on vision-based motion estimation - arXiv
Jul 19, 2024 · Existing vision-based motion estimation methods can be classified into two branches: matching-based methods, which work from Lagrangian ...
[3]
https://ieeexplore.ieee.org/document/5620542
[4]
Motion Estimator - an overview | ScienceDirect Topics
A motion estimator is defined as a computational algorithm that calculates the optical flow by estimating the motion vectors between consecutive image frames, ...
[5]
[PDF] Motion Estimation Techniques - Marco Cagnazzo
motion estimation. Motion estimation plays an important role in a broad range of applications encompassing image sequence analysis, computer vision and video.
[6]
46 Motion Estimation - Foundations of Computer Vision
Motion tells us how objects move in the world, and how we move relative to the scene. It is an important grouping cue that lets us discover new objects.
[7]
[PDF] Lecture 13: Tracking motion features – optical flow
Nov 9, 2011 · • Brightness constancy: projection of the same point looks the same in every frame. • Spatial coherence: points move like their neighbors.
[8]
[PDF] Motion and optical flow Announcements
Feb 2, 2017 · How to estimate pixel motion from image H to image I? • Solve pixel correspondence problem. – given a pixel in H, look for nearby pixels of the ...
[9]
A review of motion estimation algorithms for video compression
For this purpose, the block-based motion estimation (BBME) technique has been successfully applied in the video compression standards from H.261 to H.264.
[10]
Motion Estimation using Optical Flow - Scaler Topics
Jul 11, 2023 · Optical flow motion estimation is a method used in computer vision to estimate the motion of objects between consecutive frames of an image or video sequence.
[11]
Multiple View Geometry in Computer Vision
Geometry of single axis motions using conic fitting. ... Richard Hartley, Australian National University, Canberra, Andrew Zisserman, University of Oxford.
[12]
[PDF] An Iterative Image Registration Technique - CMU Robotics Institute
In this paper we present a new image registration technique that uses spatial intensity gradient information to direct the search for the position that yields ...
[13]
[PDF] Multiple View Geometry in Computer Vision, Second Edition
PART 0: The Background: Projective Geometry, Transformations and Esti- mation. 23. Outline. 24. 2. Projective Geometry and Transformations of 2D.
[14]
[PDF] Algorithmic Issues in Modeling Motion - Duke Computer Science
Another possible trade-off is between efficiency and accuracy. How much efficiency can be gained by maintaining a geometric structure approximately? For example ...
[15]
Determining optical flow - ScienceDirect.com
Optical flow cannot be computed locally, requiring a second constraint. A method assumes smooth brightness pattern velocity, using an iterative implementation.
[16]
[PDF] Probabilistic and Sequential Computation of Optical Flow Using ...
In this paper we present a temporal, multi-frame extension of the dense optical flow estimation formulation proposed by Horn and Schunck [1] in which we use ...
[17]
[PDF] Determining Optical Flow - Faculty
Berthold K.P. Horn and Brian G. Schunck ABSTRACT Optical flow cannot be computed locally, since only one independent measurement is available from the image ...
[18]
Glacier Surface Motion Estimation from SAR Intensity Images Based ...
Aug 6, 2020 · This paper proposes a robust subpixel frequency-based image correlation method for dense matching and integrates the improved matching into a ...
[19]
[PDF] Robust motion estimation under varying illumination
The basic notion behind the use of a robust estimator is to obtain a statistical characterization of the data that is immune to the outliers. Several approaches ...
[20]
[PDF] Kanade-Lucas-Tomasi (KLT) Tracker - Carnegie Mellon University
Kanade. Tomasi. Good Features to Track. 1994. Tomasi. Shi. History of the. Kanade-Lucas-Tomasi. (KLT) Tracker. The original KLT algorithm. Page 5. Method for ...
[21]
(PDF) Block matching algorithms for motion estimation - ResearchGate
Oct 21, 2017 · This paper is a review of the block matching algorithms used for motion estimation in video compression. It implements and compares 7 different types of block ...
[22]
[PDF] Distinctive Image Features from Scale-Invariant Keypoints
Jan 5, 2004 · This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between ...
[23]
ORB: An efficient alternative to SIFT or SURF - IEEE Xplore
In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise.
[24]
Random sample consensus: a paradigm for model fitting with ...
Jun 1, 1981 · A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a ...
[25]
FlowNet: Learning Optical Flow with Convolutional Networks - arXiv
Apr 26, 2015 · In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task.
[26]
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow - arXiv
Mar 26, 2020 · We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features.Missing: unsupervised | Show results with:unsupervised
[27]
[PDF] FlowNet 2.0: Evolution of Optical Flow Estimation With Deep Networks
In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are ...
[28]
Learning to Estimate Hidden Motions with Global Motion Aggregation
Apr 6, 2021 · We introduce a global motion aggregation module, a transformer-based approach to find long-range dependencies between pixels in the first image.
[29]
ICCV 2025 Open Access Repository
Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also ...
[30]
[PDF] Fast Local and Global Projection-Based Methods for Affine Motion ...
The perfor- mance of the iterative nonlinear least squares estima- tors depend on both the convexity of the objective function (sum of the squared image ...
[31]
Motion displacement estimation using an affine model for
Brockett6 developed a least-squares approach to approx- imate optical flow by affine vector fields using shape gramians. A broad class of gradient-based methods ...
[32]
Parametric estimation of affine deformations of planar shapes
Aug 5, 2025 · In Section 4, we propose an iterative solution of overdetermined systems, a direct analytical solution of non-singular systems, and a ...
[33]
Homography estimation using local affine frames - IEEE Xplore
Throughout this paper we propose a simple, direct linear transformation (DLT) like solution to the problem of homography estimation using local affine frames, ...
[34]
[PDF] Planar Affine Rectification from Change of Scale - CMP
A proof of degenerate case of collinear points can be found in appendix A. 2 The Method. First, the concept of local scale change under planar homography is ...Missing: configurations | Show results with:configurations
[35]
[PDF] ASIFT: An Algorithm for Fully Affine Invariant Comparison
The ASIFT feature computation complexity is therefore 13.5 times the complexity for computing SIFT features. The complexity growth is “linear” and thus marginal ...
[36]
[PDF] High Accuracy Optical Flow Estimation Based on a Theory for ...
Abstract. We study an energy functional for computing optical flow that com- bines three assumptions: a brightness constancy assumption, a gradient ...
[37]
[PDF] Pyramidal Implementation of the Lucas Kanade Feature Tracker ...
The overall pyramidal tracking algorithm proceeds as follows: first, the optical flow is comptuted at the deepest pyramid level Lm. Then, the result of the ...
[38]
[PDF] Hierarchical Model-Based Motion Estimation
Arguments for use of hierarchical (i.e. pyramid based) estimation techniques for mo- tion estimation have usually focused on issues of computational efficiency.
[39]
[PDF] PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost ...
We present a compact but effective CNN model for op- tical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established ...
[40]
[PDF] The Laplacian Pyramid as a Compact Image Code
Apr 4, 1983 · Abstract—We describe a technique for image encoding in which local operators of many scales but identical shape serve as the basis.
[41]
H.264 : Advanced video coding for generic audiovisual services
**Summary of H.264 Motion Estimation and Related Features:**
[42]
H.265 : High efficiency video coding
**Summary of Motion Estimation in HEVC (H.265):**
[43]
https://homes.cs.washington.edu/~shapiro/EE596/notes/bergen_eccv92.pdf
[44]
https://openaccess.thecvf.com/content_cvpr_2018/papers/Sun_PWC-Net_CNNs_for_CVPR_2018_paper.pdf
[45]
[PDF] Multiple Reference Motion Compensation - UC San Diego
Abstract. Motion compensation exploits temporal correlation in a video sequence to yield high compression efficiency. Multiple reference frame motion.
[46]
H.262 : Information technology - Generic coding of moving pictures and associated audio information: Video
**Summary of Motion Estimation in MPEG-2/H.262 (Block-Based, Half-Pixel Accuracy):**
[47]
(PDF) Motion Vector Coding and Block Merging in Versatile Video ...
Sep 6, 2021 · This paper overviews the motion vector coding and block merging techniques in the Versatile Video Coding (VVC) standard developed by the Joint Video Experts ...<|control11|><|separator|>
[48]
H.266 : Versatile video coding
**Summary of H.266 (Versatile Video Coding) from ITU-T REC-H.266:**
[49]
An efficient versatile video coding motion estimation hardware
Jan 29, 2024 · In this paper, we propose an efficient VVC ME hardware. It is the first VVC ME hardware in the literature. It has real time performance with small hardware ...
[50]
[PDF] Bundle Adjustment — A Modern Synthesis
Abstract. This paper is a survey of the theory and methods of photogrammetric bundle adjustment, aimed at potential implementors in the computer vision ...
[51]
https://www.itu.int/rec/T-REC-H.262/en
[52]
[PDF] Structure-From-Motion Revisited - CVF Open Access
This paper proposes a SfM algorithm that overcomes key challenges to make a further step towards a general-purpose. SfM system. The proposed components of ...
[53]
Crowdsource Drone Imagery – A Powerful Source for the 3D ...
Dec 28, 2020 · In this paper, we propose the idea of using crowdsource drone images and videos which are captured by amateurs for the documentation of heritage sites.
[54]
Direct Iterative Closest Point for real-time visual odometry
Abstract: In RGB-D sensor based visual odometry the goal is to estimate a sequence of camera movements using image and/or range measurements.
[55]
Understanding Iterative Closest Point (ICP) Algorithm with Code
Apr 30, 2025 · Iterative Closest Point (ICP) is a widely used classical computer vision algorithm for 2D or 3D point cloud registration.
[56]
ORB-SLAM3: An Accurate Open-Source Library for Visual ... - arXiv
Jul 23, 2020 · This paper presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multi-map SLAM with monocular, stereo and RGB-D cameras.
[57]
Loop Closure Detection for Monocular Visual Odometry - IEEE Xplore
In order to decrease monocular visual odometry drift by detecting loop closure, this paper presents a comparison between state of the art, 2-channel and ...
[58]
Multi-Object Tracking Using Kalman Filter and Historical Trajectory ...
We propose a multi-object tracking using Kalman filter and historical trajectory correction for surveillance videos.
[59]
FusionSORT: Fusion Methods for Online Multi-object Visual Tracking
May 10, 2025 · In our tracker, we use Kalman filter (KF) [8] with a constant-velocity model for motion estimation of object tracklets in the image plane, ...
[60]
Visual multi-object tracking with re-identification and occlusion ...
This paper proposes an online visual multi-object tracking (MOT) algorithm that resolves object appearance–reappearance and occlusion.
[61]
Deep Learning Based Real-Time Object Detection on Jetson Nano ...
Aug 6, 2025 · Deep Learning Based Real-Time Object Detection on Jetson Nano Embedded GPU. June 2023; Lecture Notes in Electrical Engineering. DOI:10.1007/978 ...
[62]
Vision-Based Embedded System for Noncontact Monitoring ... - arXiv
Sep 2, 2025 · We introduce an embedded monitoring system that utilizes a quantized MobileNet model deployed on a Raspberry Pi for real-time behavioral state ...
[63]
Kitti Odometry Dataset - Andreas Geiger
For this benchmark you may provide results using monocular or stereo visual odometry, laser-based SLAM or algorithms that combine visual and LIDAR information.
[64]
An unsupervised video anomaly detection method via Optical Flow ...
Our proposed method, OFST, combines optical flow reconstruction and video frame prediction to improve video anomaly detection. OFST is composed of two modules, ...
[65]
Statistical Modeling of Long-Range Drift in Visual Odometry
Aug 7, 2025 · This paper models the drift as a combination of wide-band noise and a first-order Gauss-Markov process, and analyzes it using Allan variance.
[66]
[PDF] Past Research, State of Automation Technology, and ... - NHTSA
Advanced Research Projects Agency (DARPA) challenges (e.g., Montemerlo et al., 2008; Umson et ... 13.08 s and 15.60 s to complete primarily visual and combined ...
[67]
RD-VIO: Robust Visual-Inertial Odometry for Mobile Augmented ...
We also compared the ability to eliminate outliers in visual observations using IMU pre-integration predicted poses. ... SLAM system ORB-SLAM3 and recent DynaVINS ...
[68]
https://arxiv.org/html/2509.02018v1
[69]
[PDF] Super Odometry: A Robust LiDAR-Visual-Inertial Estimator for ...
We propose Super Odometry, a high-precision multi-modal sensor fusion framework, providing a simple but effective way to fuse multiple sensors.Missing: 2020s | Show results with:2020s
[70]
Camera, LiDAR, and IMU Based Multi-Sensor Fusion SLAM: A Survey
Sep 22, 2023 · This paper can be considered as a brief guide to newcomers and a comprehensive reference for experienced researchers and engineers to explore ...
[71]
CVPR Poster On-Device Self-Supervised Learning of Low-Latency ...
Online, on-device learning allows robots to “train in their test environment”. We improve the time and memory efficiency of the self-supervised contrast ...Missing: contrastive | Show results with:contrastive
[72]
Self-Supervised Optical Flow Estimation for Event-based Cameras
Feb 19, 2018 · We present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event based cameras.Missing: contrastive motion
[73]
[PDF] Unsupervised Joint Learning of Optical Flow and Intensity with Event ...
Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are ...
[74]
Using Bayesian deep learning approaches for uncertainty-aware ...
Bayesian methods can quantify that uncertainty, and deep learning models exist that follow the Bayesian paradigm. These models, namely Bayesian neural ...Missing: flows motion
[75]
Representing Model Uncertainty in Deep Learning - arXiv
Jun 6, 2015 · In this paper we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian ...Missing: flows motion robotics
[76]
[PDF] A General Framework for Uncertainty Estimation in Deep Learning
In this paper, we propose a novel framework for uncertainty estimation of deep neural network predictions. By combin- ing Bayesian belief networks [5], [6], [7] ...
[77]
A survey of uncertainty in deep neural networks
Jul 29, 2023 · This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges,
[78]
Quality Scalable Quantization Methodology for Deep Learning on ...
Jul 15, 2024 · The methodology uses 3-bit parameter compression and quality scalable multipliers to reduce energy and size of CNNs for edge computing, with on ...Missing: motion FlowNet
[79]
Energy-Efficient Optical Flow Estimation using Sensor Fusion and ...
In addition to accurately recovering the motion parameters of the problem, our framework produces motion-corrected edge-like images with high dynamic range ...
[80]
100FPS@1W Dense Optical Flow For Tiny Mobile Robots - arXiv
Nov 21, 2024 · In this paper, we propose EdgeFlowNet, a high-speed, low-latency dense optical flow approach for tiny autonomous mobile robots by harnessing the ...Missing: FlowNet | Show results with:FlowNet
[81]
[PDF] Quantum Motion Segmentation
This paper introduces the first algorithm for motion segmentation that relies on adiabatic quantum optimization of the objective function. The proposed method ...
[82]
(PDF) Ethical Considerations in AI-Powered Surveillance Systems
Oct 29, 2024 · This paper examines the moral implications of AI-driven surveillance, highlighting tensions between national security, public safety, and individual privacy.
[83]
Legal and ethical implications of AI-based crowd analysis - NIH
While AI offers promise in analysing crowd dynamics and predicting escalations, its deployment raises significant ethical concerns, regarding privacy, bias, ...