Fact-checked by Grok 2 weeks ago

Motion estimation

Motion estimation is the process of estimating the motion that occurs between a reference frame and the current frame in a video sequence, typically by determining motion vectors that describe displacements or transformations from one to another. This exploits temporal correlations in sequences to model apparent motion caused by object movement or camera motion, serving as a fundamental component in fields such as video compression and . In , it enables efficient encoding by reducing redundancy, while in , it facilitates tasks like object tracking and scene understanding by measuring displacements from sequences with high density and low cost compared to physical sensors. Key methods for motion estimation include block-matching algorithms, which divide frames into fixed-size blocks (e.g., 16×16 pixels) and search for the best-matching block in a reference frame to compute motion vectors, offering a practical of accuracy and computational efficiency in video coding standards. These are prominently featured in standards from to H.264/AVC and HEVC, where they support to achieve significant compression gains by predicting frame content from prior ones. In contrast, techniques in estimate dense or sparse motion fields by analyzing brightness constancy or gradient changes across pixels, with notable approaches like the Lucas-Kanade method for sparse flows and Horn-Schunck for dense regularization-constrained estimation. Other variants, such as models for rigid motions or feature-based matching using descriptors like SIFT, address complex scenarios including nonrigid deformations. Despite its utility, motion estimation faces challenges as an ill-posed problem due to the projection from 3D scenes to images, often requiring assumptions like or regularization to resolve ambiguities. Applications extend beyond to biomedical imaging for tracking cellular motion, for , and structural monitoring for vibration analysis, where accuracies down to 0.01 pixels can be achieved with advanced or kernel-based methods. Ongoing advancements, including nature-inspired algorithms and integrations, continue to enhance robustness and real-time performance across these domains.

Fundamentals

Definition and Overview

Motion estimation is a fundamental task in and that involves computing motion vectors or displacement fields to describe how pixels or features in an image sequence move from one frame to the next. These vectors represent the apparent motion of objects projected onto the two-dimensional , capturing the transformation between consecutive frames in a video. The process takes as input a sequence of images and produces as output a motion field, which quantifies displacements at the or level, thereby enabling the analysis of dynamic scenes. At its core, motion estimation relies on two key principles: brightness constancy, which assumes that the intensity of a pixel remains unchanged as it moves across frames, and spatial coherence, which posits that nearby pixels belonging to the same surface exhibit similar motion patterns. These assumptions facilitate the estimation of motion by solving the pixel correspondence problem, where the goal is to match points between frames based on their visual properties. In practice, this reduces temporal redundancy in video data, allowing for efficient representation by predicting subsequent frames from previous ones rather than encoding each frame independently. A basic example of motion estimation involves estimating pure in a simple two-dimensional rigid motion scenario, where an entire object shifts uniformly across without or , yielding a constant motion for all pixels. provides a dense representation of this motion field, assigning a to every pixel. Motion estimation is crucial for temporal analysis in dynamic environments and serves as a foundational technique for numerous applications, including video compression and object tracking.

Historical Development

The origins of motion estimation trace back to the mid-20th century in and pioneering research, where aligning successive images was essential for reconstructing three-dimensional scenes from two-dimensional projections. In the and , early efforts focused on techniques to handle geometric transformations between views, with applications in aerial surveying and . Early computational models, such as Hassenstein and Reichardt's 1956 correlation-based detector for motion in insect vision, provided foundational ideas for later algorithms. The 1980s marked a pivotal era with the formalization of differential methods for dense motion analysis, driven by the need to model continuous image changes over time. In 1981, Berthold K. P. Horn and Brian G. Schunck published "Determining Optical Flow," proposing a variational framework that minimized an energy functional to compute smooth velocity fields across entire images, influencing subsequent approaches in . Concurrently, Bruce D. Lucas and introduced a local iterative technique in their 1981 paper "An Iterative Image Registration Technique with an Application to Stereo Vision," which estimated motion by solving least-squares problems over small windows, laying the groundwork for efficient feature-based tracking. These works established the brightness constancy assumption as a core principle, assuming pixel intensities remain constant under motion, which briefly referenced early applications but spurred broader adoption in and . By the , motion estimation transitioned into practical domains, particularly video , where discrete methods enabled real-time . Block-matching algorithms, which divide into blocks and search for best-matching displacements, were integrated into international standards like (1993) and (1995), achieving efficiencies by exploiting temporal redundancies in broadcast television and digital storage. This period also saw refinements in tracking, such as the Kanade-Lucas-Tomasi (KLT) feature tracker introduced in Carlo Tomasi and Takeo Kanade's 1991 technical report "Detection and Tracking of Point Features," which extended local methods to affine motion models for robust object following in sequences. The 2000s brought initial fusions of motion estimation with , enhancing adaptability to complex scenes beyond rigid assumptions. Kernel-based trackers and probabilistic models, building on KLT, incorporated learning to predict feature trajectories, as seen in extensions like the online boosting classifiers for visual tracking around 2004. These laid preparatory groundwork for data-driven paradigms. In the and , revolutionized motion estimation, shifting from handcrafted features to end-to-end neural architectures trained on large datasets. Philipp Fischer et al.'s 2015 paper "FlowNet: Learning Optical Flow with Convolutional Networks" introduced the first CNN-based estimator, achieving near-real-time performance on benchmarks like Sintel by directly regressing motion fields from image pairs. This sparked a surge in supervised methods, with refinements like PWC-Net (2018) improving accuracy via pyramid warping. By the early , transformer architectures addressed long-range dependencies, as in the 2021 GMA (Global Motion Aggregation) module integrated into , enabling state-of-the-art unsupervised flow on datasets up to 2025, with emerging diffusion frameworks and biologically-inspired ML models enhancing accuracy in complex scenes. Influential standards like H.264/AVC (2003) and its successors incorporated advanced block-matching techniques, such as variable block sizes, boosting video quality in streaming up to HEVC (2013).

Mathematical Foundations

Motion Models

Motion models in motion estimation provide parametric approximations of the underlying scene dynamics, enabling efficient computation by reducing the compared to dense pixel-wise estimates. These models assume that motion can be captured by transformations with a small number of parameters, suitable for rigid or semi-rigid objects in video sequences or image pairs. Seminal works in have formalized these representations to balance representational power with estimation tractability. The simplest motion model is the translational model, which assumes uniform global shift across the image and is parameterized by a 2D vector (t_x, t_y) representing displacement in the x and y directions. This model is effective for scenarios with pure camera panning or object without or scaling, as seen in early techniques. , in contrast, models around a center, parameterized by an \theta (or multiple angles for ), preserving distances but altering orientations; it is often combined with for basic rigid transformations. These fundamental models form the basis for more complex approximations in structure-from-motion pipelines. For scenes involving , , or effects, the affine model extends and with additional parameters, using a 6-degree-of-freedom : \begin{pmatrix} x' \\ y' \end{pmatrix} = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} + \begin{pmatrix} t_x \\ t_y \end{pmatrix}, where the matrix elements a, b, c, d capture uniform scaling (via a and d), (via b and c), and . This model assumes and is widely used for local motion estimation in non-planar scenes, as introduced in extensions of the Lucas-Kanade framework for tracking. Perspective distortions in planar scenes or pure camera motion are handled by projective models, specifically the 8-parameter homography H, a 3x3 up to scale, that maps points via: \begin{pmatrix} x' \\ y' \\ w' \end{pmatrix} = H \begin{pmatrix} x \\ y \\ 1 \end{pmatrix}, with normalized coordinates (x'/w', y'/w'). Homographies model arbitrary projective transformations, including , , , and skew, and are central to two-view geometry for reconstructing planar structures. Non-rigid motions, such as deformations in elastic objects or articulated bodies, require deformable models like thin-plate splines (), which interpolate local displacements using a to minimize bending energy while fitting control points. TPS extends affine models locally, enabling smooth non-linear warps without assuming global rigidity, and has been applied in direct non-rigid registration methods. The choice of motion model involves a between complexity (number of parameters) and fitting accuracy: translational or rotational models suffice for rigid, distant objects to avoid noise, while affine or projective models improve accuracy for closer or planar scenes at higher computational cost, and deformable models like TPS are selected for elastic deformations despite increased estimation challenges. Unlike non-parametric , which computes dense fields without assumptions, parametric models prioritize low-dimensional fits for robustness in sparse data regimes.

Optical Flow Constraint

The optical flow constraint arises from the fundamental assumption of brightness constancy, which posits that the intensity of a point in an image sequence remains unchanged as it moves between frames. This assumption, central to differential methods for motion estimation, states that for an image intensity function I(x, y, t), the value at a displaced position satisfies I(x + dx, y + dy, t + dt) = I(x, y, t), where dx = u \, dt and dy = v \, dt, with (u, v) representing the components of the . To derive the constraint equation, a first-order expansion is applied around the point (x, y, t), assuming small motions such that higher-order terms are negligible: \begin{aligned} I(x + u \, dt, y + v \, dt, t + dt) &\approx I(x, y, t) + \frac{\partial I}{\partial x} u \, dt + \frac{\partial I}{\partial y} v \, dt + \frac{\partial I}{\partial t} dt \\ &= I(x, y, t). \end{aligned} Dividing through by dt and rearranging yields the optical flow constraint equation: \frac{\partial I}{\partial x} u + \frac{\partial I}{\partial y} v + \frac{\partial I}{\partial t} = 0, where \frac{\partial I}{\partial x} and \frac{\partial I}{\partial y} are the spatial , and \frac{\partial I}{\partial t} is the temporal gradient. This links the observed changes in image intensity to the underlying motion, forming the basis for many intensity-based algorithms. A key challenge with this constraint is the aperture problem, which stems from its underconstrained nature: for each , there is only one but two unknowns (u and v). This results in an infinite number of possible vectors that satisfy the , lying along a line perpendicular to the local image . For instance, along a straight edge with uniform parallel to the edge, only the normal component of the (perpendicular to the edge) can be reliably estimated from local intensity changes, while the tangential component remains ambiguous without additional contextual information from neighboring . The derivation relies on the small motion assumption inherent in the linear approximation, which holds well for sub-pixel displacements but breaks down for larger movements. Spatial are typically computed using kernels such as Sobel operators to approximate \frac{\partial I}{\partial x} and \frac{\partial I}{\partial y}, while the temporal \frac{\partial I}{\partial t} is often obtained via finite differences between consecutive . These approximations introduce sensitivity to , necessitating preprocessing like Gaussian smoothing. To address the underconstrained nature of the single-frame constraint, extensions incorporate multi-frame information, such as temporal coherence models that integrate data across several frames to provide additional equations and improve solvability. For example, sequential estimation using can propagate flow estimates over time, reducing ambiguities by leveraging the diversity of temporal gradients across frames. Despite its foundational role, the constraint has limitations, failing under large displacements where the small motion assumption does not hold, changes in illumination that violate brightness constancy, or occlusions that introduce discontinuities in the flow field. These issues highlight the need for robust extensions in practical applications, though the core remains a cornerstone for instantaneous motion .

Algorithms

Intensity-Based Methods

Intensity-based methods estimate dense motion fields by directly utilizing intensities from consecutive image frames, relying on the assumption of brightness constancy to solve the constraint . These approaches compute motion vectors for every , producing a complete flow field suitable for applications requiring detailed information, unlike sparse methods that focus on select features. The aperture problem, inherent to local intensity changes, is addressed through either smoothness assumptions or local averaging within windows. Global methods, such as the Horn-Schunck algorithm, formulate motion estimation as a variational that minimizes a global energy functional combining data fidelity and smoothness terms. The energy is defined as E = \int \left( (I_x u + I_y v + I_t)^2 + \alpha (|\nabla u|^2 + |\nabla v|^2) \right) \, dx \, dy, where I_x, I_y, I_t are the spatial and temporal intensity derivatives, u and v are the horizontal and vertical flow components, and \alpha controls the smoothness penalty. This functional is solved using the Euler-Lagrange equations, yielding a system of partial differential equations that enforce neighboring flow consistency to resolve ambiguities from the aperture problem. The resulting dense flow is smooth but can oversmooth discontinuities in motion boundaries. Local methods, exemplified by the Lucas-Kanade algorithm, estimate motion within small spatial windows by assuming constant flow across the region and solving a least-squares problem. For a window around pixel (x, y), the flow (u, v) is computed as \begin{pmatrix} u \\ v \end{pmatrix} = (A^T A)^{-1} A^T b, where A is the matrix of spatial gradients I_x and I_y for pixels in the window, and b contains the temporal derivatives -I_t. This approach provides robustness to noise through averaging but requires careful window size selection: smaller windows capture fine details and handle occlusions better, while larger ones improve stability for low-texture areas at the cost of blurring motion edges. The method assumes small displacements, limiting its applicability to sub-pixel motions without extensions. Variational frameworks extend these ideas by incorporating advanced regularizers, such as (TV) terms, to enhance robustness against and outliers while preserving flow discontinuities. The TV-L1 model replaces the quadratic data term with a robust L1 norm and uses TV regularization on the magnitude, formulated as minimizing \int |I_x u + I_y v + I_t| + \lambda |\nabla \mathbf{w}| \, dx \, dy, where \mathbf{w} = (u, v) and \lambda balances and regularity. This duality-based optimization yields piecewise-smooth flows resilient to illumination variations and sparse errors, outperforming penalties in textured scenes. Computationally, global methods like Horn-Schunck employ iterative solvers such as Gauss-Seidel relaxation to approximate the Euler-Lagrange solutions, converging in tens of iterations for typical image sizes but scaling poorly with resolution. Local methods like Lucas-Kanade are faster, requiring matrix inversions per window, and often use multi-resolution pyramids to handle larger motions by estimating coarse flows first and refining at finer levels. These techniques enable processing on modern hardware for video sequences up to resolution. Intensity-based methods produce dense flow fields valuable for applications like () imaging, where they estimate glacier surface motion from intensity images with sub-pixel accuracy under varying speckle noise. However, they remain sensitive to illumination changes, which violate the brightness constancy assumption and introduce errors in the data term, necessitating robust variants for outdoor scenes. In contrast to feature-based approaches, their use of all pixels yields comprehensive coverage but at higher computational cost.

Feature-Based Methods

Feature-based methods in motion estimation focus on identifying and tracking distinct keypoints or sparse regions within to estimate motion vectors, prioritizing computational and robustness to occlusions over the dense pixel-wise analysis of intensity-based approaches. These techniques typically divide the into blocks or select features based on properties, then them across frames using similarity metrics. By operating on a limited set of features—often numbering in the hundreds rather than thousands of pixels—these methods achieve lower complexity while effectively capturing dominant motions in scenes with structured elements like edges or corners. One foundational approach is block matching, where the image is partitioned into fixed-size blocks, typically 16x16 pixels, and each block in the current frame is compared to candidate blocks in a search window of the reference frame to find the best match. The similarity is commonly measured by the sum of absolute differences (SAD), defined as \sum |I(x) - I'(x + d)| over block pixels, where I and I' are the reference and current frame intensities, and d is the displacement vector minimizing SAD. Exhaustive search evaluates all positions in the search window for the minimum SAD, providing high accuracy but at a computational cost of O(B \cdot W^2) per block, where B is the number of blocks and W the search window size; this full-search block matching forms the basis of motion compensation in standards like MPEG-1. To accelerate this, fast search patterns reduce evaluations: the three-step search starts with a coarse grid of nine points spaced by half the window size, refines to a smaller step, and ends with an unrestricted search around the minimum, typically requiring about 25 checks versus 225 for exhaustive search on a ±7 pixel window. Similarly, the diamond search employs a large diamond pattern (five points) iteratively until convergence, followed by a small diamond refinement, achieving up to 22% fewer computations than three-step search while maintaining comparable accuracy in MPEG video coding. Feature tracking methods, such as the Kanade-Lucas-Tomasi (KLT) tracker, select and follow individual keypoints across frames using local optimization. Feature selection relies on the , a 2x2 G = \sum_w \begin{pmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{pmatrix}, where I_x, I_y are spatial gradients weighted by a window w; good features are those with both eigenvalues of G above a threshold, ensuring corner-like stability under small translations. The tracker then estimates affine motion per feature by solving the least-squares system from the Lucas-Kanade constraint I_x u + I_y v + I_t = 0, where (u, v) are flow components and I_t the , iteratively warping the template to minimize residual errors. For subpixel accuracy in sparse , the inverse compositional variant optimizes the warp parameters by composing updates inversely on the template, avoiding recomputation per iteration and enabling efficient affine or models with convergence in 5-10 steps. Descriptor-based methods enhance matching robustness by extracting invariant feature descriptors at keypoints, followed by correspondence estimation. The (SIFT) detects keypoints at scale-space extrema and describes them with 128-dimensional histograms of oriented gradients, invariant to scale and rotation; matches are found via nearest-neighbor search in descriptor space, often using Lowe's ratio test to discard ambiguous pairs (distance to closest over second-closest < 0.8). For faster alternatives, (ORB) combines FAST corner detection with binary BRIEF descriptors rotated to keypoint orientation, enabling Hamming distance matching at speeds 100 times faster than SIFT with similar accuracy on rotated images. To fit a motion model from these correspondences amid outliers (e.g., 50% from repetitive textures), RANSAC randomly samples minimal sets (e.g., two points for translation), fits the model, counts inliers within a threshold, and selects the largest consensus set, iteratively over thousands of trials for robust estimation. These methods excel in handling large displacements by focusing on distinctive features less prone to aperture problems, unlike dense intensity-based techniques that assume global smoothness. Their complexity scales linearly as O(N) with the number of features N, making them suitable for applications like video stabilization, where N \approx 200-500 suffices for accurate ego-motion recovery.

Deep Learning Methods

Deep learning methods have revolutionized motion estimation by enabling end-to-end learning of dense from image pairs, surpassing traditional hand-crafted approaches in generalization and accuracy on diverse datasets. These methods typically leverage convolutional neural networks (CNNs) or transformers to predict pixel-wise motion vectors, often trained with losses inspired by the constraint equation to minimize photometric errors between frames. A seminal CNN-based approach is FlowNet, introduced in 2015, which employs a two-stream architecture to extract features from consecutive frames and computes a volume across multi-scale feature maps for end-to-end flow prediction. This correlation layer enables the network to match corresponding pixels efficiently without explicit search, achieving performance on GPUs. FlowNet is supervised-trained on synthetic datasets such as Flying Chairs, which simulate realistic motion with rendered chair sequences to generate ground-truth flow labels. Subsequent improvements, like FlowNet 2.0, stacked multiple networks for refined estimates, reducing end-point error (EPE) on benchmarks like Sintel by integrating coarse-to-fine processing. Unsupervised variants address the need for labeled data by relying on photometric and smoothness losses, with RAFT (Recurrent All-Pairs Field Transforms) from 2020 exemplifying iterative refinement through a gated recurrent unit (GRU) that updates flow estimates over multiple iterations. RAFT constructs all-pairs correlation volumes at multiple pyramid levels to handle large displacements and incorporates occlusion handling via forward-backward consistency checks, where inconsistent flows between forward and backward predictions are masked. This approach achieves state-of-the-art EPE on the KITTI dataset (around 3.5 pixels average) and Sintel (under 2 pixels on clean passes), demonstrating robust generalization without ground-truth supervision in its core refinement loop. Transformer-based models extend these capabilities by capturing long-range dependencies, as seen in GMA (Global Motion Aggregation) from 2021, which integrates a module to aggregate global motion cues from the first frame into local flow predictions. GMA's attention mechanism reasons over occluded regions by attending to visible pixels with similar motions, improving EPE in challenging areas like the Sintel final pass by up to 20% over baselines. This enables better handling of complex scenes with occlusions and non-rigid motions. Hybrid approaches combine with classical elements, such as RAFT's processing, where multi-resolution volumes mimic traditional coarse-to-fine strategies to capture both small and large motions efficiently. By 2025, trends emphasize lightweight models for devices, like EdgeFlowNet, a tailored for tiny mobile robots that delivers 100 dense flow at 1W power on hardware like the Coral TPU, with EPE competitive on Sintel (6.53 pixels). with models for generative motion estimation has emerged, as in GENMO, which combines with processes to produce diverse yet accurate human motion flows from sparse inputs, advancing applications in and . These developments support real-time in resource-constrained environments.

Advanced Techniques

Affine Motion Estimation

Affine motion estimation involves computing the six parameters of an —two for translation, one for , one for , and two for —from sparse point correspondences between images, enabling the modeling of more complex deformations than pure translation. This approach assumes a linear relationship between corresponding points (x_i, y_i) and (x_i', y_i'), formulated as: \begin{align*} x_i' &= a x_i + b y_i + t_x, \\ y_i' &= c x_i + d y_i + t_y, \end{align*} where a, b, c, d capture , , and , while t_x, t_y represent . typically employs least-squares optimization to minimize the over N correspondences: \min_{a,b,c,d,t_x,t_y} \sum_{i=1}^N \left[ (x_i' - (a x_i + b y_i + t_x))^2 + (y_i' - (c x_i + d y_i + t_y))^2 \right]. This problem is linear in the parameters and can be solved directly using the normal equations or the pseudoinverse of the . To improve , the points can be centered by subtracting their centroids, which decouples and allows solving for the linear part first. For multiple correspondences forming an A \mathbf{p} = \mathbf{b}, where \mathbf{p} = [a, b, c, d, t_x, t_y]^T, the solution is the least-squares minimizer \mathbf{p} = (A^T A)^{-1} A^T \mathbf{b}. Robust variants address outliers in correspondences using RANSAC, which iteratively samples minimal sets (three non-collinear points for affine) to hypothesize transformations, evaluating consensus via inlier counts before refining with least-squares on the largest set. The (DLT) provides an efficient linear solver for the affine system, constructing a constraint matrix from correspondences and applying to solve the homogeneous form, though it requires normalization to avoid numerical instability. Degenerate configurations, such as collinear points, lead to rank-deficient systems where the affine matrix cannot be uniquely determined, as multiple transformations map the line identically; detection involves checking the condition number of the SVD or requiring at least three non-collinear points. In tracking applications, affine-invariant features like ASIFT enhance robustness to viewpoint changes by simulating all possible affine distortions during feature extraction, allowing reliable matching under , , and for sustained object tracking across frames. Evaluation often quantifies accuracy via parameter deviation metrics, such as the between estimated and ground-truth parameter vectors, or endpoint error on held-out points, with applications in demonstrating sub-pixel precision on synthetic datasets.

Multi-Resolution Approaches

Multi-resolution approaches in motion estimation employ hierarchical image representations to address challenges posed by large displacements between frames, enabling more robust and efficient computation. These methods decompose the input images into multiple scales, typically using Gaussian or Laplacian , where coarser levels capture global motion patterns while finer levels refine local details. The Gaussian pyramid is constructed by successively low-pass filtering and subsampling the image, creating a series of reduced-resolution versions that approximate the original at decreasing spatial frequencies. In contrast, the Laplacian pyramid encodes band-pass filtered details at each scale, facilitating the propagation of high-frequency information during refinement. This coarse-to-fine strategy, popularized in estimation, warps intermediate pyramid levels to align images and propagate flow estimates upward, significantly improving convergence for scenarios with substantial motion. In implementation, the process begins at the coarsest pyramid level, where motion is estimated under reduced to capture broad displacements, often assuming affine or translational models for global consistency. This initial estimate is then upsampled to the next finer level and used to initialize a local refinement step, such as iterative optimization within small windows. The upsampling typically scales the coarser flow vector \mathbf{u}_{l+1} by a factor of 2 (matching the pyramid's rate) and adds a correction term \Delta \mathbf{u}_l computed at the current level l, formalized as: \mathbf{u}_l = 2 \mathbf{u}_{l+1} + \Delta \mathbf{u}_l This propagation stabilizes the search by constraining the refinement to small residuals, integrating seamlessly with intensity-based methods like Lucas-Kanade optical flow or block matching algorithms. The primary benefits of multi-resolution approaches include a drastic reduction in the search space at coarser scales, which mitigates the computational burden of exhaustive matching and enhances efficiency through subsampling—often achieving speedups of 4-8 times per level compared to single-scale methods. Additionally, by estimating large motions globally first, these techniques avoid entrapment in local minima, leading to more accurate flows in complex scenes with occlusions or rapid changes. For instance, in video sequences with fast camera panning, the hierarchical refinement preserves structural coherence that single-resolution estimators frequently lose. Variants extend the classic pyramid framework, such as overcomplete representations that maintain overlapping scales for smoother transitions, or learned multi-scale hierarchies in models like PWC-Net, which incorporate processing with warping and cost volumes to achieve state-of-the-art accuracy on benchmarks like MPI Sintel, with end-point errors around 10% lower than some prior methods. These adaptations preserve the core coarse-to-fine paradigm while adapting to modern neural architectures. Despite these advantages, multi-resolution approaches can introduce artifacts from quantization errors at coarse scales, where blurs fine details and may propagate inaccuracies, particularly in handling fast camera motion exceeding the 's displacement limits—resulting in aliased flows or failure to capture sub-pixel in high-speed scenarios. Such limitations underscore the need for careful pyramid depth selection, typically 3-5 levels, to balance accuracy and robustness.

Applications

Video Coding

Motion estimation plays a pivotal role in video coding by exploiting temporal redundancy between consecutive , enabling efficient through predictive modeling of pixel displacements. In hybrid video codecs, it forms the basis for inter-frame , where motion vectors describe block movements to reconstruct current from , significantly reducing the bitrate required for or . This process is central to standards developed by the Joint Video Experts Team (JVET) and predecessors, achieving substantial gains in efficiency. Block-based prediction is the cornerstone of motion estimation in modern video coding standards such as H.264/AVC and HEVC (H.265). Frames are partitioned into macroblocks or coding units—typically 16×16 pixels in H.264/AVC and variable sizes up to 64×64 in HEVC—and a motion vector is estimated for each to minimize the residual error when predicting from a reference frame. To enhance accuracy, these standards support sub-pixel precision, particularly quarter-pixel accuracy achieved via bilinear or Wiener interpolation filters on reference samples. This fractional resolution allows finer motion compensation, improving prediction quality especially for smooth or non-integer movements. Search strategies for motion vectors balance and accuracy, with full search exhaustively evaluating all candidate positions within a search window but at high cost, while fast algorithms like TZSearch in HEVC's reference software reduce evaluations through zonal patterns and early termination. These methods integrate rate-distortion optimization (RDO) to select the best vector and mode, minimizing the Lagrangian cost function J = D + \lambda R, where D represents distortion (e.g., ), R is the bitrate for encoding the vector and , and \lambda is a trading off distortion against rate. RDO ensures decisions align with overall compression goals, often yielding 10-20% bitrate savings in inter prediction. Prediction modes further refine motion estimation: unidirectional modes use forward or backward prediction from one (P-frames in H.264/AVC), while bidirectional modes in B-frames combine two for better in scenes with occlusions or reversals. Both standards support multiple reference frames—up to 16 in HEVC—allowing selection from a list to capture longer-term dependencies, which can improve by 5-15% in low-motion sequences. The evolution of motion estimation traces from (H.262) in the 1990s, which introduced block-based compensation with half-pixel accuracy, to H.264/AVC (2003) enhancing it with quarter-pixel and variable block shapes for about 50% bitrate reduction over . HEVC (2013) extended this with larger blocks and TZSearch, doubling efficiency over H.264/AVC. (H.266, 2020) incorporates affine motion models for intra-block variations, treating rotation and scaling alongside translation via 4- or 6-parameter transforms on 4×4 sub-blocks, yielding up to 50% further bitrate savings for complex motions like camera pans. These advancements have enabled /8K streaming at viable bitrates. Despite these gains, challenges persist, notably the bitrate overhead from transmitting motion vectors, which can consume 10-20% of the in high-motion or fine-grained scenarios, necessitating advanced like CABAC to mitigate. In streaming services, this overhead impacts adaptive bitrate delivery, prompting optimizations like vector merging in to reduce signaling for similar neighboring blocks.

3D Reconstruction

Motion estimation plays a crucial role in by enabling the recovery of camera poses and scene structure from sequences of 2D images, primarily through the structure-from-motion (SfM) pipeline. This process begins with feature matching across multiple views to establish correspondences between images, leveraging techniques from feature-based methods to identify and track keypoints such as corners or blobs. From these correspondences, the fundamental matrix F is estimated, which encodes the relating points in two images; assuming known camera intrinsics K, the essential matrix E is then derived via E = K^T F K[], capturing the relative rotation and translation up to scale between views. follows, projecting the matched features back into 3D space using the decomposed camera poses from E to initialize a sparse representing the scene structure. To refine the initial estimates, performs a joint optimization over all camera poses P and points X_i, minimizing the reprojection error defined as \sum_i \| x_i - \pi(P, X_i) \|^2, where \pi is the function and x_i are the observed points. This non-linear least-squares problem, often solved using Levenberg-Marquardt, ensures consistency across the entire , significantly improving accuracy in the presence of noise or outliers. For dense reconstruction beyond the sparse SfM output, multi-view stereo extends the model by computing depth for most pixels, employing methods like patch matching to evaluate photo-consistency across views or to propagate correspondences; a seminal approach uses adaptive patch-based evaluation to generate quasi-dense surface models visible in the input images. In scenarios with planar scenes, such as facades or documents, affine approximations simplify motion estimation by directly decomposing the matrix—induced by the plane—into and components, avoiding full recovery when depth variation is minimal. This is particularly useful for initial pose estimation in restricted environments. Practical implementations, like the COLMAP software, integrate these SfM steps into an end-to-end pipeline for robust from unordered image collections, supporting both sparse and dense outputs. In , drone-based SfM has been applied to sites in the 2020s, such as generating detailed models of historical monuments from aerial imagery to aid preservation and virtual tourism.

Robotics and Surveillance

In robotics, motion estimation plays a critical role in (VO), which estimates the ego-motion of a using sequential camera images to enable in dynamic environments. Direct methods, such as those aligning pixel intensities between frames, and feature-based approaches like (ICP) for registration, are commonly employed to compute relative poses with high accuracy in real-time settings. A prominent example is ORB-SLAM3, an open-source visual-inertial system that integrates IMU data for robust ego-motion estimation, achieving low drift through tightly coupled optimization of visual and inertial measurements. Loop closure detection further enhances VO by identifying revisited locations, allowing global pose corrections via to mitigate cumulative errors in long trajectories. In surveillance applications, motion estimation facilitates multi-object tracking by estimating trajectories from video feeds, often fusing motion vectors—derived from optical flow or block matching—with predictive models to maintain track continuity. The Kalman filter is widely used for this fusion, predicting object states based on constant-velocity assumptions and updating with detected motion vectors to handle multiple targets in cluttered scenes. Occlusions, a common challenge in surveillance, are addressed through re-identification techniques that match object appearances across frames using deep features, enabling track recovery post-obstruction without relying solely on motion continuity. Real-time constraints in these domains demand sub-millisecond motion estimation to support immediate decision-making, achieved through GPU-accelerated models optimized for hardware. Lightweight convolutional networks, such as quantized variants of or MobileNet, process or feature correspondences at high frame rates on platforms like , enabling robust estimation under resource limitations. enhances robustness to lighting variations and partial occlusions, briefly leveraging multi-resolution processing for large-scale scenes when necessary. Applications in autonomous vehicles utilize for precise localization, with benchmarks on the KITTI dataset evaluating translational and rotational errors, where state-of-the-art systems achieve average drifts below 1% over sequences of several kilometers. In security cameras, abnormal patterns signal anomalies like intrusions, with detection models reconstructing expected flows to flag deviations in . Key metrics include tracking success rates exceeding 80% in multi-object scenarios and drift limited to 0.5-2% in extended runs, as demonstrated in Urban Challenge (2007) and SubT Challenge (2019-2021) evaluations of robotic under unstructured conditions.

Challenges

Limitations in Real-World Scenarios

Motion estimation algorithms encounter significant challenges in real-world scenarios due to occlusions and disocclusions, where parts of objects are temporarily hidden or newly revealed during motion. In such cases, motion vectors become unreliable at object boundaries because matching correspondences fail, leading to erroneous flow estimates in affected regions. A common detection method involves forward-backward error analysis, which identifies occlusions by comparing the forward-warped positions from one frame to the next with backward-warped positions, flagging inconsistencies as occluded areas. Disocclusions, often arising from dynamic object movements, exacerbate this issue by introducing novel visible regions without prior matching points, further degrading estimation accuracy in video sequences. Illumination variations pose another critical limitation by violating the fundamental brightness constancy assumption underlying many optical flow methods, which posits that pixel intensity remains unchanged under motion. Changes in lighting, such as shadows or global illumination shifts, introduce mismatches in intensity-based similarity measures, resulting in inaccurate displacement estimates. To mitigate this, normalized cross-correlation is employed as a robust similarity metric, which normalizes patch intensities to reduce sensitivity to linear illumination changes, though it does not fully resolve non-linear variations. Large or non-rigid motions amplify the aperture problem, where local gradient-based methods can only reliably estimate motion components perpendicular to edges, leading to ambiguous directions parallel to contours and higher failure rates overall. In non-rigid scenarios, such as deformable objects, this ambiguity propagates, causing widespread estimation errors. For instance, on the Middlebury optical dataset, algorithms exhibit endpoint error rates exceeding 20% in sequences featuring large displacements and outdoor scenes with complex, non-rigid elements like waving flags or moving crowds. Noise and aliasing further impair motion estimation by corrupting gradient computations essential to differential techniques, introducing spurious local minima in the optimization process and biasing flow vectors. Image noise amplifies uncertainties in spatial and temporal derivatives, while aliasing from undersampling high-frequency motions distorts gradient accuracy, particularly in low-texture areas. In robotics applications, these effects are quantified using average trajectory error (ATE), where noisy estimates can increase ATE by factors of 2-5 compared to ideal conditions, as seen in visual odometry benchmarks on datasets like KITTI. Computational bottlenecks limit the practicality of motion estimation in resource-constrained environments, such as mobile devices, where dense algorithms demand high processing power for performance. Without like GPUs, many methods are capped at around 30 frames per second () for standard resolutions, falling short for or interactive applications, due to the in search space and iterative optimizations.

Emerging Solutions

Emerging solutions in motion estimation are addressing longstanding challenges in accuracy, robustness, and efficiency by integrating data, advanced learning paradigms, and uncertainty-aware frameworks, particularly as of 2025. These innovations build on methods to enhance performance in dynamic environments, such as and applications. For instance, recent in 2025 , such as the HuPerFlow benchmark, demonstrate improved through hybrid approaches that mitigate sensor-specific limitations like visual occlusions or inertial drift. Hybrid sensor fusion techniques combine visual data with complementary modalities like and inertial measurement units () to achieve more reliable motion estimation. In visual-inertial odometry (VIO) systems for AR glasses, such as those used in mobile augmented reality, camera feeds are fused with IMU accelerations and angular velocities to track head movements with sub-millimeter precision even under rapid rotations or low-light conditions. Graph-based optimization methods, which model sensor measurements as nodes and constraints in a , further refine these estimates by jointly optimizing poses across point clouds, visual features, and IMU biases, reducing drift errors by up to 40% in urban navigation scenarios. Similarly, Kalman filter variants, including extended and unscented forms, enable real-time fusion of GNSS, , vision, and IMU data for robust vehicle pose estimation, maintaining accuracy within 0.5 meters in GPS-denied environments. These approaches, exemplified in systems like Super Odometry, demonstrate how multi-sensor integration enhances motion estimation for applications requiring seamless AR overlays. Self-supervised learning has gained traction for motion estimation by leveraging unlabeled data through contrastive losses, reducing reliance on costly annotations. In this paradigm, models learn representations by contrasting positive motion pairs (e.g., temporally adjacent frames) against negative ones, enabling end-to-end training for prediction with minimal supervision. Advances in 2025 particularly highlight event-based cameras, which use neuromorphic sensors to capture asynchronous changes for high-speed motion, achieving latencies under 1 ms in dynamic scenes like flight. For example, EV-FlowNet employs self-supervised photometric consistency losses on event streams to estimate dense , competitive with supervised frame-based methods in high-dynamic-range conditions without labeled data. Recent extensions, such as unsupervised joint learning frameworks for event cameras and ESMD for simultaneous motion and estimation, further integrate reconstruction with flow estimation using contrastive objectives, facilitating deployment in resource-constrained neuromorphic hardware for real-time tracking. Uncertainty estimation in deep learning models for motion estimation is advancing through Bayesian techniques, providing probabilistic outputs crucial for safe operations. Bayesian normalizing flows model the posterior distribution over motion parameters, allowing efficient sampling of possible trajectories to quantify aleatoric and epistemic uncertainties in . dropout, a practical approximation to , applies dropout at inference time to generate multiple predictions, whose variance yields uncertainty measures; this has been applied to networks like FlowNet to flag unreliable estimates in occluded regions, enhancing reliability in applications such as in cluttered environments. In contexts, frameworks combining evidential deep learning with methods estimate motion uncertainties for tasks like , enabling adaptive planning that avoids high-variance paths. These methods, as surveyed in recent works, emphasize scalable Bayesian approximations to ensure trustworthy motion predictions without excessive computational overhead. Scalable architectures for edge are optimizing motion estimation models through quantization, enabling deployment on low-power devices like mobile robots. Quantization reduces model precision from 32-bit floats to 8-bit or lower integers, compressing parameters while preserving accuracy for . For instance, quantization techniques applied to models for , such as variants of FlowNet, can achieve substantial parameter reductions while maintaining accuracy on benchmarks like KITTI, facilitating 30+ operation on edge hardware with under power draw. Fusion-FlowNet exemplifies this by integrating with quantized , significantly reducing energy consumption for in applications like autonomous driving while preserving accuracy. Such optimizations, including quality-scalable multipliers, ensure motion estimation remains viable for battery-constrained applications without retraining from scratch. Research frontiers in motion estimation explore quantum-inspired optimization for global methods and ethical considerations in surveillance applications. Quantum annealing-inspired algorithms solve combinatorial optimization problems in motion segmentation, partitioning video frames into coherent motion clusters more efficiently than classical solvers, with speedups of 10-100x on NP-hard instances using adiabatic quantum computing frameworks. These techniques extend to motion estimation by minimizing functions over large search spaces, promising breakthroughs in multi-object tracking. On the ethical front, in motion tracking for —such as racial or disparities in detection—raises concerns about perpetuating inequalities, with studies as of 2025 recommending diverse datasets and fairness audits to mitigate discriminatory outcomes in public monitoring systems. Frameworks for ethical emphasize in tracking algorithms to balance benefits with rights, addressing biases that amplify societal inequities.

References

  1. [1]
    Motion Estimation - an overview | ScienceDirect Topics
    Motion estimation (ME) is defined as the process of estimating the motion that occurs between a reference frame and the current frame in a video sequence, ...
  2. [2]
    [PDF] a review on vision-based motion estimation - arXiv
    Jul 19, 2024 · Existing vision-based motion estimation methods can be classified into two branches: matching-based methods, which work from Lagrangian ...
  3. [3]
  4. [4]
    Motion Estimator - an overview | ScienceDirect Topics
    A motion estimator is defined as a computational algorithm that calculates the optical flow by estimating the motion vectors between consecutive image frames, ...
  5. [5]
    [PDF] Motion Estimation Techniques - Marco Cagnazzo
    motion estimation. Motion estimation plays an important role in a broad range of applications encompassing image sequence analysis, computer vision and video.
  6. [6]
    46 Motion Estimation - Foundations of Computer Vision
    Motion tells us how objects move in the world, and how we move relative to the scene. It is an important grouping cue that lets us discover new objects.
  7. [7]
    [PDF] Lecture 13: Tracking motion features – optical flow
    Nov 9, 2011 · • Brightness constancy: projection of the same point looks the same in every frame. • Spatial coherence: points move like their neighbors.
  8. [8]
    [PDF] Motion and optical flow Announcements
    Feb 2, 2017 · How to estimate pixel motion from image H to image I? • Solve pixel correspondence problem. – given a pixel in H, look for nearby pixels of the ...
  9. [9]
    A review of motion estimation algorithms for video compression
    For this purpose, the block-based motion estimation (BBME) technique has been successfully applied in the video compression standards from H.261 to H.264.
  10. [10]
    Motion Estimation using Optical Flow - Scaler Topics
    Jul 11, 2023 · Optical flow motion estimation is a method used in computer vision to estimate the motion of objects between consecutive frames of an image or video sequence.
  11. [11]
    Multiple View Geometry in Computer Vision
    Geometry of single axis motions using conic fitting. ... Richard Hartley, Australian National University, Canberra, Andrew Zisserman, University of Oxford.
  12. [12]
    [PDF] An Iterative Image Registration Technique - CMU Robotics Institute
    In this paper we present a new image registration technique that uses spatial intensity gradient information to direct the search for the position that yields ...
  13. [13]
    [PDF] Multiple View Geometry in Computer Vision, Second Edition
    PART 0: The Background: Projective Geometry, Transformations and Esti- mation. 23. Outline. 24. 2. Projective Geometry and Transformations of 2D.
  14. [14]
    [PDF] Algorithmic Issues in Modeling Motion - Duke Computer Science
    Another possible trade-off is between efficiency and accuracy. How much efficiency can be gained by maintaining a geometric structure approximately? For example ...
  15. [15]
    Determining optical flow - ScienceDirect.com
    Optical flow cannot be computed locally, requiring a second constraint. A method assumes smooth brightness pattern velocity, using an iterative implementation.
  16. [16]
    [PDF] Probabilistic and Sequential Computation of Optical Flow Using ...
    In this paper we present a temporal, multi-frame extension of the dense optical flow estimation formulation proposed by Horn and Schunck [1] in which we use ...
  17. [17]
    [PDF] Determining Optical Flow - Faculty
    Berthold K.P. Horn and Brian G. Schunck​​ ABSTRACT Optical flow cannot be computed locally, since only one independent measurement is available from the image ...
  18. [18]
    Glacier Surface Motion Estimation from SAR Intensity Images Based ...
    Aug 6, 2020 · This paper proposes a robust subpixel frequency-based image correlation method for dense matching and integrates the improved matching into a ...
  19. [19]
    [PDF] Robust motion estimation under varying illumination
    The basic notion behind the use of a robust estimator is to obtain a statistical characterization of the data that is immune to the outliers. Several approaches ...
  20. [20]
    [PDF] Kanade-Lucas-Tomasi (KLT) Tracker - Carnegie Mellon University
    Kanade. Tomasi. Good Features to Track. 1994. Tomasi. Shi. History of the. Kanade-Lucas-Tomasi. (KLT) Tracker. The original KLT algorithm. Page 5. Method for ...
  21. [21]
    (PDF) Block matching algorithms for motion estimation - ResearchGate
    Oct 21, 2017 · This paper is a review of the block matching algorithms used for motion estimation in video compression. It implements and compares 7 different types of block ...
  22. [22]
    [PDF] Distinctive Image Features from Scale-Invariant Keypoints
    Jan 5, 2004 · This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between ...
  23. [23]
    ORB: An efficient alternative to SIFT or SURF - IEEE Xplore
    In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise.
  24. [24]
    Random sample consensus: a paradigm for model fitting with ...
    Jun 1, 1981 · A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a ...
  25. [25]
    FlowNet: Learning Optical Flow with Convolutional Networks - arXiv
    Apr 26, 2015 · In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task.
  26. [26]
    RAFT: Recurrent All-Pairs Field Transforms for Optical Flow - arXiv
    Mar 26, 2020 · We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features.Missing: unsupervised | Show results with:unsupervised
  27. [27]
    [PDF] FlowNet 2.0: Evolution of Optical Flow Estimation With Deep Networks
    In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are ...
  28. [28]
    Learning to Estimate Hidden Motions with Global Motion Aggregation
    Apr 6, 2021 · We introduce a global motion aggregation module, a transformer-based approach to find long-range dependencies between pixels in the first image.
  29. [29]
    ICCV 2025 Open Access Repository
    Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also ...
  30. [30]
    [PDF] Fast Local and Global Projection-Based Methods for Affine Motion ...
    The perfor- mance of the iterative nonlinear least squares estima- tors depend on both the convexity of the objective function (sum of the squared image ...
  31. [31]
    Motion displacement estimation using an affine model for
    Brockett6 developed a least-squares approach to approx- imate optical flow by affine vector fields using shape gramians. A broad class of gradient-based methods ...
  32. [32]
    Parametric estimation of affine deformations of planar shapes
    Aug 5, 2025 · In Section 4, we propose an iterative solution of overdetermined systems, a direct analytical solution of non-singular systems, and a ...
  33. [33]
    Homography estimation using local affine frames - IEEE Xplore
    Throughout this paper we propose a simple, direct linear transformation (DLT) like solution to the problem of homography estimation using local affine frames, ...
  34. [34]
    [PDF] Planar Affine Rectification from Change of Scale - CMP
    A proof of degenerate case of collinear points can be found in appendix A. 2 The Method. First, the concept of local scale change under planar homography is ...Missing: configurations | Show results with:configurations
  35. [35]
    [PDF] ASIFT: An Algorithm for Fully Affine Invariant Comparison
    The ASIFT feature computation complexity is therefore 13.5 times the complexity for computing SIFT features. The complexity growth is “linear” and thus marginal ...
  36. [36]
    [PDF] High Accuracy Optical Flow Estimation Based on a Theory for ...
    Abstract. We study an energy functional for computing optical flow that com- bines three assumptions: a brightness constancy assumption, a gradient ...
  37. [37]
    [PDF] Pyramidal Implementation of the Lucas Kanade Feature Tracker ...
    The overall pyramidal tracking algorithm proceeds as follows: first, the optical flow is comptuted at the deepest pyramid level Lm. Then, the result of the ...
  38. [38]
    [PDF] Hierarchical Model-Based Motion Estimation
    Arguments for use of hierarchical (i.e. pyramid based) estimation techniques for mo- tion estimation have usually focused on issues of computational efficiency.
  39. [39]
    [PDF] PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost ...
    We present a compact but effective CNN model for op- tical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established ...
  40. [40]
    [PDF] The Laplacian Pyramid as a Compact Image Code
    Apr 4, 1983 · Abstract—We describe a technique for image encoding in which local operators of many scales but identical shape serve as the basis.
  41. [41]
    H.264 : Advanced video coding for generic audiovisual services
    **Summary of H.264 Motion Estimation and Related Features:**
  42. [42]
    H.265 : High efficiency video coding
    **Summary of Motion Estimation in HEVC (H.265):**
  43. [43]
  44. [44]
  45. [45]
    [PDF] Multiple Reference Motion Compensation - UC San Diego
    Abstract. Motion compensation exploits temporal correlation in a video sequence to yield high compression efficiency. Multiple reference frame motion.
  46. [46]
    H.262 : Information technology - Generic coding of moving pictures and associated audio information: Video
    **Summary of Motion Estimation in MPEG-2/H.262 (Block-Based, Half-Pixel Accuracy):**
  47. [47]
    (PDF) Motion Vector Coding and Block Merging in Versatile Video ...
    Sep 6, 2021 · This paper overviews the motion vector coding and block merging techniques in the Versatile Video Coding (VVC) standard developed by the Joint Video Experts ...<|control11|><|separator|>
  48. [48]
    H.266 : Versatile video coding
    **Summary of H.266 (Versatile Video Coding) from ITU-T REC-H.266:**
  49. [49]
    An efficient versatile video coding motion estimation hardware
    Jan 29, 2024 · In this paper, we propose an efficient VVC ME hardware. It is the first VVC ME hardware in the literature. It has real time performance with small hardware ...
  50. [50]
    [PDF] Bundle Adjustment — A Modern Synthesis
    Abstract. This paper is a survey of the theory and methods of photogrammetric bundle adjustment, aimed at potential implementors in the computer vision ...
  51. [51]
  52. [52]
    [PDF] Structure-From-Motion Revisited - CVF Open Access
    This paper proposes a SfM algorithm that overcomes key challenges to make a further step towards a general-purpose. SfM system. The proposed components of ...
  53. [53]
    Crowdsource Drone Imagery – A Powerful Source for the 3D ...
    Dec 28, 2020 · In this paper, we propose the idea of using crowdsource drone images and videos which are captured by amateurs for the documentation of heritage sites.
  54. [54]
    Direct Iterative Closest Point for real-time visual odometry
    Abstract: In RGB-D sensor based visual odometry the goal is to estimate a sequence of camera movements using image and/or range measurements.
  55. [55]
    Understanding Iterative Closest Point (ICP) Algorithm with Code
    Apr 30, 2025 · Iterative Closest Point (ICP) is a widely used classical computer vision algorithm for 2D or 3D point cloud registration.
  56. [56]
    ORB-SLAM3: An Accurate Open-Source Library for Visual ... - arXiv
    Jul 23, 2020 · This paper presents ORB-SLAM3, the first system able to perform visual, visual-inertial and multi-map SLAM with monocular, stereo and RGB-D cameras.
  57. [57]
    Loop Closure Detection for Monocular Visual Odometry - IEEE Xplore
    In order to decrease monocular visual odometry drift by detecting loop closure, this paper presents a comparison between state of the art, 2-channel and ...
  58. [58]
    Multi-Object Tracking Using Kalman Filter and Historical Trajectory ...
    We propose a multi-object tracking using Kalman filter and historical trajectory correction for surveillance videos.
  59. [59]
    FusionSORT: Fusion Methods for Online Multi-object Visual Tracking
    May 10, 2025 · In our tracker, we use Kalman filter (KF) [8] with a constant-velocity model for motion estimation of object tracklets in the image plane, ...
  60. [60]
    Visual multi-object tracking with re-identification and occlusion ...
    This paper proposes an online visual multi-object tracking (MOT) algorithm that resolves object appearance–reappearance and occlusion.
  61. [61]
    Deep Learning Based Real-Time Object Detection on Jetson Nano ...
    Aug 6, 2025 · Deep Learning Based Real-Time Object Detection on Jetson Nano Embedded GPU. June 2023; Lecture Notes in Electrical Engineering. DOI:10.1007/978 ...
  62. [62]
    Vision-Based Embedded System for Noncontact Monitoring ... - arXiv
    Sep 2, 2025 · We introduce an embedded monitoring system that utilizes a quantized MobileNet model deployed on a Raspberry Pi for real-time behavioral state ...
  63. [63]
    Kitti Odometry Dataset - Andreas Geiger
    For this benchmark you may provide results using monocular or stereo visual odometry, laser-based SLAM or algorithms that combine visual and LIDAR information.
  64. [64]
    An unsupervised video anomaly detection method via Optical Flow ...
    Our proposed method, OFST, combines optical flow reconstruction and video frame prediction to improve video anomaly detection. OFST is composed of two modules, ...
  65. [65]
    Statistical Modeling of Long-Range Drift in Visual Odometry
    Aug 7, 2025 · This paper models the drift as a combination of wide-band noise and a first-order Gauss-Markov process, and analyzes it using Allan variance.
  66. [66]
    [PDF] Past Research, State of Automation Technology, and ... - NHTSA
    Advanced Research Projects Agency (DARPA) challenges (e.g., Montemerlo et al., 2008; Umson et ... 13.08 s and 15.60 s to complete primarily visual and combined ...
  67. [67]
    RD-VIO: Robust Visual-Inertial Odometry for Mobile Augmented ...
    We also compared the ability to eliminate outliers in visual observations using IMU pre-integration predicted poses. ... SLAM system ORB-SLAM3 and recent DynaVINS ...
  68. [68]
  69. [69]
    [PDF] Super Odometry: A Robust LiDAR-Visual-Inertial Estimator for ...
    We propose Super Odometry, a high-precision multi-modal sensor fusion framework, providing a simple but effective way to fuse multiple sensors.Missing: 2020s | Show results with:2020s
  70. [70]
    Camera, LiDAR, and IMU Based Multi-Sensor Fusion SLAM: A Survey
    Sep 22, 2023 · This paper can be considered as a brief guide to newcomers and a comprehensive reference for experienced researchers and engineers to explore ...
  71. [71]
    CVPR Poster On-Device Self-Supervised Learning of Low-Latency ...
    Online, on-device learning allows robots to “train in their test environment”. We improve the time and memory efficiency of the self-supervised contrast ...Missing: contrastive | Show results with:contrastive
  72. [72]
    Self-Supervised Optical Flow Estimation for Event-based Cameras
    Feb 19, 2018 · We present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event based cameras.Missing: contrastive motion
  73. [73]
    [PDF] Unsupervised Joint Learning of Optical Flow and Intensity with Event ...
    Event cameras rely on motion to obtain information about scene appearance. This means that appearance and motion are inherently linked: either both are ...
  74. [74]
    Using Bayesian deep learning approaches for uncertainty-aware ...
    Bayesian methods can quantify that uncertainty, and deep learning models exist that follow the Bayesian paradigm. These models, namely Bayesian neural ...Missing: flows motion
  75. [75]
    Representing Model Uncertainty in Deep Learning - arXiv
    Jun 6, 2015 · In this paper we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian ...Missing: flows motion robotics
  76. [76]
    [PDF] A General Framework for Uncertainty Estimation in Deep Learning
    In this paper, we propose a novel framework for uncertainty estimation of deep neural network predictions. By combin- ing Bayesian belief networks [5], [6], [7] ...
  77. [77]
    A survey of uncertainty in deep neural networks
    Jul 29, 2023 · This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges,
  78. [78]
    Quality Scalable Quantization Methodology for Deep Learning on ...
    Jul 15, 2024 · The methodology uses 3-bit parameter compression and quality scalable multipliers to reduce energy and size of CNNs for edge computing, with on ...Missing: motion FlowNet
  79. [79]
    Energy-Efficient Optical Flow Estimation using Sensor Fusion and ...
    In addition to accurately recovering the motion parameters of the problem, our framework produces motion-corrected edge-like images with high dynamic range ...
  80. [80]
    100FPS@1W Dense Optical Flow For Tiny Mobile Robots - arXiv
    Nov 21, 2024 · In this paper, we propose EdgeFlowNet, a high-speed, low-latency dense optical flow approach for tiny autonomous mobile robots by harnessing the ...Missing: FlowNet | Show results with:FlowNet
  81. [81]
    [PDF] Quantum Motion Segmentation
    This paper introduces the first algorithm for motion segmentation that relies on adiabatic quantum optimization of the objective function. The proposed method ...
  82. [82]
    (PDF) Ethical Considerations in AI-Powered Surveillance Systems
    Oct 29, 2024 · This paper examines the moral implications of AI-driven surveillance, highlighting tensions between national security, public safety, and individual privacy.
  83. [83]
    Legal and ethical implications of AI-based crowd analysis - NIH
    While AI offers promise in analysing crowd dynamics and predicting escalations, its deployment raises significant ethical concerns, regarding privacy, bias, ...