Fact-checked by Grok 2 weeks ago

Visual odometry

Visual odometry (VO) is a computer vision technique used to estimate the egomotion—position and orientation—of a camera or robotic agent relative to its previous positions by analyzing sequential images captured from one or more cameras. This method computes incremental motion estimates over short distances, providing a relative trajectory without requiring external references like GPS, and serves as an alternative to traditional odometry sensors such as wheel encoders, which can fail in challenging terrains like slippery surfaces or uneven ground.^[1] VO operates in real-time on resource-constrained platforms and is fundamental for tasks requiring precise localization, though it accumulates errors over time that necessitate integration with other systems like simultaneous localization and mapping (SLAM) for long-term accuracy.^[1] The origins of visual odometry trace back to the early 1980s, when Hans Moravec developed pioneering stereo vision-based motion estimation for planetary rovers as part of NASA's exploration programs, demonstrating the feasibility of using camera images for obstacle avoidance and navigation on rough terrain.^[2] The term "visual odometry" was formally coined in 2004 by David Nistér and colleagues, who introduced a robust system for ground vehicles using feature tracking and multi-frame bundle adjustment to achieve real-time performance with a single or stereo camera. This work revived academic interest, building on earlier NASA applications, and VO was deployed operationally on the Mars Exploration Rovers, Spirit and Opportunity, where it served as a primary safety mechanism used in approximately 80% of drives to enable safer traversal of extraterrestrial landscapes during the missions, which extended beyond two years for Spirit and over a decade for Opportunity.^[3] VO has continued to be integral in subsequent missions, including the Curiosity (landed 2012) and Perseverance (landed 2021) rovers, with enhancements such as real-time visual odometry processing during drives as of 2025.^[4] VO systems are categorized by sensor configuration and algorithmic approach, with monocular VO relying on a single camera for scale-ambiguous estimates, stereo VO using dual cameras for direct depth and metric scale, and RGB-D variants incorporating depth sensors for enhanced robustness in low-texture environments.^[1] Algorithmically, feature-based methods, such as those employing corner detection and optical flow for sparse point tracking, dominate for their efficiency and accuracy in textured scenes, while direct methods optimize over pixel intensities for dense reconstruction, excelling in uniform areas but demanding more computation.^[1] Recent advances integrate deep learning for end-to-end pose regression and feature learning, improving resilience to challenges like rapid motion, varying illumination, and occlusions, though traditional geometric pipelines remain prevalent for their interpretability and low latency.^[5] Applications of VO span autonomous driving, where it fuses with inertial sensors for robust vehicle localization in urban settings; aerial and underwater robotics, enabling drones and submersibles to navigate GPS-denied environments; and augmented reality, supporting head-mounted displays for stable virtual overlay.^[1] Despite its successes, VO faces limitations from drift accumulation, sensitivity to dynamic objects, and computational demands, driving ongoing research toward hybrid sensor fusion and learning-based enhancements for broader deployment in safety-critical systems.^[1]

Introduction

Definition and Core Principles

Visual odometry (VO) is the process of estimating the egomotion—changes in position and orientation—of an agent, such as a robot or vehicle, using sequential images captured by one or more cameras attached to it.^[6] The term was coined in 2004 to describe this vision-based approach to motion estimation, analogous to wheel odometry but relying solely on visual cues rather than mechanical sensors. VO operates by analyzing the apparent motion of image features or pixel intensities between consecutive frames to infer the camera's trajectory in three-dimensional space.^[7] At its core, VO relies on geometric principles to interpret image motion, such as optical flow—the pattern of apparent motion of objects in a visual scene caused by relative motion between the observer and the scene—and epipolar geometry, which constrains possible correspondences between points in stereo or sequential images.^[6] These principles enable the recovery of relative camera poses and, in some configurations, sparse 3D structure of the environment, without requiring external references like GPS. Unlike simultaneous localization and mapping (SLAM), which builds and maintains a global map with loop closure for long-term consistency, VO emphasizes short-term, incremental motion estimation focused on local trajectory accuracy, trading global optimization for computational efficiency and real-time performance.^[6] The basic workflow of VO typically involves four main stages: acquiring synchronized image sequences from the camera(s); extracting and tracking salient features (e.g., corners or edges) or analyzing pixel intensities across frames; generating motion hypotheses by solving geometric constraints like the essential matrix for rotation and translation; and refining the estimated trajectory through techniques such as pose graph optimization or bundle adjustment to minimize accumulated errors.^[7] This process assumes a textured, static environment with sufficient lighting and inter-frame overlap to ensure reliable correspondences.^[6] Key benefits of VO include its low cost and passive nature, as it uses widely available cameras without emitting signals, making it suitable for resource-constrained platforms.^[7] It excels in GPS-denied environments, such as indoors, tunnels, or planetary surfaces, where traditional navigation fails, achieving relative position errors of 0.1% to 2% over traveled distances in favorable conditions.^[6]

Historical Development

The foundations of visual odometry (VO) trace back to the late 1970s and 1980s, when researchers began exploring vision-based navigation for planetary rovers. Hans Moravec's work at Carnegie Mellon University in the early 1980s, motivated by NASA's interest in autonomous exploration, introduced early concepts of using camera images to estimate robot motion on extraterrestrial surfaces, laying groundwork for what would later be formalized as VO.^[8] Concurrently, advancements in optical flow estimation, such as the Lucas-Kanade method developed in 1981, provided essential tools for tracking image features to infer egomotion, influencing subsequent VO pipelines. By the 1990s, structure from motion (SfM) techniques further evolved these ideas, enabling 3D reconstruction from sequential images, though real-time applications remained limited by computational constraints.^[6] The term "visual odometry" was formally coined in 2004 by David Nistér and colleagues, who presented the first real-time monocular VO system capable of estimating camera pose from a single viewpoint, marking a pivotal milestone in the field. That same year, NASA's Mars Exploration Rovers, Spirit and Opportunity, deployed stereo VO in extraterrestrial environments, using feature tracking in stereo image pairs to correct wheel odometry errors on slippery Martian terrain and enabling drives up to several hundred meters with sub-meter accuracy.^[9] The 2000s also saw VO gain traction in terrestrial robotics, accelerated by challenges like the DARPA Grand Challenge (2004–2005) and Urban Challenge (2007), which spurred research in autonomous vehicles and sensor fusion for robust navigation in unstructured environments.^[7] A comprehensive survey by Davide Scaramuzza and Friedrich Fraundorfer in 2011 synthesized three decades of progress, highlighting feature-based pipelines and their evolution from offline SfM to online, real-time systems.^[7] The 2010s brought refinements and expansions, with ORB-SLAM in 2015 introducing a versatile feature-based monocular system that integrated loop closure for improved long-term accuracy in diverse environments.^[10] In 2017, Direct Sparse Odometry (DSO) advanced direct methods by optimizing photometric errors over sparse keypoints, achieving high precision without explicit feature extraction and running in real-time on standard hardware.^[11] Open-source frameworks like OpenVINS, emerging around 2018 and formalized in 2020, democratized visual-inertial odometry research by providing modular, filter-based estimators for stereo and monocular setups.^[12] Advancements in the 2020s have further integrated deep learning and neuromorphic sensing for enhanced robustness. Building on early works like DeepVO (2017), recent methods as of 2025, such as LEAP-VO (2024), employ attention-based refiners for long-term effective point tracking in monocular VO, improving accuracy in challenging scenes.^[13] Similarly, RWKV-VIO (2025) introduces efficient visual-inertial odometry using recurrent weighted key-value networks for low-drift pose estimation with reduced computational demands. Post-2020 developments in event-based VO continue to leverage dynamic vision sensors for high-speed and low-light applications, as in pipelines from the University of Zurich's Robotics and Perception Group.^[14]^[15]

Sensor Configurations

Monocular Visual Odometry

Monocular visual odometry employs a single camera, typically a perspective or fisheye lens, to capture sequential images and estimate the camera's egomotion by analyzing relative displacements of scene features across frames. This setup relies on the fundamental principles of projective geometry, where the camera's pose is inferred from correspondences between observed image points and their projected 3D positions in the environment. Unlike multi-camera systems, it processes monocular sequences without requiring baseline separation, making it computationally lightweight for real-time operation.^[16] The primary advantages of monocular visual odometry stem from its simplicity and minimal hardware requirements, utilizing a low-cost, off-the-shelf camera that occupies little space and power. This configuration is particularly well-suited for resource-constrained platforms such as drones, where weight and energy efficiency are critical, and wearable devices for augmented reality applications. Its ease of deployment enables broad accessibility in mobile robotics without the need for complex calibration of multiple sensors.^[17]^[5] A core challenge in monocular visual odometry is scale ambiguity, arising because a single viewpoint provides no direct metric information about absolute distances; the estimated trajectory is only recoverable up to an unknown scale factor, preventing accurate reconstruction of the environment's true size without additional cues like prior knowledge or motion models. Initialization poses another hurdle, often requiring an initial pure rotation or known motion to establish a baseline for triangulation, as pure translational motion can lead to degenerate configurations where depth cannot be resolved. Over extended sequences, errors accumulate due to the absence of explicit depth measurements, resulting in drift; in favorable conditions with textured environments and controlled lighting, typical systems achieve 1-2% relative pose error per 100 meters of travel, though this degrades rapidly in low-texture or dynamic scenes.^[18]^[19]^[20] Prominent example systems include Parallel Tracking and Mapping (PTAM), introduced in 2007 for augmented reality, which separates tracking and mapping into parallel threads to enable real-time monocular pose estimation in small workspaces using feature-based methods. Early implementations for mobile robots, such as those adapting monocular SLAM techniques, demonstrated feasibility in indoor navigation but highlighted the need for loop closure to mitigate drift. These systems underscore monocular odometry's role in pioneering lightweight, camera-only localization.^[21]^[22]

Stereo and RGB-D Visual Odometry

Stereo visual odometry employs a pair of synchronized cameras separated by a known baseline to capture two offset views of the scene, enabling depth estimation through disparity computation between corresponding image points.^[9] This setup leverages epipolar geometry to match features across the stereo pair, yielding 3D points via triangulation based on the disparity and camera intrinsics.^[23] In contrast, RGB-D sensors integrate an RGB camera with a depth-sensing mechanism, such as structured light or time-of-flight, exemplified by the Microsoft Kinect, which projects infrared patterns to directly measure per-pixel depths alongside color information.^[24] A primary advantage of stereo and RGB-D configurations is the provision of direct metric-scale depth measurements, eliminating the scale ambiguity inherent in monocular systems and enabling absolute pose estimation without additional sensors.^[17] The fixed baseline in stereo cameras or explicit depth values in RGB-D setups facilitate robust handling of pure translational motions, where monocular methods often fail due to insufficient geometric constraints.^[5] Furthermore, these systems perform better in low-texture environments, as depth data supports dense or semi-dense tracking even when sparse features are scarce.^[25] The typical processing pipeline begins with stereo disparity estimation, often using block-matching or semi-global matching algorithms to generate a disparity map, which is then converted to 3D points through projection and triangulation using the camera baseline and intrinsics.^[26] These 3D points are tracked across consecutive frames via feature matching or direct alignment, with initial pose hypotheses derived from essential matrix decomposition or iterative closest point registration, providing robustness against scale drift by anchoring estimates in metric space.^[27] For RGB-D, the pipeline similarly projects depth-augmented points into 3D and aligns them frame-to-frame, often incorporating color for refinement.^[28] Key challenges include precise calibration of intrinsic parameters for both cameras and extrinsic alignment of the stereo baseline or RGB-D components, as inaccuracies propagate errors in depth computation.^[29] Real-time disparity or depth processing demands significant computational resources, limiting deployment on resource-constrained platforms without optimized hardware.^[5] RGB-D systems, while benefiting from infrared depth, remain sensitive to lighting variations that affect pattern projection or RGB feature detection, particularly in outdoor or high-dynamic-range scenes.^[30] Prominent implementations include NASA's stereo visual odometry system deployed on the Mars Exploration Rovers in 2004, which processed navigation camera pairs to estimate rover motion across challenging Martian terrain, achieving sub-meter accuracy over hundreds of meters.^[9] For RGB-D, the KinectFusion framework introduced in 2011 demonstrated real-time dense surface reconstruction and odometry using Kinect depth data, enabling interactive 3D mapping in indoor environments with millimeter-level precision.^[24] These systems can be briefly fused with inertial measurements for enhanced robustness in dynamic conditions, though such integration is detailed in visual-inertial odometry approaches.^[28]

Visual-Inertial and Event-Based Odometry

Visual-inertial odometry (VIO) integrates data from cameras and inertial measurement units (IMUs), which typically include accelerometers and gyroscopes, to estimate the pose and velocity of a moving agent.^[31] The IMU provides high-frequency measurements of linear acceleration and angular velocity, enabling short-term motion prediction and propagation of the state estimate during periods of visual data loss, such as rapid camera motion or feature scarcity.^[32] This fusion leverages the complementary strengths of visual observations, which offer rich environmental information for long-term accuracy, and inertial data, which ensure continuity in challenging conditions.^[33] VIO systems exhibit several key advantages over pure visual odometry, including improved robustness to fast motions where frame-based cameras may fail to capture sufficient features, as the IMU maintains tracking through preintegration of measurements between visual keyframes.^[34] They also handle occlusions and textureless areas more effectively by relying on inertial predictions to bridge gaps in visual input, reducing drift during temporary sensor outages.^[35] Additionally, VIO provides metric scale estimation without external references, achieved through alignment of the gravity vector from accelerometer data, enabling absolute pose recovery in monocular setups.^[31] Event-based odometry employs dynamic vision sensors (DVS), also known as event cameras, which asynchronously record per-pixel brightness changes as discrete events rather than full frames, achieving temporal resolutions on the order of microseconds.^[36] This paradigm suits high-speed scenarios, such as agile robotics or vehicular navigation, where traditional cameras suffer from motion blur or low frame rates, allowing event streams to capture fine-grained motion details for precise trajectory estimation.^[37] When combined with inertial data in visual-inertial variants, event cameras enhance fusion by providing dense, low-latency inputs that complement IMU's proprioceptive measurements.^[38] Despite these benefits, both VIO and event-based odometry face significant challenges in sensor fusion and processing. Synchronization between visual or event data and IMU timestamps is critical, as misalignment from varying camera-IMU time offsets can introduce estimation errors, necessitating online calibration techniques.^[39] For event-based systems, filtering noise in the asynchronous event stream—arising from sensor hotspots or transient lighting—is essential to avoid spurious features, often requiring adaptive thresholding or clustering methods.^[40] Moreover, the high volume of events demands substantial computational resources for real-time processing, prompting optimizations like selective accumulation or voxel-based representations to manage load without sacrificing accuracy.^[41] Prominent example systems include VINS-Mono, a monocular VIO framework that uses tightly coupled optimization to fuse visual features and IMU preintegration, demonstrating robustness in real-world aerial and handheld applications with low drift rates on public benchmarks.^[31] For event-based odometry, EVO employs a geometric approach to track 6-DOF motion from event streams via parallel tracking and mapping, excelling in high-dynamic-range environments like rapid rotations.^[42] These methods have found practical use in drone navigation; for instance, VIO integration into the PX4 autopilot stack since 2018 has enabled GPS-denied flight in dynamic indoor settings, achieving reliable state estimation for autonomous control.^[43]^[44]

Methods and Approaches

Feature-Based Methods

Feature-based methods in visual odometry rely on detecting and tracking discrete keypoints, or features, across consecutive image frames to estimate camera motion through geometric correspondences. These approaches extract salient points such as corners or blobs using detectors like the Scale-Invariant Feature Transform (SIFT), introduced in 1999, which identifies scale- and rotation-invariant keypoints by detecting extrema in a difference-of-Gaussians scale space.^[45] Later, the Oriented FAST and Rotated BRIEF (ORB) detector, proposed in 2011, offered a faster alternative by combining the FAST corner detector with a binary descriptor for rotation invariance, making it suitable for real-time applications.^[46] Tracking occurs either by descriptor matching, where features are compared using similarity metrics like Hamming distance for binary descriptors, or via optical flow methods to predict feature locations in subsequent frames. Pose estimation then derives from 2D-2D or 2D-3D correspondences, reconstructing the camera's egomotion via projective geometry.^[47] The typical pipeline begins with feature extraction in each frame, selecting a sparse set of robust keypoints to reduce computational load. Matching identifies correspondences between frames, often employing the Random Sample Consensus (RANSAC) algorithm to reject outliers by iteratively estimating a model from random subsets and selecting the one with the most inliers. For uncalibrated cameras, the fundamental matrix is computed from these matches to enforce epipolar constraints, while calibrated systems use the essential matrix to recover relative rotation and translation up to scale. Triangulation then projects matched 2D points into 3D landmarks, enabling bundle adjustment for refined pose and map optimization over multiple frames. This sparse representation contrasts with dense pixel-based methods by focusing on geometric reliability rather than photometric consistency.^[17] A key advantage of feature-based methods is their invariance to moderate lighting variations, achieved through normalized descriptors like SIFT's gradient histograms or ORB's binary tests, which maintain distinctiveness across illuminations. Their sparse nature also enables efficient processing, with low-dimensional representations allowing real-time operation on resource-constrained hardware, unlike denser alternatives that demand intensive optimization.^[45]^[46]^[48] However, these methods struggle in low-texture environments, such as uniform walls or skies, where insufficient keypoints lead to tracking failures and drift accumulation. They are also sensitive to motion blur in high-speed scenarios, as blurred images degrade feature detection and matching accuracy, potentially causing outliers to dominate RANSAC iterations. Repetitive structures, like grids or periodic patterns, further complicate unique correspondence establishment.^[17] Seminal implementations include Parallel Tracking and Mapping (PTAM), developed in 2007, which pioneered real-time monocular SLAM by separating tracking and mapping into parallel threads using corner features and bundle adjustment for augmented reality workspaces. The ORB-SLAM series, starting with the 2015 version, extended this to a versatile feature-based system supporting monocular, stereo, and RGB-D inputs, achieving loop closure and relocalization through a bag-of-words model on ORB descriptors. Subsequent iterations, like ORB-SLAM2 in 2017 and ORB-SLAM3 in 2021, enhanced multi-map management and visual-inertial fusion while maintaining real-time performance, such as 30 frames per second on embedded platforms like NVIDIA Jetson TX2 through optimized CPU-GPU data flows.^[21]^[48]^[49] These systems demonstrate robustness in textured indoor and outdoor scenes, with ORB-SLAM reporting low absolute trajectory errors (e.g., RMSE of 0.01-0.05 m on many TUM RGB-D sequences).^[50]

Direct and Semi-Direct Methods

Direct methods in visual odometry estimate camera motion by directly minimizing the photometric error, which measures differences in pixel intensities between consecutive frames, under the assumption of brightness constancy that pixel intensity remains unchanged across small motions.^[51] This approach enables dense or semi-dense alignment, utilizing either all pixels (dense) or high-gradient pixels (semi-dense) for pose estimation, making it particularly effective in environments lacking distinct features.^[52] Semi-direct methods bridge the gap between direct and feature-based techniques by combining sparse keypoints with direct optimization on image intensities, such as updating feature descriptors through photometric consistency rather than traditional matching.^[53] For instance, these methods first perform sparse image alignment to obtain an initial pose estimate and then refine it using direct minimization on selected pixels around features, enhancing efficiency while retaining intensity-based accuracy.^[53] The typical pipeline for both direct and semi-direct methods involves frame-to-frame alignment through iterative optimization, often employing Gauss-Newton methods to solve for pose parameters by linearizing the photometric error around the current estimate.^[51] To ensure robustness to large motions and varying scales, multi-resolution processing via image pyramids is commonly used, starting alignment at coarser levels and refining at finer ones; this is complemented by techniques like Levenberg-Marquardt for damping in semi-direct variants when needed.^[51]^[53] These methods offer advantages in low-texture or untextured areas where feature-based approaches falter, as they leverage broader image information for higher-density point clouds and improved accuracy in pose estimation.^[52] However, they are sensitive to illumination variations, which violate the brightness constancy assumption and can introduce significant drift, and they demand higher computational resources due to the intensity-based optimization over larger pixel sets.^[51] Prominent implementations include LSD-SLAM, a semi-dense direct monocular SLAM system from 2014 that reconstructs large-scale maps using probabilistic depth filters and pose graph optimization, achieving real-time performance on standard hardware.^[52] DSO, introduced in 2017, advances direct sparse odometry with joint optimization of poses, depths, and affine brightness parameters in a sliding window, demonstrating superior accuracy over prior methods on benchmark datasets like TUM RGB-D.^[51] Similarly, SVO from 2014 employs a semi-direct pipeline for fast monocular odometry, processing at over 50 frames per second on embedded systems by interleaving direct tracking with sparse feature updates.^[53]

Learning-Based and Hybrid Methods

Learning-based methods in visual odometry leverage deep neural networks to directly estimate camera motion from image sequences, often bypassing traditional geometric pipelines. Early seminal works include FlowNet, which introduced convolutional neural networks for end-to-end optical flow estimation, enabling robust motion computation even in low-texture environments.^[54] SuperPoint advanced feature detection and description through self-supervised learning, producing repeatable keypoints and descriptors that outperform handcrafted alternatives like SIFT in challenging conditions.^[55] DeepVO further pioneered pose regression using recurrent convolutional neural networks, achieving end-to-end monocular VO with reduced drift on datasets like KITTI. Hybrid methods integrate learning components with classical geometric techniques to enhance reliability and adaptability. For instance, Bayesian filters can fuse deep learning-based pose estimates with probabilistic models for uncertainty quantification and loop closure detection, improving long-term accuracy in dynamic scenes.^[56] Reinforcement learning approaches treat VO as a sequential decision process, dynamically optimizing hyperparameters like keyframe selection in direct sparse odometry, yielding up to 19% lower absolute trajectory error on EuRoC benchmarks compared to heuristic baselines. These methods offer key advantages, such as superior handling of dynamic objects and illumination variations through learned representations, and better generalization to novel environments via large-scale training data. However, challenges persist, including the need for extensive annotated datasets, limited interpretability of black-box models, and difficulties in real-time deployment on resource-constrained devices due to high computational demands. Recent developments up to 2025 emphasize transformer architectures for capturing long-range dependencies in video sequences. TSformer-VO employs spatio-temporal attention for monocular pose estimation, outperforming DeepVO with 16.72% average translation error on KITTI.^[57] ViTVO uses vision transformers with supervised attention maps to focus on static regions, reducing errors in dynamic settings.^[58] The Visual Odometry Transformer (VoT) achieves real-time performance at 54.58 fps with 0.51m absolute trajectory error on ARKitScenes, demonstrating scalability with pre-trained encoders.^[59] Emerging integrations with foundation models enable zero-shot adaptation, leveraging vision-language models for robust feature matching in unseen scenarios.^[60] Recent November 2025 works, such as those incorporating deep structural priors for visual-inertial odometry, further enhance robustness in challenging conditions.^[61]

Mathematical and Technical Foundations

Pose Estimation and Egomotion

Egomotion in visual odometry refers to the estimation of a camera's six degrees of freedom (6-DoF) pose, comprising three translational and three rotational components, between consecutive image frames to determine the camera's motion relative to its environment.^[62] This process is fundamental to visual odometry, as it enables the incremental reconstruction of the camera's trajectory by computing relative transformations from visual cues such as feature correspondences or direct image intensities. The geometric foundation for pose estimation relies on the pinhole camera model, which projects three-dimensional world points onto a two-dimensional image plane through a focal point, assuming ideal perspective projection without lens distortions. Under this model, the relative pose between two views is captured by the essential matrix \mathbf{E}, a 3×3 matrix that encodes the epipolar geometry for calibrated cameras and provides a 5-DoF representation of motion (up to scale for translation).^[63] The essential matrix relates corresponding points \mathbf{x}_1 and \mathbf{x}_2 in normalized image coordinates as \mathbf{x}_2^T \mathbf{E} \mathbf{x}_1 = 0, where \mathbf{E} encapsulates the rotation \mathbf{R} and translation \mathbf{t} via its decomposition \mathbf{E} = [\mathbf{t}]_\times \mathbf{R}, with [\mathbf{t}]_\times denoting the skew-symmetric matrix of the translation vector.^[63] Recovery of the relative pose from the essential matrix involves singular value decomposition (SVD): \mathbf{E} = \mathbf{U} \Sigma \mathbf{V}^T, followed by constructing \mathbf{R} and \mathbf{t} from the singular vectors, yielding up to four possible solutions that are disambiguated by geometric constraints such as positive depth.^[63] For scenarios involving planar motion, such as ground vehicles on flat surfaces, the homography matrix \mathbf{H} simplifies pose estimation by mapping points between views under a dominant plane assumption, relating points as \mathbf{x}_2 = \mathbf{H} \mathbf{x}_1 and decomposing into rotation and translation components.^[64] Kinematically, the camera's trajectory is represented as a discrete-time sequence of poses \mathbf{T}_i = (\mathbf{R}_i, \mathbf{t}_i), where each \mathbf{T}_i transforms world points to the camera frame at time i, and relative egomotion \mathbf{T}_{i,i-1} = \mathbf{T}_i \mathbf{T}_{i-1}^{-1} accumulates to form the global path.^[62] Instantaneous velocity can be approximated from finite differences between consecutive poses, such as linear velocity \mathbf{v}_i \approx \frac{\mathbf{t}_i - \mathbf{t}_{i-1}}{\Delta t} and angular velocity from rotation increments, though scale ambiguity persists in monocular setups without additional cues. Initialization of pose estimation is critical for robust egomotion recovery; in monocular visual odometry, the five-point algorithm computes the essential matrix from minimal correspondences, solving a tenth-degree polynomial via Gröbner basis for efficient real-time performance.^[63] For stereo configurations, direct depth measurements from disparity allow triangulation and absolute scale recovery, bypassing the need for epipolar decomposition in the initialization step.^[62]

Optimization and Error Correction

Local optimization in visual odometry typically involves frame-to-frame bundle adjustment (BA), which refines camera poses and 3D points by minimizing the reprojection error across consecutive frames. This process solves the nonlinear least-squares problem:

\min \sum \| \pi(K [R | t] X) - x \|^2

where \pi denotes the projection function, K is the camera intrinsic matrix, [R | t] represents the camera pose, X are the 3D points, and x are the observed 2D image points.^[65] Such local BA reduces short-term drift by jointly optimizing a small set of variables, as implemented in feature-based systems like ORB-SLAM, where robust kernels handle outliers during minimization.^[65] Global techniques extend this refinement over larger sets of frames to achieve consistency and mitigate accumulated errors. Keyframe-based BA optimizes selected keyframes and associated map points, fixing less relevant poses to maintain computational feasibility, while sliding window optimization in visual-inertial odometry (VIO) maintains a fixed-size window of recent states—including poses, velocities, and biases—fusing IMU preintegration with visual residuals for tighter coupling.^[66] Loop closure detection, often using bag-of-words models like DBoW2 on ORB descriptors, identifies revisited locations and corrects global drift by estimating a similarity transformation and performing pose graph optimization.^[65] Error correction mechanisms are integral to robust VO pipelines. Outlier rejection employs RANSAC to filter mismatched features during motion estimation, iteratively sampling minimal point sets to hypothesize poses and scoring based on inlier consensus, ensuring reliable input to optimization.^[27] In monocular setups, scale drift—arising from projective ambiguity—is recovered using auxiliary data, such as IMU measurements for metric alignment during initialization or ground plane assumptions leveraging known camera height to rescale translations via point triangulation.^[66]^[18] Covariance propagation quantifies uncertainty, particularly in VIO filters, by evolving the state covariance matrix through IMU dynamics to inform measurement weighting and detect inconsistencies.^[67] Advanced methods address scalability and real-time constraints in optimization. Marginalization in extended Kalman filters (EKFs) for VIO, as in the multi-state constraint Kalman filter (MSCKF), selectively removes old states while preserving information as priors via Schur complement, preventing covariance inflation without full history retention.^[67] For efficiency, the Schur complement decomposes the BA Hessian into pose and landmark blocks, exploiting sparsity to accelerate solves in high-dimensional problems, enabling lightweight VIO with reduced CPU overhead.^[68] Performance of these techniques is evaluated using metrics like Absolute Trajectory Error (ATE), which computes the root-mean-square deviation of the estimated trajectory from ground truth after alignment, and Relative Pose Error (RPE), which assesses local drift over fixed distances. Benchmarks on the KITTI dataset (2012) demonstrate that optimized VO systems achieve ATE below 1% on urban sequences, with sliding window VIO further reducing RPE to under 0.01°/m.^[69]^[70]

Applications and Challenges

Key Applications

Visual odometry (VO) plays a pivotal role in robotics, enabling mobile robots and drones to achieve precise navigation in both indoor and outdoor environments where GPS is unavailable or unreliable. In the DARPA Subterranean Challenge (2019-2021), teams integrated VO with visual-inertial odometry (VIO) and SLAM systems to traverse complex underground tunnels, caves, and urban settings, allowing robots to map and localize autonomously during search-and-rescue simulations. Furthermore, VO is seamlessly incorporated into the Robot Operating System (ROS) through packages like ORB-SLAM and VINS-Mono, facilitating real-time ego-motion estimation and mapping for ground and aerial robots in dynamic scenarios.^[71] In autonomous vehicles, VO enhances localization and perception by fusing camera data with LiDAR for robust odometry in urban and highway driving. Waymo's self-driving systems leverage multi-sensor fusion, including cameras for visual features, to estimate vehicle pose and trajectory, contributing to safe navigation in diverse conditions as demonstrated on their open dataset.^[72] By 2024, Tesla's Autopilot and Full Self-Driving enhancements rely on vision-only approaches, where vision-based ego-motion estimation derived from multi-camera inputs provides estimates without LiDAR, supporting features like lane changes and obstacle avoidance.^[73] Additionally, VO applications have expanded to swarm robotics, where multiple agents use shared VO data for coordinated navigation in GPS-denied environments, as explored in recent multi-robot systems as of 2025.^[74] For augmented and virtual reality (AR/VR), VO enables six-degrees-of-freedom (6-DoF) tracking in head-mounted displays, allowing users to interact naturally without external sensors. The Oculus Quest (released 2019) employs inside-out tracking via SLAM and VO algorithms on its wide-angle cameras, generating real-time 3D maps of the environment for immersive, untethered experiences.^[75] In space exploration, VO supports rover navigation on extraterrestrial surfaces by estimating motion from stereo imagery in low-gravity, feature-sparse terrains. The NASA Perseverance rover, landing in 2021, utilizes VIO with its engineering cameras to compute visual odometry during autonomous drives, enabling hazard avoidance and precise path planning across the Jezero Crater.^[76] This approach has been foundational for planetary rovers, including earlier Mars missions, where it supplements wheel odometry to mitigate slippage.^[77] Beyond these domains, VO finds applications in underwater vehicles and medical robotics. Autonomous underwater vehicles (AUVs) employ monocular or stereo VO to localize in turbid, GPS-denied waters, such as during coral reef inspections, by tracking visual features despite light attenuation and motion blur.^[78] In medical robotics, VO aids endoscopy by estimating the endoscope's pose from monocular or stereo images, improving navigation in deformable tissues for procedures like capsule endoscopy and minimally invasive surgery.^[79]

Limitations and Mitigation Strategies

Visual odometry systems are prone to accumulating drift, with relative position errors typically ranging from 0.1% to 2% over traveled distances, due to incremental error propagation in pose estimation.^[80] These systems also exhibit sensitivity to environmental factors such as lighting variations, which can cause non-uniform illumination and disrupt feature detection, leading to inaccurate pixel displacement estimates.^[80] Motion blur from rapid camera movements further degrades image quality, resulting in false feature matches and heightened estimation errors.^[80] The presence of dynamic objects, like moving pedestrians or vehicles, introduces outliers by disturbing scene consistency and complicating ego-motion recovery.^[80] Additionally, the computational demands of feature extraction, matching, and optimization often challenge real-time performance on resource-constrained platforms.^[80] Error sources in visual odometry primarily stem from feature mismatches, where incorrect correspondences arise from repetitive patterns or occlusions, amplifying pose inaccuracies.^[80] Incorrect scale estimation, particularly in monocular setups, leads to scale drift since depth information is not directly observable from images alone.^[80] Sensor noise, including camera distortions and low-texture environments, further degrades reliability by reducing the quality of input data.^[80] Degenerate cases, such as pure forward motion in monocular visual odometry, result in system failures due to insufficient parallax for 3D reconstruction.^[80] To mitigate these issues, multi-sensor fusion approaches like visual-inertial odometry (VIO) integrate inertial measurement unit data to provide scale observability and robustness against visual-only failures, reducing drift in challenging conditions. Incorporating loop closure mechanisms from SLAM hybrids detects revisited locations to correct accumulated errors through global optimization. Robust cost functions, such as the Huber loss, downweight outliers from mismatches or dynamic objects during optimization, enhancing estimation stability.^[81] Emerging solutions, such as those from 2024-2025 research, include AI-driven anomaly detection, where machine learning models identify and recover from failures like sudden lighting changes by predicting inconsistent poses in real-time.^[82] Neuromorphic event-based sensors offer resistance to motion blur by asynchronously capturing brightness changes, enabling high-speed visual odometry in dynamic environments.^[14] For learning-based methods, dataset augmentation techniques, such as synthetic transformations and ORB feature enhancements, improve model generalization and recovery from error-prone scenarios during training. Evaluation of these limitations and strategies commonly uses standard datasets like TUM RGB-D for indoor RGB-D sequences and EuRoC for aerial VIO benchmarks, where failure rates are quantified by tracking lost poses or excessive drift in dynamic or low-texture trials.^[83]^[84]

References

[1]
(PDF) Review of visual odometry: types, approaches, challenges ...
This paper presents a review of state-of-the-art visual odometry (VO) and its types, approaches, applications, and challenges.
[2]
[PDF] Obstacle Avoidance and Navigation in the Real World by a Seeing ...
Sep 2, 1980 · The robot uses a TV camera, stereo vision to locate objects, and a computer to plan and adjust obstacle-avoiding paths based on new perceptions.
[3]
[PDF] Two Years of Visual Odometry on the Mars Exploration Rovers
During the first two years of operations, Visual Odometry evolved from an “extra credit” capability into a critical vehicle safety system.
[4]
[PDF] Approaches, Challenges, and Applications for Deep Visual Odometry
Sep 6, 2020 · Abstract—Visual odometry (VO) is a prevalent way to deal with the relative localization problem, which is becoming in-.
[5]
[PDF] Visual Odometry
Dec 8, 2011 · The term VO was coined in 2004 by Nis- ter in his landmark paper [1]. The term was chosen for its similarity to wheel odometry, which ...
[6]
https://rpg.ifi.uzh.ch/docs/VO_Part_I_Scaramuzza.pdf
[7]
[PDF] Deep Monocular Visual Odometry for fixed-winged Aircraft
In the early 1980s, Moravec [3] laid the foundation for the problem which would later be called Visual Odometry (VO). Early research was motivated by National ...
[8]
[PDF] Visual Odometry on the Mars Exploration Rovers - JPL Robotics
Visual odometry tracks features in stereo images to estimate position and attitude, correcting wheel odometry errors, using maximum likelihood estimation.
[9]
The DARPA Grand Challenge: Ten Years Later
Mar 13, 2014 · The DARPA Grand Challenge, a first-of-its-kind race to foster the development of self-driving ground vehicles.Missing: visual odometry
[10]
https://ieeexplore.ieee.org/document/7219438
[11]
https://ieeexplore.ieee.org/document/7898369
[12]
OpenVINS
The OpenVINS project houses some core computer vision code along with a state-of-the art filter-based visual-inertial estimator.Getting Started · Pages · Classes · Ov_core namespaceMissing: 2018 | Show results with:2018
[13]
DeepVO: Towards End-to-End Visual Odometry with Deep ... - arXiv
Sep 25, 2017 · This paper presents a novel end-to-end framework for monocular VO by using deep Recurrent Convolutional Neural Networks (RCNNs).Missing: 2018 | Show results with:2018
[14]
Event-based Vision, Event Cameras, Event Camera SLAM
We develop an event-based feature tracking algorithm for the DAVIS sensor and show how to integrate it in an event-based visual odometry pipeline. Features ...Missing: post- | Show results with:post-
[15]
Monocular visual SLAM, visual odometry, and structure from motion ...
Sep 30, 2024 · Review article. Monocular visual SLAM, visual odometry, and structure from motion methods applied to 3D reconstruction: A comprehensive survey.
[16]
Review of visual odometry: types, approaches, challenges, and ...
Oct 28, 2016 · The idea of estimating a vehicle's pose from visual input alone was introduced and described by Moravec in the early 1980s (Nistér et al. 2004; ...
[17]
[PDF] Monocular Visual Odometry using a Planar Road Model to Solve ...
Each class of algorithms has different benefits and draw- backs. Monocular algorithms suffer from the scale ambiguity in the translational camera movement ...Missing: PTAM | Show results with:PTAM
[18]
[PDF] Tackling The Scale Factor Issue In A Monocular Visual Odometry ...
Dec 5, 2018 · This paper presents a method of resolving the scale ambiguity and drift observed in a monocular camera-based visual odometry by using the slant ...
[19]
[PDF] Instant Visual Odometry Initialization for Mobile AR - arXiv
Jul 30, 2021 · In this paper, we present a 6-DoF monocular visual odometry that initializes instantly and without motion parallax. Our main contribution is a.
[20]
[PDF] Parallel Tracking and Mapping for Small AR Workspaces
This paper presents a method of estimating camera pose in an un- known scene. While this has previously been attempted by adapting. SLAM algorithms developed ...
[21]
Parallel Tracking and Mapping for Small AR Workspaces
We describe a fast method to relocalise a monocular visual SLAM (simultaneous localisation and mapping) system after tracking failure. The monocular SLAM ...
[22]
[PDF] Three-Point Direct Stereo Visual Odometry - BMVA Archive
In this paper, we propose a novel three-point direct method for stereo visual odometry, which is more accurate and robust to outliers. To improve both accuracy ...
[23]
[PDF] KinectFusion: Real-Time Dense Surface Mapping and Tracking
KinectFusion: Real-time 3D reconstruction and inter- action using a moving depth camera. In Symposium on User Interface. Software and Technology (UIST), 2011.
[24]
Robust RGB-D Odometry Using Point and Line Features - IEEE Xplore
Lighting variation and uneven feature distribution are main challenges for indoor RGB-D visual odometry where color information is often combined with depth ...
[25]
[PDF] Real-Time Stereo Visual Odometry for Autonomous Ground Vehicles
Visual odometry, which estimates vehicle motion from a sequence of camera images, offers a natural complement to these sensors: it is insensitive to soil ...
[26]
[PDF] Visual Odometry based on Stereo Image Sequences with RANSAC ...
The information given by such images suffices for precise motion estimation based on visual information [1], called visual odometry (e.g., Nistér et. ... Motion, ...
[27]
Fast visual odometry and mapping from RGB-D data - IEEE Xplore
In this paper, we present a real-time visual odometry and mapping system for RGB-D cameras. The system runs at frequencies of 30Hz and higher in a single thread ...Missing: key methods
[28]
[PDF] High Altitude Stereo Visual Odometry - Robotics
This paper presents a novel modification to stereo visual odometry for accurate pose estimation at high altitudes, even with poor calibration, using a ...Missing: seminal | Show results with:seminal
[29]
[PDF] Robust RGB-D Odometry Using Point and Line Features
Lighting variation and uneven feature distribution are main challenges for indoor RGB-D visual odometry where color information is often combined with depth ...
[30]
A Robust and Versatile Monocular Visual-Inertial State Estimator
Aug 13, 2017 · Access Paper: View a PDF of the paper titled VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator, by Tong Qin and 2 ...
[31]
VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State ...
Jul 27, 2018 · In this paper, we present VINS-Mono: a robust and versatile monocular visual-inertial state estimator. Our approach starts with a robust procedure for ...
[32]
[PDF] Towards Robust Multi Camera Visual Inertial Odometry
Jul 24, 2020 · IMUs provide high frequency data which can give useful information about short-term dynamics, while cameras provide useful exteroceptive.Missing: seminal | Show results with:seminal
[33]
How Visual Inertial Odometry (VIO) Works - Think Autonomous.
Apr 3, 2024 · Visual Inertial Odometry is the science of fusing both Visual Odometry (from camera images) with Inertial Odometry (from an IMU).
[34]
Advantages and Applications of Visual-Inertial Odometry - ALLPCB
Sep 10, 2025 · The advantage of VIO stems from the complementary characteristics of cameras and IMUs. Sensor Complementarity. Cameras perform well in most ...
[35]
[PDF] Event-based Vision: A Survey - Robotics and Perception Group
We present event cameras from their working principle, the actual sensors that are available and the tasks that they have been used for, from low-level vision ( ...
[36]
[PDF] Event-based Visual Odometry with Full Temporal Resolution ... - arXiv
Jun 1, 2023 · Event-based cameras perform better than traditional cam- eras in these challenging scenarios. They detect pixelwise intensity change and report ...
[37]
Event-Based Visual/Inertial Odometry for UAV Indoor Navigation
The proposed approach uses event cameras, fusing events, standard frames, and inertial measurements for indoor navigation, with a front-end and back-end thread.Missing: PX4 autopilot
[38]
[PDF] Modeling Varying Camera-IMU Time Offset in Optimization-Based ...
Abstract. Combining cameras and inertial measurement units (IMUs) has been proven effective in motion tracking, as these two sensing modal-.Missing: advantages seminal<|separator|>
[39]
[PDF] Embedded Event-based Visual Odometry - HAL
Feb 12, 2024 · Noise elimination is essential when using DVS. Indeed, the more noise, the more computation time and the more latency. In our case, a Background ...Missing: demands | Show results with:demands
[40]
Event-Based Visual Simultaneous Localization and Mapping ... - MDPI
Key challenges with EMs include balancing the temporal resolution, computational load, and processing efficiency [27,38]. Efficient algorithms are essential to ...
[41]
(PDF) EVO: A Geometric Approach to Event-based 6-DOF Parallel ...
This paper addresses a critical challenge in Industry 4.0 robotics by enhancing Visual Inertial Odometry (VIO) systems to operate effectively in dynamic and ...
[42]
[PDF] arXiv:2302.01867v1 [cs.RO] 3 Feb 2023
Feb 3, 2023 · Abstract—Integration of Visual Inertial Odometry (VIO) methods into a modular control system designed for deployment of Unmanned Aerial ...
[43]
[PDF] Object Recognition from Local Scale-Invariant Features 1. Introduction
This paper presents a new method for image feature gen- eration called the Scale Invariant Feature Transform (SIFT). This approach transforms an image into ...
[44]
ORB: An efficient alternative to SIFT or SURF - IEEE Xplore
In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise.
[45]
(PDF) ORB: an efficient alternative to SIFT or SURF - ResearchGate
Aug 6, 2025 · In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise.
[46]
ORB-SLAM: A Versatile and Accurate Monocular SLAM System
This paper presents ORB-SLAM, a feature-based monocular simultaneous localization and mapping (SLAM) system that operates in real time, in small and large ...
[47]
Data Flow ORB-SLAM for Real-time Performance on Embedded ...
We adopted a data flow paradigm to process the images, obtaining an efficient CPU/GPU load distribution that results in a processing speed of about 30 frames ...
[48]
[PDF] DSO.pdf - Direct Sparse Odometry - Jakob Engel
In this paper we propose a sparse and direct approach to monocular visual odometry. To our knowledge, it is the only fully direct method that jointly ...
[49]
[PDF] LSD-SLAM: Large-Scale Direct Monocular SLAM - Jakob Engel
The main contributions of this paper are (1) a framework for large-scale, direct monocular SLAM, in particular a novel scale-aware image alignment algorithm to ...
[50]
[PDF] SVO: Fast Semi-Direct Monocular Visual Odometry
Abstract— We propose a semi-direct monocular visual odom- etry algorithm that is precise, robust, and faster than current state-of-the-art methods.
[51]
FlowNet: Learning Optical Flow with Convolutional Networks - arXiv
Apr 26, 2015 · In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task.
[52]
SuperPoint: Self-Supervised Interest Point Detection and Description
This paper presents a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry ...
[53]
Full article: Bi-direction Direct RGB-D Visual Odometry
Oct 11, 2020 · Similar as traditional visual odometry, RGB-D visual odometry can also be divided into feature-based methods and direct methods. Feature ...Introduction · Direct Motion Estimation · EvaluationMissing: key | Show results with:key
[54]
[2305.06121] Transformer-based model for monocular visual odometry
May 10, 2023 · Abstract: Estimating the camera pose given images of a single camera is a traditional task in mobile robots and autonomous vehicles.Missing: ViTVO | Show results with:ViTVO
[55]
[PDF] Vision Transformer based Visual Odometry with Attention Supervision
Aug 14, 2024 · In this paper, we develop a Vision Transformer based visual odometry (VO), called ViTVO. ViTVO introduces an attention mechanism to perform ...
[56]
[2510.03348] Visual Odometry with Transformers - arXiv
Oct 2, 2025 · In this work, we demonstrate that monocular visual odometry can be addressed effectively in an end-to-end manner, thereby eliminating the need ...Missing: ViTVO | Show results with:ViTVO
[57]
(PDF) Can Visual Foundation Models Achieve Long-term Point ...
Aug 24, 2024 · Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings.
[58]
Visual odometry | IEEE Conference Publication
We give examples of camera trajectories estimated purely from images over previously unseen distances and periods of time. Published in: Proceedings of the 2004 ...
[59]
[PDF] An Efficient Solution to the Five-Point Relative Pose Problem
The problem is to find the possible solutions for relative camera motion between two calibrated views given five corresponding points. The algorithm consists of ...
[60]
[PDF] Monocular Omnidirectional Visual Odometry for Outdoor Ground ...
Abstract. This paper describes an algorithm for visually computing the ego-motion of a vehicle relative to the road under the assumption of planar motion.
[61]
None
### Summary of ORB-SLAM: Bundle Adjustment, Local Optimization, and Loop Closure with Bag-of-Words
[62]
None
### Summary of VINS-Mono Optimization Techniques for VIO
[63]
[PDF] High-Precision, Consistent EKF-based Visual-Inertial Odometry
This paper addresses the problem of tracking a vehicle's egomotion in GPS-denied environments, using an inertial measurement unit (IMU) and a monocular camera.
[64]
None
### Summary of Schur Complement Use in SchurVINS for Visual Inertial Navigation/Odometry
[65]
https://arxiv.org/pdf/1502.00956
[66]
[PDF] Evaluating Egomotion and Structure-from-Motion Approaches Using ...
Two frequently employed methods are the relative pose error (RPE) and the absolute trajectory error (ATE). The RPE measures the difference between the estimated ...
[67]
NVIDIA-ISAAC-ROS/isaac_ros_visual_slam: Visual SLAM ... - GitHub
VSLAM provides a method for visually estimating the position of a robot relative to its start position, known as VO (visual odometry). This is particularly ...
[68]
[PDF] Scalability in Perception for Autonomous Driving: Waymo Open ...
Most autonomous driving systems fuse sensor readings from multiple sensors, including cameras, LiDAR, radar,. GPS, wheel odometry, and IMUs. Recently released ...
[69]
Full Self-Driving (Supervised) | Tesla Support
On-board cameras with 360-degree visibility check your blind spots and move your Tesla vehicle into a neighboring lane while maintaining your speed and avoiding ...<|separator|>
[70]
How Oculus squeezed sophisticated tracking into pipsqueak hardware
Aug 22, 2019 · The term for what the headset does is simultaneous localization and mapping, or SLAM. It basically means building a map of your environment in ...
[71]
[PDF] Robotic Operations During Perseverance's First Extended Mission
AVOID ALL The primary Autonomous Driving mode, the rover uses Visual Odometry to maintain position knowledge, and employs Terrain Mapping (stereo vision ...
[72]
Localization of the Perseverance Rover at the Van Zyl ... - NASA ADS
To assess the accuracies of our localization software, based on established Visual Odometry (VO) techniques, we analyzed sequences of stereo images acquired by ...
[73]
Evolving Visual Odometry for Autonomous Underwater Vehicles
This paper improves visual odometry for AUVs, addressing robustness issues in complex scenarios, and shows progress in vehicle displacement estimation.
[74]
EndoSLAM dataset and an unsupervised monocular visual ...
In this paper, we introduce a comprehensive endoscopic SLAM dataset consisting of 3D point cloud data for six porcine organs, capsule and standard endoscopy ...
[75]
https://techcrunch.com/2019/08/22/how-oculus-squeezed-sophisticated-tracking-into-pipsqueak-hardware/
[76]
[PDF] Robustness of State-of-the-Art Visual Odometry and SLAM Systems
Jun 11, 2023 · This thesis attempts to evaluate the robustness to motion blur of two open-source state-of-the-art VIO and SLAM systems, namely Delayed ...Missing: demands | Show results with:demands
[77]
Online learning-based anomaly detection for positioning system
Apr 1, 2025 · In this study, an online learning-based anomaly detection approach is proposed for positioning systems of autonomous mobile robots.
[78]
RGB-D SLAM Dataset and Benchmark - Computer Vision Group
The dataset contains RGB-D data, ground-truth trajectory, and Kinect accelerometer data, recorded at 30Hz, for evaluating visual odometry and SLAM systems.Dataset Download · Useful tools · File Formats · Submission form for automatic...
[79]
[PDF] The TUM VI Benchmark for Evaluating Visual-Inertial Odometry
The TUM VI benchmark is a dataset with diverse sequences for evaluating visual-inertial odometry, featuring 1024x1024 images, IMU data, and synchronized ...