Fact-checked by Grok 2 weeks ago

3D pose estimation

3D pose estimation ( HPE), a subtask of pose estimation focused on humans, is a fundamental task in that involves predicting the three-dimensional spatial coordinates of key joints from input data such as single or multiple 2D images, videos, or depth sensors. This process reconstructs the articulated structure of the , typically represented as a set of keypoints connected by bones, enabling the of human movement and posture in a realistic spatial context. Unlike pose estimation, which projects joints onto a planar image, HPE addresses depth information to resolve ambiguities in pose reconstruction, making it essential for capturing full-body dynamics. The field has evolved significantly since early statistical and model-based approaches in the 2000s, which relied on predefined body models and limited , to modern paradigms introduced around 2014 with seminal works like DeepPose, which adapted convolutional neural networks (CNNs) for pose regression. Subsequent advancements incorporated graph convolutional networks (GCNs) to model skeletal connectivity as graphs, transformers for capturing long-range dependencies, and hybrid methods that lift 2D detections to 3D space or predict poses directly in 3D. Key benchmarks, such as the Human3.6M with over 3.6 million 3D poses captured via , have driven progress, alongside datasets like MPI-INF-3DHP for multi-view scenarios. Despite these gains, challenges persist, including depth ambiguity in settings, occlusions in crowded scenes, and the scarcity of annotated 3D data compared to 2D. Contemporary methods span single-stage end-to-end predictors and two-stage pipelines, supporting , multi-view, single-person, and multi-person estimation, with innovations like diffusion models and semi-supervised learning enhancing robustness to real-world variations as of 2024; ongoing work as of 2025 includes advanced techniques and integrations for improved in-the-wild performance. Applications of 3D pose estimation are diverse and impactful, spanning human-computer interaction (e.g., in ), action recognition and in security systems, sports analytics for performance tracking, motion capture in film and gaming, healthcare for rehabilitation monitoring, and autonomous robotics for safe human-robot collaboration. These uses underscore its role in bridging with actionable insights, with ongoing research focusing on efficiency and generalization across diverse environments.

Introduction

Definition and Scope

3D pose estimation is a fundamental task in computer vision and robotics that involves determining the three-dimensional position and orientation of objects, body parts, or articulated structures from sensor data, typically images, videos, or depth measurements. For rigid objects, this entails computing a 6D pose, defined as a 3D translation vector and a 3D rotation relative to a reference coordinate system, such as the camera frame. In the case of articulated entities like human bodies, it focuses on recovering the 3D coordinates of skeletal keypoints or joint angles to represent the overall configuration. This process enables machines to understand spatial relationships and movements, bridging the gap between 2D observations and 3D reality. The scope of 3D pose estimation encompasses applications to rigid bodies (e.g., tools or vehicles), human skeletons (e.g., for ), and limited scene elements, but it deliberately excludes comprehensive of object surfaces or volumetric models, which falls under fields like or mesh . It differs from 2D pose estimation by lifting projections into full 3D space, avoiding ambiguities inherent in depth from views. Key concepts include the distinction between intrinsic camera parameters, which describe internal like , and extrinsic parameters, which capture the camera's 3D pose in the world. Setups range from configurations relying on a sensor to multi-view systems using synchronized cameras for improved accuracy. Historically, 3D pose estimation traces its origins to , a discipline developed in the mid-19th century for extracting measurements from photographs, with significant advancements in the 1960s through analytical methods powered by early computers. These techniques facilitated precise 3D computations from images, including applications in space exploration for mapping and object positioning. For instance, the integration of computational tools in the 1960s enabled automated and matching, laying groundwork for modern pose recovery in diverse environments.

Historical Development

The roots of 3D pose estimation trace back to 19th-century , where the invention of in the enabled the use of images for and mapping, laying foundational principles for recovering object positions from multiple viewpoints. Early efforts focused on geometric techniques to determine camera positions relative to scenes, evolving through the into computational methods as emerged in the late 1960s and early 1970s. A key formalization occurred in the 1980s with the perspective-n-point (PnP) problem, introduced by Fischler and Bolles in 1981 as part of their (RANSAC) paradigm for robust model fitting in the presence of outliers, enabling reliable estimation of camera pose from 3D-2D correspondences. This problem became central to calibrated pose estimation, influencing subsequent algorithms for and . In the 2000s, advancements emphasized real-time systems, with precursors to modern multi-person frameworks like OpenPose emerging through methods such as pictorial structures for efficient 2D pose detection, which extended to via geometric lifting. The marked a surge in integration, driven by GPU advancements that facilitated training on large datasets and enabled end-to-end pose prediction. Milestones included the EPnP in 2009, offering an efficient O(n) non-iterative solution to the problem for arbitrary point counts, improving scalability in real-world applications. Seminal works like DeepPose in 2014 adapted convolutional neural networks (CNNs) for direct pose from images, laying groundwork for human pose estimation methods. The release of the Human3.6M dataset in 2014 provided a large-scale with 3.6 million poses from , spurring data-driven research in human pose estimation. This era's shift from purely geometric to hybrid learning-based approaches culminated in works like Convolutional Pose Machines in 2016, which leveraged convolutional networks for sequential refinement of pose predictions.

Mathematical Foundations

Pose Representation

In 3D pose estimation, the pose of a is parameterized by , consisting of a translation vector \mathbf{t} \in \mathbb{R}^3 that specifies the position of a reference point and a that describes the relative to a frame. The lies in the special orthogonal group SO(3), which preserves distances and orientations. Rotations are commonly represented using , which apply three sequential rotations around the coordinate axes (e.g., yaw, , roll), providing an intuitive but parameter-efficient description with three scalars. However, suffer from , where certain configurations lead to loss of a degree of freedom due to aligned rotation axes. Rotation matrices \mathbf{R} \in SO(3) offer a direct 3×3 representation of the linear transformation, satisfying \mathbf{R}^T \mathbf{R} = \mathbf{I} and \det(\mathbf{R}) = 1, though they require nine parameters subject to six constraints. Unit quaternions \mathbf{q} = (w, \mathbf{v}) \in \mathbb{R}^4 with the normalization constraint \|\mathbf{q}\| = 1 provide a compact, singularity-free alternative to , avoiding and enabling efficient composition and interpolation of rotations via the quaternion product. Compared to the axis-angle representation, which parameterizes rotation by an angle \theta \in [0, \pi] around a unit axis \mathbf{n} \in \mathbb{R}^3 (with \mathbf{q} = (\cos(\theta/2), \sin(\theta/2) \mathbf{n}) as an equivalent form), quaternions are numerically more stable for sequential rotations and avoid discontinuities at \theta = 0 or multiples of $2\pi, though both representations are related by the . Axis-angle is more geometrically intuitive for single rotations but less suitable for interpolation without conversion to quaternions. For articulated objects, such as human or robotic figures, the pose is represented by a set of joint angles \boldsymbol{\theta} \in \mathbb{R}^K for K joints arranged in a kinematic tree or chain, capturing the relative rotations between connected body parts (bones) of fixed lengths. The global 3D joint positions \mathbf{p}_j are derived via forward kinematics, defined as \mathbf{p}_j = f(\boldsymbol{\theta}, \mathbf{l}), where \mathbf{l} denotes the vector of bone lengths and f propagates transformations from a root joint outward through matrix multiplications of per-joint rotations and translations. This hierarchical structure ensures anatomical consistency, with constraints often imposed on \boldsymbol{\theta} to respect joint limits and avoid unnatural configurations. In 3D human pose estimation, the pose is commonly represented directly as the 3D coordinates of key body joints, forming a with a predefined , such as 17 joints in the Human3.6M . This is denoted as a \mathbf{P} \in \mathbb{R}^{J \times 3}, where J is the number of joints, allowing for direct from input data in modern methods and evaluation using metrics like mean per-joint position error (MPJPE).

Camera Models and Calibration

The serves as the foundational geometric representation in for projecting world points onto a , assuming an ideal perspective projection without lens distortions. In this model, a point \mathbf{X} = (X, Y, Z)^T in is mapped to an image point \mathbf{u} = (u, v, 1)^T via \lambda \mathbf{u} = \mathbf{K} [\mathbf{R} \mid \mathbf{t}] \mathbf{X}, where \lambda is a scale factor, \mathbf{K} is the 3×3 intrinsic parameter matrix, \mathbf{R} is the 3×3 , and \mathbf{t} is the 3×1 translation vector. The intrinsic matrix \mathbf{K} encapsulates camera-specific properties and is typically upper triangular, defined as \mathbf{K} = \begin{pmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix}, where f_x and f_y are the focal lengths in units along the x- and y-axes, (c_x, c_y) is point (image center), and s is the skew parameter (often assumed zero for rectified sensors). The extrinsic parameters [\mathbf{R} \mid \mathbf{t}] describe the from world coordinates to camera coordinates, with \mathbf{R} ensuring and 1 to represent a valid . Real lenses introduce distortions that deviate from the pinhole , primarily radial and tangential types, which must be modeled for accurate . Radial distortion arises from lens and is symmetric around the principal point, modeled as a to the normalized image coordinates (x_d, y_d): x_u = x_d (1 + k_1 r^2 + k_2 r^4 + k_3 r^6), y_u = y_d (1 + k_1 r^2 + k_2 r^4 + k_3 r^6), where r^2 = x_d^2 + y_d^2 and k_1, k_2, k_3 are radial coefficients (positive for , negative for barrel ). Tangential distortion results from sensor-lens misalignment and is asymmetric, given by x_u = x_d + [2 p_1 x_d y_d + p_2 (r^2 + 2 x_d^2)], y_u = y_d + [p_1 (r^2 + 2 y_d^2) + 2 p_2 x_d y_d], with p_1, p_2 as tangential coefficients. These models, originating from the Brown-Conrady framework, are applied after initial and before final pixel mapping to correct distorted points. Camera calibration estimates the intrinsic and extrinsic parameters by establishing correspondences between known 3D points and their 2D image projections, enabling accurate pose recovery. A widely adopted method is Zhang's flexible technique (1999), which uses images of a planar calibration pattern, such as a checkerboard, observed from multiple orientations (at least two) without requiring precise 3D measurements of the pattern. The approach solves for intrinsics via a closed-form solution from homography matrices between the world plane and image, followed by nonlinear refinement using the reprojection error minimization, achieving sub-pixel accuracy in practice. For scenarios without calibration targets, self-calibration methods recover intrinsics from image correspondences across multiple uncalibrated views, leveraging constraints like the absolute quadric or Kruppa equations derived from the fundamental matrix. Seminal work by Maybank and Faugeras (1992) established the theoretical foundation, showing that under the assumption of a constant but unknown focal length and square pixels, self-calibration is possible from pure rotational motion or general motion with sufficient views, though it requires at least three images for solvability and is sensitive to noise.

Image-Based Methods

Uncalibrated Single-Image Estimation

Uncalibrated single-image estimation addresses the task of recovering the pose of bodies from a single without knowledge of camera intrinsics, such as or principal point. This approach faces significant challenges due to the perspective model, which induces scale ambiguity—where the absolute size of the structure cannot be determined without additional cues—and depth loss, as the collapses the depth into a plane, resulting in multiple possible configurations consistent with the observed . To mitigate these issues, methods typically rely on strong assumptions, including the use of parametric body models representing the articulated and exploitation of cues like silhouettes or . One foundational technique is shape-from-shading, which infers pose by analyzing shading patterns to recover surface normals and subsequently integrate them into a coherent 3D structure consistent with a model. This method assumes a model and known or estimated light source direction, allowing pose estimation from the orientation of recovered normals relative to the viewing direction. Approaches like those integrating shading cues with statistical models (e.g., SCAPE) jointly estimate human pose and shape from single images by optimizing over pose parameters that align projected shading with observed intensities, demonstrating robustness to shadows and complex lighting. Such methods achieve qualitative accuracy in pose recovery but remain sensitive to non-Lambertian surfaces (e.g., ) and require initialization from low-level cues like edges. Single-view metrology provides another method, leveraging vanishing points—intersections of projected in the image—to compute affine measurements and relative pose without . By identifying the vanishing line (formed by at least two vanishing points corresponding to orthogonal directions), the technique rectifies the image to an affine view, enabling direct metric computations such as heights or distances. In human pose contexts, this can estimate camera orientation relative to a reference plane (e.g., ground) using body-aligned features, with applications in scenarios where environmental parallels aid scale recovery; the seminal algorithm by Criminisi et al. reports measurement errors below 5% for planar scenes. Template matching with affine invariants can facilitate uncalibrated pose estimation by comparing image regions to precomputed templates of body parts that are robust to affine distortions from unknown camera parameters. Local image patches are described using affine-covariant detectors, such as , which normalize for shear and scale, allowing matching across views to recover and for articulated models. Extensions build models from multi-view affine-invariant features of bodies and enforce kinematic consistency during matching, enabling pose hypothesis generation via geometric verification. A key algorithmic framework is the adaptation of the (DLT) for pose without intrinsics, which solves a to estimate the full 3D-to-2D from correspondences between model-generated 3D points and their image projections. Originally for , the DLT is adapted by treating unknown intrinsics as part of the projective ambiguity, requiring at least six points to solve the homogeneous system up to scale, followed by decomposition into , , and projective corrections using body constraints like joint limits. This resolves projective ambiguities into pose under affine approximations, with reported angular errors under 2 degrees for synthetic benchmarks, though it assumes a and minimal occlusion.

Calibrated Single-Image Estimation

Calibrated single-image estimation addresses the challenge of recovering the pose of a from a single image when the camera's intrinsic parameters, represented by the calibration matrix K, are known. For articulated humans, this relies on geometric solvers adapted to parametric body models (e.g., kinematic skeletons or mesh models like SMPL), which generate candidate joint points \mathbf{X}_i based on pose parameters. These are optimized to fit observed keypoints \mathbf{u}_i, determining the R and translation vector \mathbf{t}, enabling a solution unlike uncalibrated methods which suffer from projective ambiguities. The core formulation adapts the Perspective-n-Point () problem, solving for pose parameters given n correspondences satisfying \mathbf{u}_i = \proj(K [R \mid \mathbf{t}] \mathbf{X}_i(\theta)), where \mathbf{X}_i(\theta) depends on pose \theta, and \proj denotes the perspective projection function. This often involves iterative optimization over \theta to minimize fitting errors. A foundational solver for the minimal PnP case is the Perspective-Three-Point (P3P) algorithm, which provides an exact solution for n = 3 non-collinear points, yielding up to four possible pose configurations that must be disambiguated using additional body constraints. Introduced in robust model fitting contexts, P3P reduces the problem to solving a derived from distances between points and projections. For human models, it initializes optimization over larger sets of joints. For cases with more points (n > 3), efficient non-iterative solvers like EPnP extend P3P by expressing n world points as a linear combination of four virtual control points, transforming the problem into a linear system solved via eigenvalue decomposition of a 12×12 matrix, followed by quadratic equations for depths. In human pose, this initializes fitting of body models, achieving O(n) complexity for real-time use, with mean reprojection errors below 0.5 pixels on benchmarks for n up to 100 after refinement. EPnP handles non-planar configurations without degeneracy. To obtain the final pose, closed-form solutions are refined iteratively by minimizing the reprojection error through non-linear optimization, such as Gauss-Newton, which updates pose parameters by solving a least-squares system based on the of the projection function. For humans, this jointly optimizes body pose \theta and camera extrinsic [R \mid \mathbf{t}], often starting from EPnP and converging in fewer than 10 steps for sub-pixel accuracy on detections. In practice, real-world 2D keypoints from detectors include outliers, necessitating robust estimation. The RANSAC algorithm integrates with adapted solvers by sampling minimal subsets (e.g., three joints for ), computing candidate poses, and selecting the one with the largest inlier consensus based on a reprojection , achieving robustness with up to 50% outliers. This ensures reliable pose estimation for humans in cluttered scenes. The following pseudocode outlines a typical robust adapted PnP pipeline for human pose:
Input: 2D keypoints {u_i}, body model, calibration K, max iterations M, threshold τ
Output: Body pose θ, camera [R, t]

for iter = 1 to M:
    Sample 3 random keypoints
    Generate candidate 3D points X_cand from model with initial θ
    Compute candidate pose (R_cand, t_cand) using P3P or EPnP on X_cand, u_i
    Jointly refine θ, (R_cand, t_cand) with Gauss-Newton minimization of ∑ ||u_i - proj(K [R_cand | t_cand] X_i(θ))||²
    Count inliers: keypoints where reprojection error < τ
Update best pose if current inliers > best inliers
Refine final best θ, [R, t] on all inliers
Return θ, (R, t)
This framework balances speed and accuracy, with implementations processing dozens of joints in milliseconds on standard hardware.

Multi-View and Temporal Methods

Structure from Motion

(SfM) is a photogrammetric technique that recovers the three-dimensional structure of a scene and the relative poses of cameras from a set of two-dimensional images taken from unknown viewpoints, without requiring prior calibration. This process leverages to establish correspondences between images, enabling the estimation of camera motion and scene points, which is fundamental for 3D pose estimation in multi-view settings. For human pose, this is extended to non-rigid structure from motion (NRSfM) to model body articulations and deformations. SfM operates on unordered or sequential image collections, distinguishing it from calibrated single-image methods by exploiting redundancy across multiple views to disambiguate depth and pose. The SfM pipeline typically begins with feature detection and matching to identify corresponding points across images. Scale-invariant feature transform (SIFT) descriptors are widely used for this purpose, as they detect and describe local image features robust to scale, rotation, and illumination changes, facilitating reliable matches even in challenging conditions. Once correspondences are established, the fundamental matrix F is estimated using the eight-point algorithm, which solves a from at least eight point pairs to enforce the epipolar \mathbf{x}'^T F \mathbf{x} = 0, where \mathbf{x} and \mathbf{x}' are homogeneous coordinates in the two images. With F in hand, three-dimensional points are triangulated by intersecting rays from corresponding image points, yielding an initial sparse reconstruction of the scene. For pose recovery, if camera intrinsics K are known or estimated, the essential matrix E is computed as E = K^T F K, which encodes the relative rotation R and translation t up to scale via the decomposition E = _\times R, where _\times is the skew-symmetric matrix of t. This decomposition yields four possible solutions for (R, t), from which the correct one is selected based on positive depth (chirality) constraints during triangulation. Incremental SfM frameworks, such as COLMAP, extend this to large image sets by sequentially registering new images to the growing model through feature matching and pose estimation, followed by sparse bundle adjustment for refinement. A key challenge in SfM is scale , as the is only determined up to a global scale factor due to the projective nature of image ; this is resolved by incorporating ground control points with known real-world distances or leveraging scene priors such as object sizes. In the context of human pose estimation, SfM enables markerless from uncalibrated video sequences, reconstructing articulated human skeletons for applications in , , and human-computer interaction, as demonstrated in methods combining SfM with up to 2024.

Bundle Adjustment and Optimization

Bundle adjustment is a fundamental optimization technique in 3D pose estimation that refines estimates of camera poses and 3D scene points across multiple views by minimizing the geometric inconsistency between observed 2D image features and their projected 3D counterparts. In multi-view setups, it jointly optimizes the extrinsic parameters (rotations R_j and translations t_j) for each camera view j and the 3D coordinates X_k of landmark points k, starting from initial estimates often derived from pipelines. In 3D human pose estimation, it refines joint positions across views, often incorporating kinematic priors for anatomical consistency, and is essential for multi-view markerless . This process enhances the accuracy of , particularly in scenarios involving noisy feature matches or uncertainties, and supports applications like video-based temporal refinement. The core formulation of bundle adjustment poses the problem as a non-linear least-squares minimization of the reprojection error, defined as: \min_{\{R_j, t_j\}, \{X_k\}} \sum_{i} \left\| u_i - \pi \left( K [R_j | t_j] X_k \right) \right\|^2 where u_i are the observed feature points in the images, \pi denotes the projection function (typically perspective projection), K is the camera intrinsic matrix, and the sum runs over all observations i linking features to views and points. This objective captures the bundle of light rays from 3D points to their projections, hence the name, and is solved iteratively due to the non-linearity of the rotation parameters and projection model. Common techniques for solving this optimization include the Levenberg-Marquardt (LM) algorithm, which combines and Gauss-Newton methods to handle the damping of the for robust convergence in the presence of local minima. For efficiency in large-scale problems, sparse variants of exploit the sparsity of the observation , reducing from O(n^3) to near-linear time via techniques like decomposition. In temporal sequences like video, extends to incorporate constraints, modeling inter-frame point correspondences to enforce smoothness and reduce drift in pose estimates over time. This spatiotemporal formulation adds terms to the penalizing deviations from predicted flows, allowing joint optimization of dynamic trajectories and static structure in non-rigid scenes. A key variant is local bundle adjustment, which optimizes only a sliding window of recent poses and points rather than the entire map, facilitating pose tracking in sequential estimation tasks like or . By limiting the scope to co-visible features within a few frames, it achieves sub-millisecond convergence per iteration on modern hardware, balancing accuracy and computational cost for online applications.

Learning-Based Approaches

Traditional Feature-Based Learning

Traditional feature-based learning methods in 3D pose estimation represent an early integration of with handcrafted visual descriptors, primarily aimed at predicting body joint positions or classifying poses from images or depth data. These approaches typically involve extracting robust features such as histograms of oriented gradients () or depth-based pixel classifications, followed by classifiers like support vector machines (SVMs) or regression trees to infer 3D configurations. Unlike purely geometric methods, they incorporate statistical learning to handle variability in appearance and shape, enabling real-time applications in constrained settings like depth sensor inputs. A prominent example is the use of random forests for per-pixel body part and keypoint regression, as demonstrated in the Microsoft Kinect system. In this method, synthetic depth data is used to train randomized decision trees that classify each pixel into one of 31 body parts, with offsets predicted to locate joint centers in 3D space; this achieved mean errors of around 100 mm for major joints on real depth images, enabling robust pose estimation at interactive frame rates without relying on RGB cues. Similarly, SVMs trained on HOG features have been applied for pose by detecting oriented edge distributions in local image patches, which capture and limb orientations to distinguish between discrete pose categories, often achieving accuracies above 80% on benchmark datasets like INRIA Person. These techniques bridge classical by learning mappings from features to poses, reducing sensitivity to exact pixel-level alignments. For human-specific 3D pose estimation, the Pictorial Structures () model incorporates kinematic constraints to lift detections into . Introduced as a tree-structured , PS represents the body as parts connected by spring-like potentials that enforce anatomical limb lengths and angles; unary potentials from appearance models (e.g., HOG-based detectors) score part locations, while pairwise terms ensure global consistency, allowing efficient dynamic programming inference for pose recovery with sub-pixel accuracy on datasets like . This framework has been extended to handle occlusions by marginalizing over hidden parts, improving joint error rates by 20-30% compared to independent part detectors in multi-view setups. Despite their efficiency, traditional feature-based methods suffer from sensitivity to viewpoint variations and environmental factors, as handcrafted features like degrade under significant pose rotations or clutter, leading to error rates exceeding 150 mm in joint positions for out-of- views. Their heavy reliance on annotated datasets for also limited in the pre-deep learning era, where was labor-intensive and often dataset-specific. These limitations paved the way for systems that combined feature-based detectors with emerging neural networks for end-to-end refinement, enhancing robustness before fully data-driven paradigms dominated.

Deep Learning Techniques

Deep learning techniques have become the dominant paradigm for pose estimation since the mid-2010s, enabling end-to-end learning of spatial and temporal features directly from images or videos, often surpassing traditional methods that relied on handcrafted descriptors. These approaches leverage convolutional neural networks (CNNs), recurrent networks, and more recently transformers to regress joint positions, addressing challenges like depth ambiguity in settings through data-driven priors. A key architectural strategy is -to- lifting, where keypoints detected from images are mapped to coordinates using temporal convolutions for video inputs. For instance, VideoPose3D employs dilated temporal convolutions on sequences of poses to estimate trajectories, achieving state-of-the-art mean per joint position error (MPJPE) of 46.8 mm on the Human3.6M dataset through semi-supervised training on unlabeled videos. Another variant integrates depth estimation with pose regression in self-supervised frameworks, leveraging consistency losses such as enforcing temporal smoothness or across views to utilize unlabeled data and mitigate annotation scarcity, often reducing the need for ground truth by up to 90% in semi-supervised setups. In human pose estimation, early deep methods focused on heatmap regression for 2D detection followed by 3D lifting via triangulation or optimization. The stacked hourglass network, introduced in 2016, processes multi-scale features through repeated bottom-up and top-down pathways to generate precise 2D heatmaps, serving as a foundation for subsequent extensions that achieve a [email protected] of 88.0% on the MPII dataset. More recent transformer-based models like directly regress keypoints and vertices from images using masked vertex modeling, capturing long-range dependencies across body joints and achieving state-of-the-art results on Human3.6M without multi-view supervision. Training these models typically involves on annotated datasets such as MPI-INF-3DHP, which provides joint annotations for over 1.3 million frames captured in varied indoor and outdoor scenes with occlusions. Self-supervised alternatives employ consistency losses, such as enforcing temporal smoothness or across views, to leverage unlabeled data and mitigate annotation scarcity, often reducing the need for ground truth by up to 90% in semi-supervised setups. Advances in efficiency and robustness include models optimized for devices; MediaPipe Pose, based on a lightweight BlazePose topology, detects 33 3D keypoints at over 30 frames per second on mobile hardware. To handle occlusions, convolutional networks (GCNs) model the as a , propagating features across joints to infer hidden parts; for example, pose-aware GCNs incorporate a central node to aggregate contextual information, reducing MPJPE by approximately 7% on Human3.6M. Recent developments as of 2025 include hybrid models combining state-space models like with GCNs for efficient spatiotemporal modeling, such as Pose Magic, which achieves improved MPJPE on benchmarks like Human3.6M through selective scanning mechanisms. models have also emerged for generating diverse poses, enhancing robustness in low-data regimes.

Sensor Fusion and Alternative Inputs

RGB-D and Depth Sensor Integration

RGB-D sensors combine color (RGB) images with per-pixel depth measurements, enabling more robust human pose by providing direct geometric cues that mitigate depth ambiguities inherent in RGB-only methods. These sensors capture depth data D(u,v) aligned with RGB pixels (u,v), allowing straightforward of 2D detections into space while preserving metric scale without additional . Early approaches, such as the Kinect-based system introduced by Shotton et al., demonstrated pose from depth alone using randomized decision forests to label body parts and mean-shift clustering to localize joints, achieving high accuracy on indoor datasets. Data fusion in RGB-D pose estimation typically involves registering the depth map to the RGB image via intrinsic , followed by techniques like the () algorithm for alignment. iteratively minimizes the objective \min_{T} \sum \| \mathbf{p}_i - T(\mathbf{q}_i) \|^2, where \mathbf{p}_i and \mathbf{q}_i are corresponding points in the source and target clouds, and T is the , enabling accurate tracking of human poses against reconstructed models. Seminal work like KinectFusion adapted for real-time dense surface mapping and camera tracking using RGB-D inputs, laying the groundwork for human pose applications by fusing successive frames into a global for joint optimization. Modern methods leverage joint RGB-D processing, such as end-to-end networks that extract poses from RGB and them using depth-derived point clouds, or convolutional neural networks (s) applied to voxelized depth data for volumetric pose . For instance, V2V-PoseNet employs a voxel-to-voxel on discretized depth volumes to predict heatmaps of locations, improving robustness to occlusions in monocular setups extendable to RGB-D fusion. RGB-D integration offers key advantages, including inherent metric scale from depth measurements—avoiding scale drift in —and enhanced performance in controlled environments like indoor (), where systems achieve sub-centimeter accuracy for applications in human-robot interaction. Common RGB-D hardware includes structured light sensors, like the original Microsoft Kinect, which project an pattern onto the scene and compute depth via from pattern deformation, excelling in high-resolution close-range capture (up to 4 m) but sensitive to ambient light. In contrast, time-of-flight (ToF) sensors, such as those in Intel RealSense L515 or , measure round-trip light travel time using , providing wider range (up to 9 m) and faster acquisition with lower computational overhead, though with potential noise in reflective surfaces. This hardware diversity supports versatile deployment in pose estimation pipelines, balancing accuracy, speed, and environmental robustness.

IMU and Multi-Modal Fusion

Inertial measurement units () provide high-frequency data on and , enabling robust 3D pose estimation when fused with visual inputs in environments where camera data may degrade, such as low-light or fast-motion scenarios. This fusion enhances continuity and accuracy by leveraging the complementary strengths of for short-term motion prediction and cameras for long-term drift correction. Fusion frameworks often employ Kalman filters to integrate RGB and IMU data, with extended Kalman filters (EKF) commonly used for nonlinear state propagation and updates. A seminal example is the method fusing wearable with multi-view images, which estimates 3D poses by first detecting 2D poses from images and then optimizing a kinematic incorporating IMU constraints on limb accelerations and rotations. This approach refines estimates through bundle adjustment-like optimization, achieving mean per-joint position errors (MPJPE) below 50 mm on datasets like Human3.6M, even with partial occlusions. Recent advancements as of 2025, such as MobilePoser, extend this to real-time full-body estimation using sparse from consumer devices like smartphones, supporting global translation without specialized hardware. Multi-modal fusion extends beyond RGB-IMU by incorporating other sensors for specialized applications. -IMU integration, as in LiDAR-aid Inertial , captures challenging 3D human motions by fusing point clouds with IMU data, estimating consecutive local poses and global trajectories through feature matching and optimization, with drift rates below 0.1% over large-scale indoor tests. For human interaction scenarios, audio-visual fusion leverages acoustic signals to refine visual pose estimates, such as using arrays to detect subtle movements via reflections, improving localization in occluded or noisy visual conditions. Key benefits include drift correction during temporary vision loss, enabling reliable navigation in GPS-denied areas like indoors or urban canyons, where IMU data maintains pose continuity over seconds to minutes. Such systems support applications in autonomous drones and , reducing localization errors by up to 50% compared to vision-only methods in degraded conditions. Challenges in these fusions arise from sensor synchronization, requiring precise timestamp alignment to avoid pose inconsistencies, often addressed via hardware triggers or software . Additionally, modeling sensor noise—particularly IMU biases that accumulate over time—demands robust and online to prevent error propagation in the .

Evaluation and Challenges

Performance Metrics

In 3D pose estimation, performance is quantified using error metrics that assess the accuracy of predicted joint positions or object orientations relative to , often reported in millimeters for translations and degrees or radians for rotations. These metrics enable standardized comparisons across methods and datasets, emphasizing both absolute positioning errors and alignment-invariant measures to account for rigid transformations. For human pose estimation, the primary focus is on joint-level positional accuracy. The Mean Per Joint Position Error (MPJPE) is a cornerstone metric for 3D human pose estimation, computing the average Euclidean distance between predicted and ground-truth 3D joint coordinates after root joint alignment to remove global translation offsets. It is defined as: \text{MPJPE} = \frac{1}{F \times J} \sum_{f=1}^{F} \sum_{j=1}^{J} \| \hat{x}_{f,j} - x_{f,j} \| where F is the number of frames, J is the number of joints, \hat{x}_{f,j} denotes the predicted joint position, and x_{f,j} is the ground truth, with values reported in millimeters (mm). Lower MPJPE values indicate higher accuracy, and it is widely applied on the Human3.6M dataset, a large-scale benchmark comprising 3.6 million 3D poses from 11 subjects across 15 activities captured by four calibrated cameras. To mitigate sensitivity to rigid body transformations like scale, rotation, and translation, the Procrustes-Aligned MPJPE (PA-MPJPE) applies a similarity transformation alignment before computing the error, providing a more robust measure of shape fidelity. PA-MPJPE is similarly in mm and often yields 10-20 mm improvements over raw MPJPE in benchmarks. Evaluation protocols standardize comparisons, with Human3.6M commonly employing Protocol 1 (P1): training on five subjects (S1, S5, S6, S7, S8) and testing on two unseen subjects (S9, S11) using all camera views at 10 frames per second, averaging errors over 14-17 joints. Protocol 2 (P2) extends this by training on six subjects and testing on S11 with Procrustes alignment for PA-MPJPE. As of 2025, state-of-the-art methods achieve MPJPE below 40 mm on Human3.6M under Protocol 1. A notable challenge in these protocols is the performance gap between real and synthetic data, where models trained on synthetic poses (e.g., via domain randomization) often exhibit higher MPJPE on real-world benchmarks due to texture, lighting, and motion artifacts not fully captured in simulations.

Limitations and Open Problems

One of the primary challenges in 3D pose estimation is handling occlusions, where parts are obscured by objects, other individuals, or self-occlusion, leading to significant accuracy degradation, particularly in dense or crowded environments. For instance, in multi-person scenarios, occlusions can cause a substantial portion of joints to be hidden, complicating joint association and depth estimation across views. Viewpoint variations further exacerbate this issue, as setups suffer from inherent depth ambiguity, while multi-view systems are sensitive to random camera perspectives and poor , resulting in inconsistent and reduced generalization to unseen angles. across domains remains a persistent limitation, with models trained on controlled laboratory datasets like Human3.6M exhibiting substantial performance drops in in-the-wild settings due to variations in , , and types, highlighting the domain gap between synthetic and real-world data. Computational demands also pose a barrier to real-time deployment, as deep learning-based methods, especially those involving volumetric representations or multi-person processing, require high resources that limit their use on devices or in applications. Open issues include the reliance on , which demands expensive 3D annotations; unsupervised approaches leveraging kinematic constraints or self-supervision show promise but struggle with robustness to noise and lack of diverse unlabeled data. Ethical concerns arise in applications, where pose tracking enables detailed behavioral monitoring without explicit consent, raising risks and necessitating privacy-preserving techniques like or non-visual sensors to balance utility and individual rights. to crowds is another unresolved challenge, as current multi-person methods falter under heavy inter-person occlusions and varying scales, with bottom-up approaches offering better efficiency but lower precision for distant or small figures. Looking ahead, integrating Neural Radiance Fields () holds potential for dynamic scenes by enabling articulated 3D human reconstruction from sparse views, improving handling of motion and s through implicit scene representations, as demonstrated in methods like A-NeRF that combine pose priors with radiance fields. Additionally, advancements in quantum-inspired sensors, such as single-photon sensitive depth imagers, could enhance pose estimation in low-light or privacy-sensitive environments by providing high-resolution, non-visual depth data beyond classical RGB-D limitations. These directions, alongside efforts to quantify limitations using metrics like mean per joint position error under benchmarks, underscore the need for hybrid, efficient systems to advance practical adoption.

Implementations

Open-Source Libraries

Several prominent open-source libraries facilitate the implementation of 3D pose estimation algorithms, providing developers with accessible tools for both classical and learning-based approaches. These libraries offer code in languages like and C++, pre-trained models, and integrations that support practical deployment in research and applications. Key examples include for geometric solvers, MediaPipe for real-time human pose tracking, COLMAP for structure-from-motion pipelines, and deep learning toolboxes like MMPose and AlphaPose. OpenCV, a widely-used computer vision library, includes Perspective-n-Point (PnP) solvers such as solvePnP, which estimate the 3D pose of an object relative to a camera from 2D-3D point correspondences and camera intrinsics. This function supports variants like RANSAC for robust outlier rejection and is implemented in both C++ and Python, enabling efficient integration into larger systems. With over 84,000 GitHub stars, OpenCV remains actively maintained, with regular updates enhancing its calibration and reconstruction modules. MediaPipe, developed by and released in 2019, provides a cross-platform framework for real-time multimodal ML pipelines, including 3D human pose estimation via the Pose Landmarker task, which detects 33 body landmarks in 3D space using models like BlazePose GHUM. It supports , C++, and APIs, running efficiently on mobile and desktop devices for applications like fitness tracking. The library has garnered significant community adoption, evidenced by its integration with ROS through wrappers like ros_mediapipe, which publish pose data as ROS messages for . COLMAP offers a robust pipeline for Structure-from-Motion (SfM) and Multi-View Stereo, incorporating to refine camera poses and points from collections, essential for scene-level pose estimation. Implemented primarily in C++ with command-line and interfaces, it excels in reconstructing sparse models from unordered photo sets. The project, with ongoing updates as of 2025, supports extensions for robotics via ROS compatibility in derived tools. MMPose, part of the OpenMMLab ecosystem and initially released in 2020, is a PyTorch-based toolbox built on MMDetection, specializing in 2D and 3D human pose estimation with support for top-down and bottom-up paradigms. It includes pre-trained models for 3D tasks evaluated on datasets like Human3.6M, achieving state-of-the-art mean per joint position error (MPJPE) metrics through architectures like VideoPose3D. Featuring Python APIs and over 6,900 GitHub stars, MMPose emphasizes modularity for custom training and inference, with community contributions driving updates like RTMW3D for real-time whole-body estimation. AlphaPose provides real-time multi-person pose estimation, extending to via integrations like HybrIK for lifting keypoints to meshes, with updates including whole-body support in its 2022 release and subsequent refinements. Available in with , it achieves high accuracy on benchmarks like COCO (75 for ) and supports tracking across frames. Boasting around 7,500 stars, the repository reflects strong community engagement, though development has been less active since 2022 and features are often augmented through external modules.
LibraryPrimary LanguageKey 3D FeatureGitHub Stars (approx.)ROS Integration
C++/PnP solvers84,800Native via cv_bridge
MediaPipeC++//Java3D landmark detection25,000+Via ros_mediapipe wrapper
COLMAPC++ in SfM5,900+Compatible through pipelines
()3D models on Human3.6M6,900Community extensions
AlphaPose ()2D-to-3D lifting7,500Limited, via custom nodes

Real-World Applications

3D pose estimation enables immersive experiences on mobile devices through Apple's ARKit framework, introduced in 2017, which tracks the 6 (6DOF) pose of devices and supports subsequent advancements in human body pose detection for interactive applications. This system has facilitated widespread deployment in consumer AR, powering features like virtual object placement and with a single camera on compatible iPhones and iPads. Medical applications utilize Vicon's optical motion capture systems for precise 3D pose tracking in clinical settings, including surgery planning and in the 2020s, where captured and data inform personalized treatment strategies and evaluate surgical outcomes. These systems provide sub-millimeter accuracy, addressing challenges like marker-based setup in sterile environments to optimize patient progress monitoring. In the gaming industry, employs 3D pose estimation for real-time control and interactive simulations, integrating webcam-based pose data to drive character movements in applications like physiotherapy games and multiplayer experiences. Deployment success is evident in scale, with ARKit contributing to over 1 billion mobile users worldwide by 2024, enabling billions of annual sessions across apps despite constraints like computational limits on edge devices. These implementations highlight overcome hurdles in and robustness, fostering adoption in high-stakes environments from to healthcare.

References

  1. [1]
  2. [2]
    [PDF] A survey on deep 3D human pose estimation
    Nov 25, 2024 · The survey explores various solution paradigms, including single-stage vs. 2D-to-3D lifting, absolute vs relative keypoints, pixel vs voxel vs ...
  3. [3]
    A Survey of 6DoF Object Pose Estimation Methods for Different ...
    This paper will provide a comprehensive review of the latest progress in 6D pose estimation to help researchers better understanding this area.
  4. [4]
    Deep 3D human pose estimation: A review - ScienceDirect.com
    Three-dimensional (3D) human pose estimation involves estimating the articulated 3D joint locations of a human body from an image or video.
  5. [5]
    [PDF] A historical survey of geometric computer vision - Hal-Inria
    Nov 25, 2011 · Relatively quickly after the invention of photography in the 1830's1, the idea of using photographs for map creation and 3D modelling (terrains, ...
  6. [6]
    (PDF) Photogrammetry: A Brief Historical Overview - ResearchGate
    Jul 20, 2025 · This document provides a brief overview of the remarkable technology of photogrammetry—its definition, historical evolution, ...
  7. [7]
    Pose Estimation Algorithms: History and Evolution - Roboflow Blog
    Jul 19, 2023 · Pose estimation, also called keypoint detection, is a computer vision technique that pinpoints the key body joints of a human in images and ...Why and Where Did Pose... · Pose Estimation Methods Over... · Latest Models<|separator|>
  8. [8]
    Random sample consensus: a paradigm for model fitting with ...
    Jun 1, 1981 · Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Authors: Martin A. Fischler.
  9. [9]
    EPnP: An Accurate O(n) Solution to the PnP Problem
    Jul 19, 2008 · We propose a non-iterative solution to the PnP problem—the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences ...Missing: paper | Show results with:paper
  10. [10]
  11. [11]
    [PDF] Convolutional Pose Machines - CVF Open Access
    We introduce Convolutional Pose Machines (CPMs) for the task of articulated pose estimation. CPMs inherit the benefits of the pose machine [29] architecture ...
  12. [12]
    Animating rotation with quaternion curves - ACM Digital Library
    Animating rotation with quaternion curves. Author: Ken Shoemake. Ken Shoemake. The Singer Company, Link Flight Simulation Division, 1700 Santa Cruz Ave ...
  13. [13]
  14. [14]
    [PDF] Metrics for 3D Rotations: Comparison and Analysis
    Jun 18, 2009 · For research in computer vision and com- puter graphics, 3D rotations are commonly represented as unit quaternions (e.g., [5, 15, 17, 18]), ...Missing: seminal | Show results with:seminal
  15. [15]
    [PDF] Multiple View Geometry in Computer Vision, Second Edition
    ... Hartley. Australian National University,. Canberra, Australia. Andrew Zisserman ... Camera projections. 6. 1.3. Reconstruction from more than one view. 10. 1.4.
  16. [16]
    [PDF] A Flexible New Technique for Camera Calibration - Microsoft
    Abstract. We propose a flexible new technique to easily calibrate a camera. It is well suited for use without specialized knowledge of 3D geometry or ...
  17. [17]
    [PDF] EPnP: An Accurate O(n) Solution to the PnP Problem - TU Graz
    Abstract We propose a non-iterative solution to the PnP problem—the estimation of the pose of a calibrated camera from n 3D-to-2D point ...
  18. [18]
    [PDF] Random Sample Consensus: A Paradigm for Model Fitting with ...
    We call this the. "perspective-n-point" problem (PnP). In order to apply the RANSAC paradigm, we wish to determine the smallest value of n for which it is ...
  19. [19]
    [PDF] In defence of the 8-point algorithm
    The fundamental matrix is a basic tool in the analysis of scenes taken with two uncalibrated cameras, and the 8-point algorithm is a frequently cited method.
  20. [20]
    [PDF] Recovering Baseline and Orientation from `Essential' Matrix
    The two solutions for the orientation can be found using. (b · b)R = Cofactors(E)T − BE, where Cofactors(E) is the matrix of cofactors of E. There is no need to.
  21. [21]
    Recovering absolute scale for Structure from Motion using the law of ...
    Apr 15, 2019 · This inherent problem is called “scale ambiguity”, and is also rooted in monocular visual Simultaneous Localization and Mapping (SLAM). The ...
  22. [22]
    [PDF] Bundle Adjustment — A Modern Synthesis
    Bundle adjustment is really just a large sparse geometric parameter estimation problem, the parameters being the combined 3D feature coordinates, camera poses ...
  23. [23]
    [PDF] Visual SLAM: Why Bundle Adjust? - arXiv
    Jun 14, 2019 · Abstract—Bundle adjustment plays a vital role in feature- based monocular SLAM. In many modern SLAM pipelines,.
  24. [24]
    [PDF] Is Levenberg-Marquardt the Most Efficient Optimization Algorithm for ...
    In order to obtain optimal 3D structure and viewing pa- rameter estimates, bundle adjustment is often used as the last step of feature-based structure and ...
  25. [25]
    [PDF] Spatiotemporal Bundle Adjustment for Dynamic 3D Reconstruction
    Bundle adjustment jointly optimizes camera intrinsics and extrinsics and 3D point triangulation to reconstruct a static scene. The triangulation constraint ...
  26. [26]
    [PDF] Generic and Real Time Structure from Motion using Local Bundle ...
    Nov 15, 2017 · We introduce a local bundle ad- justment allowing 3D points and camera poses to be refined simultaneously through the sequence. This ...
  27. [27]
    Histograms of oriented gradients for human detection - IEEE Xplore
    We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case.
  28. [28]
    [PDF] Pictorial Structures for Object Recognition - CS@Cornell
    In this section we present a scheme to model articulated objects. Our main moti- vation is to construct a system that can estimate the pose of human bodies. We.
  29. [29]
    3D human pose estimation in video with temporal convolutions and ...
    Nov 28, 2018 · In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2 ...
  30. [30]
    VideoMAE: Masked Autoencoders are Data-Efficient Learners for ...
    Mar 23, 2022 · In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP).
  31. [31]
    Stacked Hourglass Networks for Human Pose Estimation - arXiv
    Mar 22, 2016 · This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales.
  32. [32]
    BlazePose: On-device Real-time Body Pose tracking - arXiv
    Jun 17, 2020 · BlazePose is a lightweight neural network for real-time human pose estimation on mobile devices, producing 33 keypoints at over 30 fps.
  33. [33]
    [PDF] Learning Global Pose Features in Graph Convolutional Networks for ...
    Illustration of a graph convolutional network (GCN) with a global node u for 3D human pose estimation. The input is the 2D body joint locations predicted by an ...
  34. [34]
    [PDF] Real-Time Human Pose Recognition in Parts from Single Depth ...
    In this paper, we focus on pose recognition in parts: detecting from a single depth image a small set of. 3D position candidates for each skeletal joint. Our ...
  35. [35]
  36. [36]
    A method for registration of 3-D shapes - IEEE Xplore
    ... iterative closest point (ICP) algorithm, which requires only a procedure to find the closest point on a geometric entity to a given point. The ICP algorithm ...Missing: paper | Show results with:paper
  37. [37]
    Efficient 3D human pose estimation from RGBD sensors
    In this paper, we propose an end-to-end pipeline for estimating 3D human poses that works in real-time in an off-the-shelf computer.
  38. [38]
    [PDF] How does the Kinect work? - cs.wisc.edu
    Structured light general principle: project a known pattern onto the scene ... The Kinect uses structured light and machine learning.
  39. [39]
  40. [40]
    Kinect range sensing: Structured-light versus Time-of-Flight Kinect
    This paper presents an in-depth comparison between the two versions of the Kinect range sensor, ie the Kinect SL, which is based on the Structured Light ...
  41. [41]
    A Robust and Versatile Monocular Visual-Inertial State Estimator
    Aug 13, 2017 · In this work, we present VINS-Mono: a robust and versatile monocular visual-inertial state this http URL approach starts with a robust procedure ...Missing: RGB- fusion EKF
  42. [42]
    [PDF] LOAM: Lidar Odometry and Mapping in Real-time
    Abstract—We propose a real-time method for odometry and mapping using range measurements from a 2-axis lidar moving in 6-DOF. The problem is hard because ...
  43. [43]
    Science and Systems X - Online Proceedings - Robotics
    LOAM: Lidar Odometry and Mapping in Real-time ... Abstract: We propose a real-time method for odometry and mapping using range measurements from a 2-axis lidar ...
  44. [44]
    [PDF] 3D Human Pose Estimation With Acoustic Signals - CVF Open Access
    We propose 3D human pose estimation given only low-level acoustic signals with a single pair of microphones and loudspeakers. Given an audio feature frame ( ...
  45. [45]
    Fusing Visual and Inertial Sensors with Semantics for 3D Human ...
    Sep 8, 2018 · We propose an approach to accurately estimate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data.
  46. [46]
    A survey on deep 3D human pose estimation
    Nov 25, 2024 · This survey has provided a comprehensive overview of recent advancements and strategies in 3D human pose estimation. The landscape of 3D-HPE ...
  47. [47]
    Markerless multi-view 3D human pose estimation: A survey
    3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints.
  48. [48]
    (PDF) 3D Human Pose Estimation: A Survey
    ### Summary of Challenges, Limitations, and Open Problems in 3D Human Pose Estimation
  49. [49]
    Privacy-Preserved Video Monitoring Method with 3D Human Pose ...
    Visual Obfuscation methods aim to conceal human identities in visual data while preserving sufficient scene information for the downstream vision tasks.
  50. [50]
    [PDF] A-NeRF: Articulated Neural Radiance Fields for Learning Human ...
    A feed-forward 3D human pose estimation approach provides a rough initial estimate of human pose. Afterward, a generative approach based on either a high- ...
  51. [51]
    3D Pose - QOCI Group
    Single-photon sensitive depth sensors are being increasingly used in next generation electronics for human pose and gesture recognition.
  52. [52]
    Benchmarking 3D Human Pose Estimation Models under Occlusions
    Jun 2, 2025 · All models were originally trained on Human3.6M and tested here without retraining to assess their generalization. We introduce a protocol that ...
  53. [53]
    Perspective-n-Point (PnP) pose computation - OpenCV
    The solvePnP and related functions estimate the object pose given a set of object points, their corresponding image projections, as well as the camera intrinsic ...
  54. [54]
    Pose landmark detection guide | Google AI Edge
    Jan 13, 2025 · The MediaPipe Pose Landmarker task lets you detect landmarks of human bodies in an image or video. You can use this task to identify key body locations.Python · Android · Web · iOSMissing: 2019 | Show results with:2019
  55. [55]
    COLMAP — COLMAP 3.13.0.dev0 | a5332f46 (2025-07-05 ...
    COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface.Installation · Tutorial · Frequently Asked Questions · Output Format
  56. [56]
    GitHub - opencv/opencv: Open Source Computer Vision Library
    - **Stars**: Not explicitly stated in the provided content.
  57. [57]
    3D Pose Detection with MediaPipe BlazePose GHUM and ...
    Aug 30, 2021 · 3D pose estimation opens up new design opportunities for applications such as fitness, medical, motion capture and beyond.
  58. [58]
    chrisywong/ros_mediapipe: ROS Wrapper for Google MediaPipe
    MediaPipe is a cross-platform framework for building multimodal applied machine learning pipelines including inference models and media processing functions.
  59. [59]
    Tutorial — COLMAP 3.13.0.dev0 | a5332f46 (2025-07-05 ...
    Structure-from-Motion (SfM) is the process of reconstructing 3D structure from its projections into a series of images. The input is a set of overlapping ...Graphical User Interface · Database Format · Bibliography · Key Concepts
  60. [60]
    open-mmlab/mmpose: OpenMMLab Pose Estimation Toolbox and ...
    MMPose is an open-source toolbox for pose estimation based on PyTorch. It is a part of the OpenMMLab project. The main branch works with PyTorch 1.8+.
  61. [61]
    Body 3D Keypoint - MMPose's documentation! - Read the Docs
    By default, we test models with Human 3.6m dataset processed by MMPose. The official repo's dataset includes more data and applies a different pre-processing ...
  62. [62]
    3D Body Keypoint Datasets - MMPose's documentation!
    MMPose supported datasets: Human3.6M [ Homepage ]. CMU Panoptic [ Homepage ]. Campus/Shelf [ Homepage ]. UBody [ Homepage ]. Human3.6M. Human3.6M (TPAMI'2014).
  63. [63]
    MVIG-SJTU/AlphaPose: Real-Time and Accurate Full-Body ... - GitHub
    AlphaPose is an accurate multi-person pose estimator, which is the first open-source system that achieves 70+ mAP (75 mAP) on COCO dataset and 80+ mAP (82.1 ...Missing: 2023 | Show results with:2023
  64. [64]
  65. [65]
    ARKit 6 - Augmented Reality - Apple Developer
    ### Summary of ARKit Features (2017–Present)
  66. [66]
    Capturing Depth in iPhone Photography - WWDC17 - Videos
    Portrait mode on iPhone 7 Plus showcases the power of depth in photography. In iOS 11, the depth data that drives this feature is now...Missing: ARKit estimation 2017
  67. [67]
  68. [68]
  69. [69]
    Motion Capture for Clinical Gait Analysis - Vicon
    By expertly capturing and analyzing gait patterns in as much detail as possible, Vicon helps clinicians develop better treatment plans, track patient progress, ...Clinical Gait Analysis · Motion Capture For Gait · Watch
  70. [70]
    Using Mocap In Life Sciences | Vicon Motion Capture
    The precise data collected through motion capture allowed doctors to evaluate the effectiveness of surgical interventions and tailor rehabilitation programs to ...Innovation In Animal Science... · Mocap Applications In Sports... · Mocap Software For Life...