Fact-checked by Grok 2 weeks ago

Video tracking

Video tracking is a fundamental technique in that involves detecting, identifying, and continuously monitoring the positions and trajectories of objects or entities across consecutive frames in a video sequence, enabling the analysis of motion in dynamic scenes. This process typically begins with in the initial frame, followed by and mechanisms to maintain and location in subsequent frames, distinguishing it from static image analysis by incorporating temporal information. The significance of video tracking lies in its wide-ranging applications across industries, including autonomous driving where it facilitates real-time obstacle avoidance and path planning, surveillance systems for monitoring human activity and , robotics for navigation and interaction with environments, and for player movement evaluation. Key methods in video tracking have evolved from classical approaches like Kalman filtering for motion prediction and mean-shift algorithms for feature-based tracking to modern deep learning-based techniques such as convolutional neural networks (CNNs) for appearance modeling and recurrent neural networks (RNNs) for sequence prediction, with advanced algorithms like SORT (Simple Online and Realtime Tracking) and DeepSORT integrating detection and re-identification for multi-object scenarios. Despite these advancements, video tracking faces persistent challenges, including handling occlusions where objects are temporarily hidden, variations in and camera viewpoints that alter object appearance, scale changes due to distance, and computational demands for processing in high-resolution videos. Recent developments, such as transformer-based models and end-to-end learning frameworks, aim to address these issues by improving robustness and accuracy, particularly in multi-object tracking () applications.

Overview

Definition

Video tracking is the process of locating a moving object—or multiple objects—over time using a camera, by estimating their trajectories in the across successive video frames and assigning consistent labels to them. This task provides object-centric information, such as position, orientation, area, and shape, while following the objects' motion in the video sequence. Unlike , which identifies and localizes objects within individual frames without considering temporal continuity, video tracking establishes correspondences between detections across frames to maintain identity over time. Similarly, it differs from , which partitions a single frame into meaningful regions or delineates object boundaries, whereas video tracking focuses on temporal persistence and motion rather than static partitioning. The process involves three key components: initialization, where the target object is selected and detected in the initial frame (often via manual specification or automated detection); tracking, which predicts and updates the object's position in subsequent frames; and termination, which occurs when the object exits the scene, becomes occluded beyond recovery, or fails other predefined criteria. The basic workflow proceeds frame by frame, encompassing detection to identify potential object regions, to match these regions across frames based on or spatial proximity, and to estimate future positions using underlying motion assumptions. Motion models briefly underpin the step by modeling possible object displacements, such as constant or acceleration.

Objectives and Scope

Video tracking primarily aims to achieve accurate localization of target objects across consecutive frames in a video sequence, estimating their trajectories while providing object-centric information such as , , and . This involves robust handling of object motion, accounting for challenges like abrupt changes, noise, and environmental variations to maintain consistent tracking. By enabling such precise monitoring, video tracking facilitates downstream tasks, including behavior analysis in controlled or natural settings. The scope of video tracking is generally confined to 2D representations derived from monocular video feeds, where objects are projected onto a planar image space, limiting direct depth estimation without additional sensors. Extensions beyond this core focus include 3D tracking using multi-camera setups or depth-aware systems for spatial reconstruction, as well as multi-object scenarios that simultaneously monitor multiple entities while resolving associations and occlusions. These expansions address limitations in single-view setups but introduce complexities in calibration and synchronization. Performance objectives emphasize high accuracy in localization, low computational for real-time applications, and adaptability to dynamic conditions such as varying object speeds, viewpoints, or illumination. Metrics like are commonly used to evaluate these goals, prioritizing robustness over exhaustive detail in diverse environments. In broader and contexts, video tracking integrates with learning-based frameworks to support automated, non-intrusive systems for real-world monitoring, enhancing scalability in fields like without manual intervention. This integration underscores its role in enabling scalable AI-driven analysis while respecting operational boundaries in feature extraction and motion modeling.

Historical Development

Early Foundations (Pre-2000)

Video tracking originated in the and within the burgeoning field of , where researchers sought to quantify motion from sequential s to understand scene dynamics. Initial efforts focused on , a concept formalizing the displacement of intensities between to infer object or camera movement. This period marked the transition from static to dynamic , driven by advances in hardware and algorithms that modeled motion under assumptions of brightness constancy and spatial coherence. Pioneering work by Berthold K. P. Horn and Brian G. Schunck in 1981 introduced a global optimization framework for dense estimation, treating the problem as an energy minimization that balances data fidelity from the optical flow constraint equation with a smoothness penalty to resolve the inherent problem. Their approach assumed smooth flow variations across the , enabling the computation of velocity fields for entire scenes but requiring iterative solutions that highlighted the era's emphasis on mathematical rigor over real-time performance. A key milestone in sparse feature tracking came concurrently with the Lucas-Kanade method, also proposed in 1981, which addressed limitations in global methods by estimating flow locally within small windows around distinctive image points. Developed by Bruce D. Lucas and , this differential technique solves an of equations derived from the brightness constancy assumption, assuming constant motion within the window to yield robust point correspondences suitable for tracking salient features like corners. Widely adopted for its computational efficiency compared to dense alternatives, the method facilitated applications in stereo vision and early motion analysis, though it struggled with large displacements or textured regions lacking sufficient gradients. These techniques laid the groundwork for deterministic tracking paradigms, where relied on explicit geometric constraints rather than . The 1990s saw the integration of predictive models into video tracking, particularly through , to handle occlusions and noise in dynamic environments. Originally formulated in the for state estimation, the was adapted for video applications to recursively predict and update object trajectories based on models and assumptions. In early systems, such as the video-rate tracker developed around 1990, refined pose estimates from model-based recognition, enabling real-time monitoring of moving objects in controlled settings like industrial inspection. This approach improved tracking robustness by fusing sequential measurements, with representative systems achieving update rates of 25-50 Hz on era-specific hardware for simple rigid-body motions. Despite these advances, pre-2000 video tracking was constrained by reliance on simplistic geometric models, such as constant velocity or affine transformations, which assumed predictable, non-rigid deformations and uniform illumination. Computational limitations of 1980s-1990s processors—often limited to single-core operations at MHz speeds—restricted algorithms to low-resolution footage (e.g., 320x240 pixels) and offline processing, precluding real-time use in complex, unstructured scenes. Research thus prioritized controlled environments, like laboratory setups or fixed-camera surveillance, where variability in lighting, occlusions, or multi-object interactions could be minimized to ensure algorithmic stability.

Advances in the 2000s and Deep Learning Era (2000-Present)

The 2000s marked a pivotal era for video tracking, transitioning from rudimentary optical flow methods to more sophisticated probabilistic frameworks that addressed real-world complexities like clutter and non-linear motion. Particle filters, exemplified by the CONDENSATION algorithm developed by Isard and Blake, saw widespread adoption post-2000 despite its 1998 origins, enabling robust state estimation through sequential Monte Carlo sampling for conditional density propagation in visual tracking scenarios. Complementing this, mean-shift tracking, introduced by Comaniciu et al. in 2003, utilized kernel-based density estimation to represent and localize non-rigid objects, adaptively updating appearance models to handle variations in scale and illumination while maintaining computational efficiency. These advancements laid the groundwork for scalable tracking systems, improving reliability in dynamic environments such as surveillance footage. The 2010s ushered in the revolution, shifting focus from hand-crafted features to data-driven representations that integrated directly into tracking pipelines. A landmark contribution was the Fully-Convolutional Networks (SiamFC) by Bertinetto et al. in , which employed end-to-end between exemplar and search frames to achieve performance, outperforming prior methods on challenging sequences involving deformations and occlusions. This era emphasized correlation-based trackers, fostering the development of correlation filter variants that balanced speed and accuracy, and set the stage for hybrid approaches combining probabilistic foundations with neural networks. In the 2020s, transformer-based architectures and generative models have dominated, enhancing global context modeling and resilience to severe challenges like long-term occlusions. The Tracking (TransT) approach by et al. in 2021 leveraged mechanisms to fuse and instance features, achieving state-of-the-art results on benchmarks by capturing spatiotemporal dependencies without relying on online updates. Concurrently, diffusion models have emerged for robust trajectory prediction, as in DiffusionTrack by Luo et al. in 2024, which treats multi-object association as a denoising to generate consistent tracks under uncertainty, demonstrating superior handling of crowded scenes. These innovations, supported by edge AI hardware optimizations, have enabled real-time deployment on resource-constrained devices, with trackers like TransT variants reaching over 30 frames per second on standard GPUs. By 2024-2025, further progress included state-space models in SAMBA-MOTR (2024) for complex motion handling, achieving high HOTA scores on benchmarks like DanceTrack, and transformer-based association in CAMELTrack (2025), improving performance in crowded scenarios. Progress throughout this period has been accelerated by influential benchmarks, including the Object Tracking Benchmark (OTB) by Wu et al. in 2015, which standardized evaluation across 100 sequences with attributes like illumination variation and , revealing gaps in existing trackers and driving iterative improvements. Similarly, the Visual Object Tracking (VOT) challenge, initiated in 2013 by Kristan et al., has annually evaluated dozens of trackers on diverse datasets, emphasizing metrics such as expected average overlap and robustness, with top Q-scores exceeding 70% in the VOTS2024 challenge as of 2024.

Fundamental Principles

Motion Models and Estimation

Motion models in video tracking provide mathematical frameworks to predict and estimate the displacement of objects across consecutive frames, enabling robust prediction of future positions based on observed trajectories. These models assume that object motion follows certain patterns, which are parameterized to capture translational, rotational, or deformative changes. Common linear models include constant velocity and constant acceleration assumptions, which simplify prediction by treating motion as uniform or uniformly varying over short intervals. For instance, the constant velocity model posits that an object's position updates linearly from its prior velocity, incorporating process noise to account for minor deviations, as derived in state-space formulations where the state vector includes position, velocity, and bounding box dimensions. Similarly, constant acceleration extends this by including a second-order term, constraining motion to parabolic paths under prior knowledge of smooth changes in speed. These models are particularly effective for rigid objects in predictable environments, reducing computational overhead in tracking pipelines. For more complex rigid body motions in 2D or 3D spaces, affine transformations serve as a foundational , encompassing : translation (two parameters), rotation, uniform scaling, and shear. This model warps image regions via a of coordinates, preserving parallelism and ratios, making it suitable for distortions in video sequences. Affine models are widely adopted in tracking algorithms, where they parameterize motion between frames to align object templates. In 3D extensions, they incorporate depth projections for volumetric estimation, though limited to non-perspective effects. Optical flow estimation complements these models by computing pixel-wise or point-wise motion vectors, directly informing predictions. Under the brightness constancy assumption, the of corresponding points remains invariant over time, formalized as I(x, y, t) = I(x + dx, y + dy, t + dt), where I denotes image , and (dx, dy) are components over interval dt. Dense computes vectors for all pixels, enforcing global smoothness to resolve ambiguities, as in the seminal global approach that minimizes deviations from this constraint alongside neighboring consistency. In contrast, sparse focuses on keypoints or features, assuming local constancy within windows to estimate efficiently, ideal for tracking sparse trajectories in cluttered scenes. These methods integrate with motion models by providing empirical fields to refine parametric estimates. Predictive estimation enhances reliability by propagating motion models forward in tracking loops, where forward predictions initialize searches in subsequent frames. To assess accuracy, the forward-backward evaluates consistency: a point is tracked forward from t to t+k, then backward to t, with as the between the original and reconstructed positions. Low errors indicate reliable paths, filtering occlusions or drifts, and this measure integrates seamlessly into iterative tracking, such as median flow algorithms, boosting precision to over 95% in benchmark sequences. While often paired with filters like Kalman for state updates, the focus here remains on error-based validation within model predictions. Extensions to non-linear models address deformable objects, where rigid assumptions fail, using parametric warps like thin-plate splines () to interpolate local deformations. TPS decomposes motion into a global affine component plus non-rigid deviations via control points, minimizing bending energy for smooth warps, with parameters estimated directly from appearance without explicit correspondences. In video tracking, this enables recovery of complex deformations, such as facial expressions or fabric motion, by iteratively refining warps in a stiff-to-flexible regularization scheme, achieving sub-pixel accuracy on real sequences. These models expand applicability to biological or articulated tracking while maintaining computational tractability.

Feature Extraction and Representation

Feature extraction in video tracking involves identifying and quantifying distinctive characteristics of objects within video frames to enable robust and localization across sequences. These features capture aspects such as color, , and , providing the foundational descriptors that distinguish tracked objects from the background and other elements. By focusing on or semi-invariant properties, extraction techniques enhance tracking reliability under varying conditions like partial occlusions or illumination changes. Common feature types include color histograms, which represent the distribution of pixel intensities in predefined bins to model object appearance; edge contours, which outline object boundaries based on intensity gradients; and texture descriptors such as Scale-Invariant Feature Transform (SIFT) for detecting keypoints robust to scale and rotation changes, or Histograms of Oriented Gradients (HOG) for capturing gradient orientations in local image patches. Extraction processes often begin with corner detection to identify salient points, as introduced by the Harris detector, which computes a corner response function based on the second-moment matrix of image intensities to locate high-variance regions suitable for tracking. For robustness to scale variations, scale-invariant features like employ difference-of-Gaussian filters to detect stable keypoints across multiple octaves of the image pyramid. Once extracted, features are represented in forms that facilitate matching and localization. Bounding boxes provide a rectangular enclosure for object localization, defining the spatial extent via coordinates of top-left and bottom-right corners. Kernels model probabilistic density by weighting feature contributions spatially, often using or to emphasize central regions of the object. Silhouettes offer shape-based representations by binarizing object contours, enabling contour matching for non-rigid objects like human figures. A key example of histogram-based representation is the for a 's set, defined as p(y) = \sum_{i=1}^{N} \delta(y - q_i) / N, where y is the value, q_i are the quantized values for the N points within the region, and \delta is the approximating the histogram bins. This formulation allows efficient comparison via metrics like for localization.

Tracking Algorithms

Deterministic Approaches

Deterministic approaches in video tracking rely on direct optimization of predefined criteria, such as similarity measures or error functions, to estimate object positions without modeling through probability distributions. These methods typically assume consistent object and motion, making them effective for straightforward scenarios like controlled indoor environments or short video sequences. Mean-shift tracking is a widely used that iteratively seeks the mode of a estimate in the feature space to localize the object across . The model and candidate regions are represented by histograms of features, such as color distributions, and their similarity is quantified using the Bhattacharyya coefficient, given by \rho[p, q] = \sum_{i=1}^{m} \sqrt{p_i q_i}, where p and q are the probability distributions of the and candidate, respectively, and m is the number of bins. By deriving the mean-shift vector from the of this similarity , the algorithm updates the object center towards higher regions until , enabling robust tracking of non-rigid objects in . This approach was formalized by Comaniciu, , and Meer in their seminal work on kernel-based tracking. Template matching constitutes a basic yet effective deterministic strategy for tracking rigid objects by performing an exhaustive search to align a reference with regions in the current . Common metrics include the of squared differences (SSD), which minimizes pixel-wise intensity variances, or normalized cross-correlation (NCC), which accounts for illumination changes by normalizing the . Efficient implementations, such as the fast NCC computation via integral images, reduce the search complexity from O(N^2 M^2) to O(N^2 + M^2) for an N \times N and M \times M , facilitating practical use in video applications. introduced this accelerated NCC method for in tasks. Optical flow-based tracking estimates object motion at the pixel level by computing dense or sparse fields between consecutive frames, assuming constancy and spatial smoothness. In the Lucas-Kanade method, flow is solved within local windows using a least-squares minimization of the constraint equation, I_x u + I_y v + I_t = 0, where I_x, I_y, and I_t are spatial and temporal intensity derivatives, and (u, v) is the flow vector; this is overdetermined and solved via the for sub-pixel accuracy. Block matching extends this discretely by selecting the that minimizes SSD between blocks, suitable for hardware-efficient implementations in video coding and tracking. The foundational iterative technique was developed by Lucas and Kanade for and stereo vision. The primary strengths of deterministic approaches lie in their , low computational overhead, and predictability, allowing for fast execution on resource-constrained devices and reliable in stable conditions with limited variability in object appearance or background.

Probabilistic Filtering Methods

Probabilistic filtering methods in video tracking model the inherent in object motion and observations using Bayesian frameworks, providing recursive estimates of object states that account for , occlusions, and incomplete data. These techniques propagate probability distributions over possible object states, balancing predictions from prior dynamics with updates from current frame measurements to achieve robust tracking . Unlike deterministic methods, they explicitly handle variations, making them suitable for real-world scenarios with and environmental . The serves as the cornerstone of probabilistic tracking, assuming linear Gaussian models for both state transitions and observations to deliver the minimum estimate. It operates in two phases: , which advances the state estimate and enlarges its to reflect growing uncertainty, and , which incorporates new measurements to refine the estimate and reduce . The step follows the state equation \mathbf{x}_k = F \mathbf{x}_{k-1} + \mathbf{w}_{k-1}, where \mathbf{x}_k denotes the (e.g., and ), F is the linear , and \mathbf{w}_{k-1} \sim \mathcal{N}(0, Q) is . The measurement update uses \mathbf{z}_k = H \mathbf{x}_k + \mathbf{v}_k, with \mathbf{z}_k the observed data (e.g., bounding box coordinates), H the observation matrix, and \mathbf{v}_k \sim \mathcal{N}(0, R) measurement ; the filter then computes the Kalman gain to blend prediction and measurement optimally. In video applications, this enables efficient single-object tracking under Gaussian assumptions, such as constant velocity motion. For nonlinear dynamics, such as those arising from camera motion or object rotations in video sequences, the Extended Kalman Filter (EKF) adapts the Kalman framework by linearizing nonlinear state and measurement functions around the current estimate using Jacobian matrices. The prediction propagates the state mean through the true nonlinear function f(\cdot), approximated as F_k = \frac{\partial f}{\partial \mathbf{x}} \big|_{\hat{\mathbf{x}}_{k|k-1}}, and similarly for the measurement Jacobian H_k; covariance updates then apply the standard Kalman equations to this local linearization. This Jacobian-based approach maintains computational efficiency while extending applicability to scenarios like perspective-distorted tracking, though it can suffer from instability if linearization errors accumulate. The Unscented Kalman Filter (UKF) improves upon EKF for strongly nonlinear systems by avoiding explicit ; instead, it selects a minimal set of sigma points sampled from the state distribution's mean and , propagates them through the exact nonlinear functions, and reconstructs the transformed mean and via weighted statistics. This sigma-point approach captures higher-order moments up to the third order for Gaussian inputs, yielding more accurate without Jacobian derivatives, and has demonstrated superior performance in visual tracking tasks involving rapid maneuvers or non-Gaussian noise. Particle filters overcome the Gaussian and unimodal limitations of Kalman variants by representing the full posterior distribution as a weighted collection of random state samples, or particles, suitable for multimodal distributions in complex video environments. The algorithm sequentially samples particles from a motion proposal (often the prior dynamics), evaluates their likelihood given observations via importance weights, and resamples particles with replacement proportional to weights to mitigate degeneracy, where particle diversity diminishes over time. This enables handling of arbitrary noise models and has been pivotal in tracking deformable or occluded objects. A seminal application, the CONDENSATION algorithm, uses factored sampling for efficient tracking in cluttered scenes, achieving performance on early hardware. In multi-object video tracking, probabilistic filters integrate with data association to resolve ambiguities from multiple detections per frame. The performs this by formulating association as a , computing the minimum-cost bipartite matching between predicted track states and detections based on a cost matrix (e.g., derived from Mahalanobis distances under Gaussian assumptions). This optimal greedy solution ensures unique track-to-detection links, enhancing filter stability in crowded scenes, and underpins efficient baselines like SORT, which achieves tracking updates at over 200 FPS on pedestrian datasets.

Learning-Based Techniques

Learning-based techniques in video tracking leverage neural networks and machine learning algorithms to automatically extract features and make tracking decisions from large datasets, shifting from hand-crafted representations to data-driven models that adapt to complex visual patterns. These methods, prominent since the mid-2010s, excel in handling variations in object appearance and motion by learning similarity metrics or policies directly from annotated video sequences, often outperforming traditional approaches on benchmarks like OTB and VOT. Siamese networks represent a foundational class of learning-based trackers, employing twin convolutional neural networks (CNNs) to learn similarity between a target template and search regions in subsequent frames. In the architecture, the network processes both the exemplar image (initial target) and candidate image through identical branches, followed by a layer to compute a response map indicating matching scores, enabling efficient end-to-end training on datasets like ILSVRC15 for real-time performance. This fully convolutional design avoids explicit bounding box regression, focusing instead on dense similarity computation, which has influenced subsequent variants like SiamRPN that incorporate region proposal networks for improved localization. Transformer-based trackers build on mechanisms to capture long-range dependencies in video sequences, surpassing CNN limitations in modeling global context. The TransT method, for instance, uses a Siamese-like backbone for and integrates query-key-value modules to fuse and search features, allowing dynamic weighting of relevant spatial and temporal cues for robust tracking under occlusions. By treating tracking as a set prediction task, these models achieve state-of-the-art results on datasets such as LaSOT, with layers enabling better handling of scale changes and deformations compared to purely convolutional approaches. Reinforcement learning (RL) variants introduce adaptive decision-making in tracking by framing the problem as a , where an agent learns to select actions like template updates or search strategies to maximize long-term tracking rewards. A seminal end-to-end RL approach uses deep Q-networks to predict bounding box adjustments frame-by-frame, trained on video sequences to balance of motion hypotheses against of observed trajectories. These methods are particularly effective for multi-object scenarios, where policy optimization handles uncertainties like target switches, as demonstrated in formulations that propagate tracklets via learned value functions. Recent advancements include OC-SORT, which enhances SORT with better occlusion handling via observation-centric updates, and ByteTrack, a simple yet effective multi-object tracking method that associates every detection box rather than only high-score ones, achieving high performance on benchmarks like MOT20 as of 2025. In the , hybrid models incorporating processes have emerged to address generative challenges such as handling, blending probabilistic denoising with tracking objectives to reconstruct plausible object trajectories from noisy observations. DiffusionTrack, for example, employs a noise-to-tracking where diffusion models iteratively refine multi-object associations, improving consistency in crowded scenes by generating intermediate states that mitigate drift. Such integrations leverage the generative power of diffusion for uncertainty modeling, yielding superior performance on benchmarks like MOT17 by simulating diverse recoveries without explicit probabilistic filters.

Applications

Surveillance and Security

Video tracking is integral to modern surveillance and security systems, enabling automated analysis of video feeds from closed-circuit television (CCTV) cameras to detect and respond to potential threats in real time. By continuously monitoring moving objects such as people or vehicles, these systems enhance situational awareness in environments ranging from urban public spaces to critical infrastructure, reducing reliance on human operators and improving response times to incidents. In CCTV applications, video tracking supports by modeling normal scene behaviors and flagging deviations, such as , sudden falls, or unattended objects, which could indicate risks. Techniques involve segmenting video into foreground objects and analyzing their trajectories against learned baselines, often integrated into urban monitoring setups to prioritize alerts for suspicious activities. For instance, methods extract crowd motion features to identify outliers in feeds, aiding in the prevention of incidents like or in high-traffic areas. Crowd behavior analysis leverages video tracking to assess density, flow direction, and in public gatherings, crucial for managing events and detecting emerging threats like stampedes or coordinated disruptions. Algorithms track multiple individuals simultaneously, estimating metrics such as and separation to predict and alert on abnormal patterns, thereby supporting proactive measures in stadiums, transportation hubs, and city centers. Perimeter intrusion alerts employ video tracking to delineate secure boundaries and monitor crossings, using to distinguish authorized from unauthorized entries. Systems detect and follow intruders across camera views, triggering notifications upon breach detection, which is particularly effective for protecting facilities like airports or warehouses where rapid response is essential. Tailored techniques such as with re-identification enable robust by maintaining target identities across occlusions or camera handoffs, essential for tracing lost subjects in complex scenes. Learning-based methods, including deep neural networks for appearance modeling, enhance re-identification accuracy in MOT pipelines. In smart cities, video tracking integrates into networked infrastructures to provide comprehensive coverage, as demonstrated in large-scale datasets that facilitate behavior analysis for public safety. Real-time alerts are further enabled by , where processing occurs locally on devices near cameras to minimize latency and bandwidth use, allowing immediate notifications for threats without cloud dependency. As of 2025, AI-powered have further improved by forecasting potential threats based on trajectory patterns. These applications benefit from benchmarks like MOTChallenge, which provide standardized datasets for training trackers on scenarios, leading to reduced false positives through pattern learning.

Autonomous Systems and

Video tracking plays a pivotal role in autonomous systems and by enabling perception of dynamic environments, allowing robots and vehicles to navigate, interact, and adapt to surroundings. In these applications, video tracking processes sequences of images from onboard cameras to detect, localize, and predict the motion of objects relative to the moving platform, facilitating safe and efficient operation in unstructured settings. This capability is essential for tasks requiring spatial awareness, such as avoiding collisions or coordinating with humans, and integrates with other sensors to enhance robustness. One key application is obstacle avoidance in drones and unmanned aerial vehicles (UAVs), where video tracking identifies and monitors potential hazards in the flight path to enable evasive maneuvers. For instance, vision-based systems use and tracking to estimate distances and trajectories of obstacles like trees or other , allowing UAVs to maintain safe without additional depth sensors in resource-constrained setups. Another application involves integration with (SLAM), where video tracking segments moving objects from static scenes to refine map updates and robot pose estimation in dynamic environments. This helps s build accurate environmental models while ignoring ego-motion-induced artifacts, improving localization accuracy in cluttered spaces. Human-robot collaboration also relies on video tracking to monitor operator movements, enabling collaborative tasks like assembly or search-and-rescue by predicting intent and adjusting robot actions accordingly. Techniques for tracking in these systems often fuse video from cameras with data to estimate depth and handle occlusions, providing a more complete representation of the scene for multi-object tracking. vision extracts disparity maps from paired camera feeds to compute positions, while fusion adds sparse but precise point clouds, enabling robust tracking of objects in varying or weather. Transformer-based architectures, for example, align multi-modal features from cameras and in a , achieving high tracking precision in autonomous driving scenarios by associating detections across frames. Milestones in this domain include the series in the 2000s, which advanced real-time video-based perception for off-road autonomous vehicles, with winning systems like Stanley employing for terrain and obstacle detection to complete desert courses at speeds up to 20 mph. In the , self-driving cars have adopted end-to-end approaches that incorporate video tracking directly into the perception pipeline, mapping raw camera inputs to vehicle controls while handling multi-object dynamics for urban navigation. As of 2025, edge AI advancements have enabled more efficient real-time processing for UAV obstacle avoidance in complex environments. A primary challenge addressed is ego-motion compensation in moving platforms, where the robot's own translation and rotation distort observed object trajectories in video sequences. Algorithms compensate by estimating platform motion—often via or inertial data—and subtracting it from tracked , preserving accurate relative motion cues for downstream planning.

Medical and Biological Imaging

Video tracking plays a crucial role in and biological imaging by enabling the analysis of dynamic processes such as cellular movements and physiological motions, facilitating non-invasive diagnostics and research. In applications, it supports the study of , where automated tracking quantifies trajectories in time-lapse videos to assess biological behaviors like or cancer . For instance, markerless tracking techniques, which rely on image-based detection without physical markers, allow for non-invasive of live cells, improving in high-density environments. Deformable models, such as active or mesh-based representations, are particularly effective for tracking soft that undergo non-rigid deformations during , adapting to shape changes in real-time sequences. These methods often integrate probabilistic frameworks to handle occlusions and variations in tissue appearance, enhancing robustness in clinical settings. In for , video tracking provides quantitative insights into human locomotion, aiding in the assessment of neurological disorders or post-injury recovery. Markerless approaches using , such as pose estimation networks, capture full-body from standard video feeds, offering a cost-effective alternative to traditional marker-based systems with comparable accuracy in spatiotemporal parameters like step length and . Systematic reviews highlight their validity in clinical neuro, where they enable remote monitoring and personalized therapy adjustments. For surgical tool monitoring, especially in , video tracking ensures precise navigation by detecting and following instruments in real-time, reducing procedural errors and improving workflow efficiency. Specific examples illustrate these applications' impact. In fertility research, video tracking evaluates by analyzing trajectories in microscopic videos, providing metrics such as curvilinear and to predict fertilization potential with high consistency using models. Datasets like VISEM-Tracking support advanced training for such assessments, enabling automated of patterns. In the 2020s, -driven video tracking has advanced real-time navigation, with systems overlaying tracked tool positions on live feeds to guide procedures like detection, as demonstrated in narrative reviews of intelligent modules that boost detection rates without increasing clinician workload. As of 2025, generative integrations have enhanced analysis by simulating trajectories for better predictive modeling in drug studies. The benefits of these techniques include deriving quantitative metrics like trajectory speed and path efficiency, which offer objective measures for diagnostic accuracy and treatment efficacy. For cell migration, tracking yields parameters such as to quantify modes, informing drug efficacy studies. In gait analysis, path efficiency metrics correlate with balance improvements in , while in surgical contexts, speed tracking optimizes tool handling times, ultimately enhancing patient outcomes through data-driven insights.

Challenges and Limitations

Environmental and Appearance Variations

Video tracking systems frequently encounter reliability issues due to environmental factors that alter object or , such as changes in , , or scene dynamics. These variations can degrade feature matching and , leading to tracking drift or failure unless mitigated by robust preprocessing or adaptive models. Illumination changes, including sudden shifts from artificial to or , distort color-based features and challenge intensity-invariant representations. Adaptive histograms address this by dynamically updating the target's color to account for gradual lighting variations, maintaining consistency in appearance models during tracking. Color normalization techniques, such as , further enhance invariance by remapping intensities to a standard range, reducing the impact of global or local illumination gradients on . Scale and viewpoint variations arise when objects move closer or farther from the camera or rotate, causing size discrepancies and projective distortions that misalign templates with search regions. Pyramid representations construct multi-resolution feature maps, enabling trackers to search across scales efficiently and select the optimal match without exhaustive computation. estimation compensates for viewpoint changes by computing a between frames, particularly effective for planar objects, allowing accurate warping of the target model to fit altered projections. Occlusion occurs when the target is partially or fully obscured by foreground elements, interrupting direct observation and risking identity switches in multi-object scenarios. Temporary loss via motion uses prior estimates from linear or non-linear dynamic models to forecast the occluded object's position, bridging gaps until reappearance without relying on visible cues. Clutter and background , such as dynamic textures or similar-colored distractors, confuse foreground segmentation and elevate false positives in candidate selection. Saliency maps differentiate the target by attention-weighted maps that highlight distinctive regions based on and motion cues, suppressing irrelevant background elements to refine bounding box predictions. Learning-based techniques can enhance robustness to these variations through end-to-end training on diverse datasets, though they require careful integration with classical methods for applicability.

Computational and Real-Time Constraints

Video tracking systems, especially those employing techniques, place substantial demands on resources due to the intensive computations required for feature extraction and across frames. Deep neural networks, such as those used in appearance-based re-identification modules, benefit significantly from GPU acceleration, which enables of matrix operations essential for convolutional layers and similarity computations, achieving speeds up to several times faster than CPU-only implementations. In contrast, lightweight deterministic filters, like Kalman-based trackers, can operate effectively on CPUs with minimal overhead, making them suitable for scenarios where GPU availability is limited. Real-time video tracking mandates low-latency processing to synchronize with incoming video streams, typically requiring sub-33 ms per frame for standard 30 rates to avoid perceptible delays in applications like or . This constraint often forces trade-offs, where more accurate models sacrifice speed—often running below thresholds on standard hardware, while optimized versions can achieve over 30 on high-end GPUs—while simpler probabilistic methods, such as those using particle filters, can meet thresholds but at reduced precision. For instance, benchmarks on multi-object trackers show that achieving over 30 necessitates optimized pipelines, with directly impacting tracking continuity in dynamic scenes. To address these constraints, optimization strategies like model pruning and quantization have gained prominence in AI-driven trackers during the , reducing parameter counts and precision to lower memory usage and inference time without substantial accuracy loss. eliminates redundant weights, potentially cutting by up to 35% in video analysis tasks, while quantization converts floating-point operations to integers, enabling deployment on edge devices with speedups of up to 4x on specialized hardware. These techniques are particularly vital for maintaining performance in resource-constrained settings. Scalability challenges arise prominently in multi-object tracking compared to single-target approaches, as the former involves associating detections across numerous trajectories, exponentially increasing in crowded or high-frame-rate videos. On mobile devices with limited CPU/GPU capabilities, multi-object systems often drop below 15 when handling dozens of , whereas single-target trackers remain viable at over 50 , highlighting the need for or approximated methods in portable applications. Probabilistic filtering methods can briefly mitigate this through efficient state estimation, though detailed efficiencies are explored elsewhere.

Evaluation Methods

Performance Metrics

Performance in video tracking is evaluated through quantitative metrics that measure the accuracy of bounding box predictions, localization , and overall reliability across video sequences. These metrics enable systematic comparisons of tracking algorithms, focusing on how well predicted object trajectories align with ground-truth annotations. For single-object tracking, core metrics include , , and , which are commonly derived using the Intersection over Union (). The quantifies the overlap between the predicted and ground-truth bounding boxes and is defined as: \text{[IoU](/page/IOU)} = \frac{\text{area}(\text{overlap})}{\text{area}(\text{union})} represents the fraction of tracked frames where the prediction is correct based on a location , while assesses the proportion of ground-truth frames successfully tracked without misses. measures the percentage of frames maintaining an above a specified (typically 0.5), often summarized as the area under the success plot curve for comprehensive evaluation. The Center Location Error (CLE) provides a direct measure of localization accuracy by computing the between the predicted object center and the ground-truth center in pixels; smaller values indicate better performance, with thresholds like 20 pixels used to assess in . Robustness is analyzed through attribute-specific plots, such as those for , which visualize how success rate or varies under challenging conditions like partial or full object obstruction in benchmark evaluations. In multi-object scenarios, the Multiple Object Tracking Accuracy (MOTA) serves as a primary , combining errors from missed detections (false negatives), erroneous detections (false positives), and trajectory inconsistencies (ID switches). It is formulated as: \text{MOTA} = 1 - \frac{\sum_t (\text{FN}_t + \text{FP}_t + \text{IDSW}_t)}{\sum_t \text{GT}_t} where the sums are over all t, \text{FN}_t and \text{FP}_t are the false negatives and positives at frame t, \text{IDSW}_t counts identity switches, and \text{GT}_t is the number of ground-truth ; MOTA values range from -∞ to 1, with higher scores reflecting superior overall accuracy.

Benchmark Datasets and Standards

Benchmark datasets play a crucial role in evaluating and comparing video tracking algorithms by providing standardized sequences with ground-truth annotations. For single-object tracking, the (OTB-100), introduced in 2015, consists of 100 video sequences totaling over 59,000 frames, annotated with bounding boxes to assess performance across 11 specific challenges, including illumination variation, scale changes, , deformation, , abrupt and fast motion, out-of-view scenarios, background clutter, and low . These challenges simulate real-world conditions to test tracker robustness. Complementing OTB-100, the Visual Object Tracking (VOT) challenge, held annually since 2013, focuses on short-term single-object tracking with datasets featuring dozens of sequences per edition, emphasizing no re-detection after failure and evaluating on diverse scenarios like rigid and non-rigid objects. As of 2025, the VOT challenge continues with the VOTS2025 edition, incorporating segmentation-integrated tracking. For multi-object tracking, particularly in pedestrian scenarios, the Multiple Object Tracking (MOT) benchmark includes MOT17, released in 2017, which comprises 14 sequences (7 for training and 7 for testing) captured from mobile platforms with varying densities and occlusions, providing detection annotations to facilitate end-to-end evaluation. An extension, MOT20 from 2020, addresses crowded scenes with 8 sequences (4 training and 4 testing) totaling over 13,000 frames, featuring high occlusion rates and up to 246 pedestrians per frame to challenge scalability in dense environments. Long-term tracking benchmarks address extended sequences where objects may disappear and reappear. The Large-scale Single Object Tracking (LaSOT) dataset, introduced in 2019, includes 1,400 videos across 70 object classes, averaging over 2,500 frames per sequence (more than 3.5 million frames total), designed to test sustained performance over long durations with category-specific challenges. Evaluation standards ensure fair comparisons through defined protocols. In the OTB framework, the One-Pass (OPE) protocol initializes the tracker with the ground-truth bounding box from the first frame and measures performance across the entire sequence without resets, while the Temporal Robustness (TRE) assesses robustness by initializing at multiple starting frames to evaluate handling of temporal variations. VOT employs a no-reset protocol with accuracy-robustness rankings, and uses detection-based pipelines with sequence-specific splits. In the 2020s, benchmarks like MOT20 and ongoing VOT editions have incorporated challenges by providing large-scale, annotated data suitable for training neural network-based trackers, including segmentation and real-time variants to reflect advancements in AI-driven methods.
DatasetTypeKey FeaturesYearSource
OTB-100Single-object100 sequences, 11 challenges, ~59K frames2015IEEE TPAMI
VOT (annual)Single-object, short-termDozens of sequences per challenge, no re-detection2013–presentVOTChallenge.net
MOT17Multi-object, pedestrians14 sequences, detection annotations2017MOTChallenge.net
MOT20Multi-object, crowded8 sequences, high density/2020arXiv:2003.09003
LaSOTSingle-object, long-term1,400 sequences, 70 classes, >3.5M frames2019arXiv:1809.07845

References

  1. [1]
    Explore Advanced Object Tracking in Computer Vision - Viso Suite
    Video tracking is an application of object tracking where moving objects are located within video information. Hence, video tracking systems can process live, ...
  2. [2]
    What is Object Tracking in Computer Vision? - Roboflow Blog
    Oct 21, 2022 · Object tracking is a computer vision application where a program detects objects and then tracks their movements in space or across different camera angles.
  3. [3]
    What Is Computer Vision? | IBM
    Object tracking follows and traces an object as it moves across a sequence of video or image frames. It pinpoints and distinguishes the object in each frame and ...
  4. [4]
    The Complete Guide to Object Tracking – OpenCV, DeepSort ...
    Object Tracking is the process of finding objects and keeping track of their trajectories in a video sequence. We provide a complete guide for Object ...
  5. [5]
    Object Tracking in Computer Vision: An In-Depth ... - BasicAI
    Object tracking, a critical component in the field of computer vision, refers to the process of identifying and following objects over time in video sequences.Object Tracking Vs Object... · Integration With Other... · Deepsort
  6. [6]
    Object Tracking Using Computer Vision: A Review - MDPI
    Object tracking is one of the most important problems in computer vision applications such as robotics, autonomous driving, and pedestrian movement.
  7. [7]
    Object tracking with computer vision – types and business use cases
    Object tracking is a computer vision process where computers detect, understand, and monitor objects in images or videos. It can be image or video based, and ...
  8. [8]
    Top 10 Video Object Tracking Algorithms in 2025 - Encord
    Jan 4, 2025 · This article will discuss the top 10 most popular video object-tracking algorithms. It will go over video object-tracking algorithms' back-end implementations, ...
  9. [9]
    Multiple Object Tracking (MOT): Methods & Latest Advances
    Jul 16, 2025 · Multiple Object Tracking (MOT) represents one of the most challenging and practically significant problems in computer vision, involving the ...
  10. [10]
  11. [11]
    Object tracking: A survey: ACM Computing Surveys: Vol 38, No 4
    The goal of this article is to review the state-of-the-art tracking methods, classify them into different categories, and identify new trends.
  12. [12]
    ezTrack: An open-source video analysis pipeline for the ... - Nature
    Dec 27, 2019 · We present an open-source and platform independent set of behavior analysis pipelines using interactive Python that researchers with no prior programming ...Results · Methods · Behavioral Testing And...Missing: objectives | Show results with:objectives<|control11|><|separator|>
  13. [13]
    Visual object tracking: Progress, challenge, and future - PMC - NIH
    Feb 21, 2023 · Despite considerable progress in recent years, many problems are unsolved in visual object tracking. In the following, we will discuss ...
  14. [14]
    Determining optical flow - ScienceDirect.com
    August 1981, Pages 185-203. Artificial Intelligence. Determining optical flow. Author links open overlay panel. Berthold K.P. Horn , Brian G. Schunck. Show more.
  15. [15]
    [PDF] Optical flow modeling and computation: a survey - Hal-Inria
    Dec 19, 2015 · Optical flow estimation is one of the oldest and still most active research domains in computer vision. In 35 years, many methodological ...
  16. [16]
    [PDF] An Iterative Image Registration Technique - CMU Robotics Institute
    In this paper we present a new image registration technique that uses spatial intensity gradient information to direct the search for the position that yields ...
  17. [17]
    The computation of optical flow - ACM Digital Library
    We investigate the computation of optical flow in this survey: widely known methods for estimating optical flow are classified and examined by scrutinizing the ...
  18. [18]
    [PDF] Kalman Filtering of Pose Estimates in Applications of the RAPID ...
    The filtered tracker output provides a robust estimate of object pose at video rate when implemented in software running on a standard mini-computer. The ...Missing: early surveillance
  19. [19]
    [PDF] CONDENSATION—Conditional Density Propagation for Visual ...
    Manufactured in The Netherlands. CONDENSATION—Conditional Density Propagation for Visual Tracking. MICHAEL ISARD AND ANDREW BLAKE. Department of Engineering ...Missing: particle | Show results with:particle
  20. [20]
    [PDF] Kernel-Based Object Tracking 1 Introduction - Dorin Comaniciu
    A new approach toward target representation and localization, the central component in visual track- ing of non-rigid objects, is proposed.
  21. [21]
    Fully-Convolutional Siamese Networks for Object Tracking - arXiv
    Jun 30, 2016 · In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset ...
  22. [22]
    [2103.15436] Transformer Tracking - arXiv
    Mar 29, 2021 · We present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism.
  23. [23]
  24. [24]
    VOT Challenge
    The VOT challenges provide the visual tracking community with a precisely defined and repeatable way of comparing short-term trackers.VOTS2024 ChallengeVOTS2025 ChallengeVOTS2023 BenchmarkVOT2022 BenchmarkChallenges
  25. [25]
    [PDF] The First Visual Object Tracking Segmentation VOTS2023 ...
    The Visual Object Tracking Segmentation VOTS2023 challenge is the eleventh annual tracker benchmarking ac- tivity of the VOT initiative. This challenge is the ...
  26. [26]
    [PDF] Object Tracking: A Survey
    The goal of this article is to review the state-of-the-art tracking methods, classify them into different cate- gories, and identify new trends. Object ...
  27. [27]
    [PDF] Derivation of a Constant Velocity Motion Model for Visual Tracking
    Oct 20, 2020 · The constant velocity motion model is used to predict object locations in visual tracking, and this document derives it, incorporating object ...
  28. [28]
    [PDF] An Experimental Comparison of Online Object Tracking Algorithms
    The most commonly adopted models are translational motion (2 parameters), similarity transform (4 parameters), and affine transform (6 parameters). The classic ...
  29. [29]
    [PDF] Forward-Backward Error: Automatic Detection of Tracking Failures
    This paper proposed a novel measure, Forward- Backward error, that estimates reliability of a trajec- tory. A validation trajectory is constructed by backward ...
  30. [30]
    None
    ### Key Points on Thin Plate Spline for Non-Rigid Motion in Video Sequences
  31. [31]
    [PDF] Real-Time Tracking of Non-Rigid Objects using Mean Shift
    This paper presents a new approach to the real-time tracking of non-rigid objects based on visual features such as color and/or texture, whose statistical ...<|control11|><|separator|>
  32. [32]
    [PDF] A COMBINED CORNER AND EDGE DETECTOR
    Harris, C G & J M Pike, 3D Positional Integration from Image Sequences, Proceedings third Alvey Vision. Conference (AVC87), pp. 233-236,1987; reproduced in.<|separator|>
  33. [33]
    [PDF] Object Recognition from Local Scale-Invariant Features 1. Introduction
    This paper presents a new method for image feature gen- eration called the Scale Invariant Feature Transform (SIFT). This approach transforms an image into ...Missing: citation | Show results with:citation
  34. [34]
    [PDF] Histograms of Oriented Gradients for Human Detection
    Navneet Dalal and Bill Triggs. INRIA Rhône-Alps, 655 ... We will refer to these two arrangements as R-HOG and C-HOG (for rectangular and circular HOG).
  35. [35]
    [PDF] Learning to Track 3D Human Motion from Silhouettes
    Abstract. We describe a sparse Bayesian regression method for recovering 3D human body motion directly from silhouettes extracted from monocular video.
  36. [36]
    [PDF] Fast Normalized Cross-Correlation - JP Lewis
    Fast Normalized Cross-Correlation. J. P. Lewis∗. Industrial Light & Magic ... [10] J. P. Lewis, “Fast Template Matching”, Vision Inter- face, p. 120-123 ...
  37. [37]
    [PDF] A New Approach to Linear Filtering and Prediction Problems1
    Using a photo copy of R. E. Kalman's 1960 paper from an original of the ASME “Journal of Basic Engineering”, March. 1960 issue, I did my best to make an ...
  38. [38]
    (PDF) Kalman and Extended Kalman Filters: Concept, Derivation ...
    This report presents and derives the Kalman filter and the Extended Kalman filter dynamics. The general filtering problem is formulated.
  39. [39]
    [PDF] A New Extension of the Kalman Filter to Nonlinear Systems
    In this paper we derive a new linear estimator which yields performance equivalent to the Kalman lter for linear systems, yet generalises elegantly to nonlinear ...
  40. [40]
    [PDF] Data association in multiple object tracking
    Dec 18, 2021 · The incorporation of the Hungarian algorithm within current approaches is still the most common method. A two-step approach deduced into a.
  41. [41]
    Deep Reinforcement Learning for Visual Object Tracking in Videos
    Jan 31, 2017 · In this paper we introduce a fully end-to-end approach for visual tracking in videos that learns to predict the bounding box locations of a target object at ...Missing: seminal | Show results with:seminal
  42. [42]
    Video Surveillance using Deep Learning - A Review - IEEE Xplore
    Feb 13, 2020 · Sophisticated video object tracking techniques specially designed for surveillance applications are of increasing importance for analyzing ...
  43. [43]
  44. [44]
    Multiple Object Tracking Incorporating a Person Re-Identification ...
    Sep 6, 2024 · Among the crucial tasks in such systems, person re-identification (re-ID) and multiple object tracking (MOT) are paramount. Despite the common ...Missing: security | Show results with:security
  45. [45]
    Edge Video Analytics: A Survey on Applications, Systems and ...
    Nov 28, 2022 · Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, ...
  46. [46]
    MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking
    Apr 8, 2015 · With MOTChallenge we aim to pave the way toward a unified evaluation framework for a more meaningful quantification of multi-target tracking.
  47. [47]
    (PDF) Occlusion Handling in Videos Object Tracking: A Survey
    In a nutshell, the goal of this paper is to discuss in detail the problem of occlusion in object tracking and review the state of the art occlusion handling ...Missing: extrapolation seminal
  48. [48]
    Illuminant and device invariant colour using histogram equalisation
    In this paper we propose a new colour invariant image representation based on an existing grey-scale image enhancement technique: histogram equalisation. We ...
  49. [49]
    [PDF] Accurate Scale Estimation for Robust Visual Tracking - DiVA portal
    This paper presents a novel approach for robust scale estimation in a tracking-by-detection framework. The proposed approach works by learning discriminative ...Missing: viewpoint homography key
  50. [50]
    [PDF] Robust Occlusion Handling in Object Tracking
    In [12], occlusion situation is analyzed by comparing motion characteristics between the target and the image blocks that cannot be well motion-compensated.
  51. [51]
    [PDF] Saliency-based Discriminant Tracking
    We propose a biologically inspired framework for visual tracking based on discriminant center surround saliency. At each frame, discrimination of the target ...
  52. [52]
    Performance Optimization of Object Tracking Algorithms in OpenCV ...
    Aug 3, 2022 · In this paper, we analyze the performance bottleneck of two well-known computer vision algorithms for object tracking: object detection and optical flow.
  53. [53]
    Speed-FairMOT: multi-class multi-object tracking for real-time ...
    Jul 2, 2025 · As the result, we were able to achieve MOT at an maximum speed over 58 fps in real-time. This real-time speed is sufficient for feedback control ...
  54. [54]
    Energy-aware deep learning for real-time video analysis through ...
    Jun 16, 2025 · This paper systematically addresses energy optimization in real-time video inference by exploring energy-performance trade-offs with sophisticated AI models.
  55. [55]
    A Review of Multi‐Object Tracking in Recent Times - IET Journals
    Mar 9, 2025 · This paper reviews several recent deep learning-based MOT methods and categorises them into three main groups: detection-based, single-object tracking (SOT)- ...
  56. [56]
    [PDF] Online Object Tracking: A Benchmark - CVF Open Access
    Object tracking estimates target states in video frames. This benchmark evaluates performance of online single target tracking algorithms.
  57. [57]
    [PDF] Research Article Evaluating Multiple Object Tracking Performance
    In this work, we introduce two intuitive and general metrics to allow for objective comparison of tracker characteristics, focusing on their precision in ...
  58. [58]
    A High-quality Benchmark for Large-scale Single Object Tracking
    Sep 20, 2018 · In this paper, we present LaSOT, a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more ...