Activity recognition
Activity recognition, commonly referred to as human activity recognition (HAR), is the automatic detection, identification, and classification of human activities—such as walking, sitting, or more complex actions like playing sports—using data from sensors, cameras, or other sources to interpret sequential behaviors in indoor or outdoor environments.[1] This interdisciplinary field draws from computer science, machine learning, and signal processing to enable systems that understand human actions in real-time or from recorded data, often distinguishing between static postures (e.g., standing) and dynamic movements (e.g., running).[2] HAR systems typically involve data acquisition from wearable devices like accelerometers and gyroscopes in smartphones or smartwatches, body sensors, or vision-based inputs such as CCTV footage and depth cameras like Kinect.[1] Key methods in activity recognition rely on machine learning and deep learning techniques to process and analyze this data, with supervised classification being predominant; common algorithms include convolutional neural networks (CNNs) for spatial feature extraction from images or signals, recurrent neural networks (RNNs) and long short-term memory (LSTM) models for capturing temporal sequences, and support vector machines (SVMs) for simpler pattern recognition.[3] Recent advancements emphasize multi-modal data fusion, combining sensor and visual inputs to improve accuracy, alongside transformers for handling long-range dependencies in activity sequences.[2] Feature extraction steps often precede model training, involving preprocessing to handle noise, segmentation of activity windows, and selection of relevant attributes like signal magnitude or frequency domain characteristics.[1] Applications of activity recognition span healthcare for elderly monitoring and fall detection, surveillance for identifying suspicious behaviors, smart homes for energy-efficient automation, sports analytics for performance tracking, and emerging areas like human-robot interaction and virtual reality.[3] Despite these benefits, challenges persist, including data scarcity and variability due to diverse environments, high computational costs for real-time processing, privacy concerns with visual data, and difficulties in recognizing overlapping or complex group activities.[2] Ongoing research focuses on addressing these issues through transfer learning, unsupervised methods, and more robust datasets like UCI-HAR or WISDM to enhance generalizability across users and contexts.[1]Fundamentals
Definition and Scope
Activity recognition, also known as human activity recognition (HAR), refers to the automatic identification and classification of human physical activities, such as walking, running, or sitting, from sensor data or observational inputs, often performed in real-time.[4] This process involves analyzing signals from various sources to detect patterns corresponding to specific movements or behaviors.[5] The scope of activity recognition encompasses human-centric activities across diverse contexts, including daily living, sports performance, and industrial tasks, with applications in areas like health monitoring using sensors such as accelerometers or cameras. It differs from gesture recognition, which targets short-duration motions like hand signals, and from video-based action recognition, which primarily focuses on sequential patterns in visual data rather than broader behavioral inference.[6] Key concepts in activity recognition include varying levels of granularity, ranging from low-level atomic actions, such as chopping vegetables, to high-level composite activities like cooking, which may involve multiple concurrent or overlapping actions.[4] The field is inherently interdisciplinary, drawing from artificial intelligence for pattern classification, signal processing for data handling, and human-computer interaction to enable intuitive system responses. This technology is important for advancing context-aware computing, enhancing human-machine interfaces through adaptive responses, and providing data-driven insights for behavioral analysis in fields like healthcare and assistive technologies.[5]Historical Development
The field of activity recognition traces its roots to the 1990s, emerging from advancements in pattern recognition, artificial intelligence, and early wearable computing initiatives. Initial efforts focused on gait analysis using rudimentary sensors, such as accelerometers and gyroscopes, to detect basic locomotion patterns in controlled environments. These pioneering works emphasized rule-based methods to interpret sensor signals, marking the transition from theoretical AI concepts to practical sensor-driven applications.[7] In the 2000s, activity recognition expanded significantly with the adoption of machine learning techniques, driven by improved sensor affordability and the proliferation of mobile devices. Seminal research by Bao and Intille in 2004 demonstrated the feasibility of recognizing 20 physical activities using multiple body-worn accelerometers, achieving 84% accuracy with decision tree classifiers and highlighting the importance of feature extraction from time-series data.[8] This period also saw the influence of smartphone accelerometers, with studies from 2005 to 2010, such as Kwapisz et al.'s 2010 work, leveraging built-in sensors in cell phones for real-world activity monitoring, including walking, jogging, and sitting, with accuracies around 90% using machine learning classifiers such as multilayer perceptrons.[9] These developments shifted focus from specialized wearables to ubiquitous computing, enabling broader applications in health monitoring. The 2010s brought breakthroughs through the integration of deep learning, particularly convolutional neural networks (CNNs), which revolutionized video-based activity recognition. The 2012 ImageNet challenge victory by AlexNet demonstrated CNNs' prowess in image classification, inspiring adaptations for temporal data in videos, leading to models like two-stream CNNs for action recognition with accuracies exceeding 88% on benchmarks such as UCF101. In sensor-based contexts, deep learning surpassed traditional methods; for example, Ordóñez and Roggen's 2016 LSTM-based approach achieved 92% accuracy on wearable IMU data for daily activities. This era marked a pivot from handcrafted features to end-to-end learning, accelerating adoption in vision systems and hybrid setups.[10] Advancements in the 2020s have emphasized multimodal fusion, edge computing for real-time processing, and generalization across users and environments, addressing limitations in prior single-modality approaches. Post-2020 surveys on IMU-based human activity recognition underscore deep learning's dominance, with hybrid models incorporating transfer learning for robustness. Recent works highlight privacy-preserving techniques like federated learning, enabling collaborative model training without sharing raw data, as seen in 2025 frameworks achieving 95% accuracy in distributed wearable systems.[3][11] These shifts reflect a move toward scalable, ethical systems integrating wearables, vision, and ambient sensors for applications demanding low latency and data security.Classification
Single-User Activity Recognition
Single-user activity recognition focuses on isolating and classifying the activities performed by an individual, typically leveraging body-worn or proximal sensors to capture personal motion and physiological data. This approach aims to detect and interpret a person's actions in isolation, without considering interactions with others, making it suitable for personal health monitoring and daily routine analysis.[5] Key challenges in single-user activity recognition include intra-user variability, where the same activity—such as walking—may exhibit differences in execution due to factors like fatigue, mood, or physical condition, leading to inconsistent sensor signals across sessions for the same individual. Additionally, sensor placement significantly impacts accuracy; for instance, accelerometers positioned on the wrist versus the hip can yield varying data quality and recognition rates, often requiring user-specific calibration to mitigate errors.[5][12] Common setups for single-user activity recognition utilize smartphones or smartwatches equipped with built-in inertial sensors, such as accelerometers and gyroscopes, to monitor daily activities like distinguishing sitting from standing based on posture and movement patterns. These devices enable unobtrusive, real-time detection in everyday environments, often processing data on-device to ensure privacy and low latency. This often relies on accelerometers to capture acceleration profiles that differentiate static from dynamic states.[5][12] Activity recognition operates across varying granularity levels, from atomic actions—such as hand gestures or stepping—to composite activities—like brushing teeth, which combine multiple atomic elements over time. Hierarchical models address this progression by first identifying low-level atomic actions and then aggregating them into higher-level composite activities, improving overall recognition accuracy through layered inference.[5][13] Practical examples include fitness tracking applications on wearables that recognize exercise types, such as running versus cycling, by analyzing motion intensity and duration to estimate caloric expenditure and provide personalized feedback. In rehabilitation, single-user systems monitor patient movements, like gait during recovery from injury, using wrist-worn sensors to track progress and alert therapists to deviations in activity patterns. This contrasts with multi-user scenarios involving social interactions, where activity inference must account for group dynamics.[5][12]Multi-User Activity Recognition
Multi-user activity recognition involves identifying joint activities performed by two or more individuals, such as handshakes or dancing duos, through the analysis of synchronized data streams from sensors like wearables, cameras, or ambient devices. This process captures collaborative or parallel actions where participants' behaviors are interdependent, distinguishing it from isolated individual monitoring.[14][15] A core focus is modeling inter-user dependencies, including spatial relations (e.g., proximity and relative positions) and temporal synchronization (e.g., coordinated movement onset). Techniques often employ pairwise modeling, such as graph-based representations that treat users as nodes and interactions as edges to encode relational dynamics. For example, skeleton data from depth cameras can be clustered into postures and classified using support vector machines to recognize two-person interactions. Scalability to small groups (2-5 users) leverages deep learning models like bidirectional gated recurrent units to process multi-stream inputs, achieving accuracies above 85% in controlled settings. Challenges include occlusions in vision-based sensing, where one user's pose obscures another's, and data association issues in noisy environments that complicate linking events to specific individuals.[14][16] In practice, multi-user recognition enables applications like social interaction detection in elder care, where ambient sensors in smart homes differentiate collaborative tasks (e.g., assisting with daily routines) from potential conflicts (e.g., arguing), improving monitoring without invasive tracking. Another example is collaborative sports, where WiFi-based systems enable multi-user activity recognition with localization errors below 0.5 meters and accuracies over 90% for up to three participants. Unlike single-user recognition, which isolates actions via individual sensor fusion, multi-user methods emphasize relational modeling to infer joint intent from collective signals. This approach often relies on camera feeds for spatial detail but extends to wireless modalities for non-line-of-sight robustness.[15][17][18]Group Activity Recognition
Group activity recognition classifies the collective behaviors of multiple individuals, such as those in team sports or public protests, by integrating individual actions into overarching patterns that reflect group-level dynamics and interactions.[19] This process emphasizes hierarchical structures, where spatiotemporal features from video or sensor data reveal emergent group states, distinguishing it from individual or pairwise analyses by prioritizing holistic outcomes over personal identities.[19] Major challenges in group activity recognition include scalability to handle dense crowds with occlusions and varying group sizes, mitigation of noise from extraneous movements or environmental factors like motion blur, and accurate contextual inference to differentiate subtle variations, such as a cheering crowd from a dancing assembly.[20] These issues arise because group behaviors often emerge from complex interdependencies, requiring robust modeling of both spatial arrangements and temporal evolutions without over-relying on precise individual tracking.[20] Approaches to group activity recognition generally fall into top-down and bottom-up paradigms, with additional techniques for role assignment to enhance semantic understanding. Top-down methods perform global scene analysis by treating the group as a unified entity, such as modeling configurations of interacting objects as deforming shapes to capture overall dynamics—a foundational technique introduced by Vaswani et al. in 2005.[19] In contrast, bottom-up approaches aggregate features from detected individuals to infer collective activities, exemplified by Amer et al.'s 2012 hierarchical random field model that reasons across scales from personal actions to group contexts.[19] Role assignment further refines these by identifying functional positions within the group, like attackers or defenders in sports, as proposed in Shu et al.'s 2017 framework for joint inference of roles and events in multi-person scenes.[19] Practical applications include surveillance of public gatherings to identify abnormal collective behaviors, aiding in real-time security monitoring of crowds.[21] It also supports team coordination in domains like manufacturing assembly lines or emergency response operations, where recognizing synchronized group actions improves oversight of collaborative workflows.[19] Evaluation metrics focus on group-level accuracy to measure the correct classification of collective activities, particularly emphasizing the capture of emergent behaviors that cannot be reduced to sums of individual contributions, such as synchronized team formations.[19] These metrics highlight the importance of handling inter-person dependencies, with successful methods demonstrating substantial improvements in recognizing complex, interaction-driven patterns over baseline individual-focused evaluations.[19]Sensing Modalities
Inertial and Wearable Sensors
Inertial and wearable sensors play a central role in activity recognition by directly capturing human motion through body-attached devices, enabling the detection of physical activities such as walking, running, or gesturing. These sensors, often integrated into inertial measurement units (IMUs), provide high-fidelity data on body dynamics without relying on external infrastructure.[22] IMUs typically comprise three primary components: accelerometers, which measure linear acceleration along three axes (x, y, z) to detect changes in velocity and orientation relative to gravity; gyroscopes, which quantify angular velocity to track rotational movements; and magnetometers, which sense the Earth's magnetic field to determine absolute orientation and compensate for drift.[23] This combination allows for comprehensive motion profiling, with accelerometers being the most fundamental for basic activity detection due to their sensitivity to both static (e.g., posture) and dynamic (e.g., locomotion) accelerations.[8] The data generated by these sensors consists of multivariate time-series signals, typically sampled at rates of 20–100 Hz, producing 3D vectors for each sensor type (e.g., tri-axial acceleration as [a_x, a_y, a_z]).[24] Preprocessing is essential to handle noise from environmental vibrations or sensor imperfections, often involving low-pass or median filtering to remove high-frequency artifacts while preserving signal integrity.[22] Segmentation follows, dividing continuous streams into fixed-length windows (e.g., 2–5 seconds) or event-based segments using thresholds on signal magnitude to isolate activity bouts, facilitating subsequent analysis.[25] These steps ensure robust feature extraction, such as signal magnitude area or frequency-domain metrics, though the raw time-series nature supports direct input to recognition models.[26] Wearable IMUs are commonly placed on key body parts to optimize capture of relevant motions: wrists or arms for upper-body gestures and daily activities, ankles or thighs for gait analysis, and waists or chests for whole-body locomotion.[27] This strategic placement enhances detection accuracy, as proximal sites like the waist provide stable signals for ambulation, while distal ones like wrists suit gesture-rich tasks.[22] Key advantages include high portability due to compact, low-power designs (often <1 gram and battery life of 8–24 hours), enabling unobtrusive long-term monitoring, and superior privacy preservation compared to camera-based systems, as they capture only wearer-specific motion without visual exposure.[28] These attributes make them ideal for personal health applications, contrasting with non-contact methods that require fixed installations.[24] Modern wearables also integrate physiological sensors, such as photoplethysmography (PPG) for heart rate monitoring and electrocardiogram (ECG) sensors, to enrich activity recognition with biometric data. These enable detection of activity intensity or stress levels, for instance, combining acceleration with heart rate variability to distinguish moderate from vigorous exercise, achieving accuracies up to 97% in datasets like MHEALTH as of 2025.[29] Despite their strengths, inertial sensors face limitations such as gyroscope drift, where cumulative errors in angular measurements lead to orientation inaccuracies over extended periods (e.g., minutes to hours), necessitating periodic recalibration.[30] Battery constraints further restrict continuous use, particularly in multi-sensor setups, while occlusion or loose attachment can degrade signal quality.[31] To mitigate these, sensor fusion techniques integrate IMU outputs with complementary data, such as magnetometer readings for drift correction or barometric pressure for altitude, improving overall accuracy by 10–20% in complex scenarios.[32] Practical examples include smartphones using built-in IMUs to detect jogging via periodic acceleration peaks exceeding 2g, enabling real-time fitness feedback.[33] Similarly, fitness bands like those employing ADXL-series accelerometers track steps by thresholding vertical oscillations, achieving counts within 5% error for steady walking.[34]Vision-Based Sensing
Vision-based sensing utilizes cameras to capture and analyze visual cues from human movements, enabling non-intrusive activity recognition across single or multiple subjects without requiring physical contact. This modality leverages RGB cameras to extract color, texture, and appearance features, providing foundational data for motion analysis in unconstrained environments. Depth sensors, such as the Microsoft Kinect introduced in 2010, complement RGB data by generating 3D depth maps through structured light or time-of-flight technology, which mitigate issues like viewpoint variations and enhance spatial understanding of activities.[35] These technologies support a range of applications by processing video feeds to detect poses and trajectories, often achieving accuracies exceeding 90% on benchmark datasets like NTU RGB+D for daily activities.[36] Emerging event-based vision sensors, or neuromorphic cameras, capture asynchronous changes in pixel intensity rather than full frames, offering low-latency and low-power alternatives for real-time HAR. These sensors excel in dynamic environments by reducing data redundancy, enabling efficient recognition of fast actions like gestures, with applications in robotics and wearables as of 2025.[29] Key feature extraction techniques in vision-based systems include optical flow, which quantifies pixel motion across frames to represent dynamic patterns, and pose estimation, which identifies human body keypoints for skeletal representations. Optical flow methods, such as those based on the Lucas-Kanade algorithm, capture temporal changes essential for distinguishing actions like walking from running.[37] Pose estimation frameworks like OpenPose, utilizing part affinity fields, enable real-time 2D multi-person skeleton detection from RGB images, processing up to 25 frames per second on standard hardware.[38] These features allow for granular analysis, from fine-grained actions—such as localizing "pouring" within a longer video sequence using interest point descriptors—to coarser gait recognition, where walking styles are identified from silhouette contours without explicit joint tracking.[36] The standard processing pipeline for vision-based activity recognition initiates with background subtraction to segment foreground subjects from static scenes, employing models like Gaussian mixture models to handle gradual illumination shifts. Subsequent steps involve tracking, using techniques such as Kalman filters for predicting object trajectories, and action localization, which employs sliding windows or region proposals to isolate activity segments within videos.[37] This sequence ensures efficient handling of temporal data, though it demands computational resources for real-time deployment. Significant challenges in vision-based sensing arise from environmental factors, including lighting variations that introduce shadows or overexposure, degrading feature reliability, and occlusions where body parts are hidden by objects or other individuals, leading to incomplete motion cues. These issues can reduce recognition accuracy by up to 20-30% in uncontrolled settings, as observed in datasets like Hollywood2 with dynamic backgrounds.[36] Recent advances emphasize refined 2D and 3D pose models, such as graph convolutional networks on skeletons for robust joint estimation, improving invariance to camera angles. In gait analysis, stride length extracted from video silhouettes serves as a biometric identifier, with seminal work demonstrating person identification at distances up to 50 meters using optical flow-based periodicity. These developments, often enhanced by convolutional neural networks, elevate performance on complex scenarios. Practical examples include home security cameras employing depth-enabled fall detection, where sudden posture drops trigger alerts with over 95% sensitivity in indoor trials. In sports analytics, vision systems track player movements via pose trajectories to evaluate tactics, such as sprint patterns in soccer, supporting data-driven coaching decisions.[39][40]Ambient and Wireless Sensing
Ambient and wireless sensing leverages environment-embedded technologies to detect human activities passively, without requiring wearable devices or direct visual input. This approach utilizes signals from existing infrastructure, such as Wi-Fi networks, radar systems, and GPS, to capture perturbations caused by human motion, enabling non-intrusive monitoring in indoor and outdoor settings.[41] Acoustic sensing, employing ambient microphones, represents another key ambient modality by analyzing sound patterns generated by activities, such as footsteps or object interactions, in a privacy-preserving manner without capturing identifiable audio. Processing involves feature extraction from spectrograms or mel-frequency cepstral coefficients to classify activities, achieving up to 95% accuracy in everyday scenarios as of 2025.[42] Key sensor types include Wi-Fi Channel State Information (CSI), which measures signal perturbations due to human-induced changes in the wireless channel. CSI provides fine-grained data on amplitude and phase variations as radio frequency signals interact with the body.[43] Millimeter-wave (mmWave) radar sensors detect micro-motions through reflected electromagnetic waves, capturing subtle movements like gestures or vital signs with high precision.[44] GPS, integrated for location-contextual activities, tracks positional changes to infer mobility patterns, such as transitions between environments.[45] Data processing in these systems focuses on analyzing signal reflections and modulations. For radar, Doppler shifts in the reflected signals reveal velocity and motion patterns, allowing differentiation of activities like walking or sitting.[44] In Wi-Fi CSI, models examine amplitude and phase changes to model body movements, often using principal component analysis or filtering to extract activity signatures from multipath effects.[43] GPS processing involves trajectory segmentation and speed estimation to contextualize activities relative to locations.[45] These methods offer significant advantages, including privacy preservation by avoiding image capture and wall-penetrating capabilities that function through obstacles, making them ideal for smart home deployments.[41] Compared to vision-based sensing, they provide a contactless alternative that maintains user anonymity.[46] However, limitations arise from environmental sensitivity, such as multipath interference in Wi-Fi signals that can distort readings in cluttered spaces, and inherently lower spatial resolution than optical systems for fine-grained pose estimation.[41] Practical examples demonstrate their utility: commodity Wi-Fi routers have been used to detect room occupancy by monitoring CSI fluctuations from multiple users and to identify falls through sudden amplitude drops indicating posture changes.[47] GPS tracking supports recognition of outdoor activities like hiking by correlating location trajectories with elevation and speed profiles.[45] Additionally, mmWave radar excels in multi-user scenarios, such as monitoring group interactions in shared spaces without individual identification.[44]Methods and Algorithms
Rule-Based and Logical Methods
Rule-based and logical methods in activity recognition rely on deterministic approaches that infer activities from sensor data using predefined rules and logical inference, without relying on probabilistic modeling or learning from data. These methods typically employ if-then rules based on thresholds applied to sensor signals, such as acceleration exceeding 2g to indicate running or a sudden drop below a posture threshold to detect falls.[48] Ontology-based reasoning extends this by representing activities in hierarchical structures, where sensor observations are mapped to concepts like "walking" as a subclass of "locomotion," enabling inference of higher-level activities through semantic relationships. Key algorithms in this category include finite state machines (FSMs), which model activity sequences as transitions between discrete states triggered by sensor conditions, such as shifting from "standing" to "sitting" upon detecting a decrease in vertical acceleration. Logic programming paradigms, such as Prolog, facilitate relational inference by encoding rules as logical predicates; for instance, a rule might define "preparing meal" if "opening fridge" and "handling utensils" are observed in sequence. These techniques are particularly suited for domain-specific scenarios where activities follow predictable patterns. A primary advantage of rule-based and logical methods is their interpretability, as the decision logic is explicitly defined and traceable, allowing domain experts to verify and modify rules without needing computational expertise.[48] They also require no training data, enabling rapid deployment in resource-constrained environments like wearable devices. However, these methods are brittle to variations in sensor noise, user physiology, or environmental factors, often failing when conditions deviate from rule assumptions, and they struggle to scale to complex, multifaceted activities involving multiple users or ambiguous contexts.[48] Representative examples include threshold-based fall detection systems using accelerometers, where a peak acceleration greater than 3g combined with a low vertical velocity post-impact triggers an alert, achieving high specificity in controlled tests. In smart home applications, rule engines process door sensors and motion detectors with logic like "if motion in kitchen and fridge opened, then cooking activity," automating triggers for energy management or assistance.Probabilistic and Statistical Methods
Probabilistic and statistical methods in activity recognition model the inherent uncertainty in sensor data by representing activities as stochastic processes, enabling the estimation of activity states from noisy or incomplete observations. These approaches draw on probability theory to capture dependencies between observations and hidden states, often outperforming deterministic methods in real-world scenarios where data variability is high.[49] A foundational model is the Hidden Markov Model (HMM), which treats activities as sequences of hidden states with transition probabilities defining shifts between them, such as from "walking" to "running." Observations from sensors, like accelerometer readings, are modeled as emissions from these states, allowing HMMs to infer the most likely activity sequence via the Viterbi algorithm—a dynamic programming method that maximizes the joint probability of the observation sequence and state path. For instance, in sensor-based human activity recognition, HMMs use transition matrices to encode temporal patterns, with parameters estimated using the Baum-Welch algorithm for unsupervised learning from data.[50][51][50] Bayesian networks extend this framework by modeling causal relationships among multiple variables, representing activities as directed acyclic graphs where nodes denote sensor observations or activity states, and edges capture conditional dependencies. Inference in Bayesian networks relies on Bayes' theorem to compute posterior probabilities: P(\text{Activity} \mid \text{Observations}) = \frac{P(\text{Observations} \mid \text{Activity}) \cdot P(\text{Activity})}{P(\text{Observations})} This enables the integration of prior knowledge about activity likelihoods with likelihoods from sensor evidence, facilitating multi-sensor fusion by propagating probabilities across the network. For example, dynamic Bayesian networks have been used to fuse accelerometer and gyroscope data for recognizing complex events like "preparing a meal," where conditional probabilities account for interactions between posture and motion.[49][52][52] These methods excel in handling noisy sensor data through probabilistic marginalization, providing quantifiable confidence scores for activity predictions, and supporting fusion from heterogeneous sources via joint probability distributions. In multi-sensor setups, such as combining inertial and environmental sensors, Bayesian inference weighs contributions based on conditional independencies, improving robustness to sensor failures or outliers. However, limitations include the Markov assumption in HMMs, which presumes state independence given the previous state and thus struggles with long-range dependencies or concurrent activities, alongside high computational costs for large state spaces requiring exact inference approximations.[50][52][51] An illustrative application is GPS-based trajectory modeling for travel mode detection, where probabilistic models like HMMs classify modes (e.g., car versus bike) using speed distributions as emission probabilities—cars exhibit higher mean speeds (up to 50 m/s) compared to bikes (up to 10 m/s)—with transition probabilities reflecting realistic mode switches. This approach leverages statistical inference to disambiguate ambiguous trajectories, achieving improved accuracy over rule-based thresholds in datasets like GeoLife.[53][53]Machine Learning and Data Mining Approaches
Machine learning and data mining approaches have become central to activity recognition, enabling the extraction of patterns from sensor data through supervised classification, unsupervised clustering, and pattern discovery techniques. These methods typically rely on handcrafted features derived from raw signals, such as accelerometer or gyroscope readings, to model activities like walking, sitting, or running. Supervised learning uses labeled data to train models that predict activity classes, while unsupervised methods identify inherent structures without labels, and data mining uncovers recurring sequences or associations in large datasets.[54] In supervised techniques, classifiers such as support vector machines (SVM) and decision trees are widely applied for feature-based recognition. SVM excels in high-dimensional spaces by finding hyperplanes that separate activity classes, achieving accuracies up to 92% on wearable sensor data when using radial basis function kernels. Decision trees, including variants like C4.5, build hierarchical structures based on feature splits, offering interpretability and handling non-linear relationships in activities, with reported F1-scores around 85-90% for multi-class problems. Feature extraction often involves time-domain statistics, such as mean, variance, and skewness of signal segments, which capture amplitude variations indicative of motion intensity.[55][56][57] Frequency-domain features complement these by applying the Fast Fourier Transform (FFT) to reveal periodic components, like dominant frequencies in gait cycles, enhancing discrimination between cyclic activities such as walking and jogging. The typical pipeline includes segmenting signals into windows (e.g., 2-5 seconds), engineering features, selecting relevant ones via methods like Principal Component Analysis (PCA) for dimensionality reduction—which can retain 95% variance while reducing features by 70%—and tuning models with k-fold cross-validation to mitigate overfitting and ensure generalization across users. PCA projects data onto principal axes, preserving key variances for robust classification.[58][59][60] Unsupervised approaches, such as k-means clustering, facilitate activity discovery by partitioning unlabeled data into clusters based on feature similarity, often revealing novel patterns like transitions between daily routines. K-means iteratively assigns data points to centroids, minimizing intra-cluster variance, and has been used to group accelerometer trajectories into activity modes with silhouette scores above 0.6. Data mining techniques, including frequent pattern mining with the Apriori algorithm, identify sequential activities by discovering itemsets exceeding a support threshold, such as recurring patterns like "entering room followed by sitting" in smart home logs. These methods are often combined with probabilistic models like hidden Markov models for temporal smoothing.[61][62][63] The advantages of these approaches include their ability to handle complex, non-linear patterns in heterogeneous data and adaptability to new instances via retraining, making them suitable for real-world deployment. However, they require substantial labeled data for supervision, which can be costly to annotate, and are prone to overfitting without proper regularization, particularly in inter-subject variability scenarios. Examples include mining wearable sensor data for anomaly detection in elderly routines, where clustering identifies deviations from normal walking patterns with precision over 80%, and analyzing GPS logs to uncover urban mobility patterns, such as frequent stop-go sequences in traffic, using sequential mining to support transportation planning. These techniques serve as precursors to deep learning methods by emphasizing engineered representations.[54][64][65]Deep Learning Approaches
Deep learning approaches have revolutionized human activity recognition (HAR) by enabling end-to-end learning from raw sensor data, surpassing traditional feature-engineered methods through automated extraction of hierarchical representations.[66] These methods leverage neural networks to model complex spatiotemporal patterns in data from wearables, cameras, and ambient sensors, achieving state-of-the-art performance in diverse scenarios.[3] Unlike earlier machine learning techniques that rely on handcrafted features, deep learning automates this process, allowing models to adapt to varied input modalities without extensive preprocessing.[10] Key architectures in deep learning for HAR include convolutional neural networks (CNNs), recurrent neural networks (RNNs) such as long short-term memory (LSTM) units, and transformer models. CNNs excel at capturing spatial features, particularly in vision-based HAR where they process image or video frames to detect local patterns like body poses.[66] For instance, 1D-CNNs are applied to sequential signals like Wi-Fi channel state information (CSI) to extract temporal-spatial features directly from amplitude and phase variations.[67] RNNs and LSTMs address temporal dependencies in time-series data from inertial measurement units (IMUs), modeling sequential dynamics in activities like walking or gesturing.[68] Transformers, introduced in HAR contexts around 2020, use attention mechanisms for long-range dependency modeling and multimodal fusion, as seen in the Human Activity Recognition Transformer (HART), which processes heterogeneous sensor streams efficiently.[69] Recent advances emphasize multimodal integration and self-supervised paradigms to handle diverse data sources and labeling scarcity. Multimodal deep learning fuses IMU signals with vision data through late fusion strategies, where separate encoders process each modality before combining representations, improving robustness in occluded environments by up to 5% in accuracy.[70] Self-supervised learning, particularly contrastive methods post-2022, pretrains models on unlabeled data by learning invariant representations across augmented views of sensor signals, reducing reliance on annotations while boosting downstream fine-tuning performance on benchmarks.[71] Training in these approaches typically involves backpropagation to minimize loss functions like cross-entropy for classification tasks, enabling gradient-based optimization of network parameters. LSTMs, a cornerstone for sequential HAR, incorporate gating mechanisms to regulate information flow; the forget gate, for example, is computed as: f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) where \sigma is the sigmoid function, W_f and b_f are learnable weights and biases, h_{t-1} is the previous hidden state, and x_t is the current input.[68] This structure mitigates vanishing gradients in long sequences, facilitating effective learning from IMU time series. Advantages of deep learning in HAR include superior accuracy, often exceeding 95% on public benchmarks like UCI-HAR, due to its ability to process raw data without manual feature engineering.[66] Graph convolutional networks (GCNs), for skeleton-based recognition, exemplify this by modeling joint interdependencies as graphs, as in the Spatial-Temporal Graph Convolutional Network (ST-GCN), which achieves high precision in pose estimation from video.[72] However, deep learning models suffer from high data requirements, often needing thousands of labeled samples per class, and limited interpretability, complicating trust in real-world deployments.[3] As of 2025, trends focus on lightweight architectures for edge devices, such as TinierHAR, which uses depthwise separable convolutions to reduce parameters by over 90% while maintaining near-state-of-the-art accuracy on mobile IMUs.[73]Data and Evaluation
Public Datasets
Public datasets play a crucial role in advancing activity recognition research by providing standardized benchmarks for developing and evaluating algorithms across diverse sensing modalities. These datasets facilitate reproducibility, enable comparisons of methods, and address challenges such as data scarcity and variability in real-world scenarios. Key collections emphasize diversity in activities, participant demographics, environmental conditions, and annotation quality to support robust model training and generalization. Inertial sensor datasets, primarily derived from accelerometers and gyroscopes in wearables or smartphones, focus on basic daily activities and locomotion. The UCI Human Activity Recognition (UCI HAR) dataset, released in 2012, comprises recordings from 30 subjects performing six activities of daily living—walking, walking upstairs, walking downstairs, sitting, standing, and laying—using smartphone inertial measurement units (IMUs) mounted on the waist. It includes 7,352 training instances and 2,947 test instances, with time-series signals segmented into 2.56-second windows and labeled for activity type, making it a foundational resource for supervised learning in wearable-based recognition.[74] The WISDM dataset, introduced in 2010, captures accelerometer and gyroscope data from 36 subjects engaged in six daily actions—walking, jogging, sitting, standing, going upstairs, and going downstairs—sampled at 20 Hz over three-minute trials, yielding 1,098,207 instances in its lab version and emphasizing real-world variability through uncontrolled phone placements in pockets. This dataset highlights challenges like class imbalance, with walking and jogging comprising the majority of samples, and has been widely used to benchmark feature extraction techniques for mobile sensing.[75] Vision-based datasets leverage video footage to recognize complex human actions, often sourced from diverse real-world clips to capture variations in viewpoint, speed, and context. The HMDB-51 dataset, published in 2011, contains 6,766 video clips across 51 action categories—such as brushing hair, clapping, and sword fighting—extracted from movies, public databases, and web videos, with each class including at least 101 clips divided into three train/validation/test splits. Annotations focus on trimmed segments highlighting the primary motion, supporting evaluations of spatiotemporal models while addressing issues like occlusions and background clutter inherent in unconstrained videos.[76] The Kinetics-400 dataset, released in 2017 and later expanded to Kinetics-700 in 2020, features approximately 300,000 ten-second YouTube video clips for 400 human action classes in the original version (scaling to 650,000 clips across 700 classes), with balanced sampling of at least 400 videos per class and splits of 240,000 training, 20,000 validation, and 40,000 test instances. It prioritizes semantic diversity, including sports, daily activities, and interactions, and includes frame-level annotations to enable fine-grained temporal analysis, serving as a large-scale benchmark for deep learning in action recognition.[77] Multimodal and ambient sensing datasets integrate multiple data streams, such as wearables, environmental sensors, and wireless signals, to model interactions in instrumented settings. The OPPORTUNITY dataset, made available in 2013, records data from four subjects performing daily activities—like opening/closing doors and preparing coffee—in a sensor-rich apartment using body-worn IMUs, object-embedded sensors, and ambient wireless nodes, resulting in over 13 million instances across 11 basic and 4 high-level gesture labels with hierarchical annotations. Its design emphasizes ecological validity through scripted and free-living scenarios, tackling challenges like sensor synchronization and null-class imbalance (e.g., idle periods). For wireless-based approaches, the Widar 3.0 dataset, released in 2019, collects Channel State Information (CSI) from commodity WiFi devices for 6 hand gestures—such as push & pull, sweep, and drawing circles—performed by 16 subjects in indoor environments, with 12,000 instances in the main set including subcarrier-level amplitude and phase data across multiple positions and orientations. This dataset supports non-contact recognition, highlighting privacy-preserving annotations without video and addressing multipath effects in signal propagation.[78] Recent datasets from 2023–2025 extend multimodal paradigms for improved generalization, incorporating consumer devices and diverse settings. The MM-HAR dataset, introduced in 2023, fuses data from earbuds (accelerometers, gyroscopes) and smartwatches for 44 subjects across 12 activities—including clapping, walking, and eating—in both lab and home environments, yielding over 100 hours of synchronized recordings with subject-independent splits to evaluate cross-domain transfer. It addresses annotation challenges like privacy in audio-inclusive modalities and class imbalance in fine-grained actions, serving as a benchmark for fusion models in real-life health monitoring. For example, the CAPTURE-24 dataset, released in 2024, provides a large-scale collection of wrist-worn sensor data from over 100 participants for activity intensity levels and activities of daily living in real-world settings, emphasizing scalability for machine learning models.[79] Selection of these datasets often prioritizes factors such as activity diversity (e.g., from locomotion to gestures), subject variability (age, gender), environmental realism, and annotation robustness (e.g., inter-annotator agreement), while mitigating issues like data imbalance through oversampling or synthetic augmentation in downstream research.| Dataset | Modality | Year | Key Characteristics | Primary Use |
|---|---|---|---|---|
| UCI HAR | Inertial (smartphone IMUs) | 2012 | 30 subjects, 6 activities, 10,299 instances (7,352 train, 2,947 test), time-series windows | Wearable HAR benchmarking |
| WISDM | Inertial (accelerometer/gyro) | 2010 | 36 subjects, 6 activities, 1,098,207 instances, 20 Hz sampling | Mobile activity classification |
| HMDB-51 | Vision (videos) | 2011 | 51 classes, 6,766 clips, 3 splits | Spatiotemporal action recognition |
| Kinetics-400 | Vision (videos) | 2017 | 400 classes, ~300k clips, 10s duration | Large-scale deep learning pretraining |
| OPPORTUNITY | Multimodal (wearables, ambient) | 2013 | 4 subjects, 15 gestures, >13M instances, hierarchical labels | Sensor fusion in smart environments |
| Widar 3.0 | Ambient (WiFi CSI) | 2019 | 16 subjects, 6 gestures, 12,000 instances (main set), subcarrier data | Contactless gesture detection |
| MM-HAR | Multimodal (earbuds/watch) | 2023 | 44 subjects, 12 activities, >100 hours, cross-domain splits | Generalizable consumer HAR |