Affective computing
Affective computing is computing that relates to, arises from, or deliberately influences emotions or other affective phenomena.[1] Coined by Rosalind Picard in her seminal 1995 paper, the field focuses on enabling machines to recognize, interpret, and respond to human emotional states to facilitate more natural human-computer interactions.[1] It draws from interdisciplinary foundations in computer science, psychology, neuroscience, and engineering to bridge the gap between emotional human experiences and technological systems.[2] Picard expanded on these ideas in her 1997 book Affective Computing, published by the MIT Press, which provided the intellectual framework for the discipline and emphasized the role of emotions in rational decision-making and perception, as supported by neurological research such as Antonio Damasio's work on emotion and reason.[2] The field originated at the MIT Media Lab, where Picard's Affective Computing Research Group continues to pioneer advancements in Emotion AI—a core subset involving the detection and simulation of emotions through computational models.[3] Over the past three decades, affective computing has evolved from theoretical models of emotion recognition to practical implementations, incorporating machine learning techniques for analyzing multimodal data like facial expressions, vocal tones, physiological signals (e.g., heart rate variability), and body gestures; recent advancements include integration with large language models for enhanced multimodal emotion understanding.[4] Key components of affective computing include emotion recognition, which uses sensors and algorithms to detect affective states; affective expression, enabling systems to convey emotions via synthetic speech, avatars, or adaptive interfaces; and affective influence, where technology modulates user emotions for beneficial outcomes.[1] These elements are integrated into wearable devices and software to advance emotion theory and cognition research by collecting real-world data on affective responses.[3] Notable applications span multiple domains, including mental health monitoring to forecast and prevent conditions like depression through passive emotion tracking; educational tools that adapt content based on learner frustration or engagement; human-robot interaction for empathetic companionship; and workplace systems for stress detection and productivity enhancement. In healthcare, affective technologies support autism interventions by modeling emotional development and aid communication for individuals with expressive challenges.[3] Recent developments emphasize ethical considerations, such as data privacy in physiological sensing and explainable AI for transparent emotion inference, ensuring responsible deployment amid growing integration with ubiquitous computing.Introduction and History
Definition and Scope
Affective computing is a branch of artificial intelligence that enables machines to recognize, interpret, process, and simulate human emotions, a concept first coined by Rosalind Picard in her 1995 technical report Affective Computing and expanded upon in her 1997 book of the same name. In this foundational work, Picard describes affective computing as systems designed to relate to, arise from, or deliberately influence emotions, emphasizing the need for computers to interact more naturally with humans by incorporating emotional awareness. This field emerged from the recognition that emotions play a critical role in human cognition, decision-making, and social interaction, extending beyond traditional rational models of intelligence.[5][6] The scope of affective computing is inherently multidisciplinary, integrating principles from psychology, neuroscience, computer science, and engineering to build emotionally intelligent systems. It focuses on bridging the gap between human affective experiences and machine capabilities, allowing for more empathetic and adaptive technologies in areas like human-computer interaction. Central to this scope are the core components of the field: affect detection, which identifies emotional states through sensory inputs; affect interpretation, which assigns contextual meaning to detected emotions; and affect synthesis, which enables machines to generate and express appropriate emotional responses. These elements form the backbone of systems that can perceive and respond to human affects in real-time.[7][5] A key aspect of affective computing involves modeling human emotions, often drawing on established psychological theories to inform computational approaches. Categorical models, such as Paul Ekman's framework of six basic emotions—happiness, sadness, fear, anger, surprise, and disgust—treat emotions as discrete universal categories identifiable across cultures. In contrast, dimensional models like James Russell's 1980 circumplex model represent emotions on a two-dimensional plane of valence (pleasantness-unpleasantness) and arousal (activation level), providing a continuous spectrum for more nuanced representation. A prerequisite for advancing in affective computing is a solid grasp of these human emotion theories, as they underpin the design of reliable detection and simulation techniques without which technical implementations lack psychological validity.[8][9]Historical Development and Key Figures
The roots of affective computing trace back to the 1980s in artificial intelligence research, where scholars began exploring the role of emotions in cognition and intelligent systems. Marvin Minsky's 1986 book The Society of Mind argued that emotions are essential components of intelligent behavior, emerging from interactions among simpler cognitive processes, and influenced subsequent work on integrating affect into AI. This perspective laid theoretical groundwork by challenging the prevailing view of intelligence as purely rational, emphasizing how emotions guide attention, learning, and decision-making in complex environments. The field was formally established in the mid-1990s through Rosalind Picard's pioneering contributions at the MIT Media Lab. In her seminal 1995 technical report "Affective Computing," Picard defined the discipline as computing that relates to, arises from, or deliberately influences emotions, proposing models for machines to recognize and express affect using physiological and behavioral cues. This was expanded in her 1997 book Affective Computing, which advocated for wearable sensors to monitor emotions in real-time and founded the MIT Affective Computing Group in 1997 to advance interdisciplinary research.[5] Key figures emerged alongside these developments: Cynthia Breazeal, who in the late 1990s developed Kismet, an expressive robot head at MIT that demonstrated affective interaction through facial expressions and social cues, pioneering emotion in social robotics.[10] Björn Schuller advanced speech-based emotion recognition from the early 2000s, contributing foundational methods for acoustic feature analysis and linguistic integration in hybrid models.[11] Major milestones marked the field's evolution through the 2000s and 2010s. In the 2000s, integration of the Facial Action Coding System (FACS) enabled automated facial expression analysis for emotion detection, as seen in early real-time systems that coded action units for robust recognition. The 2010s saw the rise of multimodal fusion techniques, combining modalities like speech, face, and physiology to improve accuracy, with reviews highlighting feature-level and decision-level approaches for hybrid emotion inference.[12] Institutional efforts supported standardization, including the founding of the HUMAINE Association in 2005 (now the Association for the Advancement of Affective Computing), which organized the first International Conference on Affective Computing and Intelligent Interaction to foster global collaboration.[13] Post-2020 advancements integrated deep learning and large language models, enabling real-time emotion AI via transformer architectures for multimodal recognition, as evidenced in challenges like MER 2025 exploring emotion forecasting with pre-trained models.[14] These developments, building on Picard's vision, have scaled affective systems for diverse applications while addressing ethical considerations in emotion-aware computing.[7]Core Concepts
Emotion Detection and Recognition
Emotion detection and recognition in affective computing involves a systematic process beginning with the sensing of raw multimodal data from human users, such as audio signals, visual cues, or physiological measurements, to capture indicators of emotional states.[15] This raw data undergoes feature extraction, where relevant attributes are identified and quantified—for instance, prosodic elements like pitch variation in speech or action units in facial expressions—to represent emotional content in a computationally tractable form.[15] The extracted features are then fed into classification algorithms, which map them to either discrete emotion categories (e.g., joy, fear, anger) based on categorical models or continuous dimensions such as valence (positive-negative) and arousal (high-low intensity) using dimensional frameworks.[15] Theoretical foundations for emotion detection draw heavily from psychological models that emphasize cognitive evaluation of stimuli. Appraisal theory posits that emotions arise from an individual's subjective assessment of events in relation to personal goals and well-being, where primary appraisal evaluates the relevance and valence of a stimulus, and secondary appraisal assesses coping potential. This framework informs computational models by guiding feature selection toward indicators of evaluative processes, such as changes in physiological arousal signaling threat relevance.[16] Complementing this, Scherer's component process model (1984) conceptualizes emotions as dynamic, emergent episodes resulting from synchronized changes across multiple subsystems: cognitive appraisal of novelty and goal conduciveness, autonomic physiological responses, motivational action tendencies, motor expressions, and subjective feelings. In affective computing, this model supports recognition by modeling the temporal sequencing of these components, enabling systems to infer emotions from patterns of synchronization rather than isolated signals.[17] Unimodal detection, relying on a single input modality like speech, typically achieves accuracies of 60-70% in controlled settings due to limitations in capturing the full spectrum of emotional cues, such as contextual ambiguities in prosody alone.[18] In contrast, multimodal approaches integrate data from multiple sources (e.g., speech, facial expressions, and gestures), yielding substantial improvements through feature fusion, with meta-analyses reporting an average 8.12% gain over the best unimodal method and accuracies reaching up to 90% in laboratory environments where data synchronization is optimized.[19][20] These benefits arise from complementary information across modalities, reducing errors from noise or modality-specific variability, though real-world deployment faces challenges like asynchronous inputs.[19] Performance in emotion recognition is evaluated using standard machine learning metrics to assess reliability across diverse emotional classes. Accuracy measures the overall proportion of correct predictions, while precision quantifies the fraction of positive identifications that are truly positive, and recall captures the fraction of actual positives correctly identified; the F1-score, as their harmonic mean, balances these for imbalanced datasets common in emotion tasks.[21] Confusion matrices further visualize misclassifications, highlighting pairwise errors such as conflating surprise with fear due to overlapping arousal patterns.[21] These metrics are essential for benchmarking, as high accuracy alone may mask poor recall for subtle emotions like disgust.[21] Context plays a critical role in emotion detection, as situational and cultural factors modulate expression through display rules—social norms dictating when and how emotions are shown or suppressed.[22] For example, cultures emphasizing collectivism, such as Japan, often enforce stronger rules for masking negative emotions in public to maintain harmony, leading to subdued expressions that unimodal systems trained on Western data may misinterpret as neutral.[22][23] Incorporating contextual priors, like cultural display norms, enhances recognition robustness by adjusting classification thresholds for variability in emotional intensity and valence across groups.[23]Emotion Simulation and Expression in Machines
Emotion simulation in machines involves computational methods to generate internal emotional states and express them through various output channels, enabling more natural human-machine interactions. Early approaches relied on rule-based systems, where predefined if-then rules map situational inputs to emotional responses, such as triggering an empathy response when detecting user frustration in a conversational agent. These systems, inspired by psychological theories, provide deterministic and interpretable simulations but lack flexibility for complex, context-dependent emotions. In contrast, modern generative models create dynamic emotional expressions by learning from data to produce realistic outputs, such as animating facial features to convey subtle joy or anger from neutral inputs; recent advances as of 2025 incorporate large language models (LLMs) for generating emotionally-aligned responses in dialogue systems, shifting toward generative paradigms beyond traditional categorical frameworks.[24][25] Expression modalities in affective computing encompass multiple channels to convey simulated emotions effectively. Virtual agents often use dynamic facial animations, where machine-generated expressions mimic human micro-expressions to build rapport, as seen in embodied conversational agents that adjust eyebrow raises or smiles based on appraised emotional intensity. Tonal speech synthesis integrates prosodic features like pitch variation and tempo to infuse spoken responses with affective tone, allowing systems to sound compassionate during user distress. Haptic feedback serves as a tactile modality, using vibrations or pressure patterns on wearable devices to transmit emotions, such as rhythmic pulses simulating warmth for affection or irregular jolts for anxiety, enhancing immersion in virtual environments.[26] Computational models underpin these simulations by formalizing how machines appraise and generate emotions. The OCC model, originally a psychological framework, has been adapted for computational appraisal, where agents evaluate events relative to goals, standards, and tastes to derive emotions like pride or reproach, enabling rule-based or probabilistic simulations in virtual characters. The EMA (Emotion and Adaptation) architecture extends this by modeling dynamic appraisal processes over time, incorporating coping strategies to produce believable emotional expressions in agents, such as shifting from anger to acceptance in response to unfolding scenarios. These models prioritize cognitive structures to ensure simulated emotions align with human-like reasoning, facilitating applications in interactive systems.[16] Evaluation of emotion simulation focuses on perceived authenticity and impact rather than internal accuracy. Variants of the Turing Test assess emotional believability by having users distinguish machine-generated responses from human ones in empathetic dialogues, revealing high indistinguishability in advanced systems. User studies measure perceived empathy through scales like the Perceived Empathy of Technology Scale (PETS), which evaluates factors such as emotional responsiveness and trust, showing that expressive virtual agents increase user satisfaction in interaction tasks. These metrics emphasize subjective human judgments to refine simulation techniques.[27][28] An ethical consideration in emotion simulation is the risk of anthropomorphism, where users over-attribute genuine feelings to machines, potentially leading to emotional dependency or manipulation. Studies indicate that intentionally harming emotion-expressing robots heightens perceptions of their pain and moral status, blurring boundaries and raising concerns about psychological harm or misguided trust in non-sentient systems. Designers must balance expressiveness with transparency to mitigate deception while preserving interaction benefits.[29]Sensing Technologies
Speech-Based Emotion Recognition
Speech-based emotion recognition involves analyzing audio signals to detect emotional states conveyed through vocal expressions, focusing on paralinguistic elements that transcend linguistic content. This approach leverages the acoustic properties of speech, such as variations in tone and rhythm, to infer emotions like anger, happiness, or sadness. Key to this process is the extraction of acoustic features that capture the nuances of emotional vocalization. Prosodic features, including pitch (fundamental frequency), tempo (speech rate), and volume (energy or intensity), provide temporal and dynamic indicators of emotion; for instance, elevated pitch and faster tempo often signal excitement or anger.[30] Spectral descriptors further enhance recognition by modeling the frequency content of speech. Among these, Mel-Frequency Cepstral Coefficients (MFCCs) are widely used, representing the short-term power spectrum of sound on a nonlinear mel scale that approximates human auditory perception. The MFCCs are computed as: c_n = \sum_{k=1}^K \log(S_k) \cos\left(\frac{\pi n (k-0.5)}{K}\right), where S_k are the outputs of mel-scale filters applied to the signal's power spectrum, K is the number of filters, and n indexes the coefficients. These features effectively distinguish emotional categories by highlighting formant structures and harmonic variations in voiced segments.[31][32] Algorithms for speech-based emotion recognition have evolved from traditional statistical models to deep learning architectures tailored for audio processing. Hidden Markov Models (HMMs) were seminal in early systems, modeling the sequential nature of speech prosody to classify emotions through state transitions based on feature sequences like MFCCs and pitch contours. Convolutional Neural Networks (CNNs) advanced this by treating spectrograms—time-frequency representations of audio—as images, applying filters to detect local patterns indicative of emotional arousal or valence. In the 2020s, transformer-based models, such as wav2vec 2.0, enabled end-to-end emotion classification by learning contextual representations from raw waveforms via self-supervised pretraining on large speech corpora, followed by fine-tuning on emotion-labeled data. These transformers capture long-range dependencies in audio, outperforming prior methods on dynamic emotional expressions. Recent advances as of 2025 include adaptations of models like HuBERT and Whisper for SER, supporting naturalistic speech in challenges such as Interspeech 2025, with improved robustness to noise and dialects.[30][33][34][35][36] Despite progress, speech-based emotion recognition faces significant challenges, including speaker variability, where individual differences in voice quality and accent degrade model generalization across users. Noise robustness remains a critical issue, as environmental interference can mask subtle prosodic cues, necessitating robust feature selection or denoising preprocessing. Cultural differences in vocal emotion expression further complicate deployment; for example, anger may manifest with higher pitch in Western speakers but lower pitch in some Asian cultural contexts, leading to cross-cultural misclassifications. These factors underscore the need for diverse, inclusive training data to mitigate biases.[37][38][30] Performance benchmarks on datasets like IEMOCAP, which includes dyadic interactions with acted and improvised emotions, typically yield 65-75% accuracy for four-class recognition (e.g., angry, happy, sad, neutral), with weighted accuracies around 72-74% accounting for class imbalance. These results highlight the modality's potential but also its limitations compared to human-level perception, particularly for subtle or mixed emotions.[39][40][41] To enhance accuracy, paralinguistic analysis often integrates speech features with textual sentiment from transcribed words, combining prosodic indicators with lexical cues like positive or negative phrasing. This fusion, typically via multimodal classifiers, improves overall emotion detection by resolving ambiguities where vocal tone contradicts verbal content, such as sarcastic remarks.[42][43]Facial Expression Analysis
Facial expression analysis is a cornerstone of affective computing, focusing on the automated interpretation of human emotions through visual cues from facial movements and configurations. This involves detecting and classifying expressions to infer emotional states, enabling machines to respond empathetically in human-computer interactions. Techniques in this domain process both static images and dynamic video sequences, emphasizing the subtlety of facial dynamics to distinguish between basic emotions such as happiness, sadness, anger, fear, surprise, and disgust.[44] Feature extraction forms the initial step in facial expression analysis, where key facial components are identified to capture emotional indicators. Landmark detection, for instance, locates specific points on the face, with the widely adopted 68-point model delineating contours around the eyes, eyebrows, nose, mouth, and jawline to quantify deformations associated with expressions. This model, implemented in libraries like dlib, facilitates precise tracking of facial geometry for emotion inference. Additionally, optical flow methods analyze pixel motion between consecutive frames to detect micro-expressions—brief, involuntary facial movements lasting less than 1/25 of a second that reveal concealed emotions; these are particularly useful for applications in deception detection and mental health monitoring.[45][46] The Facial Action Coding System (FACS), developed by Paul Ekman and Wallace V. Friesen in 1978, provides a foundational framework for dissecting facial expressions into atomic components known as Action Units (AUs). FACS anatomically maps 44 AUs to specific muscle actions, such as AU12 (lip corner puller), which activates the zygomatic major muscle to produce a smile indicative of happiness. Complex emotions arise from AU combinations; for example, surprise is characterized by AU1 (inner brow raiser) + AU2 (outer brow raiser) + AU5 (upper lid raiser), resulting in widened eyes and raised eyebrows. This system enables systematic annotation and has been certified for reliability in over 3,000 studies, influencing both manual coding and automated tools in affective computing.[47][48] Classification methods in facial expression analysis leverage machine learning to map extracted features to emotional categories, contrasting AU-based approaches—which decompose expressions into independent muscle activations for granular analysis—with holistic methods that treat the face as a unified gestalt for direct emotion prediction. Early techniques employed Support Vector Machines (SVMs) on handcrafted features like Gabor wavelets, achieving up to 88% accuracy on posed expressions in controlled settings. Deep learning has advanced this field, with convolutional neural networks (CNNs) like ResNet applied to datasets such as FER2013—a benchmark comprising 35,887 grayscale images of varied expressions—yielding approximately 70-75% accuracy on test sets, though performance varies by emotion due to class imbalance. AU-based classifiers often outperform holistic ones in spontaneous scenarios by isolating subtle cues, as demonstrated in comparative studies where component-wise analysis improved recognition by 10-15% over whole-face processing.[49][50][51] Despite progress, facial expression analysis faces significant challenges, including occlusions from masks or hands, head pose variations that distort feature alignment, and the disparity between posed (deliberate, exaggerated) and spontaneous (natural, subtle) expressions. Spontaneous expressions, which better reflect authentic emotions, exhibit different muscle activation patterns and dynamics compared to posed ones, leading to substantially lower recognition accuracies—often around 50-60% in real-world settings versus 80-90% for posed data in labs. These issues are exacerbated in unconstrained environments, where lighting, ethnicity, and cultural differences further degrade performance.[52][53] Recent advances, particularly from 2024-2025, emphasize real-time facial expression analysis via edge computing for mobile devices, enabling low-latency, privacy-preserving emotion detection without cloud dependency. Lightweight architectures like MobileNet and EfficientNet, optimized for resource-constrained hardware, achieve over 70% accuracy in on-device inference while processing video at 30 FPS, supporting applications in wearable tech and telehealth. These developments integrate FACS-inspired AU detection with transformer-based models for robust handling of variations, marking a shift toward deployable affective interfaces.[54][55]Physiological Signal Monitoring
Physiological signal monitoring in affective computing involves capturing involuntary bodily responses to infer emotional states, providing a covert and objective measure of internal arousal and valence that complements visible cues. These signals, primarily from the autonomic nervous system, reflect subconscious reactions such as increased sweating or heart rate fluctuations during emotional experiences. Unlike explicit expressions, physiological monitoring enables real-time, non-invasive detection in naturalistic settings, with applications in mental health and human-machine interfaces.[56] Electrodermal activity (EDA), also known as galvanic skin response (GSR), measures changes in skin conductance due to sweat gland activity, serving as a key indicator of emotional arousal. EDA signals increase with sympathetic nervous system activation, correlating strongly with high-arousal states like excitement or stress, while showing weaker links to valence. Skin conductance G is calculated as G = \frac{I}{V}, where I is the current and V is the applied voltage, allowing quantification of tonic (baseline) and phasic (event-related) components. Meta-analyses confirm EDA's superior performance in arousal prediction over valence, with accuracies often exceeding 70% in dimensional models.[57][57][58] Heart rate variability (HRV), derived from electrocardiogram (ECG) signals, assesses fluctuations in inter-beat intervals to detect emotional stress and arousal. HRV decreases under stress due to parasympathetic withdrawal and sympathetic dominance, with spectral analysis revealing key metrics like the low-frequency (LF) to high-frequency (HF) power ratio (LF/HF), where elevated ratios indicate heightened stress. For instance, meta-analyses of 37 studies show consistent reductions in HF power and increases in LF during acute stress, linking HRV to prefrontal cortex activity for emotion appraisal. This makes HRV a reliable biomarker for distinguishing calm from anxious states in affective systems.[59][59] Facial electromyography (EMG) captures subtle muscle activations to gauge emotional valence, focusing on non-visible contractions. The zygomaticus major muscle, involved in smiling, activates during positive emotions like happiness, reflecting approach-oriented affect. Conversely, the corrugator supercilii muscle, associated with frowning, engages during negative states such as anger or sadness, indicating withdrawal tendencies. Studies demonstrate reliable differentiation, with zygomaticus activity rising to happy stimuli and corrugator to aversive ones, enabling valence classification accuracies around 80% in controlled settings.[60][60][60] Blood volume pulse (BVP), measured via photoplethysmography (PPG), tracks peripheral blood flow changes to identify emotional arousal, particularly fear. PPG sensors detect pulse wave variations, with acceleration features (e.g., pulse wave second derivative) showing abrupt shifts during fear responses due to vasoconstriction. In datasets eliciting emotions like "scary," BVP features achieve up to 71.88% recognition accuracy using machine learning, highlighting time-frequency domain metrics like skewness for distinguishing high-arousal negatives. This approach suits wearable integration for real-time fear detection in safety-critical applications.[61][61][61] Facial color analysis uses RGB imaging to detect chromaticity shifts signaling emotions, bypassing overt expressions. Blushing, indicated by increased redness from vasodilation, conveys arousal in states like anger or embarrassment, while pallor (paleness) from vasoconstriction signals fear or shock. Remote PPG (rPPG) enhances this by extracting pulse from color variations, with experiments showing 70% accuracy in decoding 18 emotions from color alone and 85% for valence. These patterns, driven by blood flow, provide an efficient, universal channel for emotional transmission.[62][62][62] Recent developments as of 2025 emphasize multimodal fusion of physiological signals, such as EEG and ECG, using ensemble learning to boost recognition accuracy in real-world scenarios, achieving up to 95% in controlled emotion elicitation via virtual reality. Wearable devices like smartwatches integrate GSR and HRV sensors for continuous physiological monitoring, enabling ambulatory affective computing. For example, wrist-based EDA and ECG capture arousal in daily life, with preprocessing mitigating issues. However, motion artifacts from user movement distort signals, particularly in PPG and EDA, requiring adaptive filtering for reliability. Privacy concerns also arise from persistent data collection, necessitating secure protocols to protect sensitive emotional insights.[63][64][56][56][65]Gesture and Body Language Recognition
Gesture and body language recognition in affective computing involves the analysis of non-verbal cues such as postures, movements, and dynamic gestures to infer emotional states, providing a complementary modality to facial or vocal signals. This approach leverages computer vision techniques to detect and interpret body dynamics, enabling machines to understand human affect through skeletal structures and motion patterns.[7] Key features extracted include pose estimation keypoints, which represent joint positions across the body, and kinematic attributes like limb velocity and acceleration to capture agitation or fluidity in movements.[66] For instance, OpenPose, a widely adopted real-time pose estimation library, detects 25 keypoints for the human body (including head, torso, limbs, and feet), facilitating the modeling of full-body configurations for emotion inference.[67] Emotions are mapped to specific gesture patterns based on psychological and computational models; for example, open arm postures often signal welcoming or happiness, while crossed arms indicate defensiveness associated with anger or discomfort.[68] These mappings draw from established nonverbal communication research, where expansive gestures correlate with positive affect and contractive ones with negative states.[69] However, cultural variations significantly influence interpretation; the thumbs-up gesture conveys positivity in Western cultures but is offensive in parts of the Middle East and Asia, highlighting the need for context-aware models in global applications.[70] Techniques for recognition typically involve skeleton-based processing, where 2D video analysis extracts keypoints from RGB footage, contrasting with 3D motion capture systems that use depth sensors for precise spatial reconstruction.[71] Temporal sequences of these skeletons are modeled using recurrent neural networks (RNNs) or long short-term memory (LSTM) units to capture sequential dependencies in gestures, such as the progression from neutral to agitated motion.[66] Seminal work has demonstrated that LSTM-enhanced RNNs achieve robust classification of emotions like anger and sadness by processing joint orientations and velocities over time.[72] As of 2025, advances include datasets like BER2024 for training body language recognition systems classifying expressions into categories such as negative, neutral, pain, and positive, alongside hybrid models integrating gestures with other modalities for improved accuracy in affective computing. In human-computer interaction (HCI), these methods enable real-time tracking for adaptive interfaces, such as virtual agents responding to user frustration via detected slumped postures.[73][74] Laboratory evaluations report accuracies of 60-80% for recognizing basic emotions (e.g., happiness, anger) using skeleton data, with higher rates (up to 92%) for distinct categories like anger in controlled settings.[71] Challenges include gesture ambiguity, where the same posture (e.g., crossed arms) may reflect comfort or hostility depending on context, and occlusion in multi-person scenarios, which obscures keypoints and reduces model reliability.[75] Multimodal fusion with other cues can mitigate these issues, though gesture analysis remains essential for silent or occluded environments.[76]Data and Modeling
Datasets and Databases
Datasets and databases play a crucial role in affective computing by providing annotated resources for training and evaluating emotion recognition systems. These resources vary by modality, capturing speech, facial expressions, physiological signals, or multimodal data, and are essential for developing models that generalize across diverse emotional contexts. Key datasets often include acted or spontaneous expressions, with annotations for categorical emotions (e.g., anger, happiness) or dimensional models (e.g., valence-arousal).[77]Speech Datasets
Speech-based datasets focus on prosodic features like pitch, tempo, and timbre to infer emotions from audio recordings. The Berlin Emotional Speech Database (Emo-DB) is a seminal acted dataset featuring ten speakers expressing seven German emotions, such as anger, boredom, and joy, through 535 utterances recorded in a controlled environment. A 2025 update, EmoDB 2.0, extends the original with additional naturalistic recordings for improved real-world applicability.[78] It has been widely used for benchmarking speech emotion recognition due to its high-quality recordings and balanced emotional categories. Another influential resource is the Interactive Emotional Dyadic Motion Capture database (IEMOCAP), which includes improvised and scripted dialogues from ten speakers in dyadic interactions, annotated for nine emotions including neutral, happy, and sad, with over 12 hours of multimodal data emphasizing spontaneous expressions.[79]Facial Datasets
Facial expression datasets provide image or video sequences labeled for emotions, often derived from action units or direct categorical annotations. The Extended Cohn-Kanade dataset (CK+) contains 327 posed image sequences from 123 subjects, depicting seven basic emotions (anger, contempt, disgust, fear, happiness, sadness, surprise) with peak frames annotated for action units, making it suitable for controlled expression analysis.[80] For in-the-wild scenarios, AffectNet offers over one million facial images crowdsourced from the internet, labeled for eight discrete emotions plus continuous valence and arousal dimensions, with approximately 450,000 manually annotated images to support robust training on naturalistic variations in lighting, pose, and occlusion.[81]Physiological Datasets
Physiological signal datasets capture bio-signals like EEG, ECG, and GSR to detect emotion-induced arousal and valence. The Database for Emotion Analysis using Physiological signals (DEAP) records EEG, peripheral physiological measures (e.g., GSR, respiration), and facial videos from 32 participants viewing 40 one-minute music videos, annotated on a 9-point valence-arousal scale, totaling 32 experimental sessions for music-elicited emotions.[82] The Wearable Stress and Affect Detection dataset (WESAD) features multimodal data from 15 subjects using wrist- and chest-worn sensors during baseline, stress (Trier Social Stress Test), amusement, and meditation conditions, with ECG, EDA, and accelerometer signals labeled for three affective states to enable wearable-based stress detection.[83]Multimodal Datasets
Multimodal datasets integrate multiple channels for comprehensive emotion modeling. The SEMAINE database comprises recordings of dyadic interactions between humans and limited agents, totaling approximately 80 hours, annotated for continuous dimensions (arousal, valence, power, expectation) during emotionally colored conversations with 150 participants.[84] The BAUM-1 dataset includes 1,184 spontaneous audio-visual video clips from 31 subjects reacting to affective video stimuli, labeled for five emotions (amusement, boredom, disgust, fear, neutral) and mental states, capturing upper-body gestures and facial expressions in naturalistic settings.[85]| Dataset | Modality | Key Features | Size/Details | Primary Source |
|---|---|---|---|---|
| Emo-DB | Speech | Acted German emotions (7 categories) | 535 utterances, 10 speakers | Burkhardt et al. (2005) |
| IEMOCAP | Speech/Multimodal | Improvised dyadic dialogues (9 emotions) | 12+ hours, 10 speakers | Busso et al. (2008) |
| CK+ | Facial | Posed sequences (7 emotions, action units) | 327 sequences, 123 subjects | Lucey et al. (2010) |
| AffectNet | Facial | In-the-wild images (8 emotions + V-A) | 1M+ images, ~450K annotated | Mollahosseini et al. (2017) |
| DEAP | Physiological | EEG/GSR/video for music-induced V-A | 32 participants, 40 trials | Koelstra et al. (2011) |
| WESAD | Physiological | Wearable sensors for stress/affect | 15 subjects, 4 conditions | Schmidt et al. (2018) |
| SEMAINE | Multimodal | Audio-visual interactions (continuous V-A) | 959 recordings, 150 participants | McKeown et al. (2012) |
| BAUM-1 | Multimodal | Spontaneous AV for emotions/mental states | 1,184 clips, 31 subjects | Zhalehpour et al. (2017) |