Fact-checked by Grok 2 weeks ago

Multimodal learning

Multimodal learning, also referred to as multimodal machine learning, is a subfield of artificial intelligence focused on developing computational models that process, integrate, and relate information from multiple heterogeneous data modalities—such as text, images, audio, video, and sensory signals—to emulate human-like perception and enhance understanding of complex real-world phenomena.^[1] This approach leverages the complementary strengths of different modalities, where, for instance, visual data provides spatial context while linguistic data offers semantic meaning, leading to improved performance over unimodal systems in tasks requiring holistic reasoning.^[2] Originating from early applications in audio-visual speech recognition in the 1980s, the field has evolved significantly with advancements in deep learning, particularly since the introduction of the Transformer architecture in 2017, which enabled scalable processing of sequential multimodal data.^[2] At its core, multimodal learning addresses five key technical challenges: representation, which involves encoding diverse data types into compatible formats; translation, mapping information between modalities (e.g., generating text from images); alignment, establishing correspondences across modalities to handle temporal or spatial discrepancies; fusion, combining features at data, feature, or decision levels to produce unified outputs; and co-learning, transferring knowledge between modalities to mitigate data scarcity in underrepresented ones.^[1] Fusion strategies, a cornerstone of the field, are categorized as early (input-level integration), late (output-level combination), or hybrid (intermediate feature merging), with recent hybrid methods using attention mechanisms showing superior results in noisy or incomplete data scenarios.^[3] Alignment techniques, such as contrastive learning in models like CLIP (2021), have been pivotal in creating shared embedding spaces, facilitating zero-shot generalization across modalities.^[2] The field has seen explosive growth with the advent of Multimodal Large Language Models (MLLMs), which extend large language models like GPT to incorporate visual and auditory inputs, enabling emergent capabilities in areas such as visual question answering and cross-modal generation; this growth has continued into 2025 with advanced models including Llama 4 and Grok-4 Multimodal.^[4]^[5] Notable applications span computer vision (e.g., image captioning), natural language processing (e.g., sentiment analysis with audio cues), healthcare (e.g., multimodal diagnostics from scans and patient records), and robotics (e.g., sensor fusion for navigation).^[3] Despite these advances, challenges persist, including the modality gap due to data heterogeneity, handling missing or noisy inputs, and ensuring ethical alignment in biased multimodal datasets.^[3] Looking ahead, ongoing research emphasizes self-supervised pretraining on vast multimodal corpora and integration with knowledge graphs to boost reasoning, positioning multimodal learning as a pathway toward artificial general intelligence.^[4]

Fundamentals

Definition and Core Concepts

Multimodal learning is a subfield of machine learning that focuses on integrating and processing information from multiple heterogeneous data sources, or modalities, such as text, images, audio, and video, to perform tasks that surpass the capabilities of unimodal approaches. This integration allows models to leverage complementary information across modalities, enabling more comprehensive understanding and decision-making in complex scenarios. Unlike unimodal learning, which relies on a single data type and may suffer from limitations in representation depth, multimodal learning exploits the synergies between modalities to capture richer contextual cues. The core objectives of multimodal learning include enhancing model robustness against noise or incomplete data, deriving richer feature representations that encode inter-modal relationships, and fostering cross-modal understanding through mechanisms like modality alignment and joint embedding spaces. Modality alignment ensures that corresponding elements across different data types—such as synchronizing visual and textual descriptions—are mapped into a shared representational framework, while joint embedding spaces project diverse modalities into a common latent space to facilitate interaction and transfer of knowledge. These objectives aim to mitigate the silos of unimodal processing, allowing models to generalize better across varied inputs and tasks. Fundamental principles of multimodal learning address the inherent heterogeneity of data types, where each modality exhibits unique statistical properties, dimensions, and noise characteristics, necessitating tailored preprocessing and representation strategies. A key principle is the need for alignment, either temporal (e.g., synchronizing audio and video streams) or spatial (e.g., linking textual labels to image regions), to establish meaningful correspondences between modalities. Additionally, handling missing modalities is crucial, as real-world data often involves incomplete inputs; techniques must robustly infer or compensate for absent data streams without significant performance degradation. For instance, in object recognition, combining visual image data with descriptive text can improve accuracy by providing contextual disambiguation—such as distinguishing between similar objects like a "red apple" versus a "red ball"—outperforming image-only models, as demonstrated in theoretical analyses showing multimodal superiority even on unimodal evaluation tasks.

Types of Modalities

In multimodal learning, the primary data modalities encompass visual, auditory, textual, and sensory types, each contributing distinct forms of information to enhance model performance through complementary representations.^[6] Visual modalities, including static images and dynamic videos, form a cornerstone due to their rich spatial and temporal details, while auditory modalities capture acoustic signals like speech and environmental sounds.^[1] Textual modalities handle natural language sequences, and sensory modalities incorporate physical or biological signals such as haptic feedback or physiological measurements.^[6] These modalities differ fundamentally in structure, enabling multimodal systems to address limitations of unimodal approaches by integrating diverse data streams.^[1] Key characteristics of these modalities include variations in dimensionality, susceptibility to noise, and synchronization requirements. Visual data exhibits high dimensionality from pixel grids (e.g., RGB channels in images), making it computationally intensive, and is prone to noise from factors like occlusion or lighting variations; synchronization challenges arise in videos, where temporal alignment with audio is essential to maintain coherence.^[6] Auditory signals are inherently sequential and time-dependent, with moderate dimensionality but high noise levels from background interference, necessitating precise alignment in multimodal contexts like audiovisual processing.^[1] Textual data, in contrast, is lower-dimensional and discrete, yet introduces noise through linguistic ambiguities or context dependencies, with synchronization often involving temporal or semantic matching to other modalities.^[6] Sensory modalities, such as physiological signals (e.g., electrocardiograms) or haptic inputs, vary in dimensionality based on sensor resolution but are typically noisy due to environmental artifacts or biological variability, requiring real-time synchronization for applications involving human interaction.^[1] Preprocessing is crucial to standardize these heterogeneous inputs for integration. For visual modalities, convolutional neural networks (CNNs) are commonly employed to extract hierarchical features, reducing raw pixel data into compact representations that preserve spatial hierarchies.^[6] Auditory data is transformed into spectrograms—visual representations of frequency over time—or Mel-frequency cepstral coefficients (MFCCs), which mimic human auditory perception by emphasizing perceptually relevant frequencies.^[1] Textual inputs undergo tokenization to segment words or subwords, followed by embedding techniques (e.g., via transformer models) to map discrete tokens into continuous vector spaces that capture semantic relationships.^[6] Sensory signals often require modality-specific filtering, such as bandpass filters for physiological data, to mitigate noise and extract relevant features like signal amplitude or frequency components.^[1] Hybrid modalities extend these primaries by combining them to leverage synergies, as seen in multimodal sentiment analysis, where textual transcripts are paired with visual facial expressions to detect nuanced emotions more accurately than single-modality methods.^[6] This approach addresses individual modality weaknesses—such as textual sarcasm undetected in isolation—by cross-validating cues across data types, though it amplifies preprocessing demands for alignment.^[1]

Historical Development

Early Foundations

The foundations of multimodal learning trace back to cognitive science, where researchers drew inspiration from human perceptual processes that integrate multiple sensory inputs. A seminal demonstration of this integration is the McGurk effect, discovered in 1976, which illustrates how conflicting auditory and visual cues in speech perception can lead observers to perceive an illusory phoneme that aligns neither with the heard sound nor the seen lip movements. This effect underscored the brain's natural multimodal processing, particularly in audiovisual speech, influencing early computational efforts to mimic such human-like integration for robust perception in noisy environments.^[7] Early computational approaches to multimodal learning emerged in the 1980s with initial work on audio-visual speech recognition (AVSR), such as Petajan's automatic lipreading system to enhance speech recognition accuracy.^[8] In the pre-deep learning era, these efforts evolved primarily in the domain of AVSR during the 1990s, leveraging statistical models to combine auditory and visual features. Hidden Markov models (HMMs) became a cornerstone for these systems, extending traditional audio-only speech recognition by incorporating visual lip motion data to enhance accuracy under adverse acoustic conditions. For instance, early HMM-based AVSR frameworks fused audio and visual streams through multi-stream processing, where visual features like optical flow or shape parameters from lip regions were concatenated with acoustic features to model temporal dependencies in speech sequences. Work in the late 1990s demonstrated how such multi-stream HMMs could jointly model audiovisual observations, achieving notable improvements in word error rates for isolated digits and words. Key surveys and foundational papers from this period highlight the progression toward simple fusion techniques, laying the groundwork for later advancements. Baltrušaitis et al. (2018) provide an overview of multimodal machine learning, emphasizing early methods like feature-level concatenation as a straightforward way to integrate modalities without complex architectures, often applied in AVSR to boost recognition robustness in noisy settings.^[9] These pre-2010s milestones, including extensions of HMMs for audiovisual integration in the late 1990s and early 2000s, focused on probabilistic modeling of temporal alignments between modalities, prioritizing interpretability and computational efficiency over scale.

Key Advancements in Deep Learning Era

The advent of deep learning marked a significant shift in multimodal learning, enabling more effective feature extraction from diverse modalities. Around 2012–2015, convolutional neural networks (CNNs) began to be widely adopted for processing visual data, while recurrent neural networks (RNNs) handled sequential modalities like text or audio, allowing for joint representations in multimodal tasks. For instance, early works integrated CNNs with sentence embeddings for image-sentence matching, demonstrating improved alignment through end-to-end training. Similarly, multimodal RNNs were introduced to generate image captions by modeling probabilistic distributions over sequences conditioned on visual features, paving the way for unified architectures.^[10]^[11] Key milestones in the deep learning era further advanced cross-modal understanding. In 2021, CLIP introduced contrastive pretraining on large-scale image-text pairs, enabling zero-shot transfer to diverse vision-language tasks without task-specific fine-tuning. That same year, Perceiver IO emerged as a unified architecture capable of processing arbitrary multimodal inputs and outputs, scaling efficiently beyond traditional Transformers for tasks like image classification and audio generation. By 2022, Flamingo advanced few-shot learning by combining frozen vision and language models into a visual language model that could handle interleaved image-text inputs, achieving state-of-the-art performance on open-ended multimodal benchmarks with minimal examples. Concurrently, VideoMAE extended masked autoencoder pretraining to videos, improving self-supervised learning for spatiotemporal data by achieving high masking ratios while maintaining data efficiency.^[12]^[13]^[14]^[15] Recent trends up to 2025 have emphasized large-scale pretraining and integration across modalities. Models like BLIP-2 (2023) bootstrapped vision-language capabilities using frozen image encoders and large language models, enhancing generative tasks through efficient parameter utilization. LLaVA (2023) further pushed visual instruction tuning, aligning vision encoders with LLMs to enable general-purpose multimodal reasoning. In 2024, OpenAI's GPT-4o integrated real-time processing of audio, vision, and text for more natural interactions, while Meta's Llama 3.2 introduced vision capabilities to the open-source LLM ecosystem. In video-audio domains, extensions of masked modeling have supported broader applications, while unified frameworks continue to proliferate. The impact of scaling—through vast datasets and increased compute—has led to emergent multimodal capabilities, such as robust zero-shot generalization and in-context learning across modalities, mirroring patterns observed in large language models but extended to vision and audio integration.^[16]^[17]^[18]^[19]^[20]

Architectures and Techniques

Fusion Methods

Fusion methods in multimodal learning refer to the techniques used to integrate information from multiple data modalities, such as visual, textual, and audio inputs, to produce a unified representation that captures inter-modal relationships. These methods are essential for leveraging complementary strengths across modalities, enabling improved performance in tasks like sentiment analysis and visual question answering. Fusion can occur at different levels depending on the stage of processing where integration happens. Early fusion combines low-level features from each modality immediately after extraction, often through simple concatenation to form a joint embedding, such as \mathbf{z} = [\mathbf{x}_v; \mathbf{x}_t], where \mathbf{x}_v represents the visual embedding and \mathbf{x}_t the textual embedding; this approach allows joint training but can suffer from high dimensionality and noise propagation. Late fusion, in contrast, integrates high-level decisions or predictions from individual unimodal models, typically via averaging, voting, or weighted summation, which preserves modality-specific processing but may overlook subtle cross-modal interactions.^[21] Intermediate fusion strikes a balance by merging mid-level representations, such as through tensor operations or attention mechanisms, to capture both intra- and inter-modal dependencies more effectively than the other levels. Strategies for fusion broadly divide into joint and sequential processing paradigms. Joint training processes all modalities simultaneously in a shared network, projecting them into a common latent space to learn unified features, which is particularly effective for capturing correlations but requires aligned data across modalities.^[21] Sequential processing, however, handles modalities in a pipeline, where outputs from one modality inform the next, often using recurrent structures; this is useful for asynchronous data but can introduce error accumulation. To address modality imbalance, where some inputs may dominate due to varying informativeness or noise, gating mechanisms dynamically weight contributions from each modality, such as by computing a gate vector \mathbf{g} = \sigma(\mathbf{W} [\mathbf{x}_v; \mathbf{x}_t]) to modulate the fused output, enhancing robustness in unbalanced scenarios.^[21] Key techniques for effective fusion include cross-modal attention and bilinear pooling. Cross-modal attention mechanisms align and weigh relevant features across modalities, for instance, by attending to textual cues when processing visual inputs in tasks like image captioning, improving alignment and interpretability. Bilinear pooling captures second-order interactions between modalities through outer products, as in multi-modal factorized bilinear pooling, which reduces computational complexity while modeling fine-grained dependencies, achieving state-of-the-art results in visual question answering with approximately 1% accuracy improvement over prior bilinear methods like MCB.^[22] Common challenges in fusion include modality dropout, where one or more inputs are missing or unreliable, and misalignment between heterogeneous data. Modality dropout is mitigated by training with random omission of modalities, promoting robustness in networks.^[21] For alignment, canonical correlation analysis (CCA) projects modalities into a shared space by maximizing their correlation, formulated as finding projection matrices \mathbf{W}_v and \mathbf{W}_t to optimize \rho = \frac{\mathbf{W}_v^T \mathbf{C}_{vt} \mathbf{W}_t}{\sqrt{\mathbf{W}_v^T \mathbf{C}_{vv} \mathbf{W}_v \mathbf{W}_t^T \mathbf{C}_{tt} \mathbf{W}_t}}, where \mathbf{C} denotes cross-covariance matrices; this unsupervised method has been foundational for tasks like audiovisual speech recognition.

Transformer-Based Architectures

Transformer-based architectures have revolutionized multimodal learning by leveraging the self-attention mechanism to capture interactions across different modalities, such as vision and language, through shared or cross-attention layers. Originally introduced for sequence transduction tasks, the transformer model enables parallelizable processing and long-range dependencies, which are particularly advantageous for aligning heterogeneous data like images and text. In multimodal settings, adaptations involve modality-specific encoders—such as convolutional networks for visual features or BERT-like models for text—that feed into a fusion transformer, allowing cross-modal attention to model relationships between elements from different inputs. Early seminal works in 2019 demonstrated these adaptations effectively. ViLBERT, for instance, employs a two-stream architecture with co-attention layers that enable separate processing of visual and linguistic inputs before fusing them via shared transformer layers, achieving state-of-the-art results on vision-and-language tasks like visual question answering. Similarly, VisualBERT integrates visual features directly into a single BERT-like transformer stack, treating image regions as additional tokens to learn joint representations in a unified manner. LXMERT extends this by using three specialized encoders—an object relationship encoder for vision, a language encoder, and a cross-modality encoder—connected through transformer-based cross-attention, which pretrains on large-scale datasets to capture bidirectional alignments.^[23]^[24]^[25] Subsequent evolution led to unified transformer models that handle multiple modalities end-to-end without separate streams. The Generative Image-to-Text Transformer (GIT), introduced in 2022, unifies vision-and-language tasks like captioning and question answering in a single autoregressive transformer, where visual tokens are prefixed to text sequences for joint generation. Building on this, Kosmos-1 (2023) presents a multimodal large language model that aligns perception with language through interleaved image and text tokens in a transformer decoder, enabling emergent capabilities such as zero-shot image captioning and multilingual understanding.^[26]^[27] At the core of these architectures is the cross-attention mechanism, which allows one modality to attend to another. Typically, after modality-specific encoding, a fusion transformer applies cross-attention where queries from one modality (e.g., visual) interact with keys and values from another (e.g., text). This is formalized as:

\text{Attention}(Q_v, K_t, V_t) = \softmax\left(\frac{Q_v K_t^T}{\sqrt{d_k}}\right) V_t

Here, Q_v represents visual queries, K_t and V_t are text keys and values, and d_k is the key dimension, enabling the model to weigh textual context relevant to visual elements. This mechanism, rooted in the original transformer design, facilitates scalable fusion in multimodal transformers. By 2024, these principles scaled to production models like GPT-4o, which integrates real-time processing of audio, vision, and text within a unified transformer framework, supporting applications from speech-to-text translation to visual reasoning with low latency. This progression highlights transformers' dominance in multimodal learning, emphasizing efficient cross-modal interactions for robust representation learning. As of 2025, continued advancements in scaling and efficiency have further enhanced cross-modal tasks in models like updated MLLMs.^[18]

Generative and Probabilistic Models

Generative and probabilistic models in multimodal learning provide a framework for capturing the joint distribution of multiple data modalities through latent variable representations, enabling the synthesis of coherent multimodal outputs. These approaches model the underlying probabilistic structure of data, often using undirected graphical models or variational inference to approximate complex joint probabilities. A foundational example is the multimodal deep Boltzmann machine (MDBM), which extends deep Boltzmann machines to handle diverse modalities like images and text by stacking energy-based layers that learn shared hidden representations. Introduced by Srivastava and Salakhutdinov, MDBMs define an energy function over visible and hidden units across modalities, allowing for generative sampling via Gibbs or annealed importance sampling, and demonstrating improved performance in tasks like cross-modal retrieval on datasets such as NYU2.^[28] Building on these foundations, multimodal variational autoencoders (MVAEs) advance probabilistic modeling by incorporating variational inference to learn disentangled latent representations that factorize modality-specific and shared information. Proposed by Wu and Goodman, MVAEs optimize a lower bound on the joint likelihood by training separate encoders and decoders for each modality while sharing a common latent space, enabling efficient inference and generation of missing modalities. The core formulation expresses the joint likelihood as

p(\mathbf{x}_v, \mathbf{x}_t) = \int p(\mathbf{z}) p(\mathbf{x}_v|\mathbf{z}) p(\mathbf{x}_t|\mathbf{z}) \, d\mathbf{z},

where \mathbf{x}_v and \mathbf{x}_t represent visual and textual modalities, \mathbf{z} is the latent variable, and the integral is approximated via the evidence lower bound (ELBO). This structure facilitates weakly supervised learning, as shown in experiments on MNIST variants and CMU-MOSEI, where MVAEs outperform unimodal VAEs in reconstruction and imputation quality. Recent generative advancements have integrated probabilistic models with diffusion processes and adversarial training to enhance multimodal synthesis capabilities. Diffusion models, such as the MM-Diffusion framework, extend denoising diffusion probabilistic models (DDPMs) to joint audio-video generation by defining coupled denoising networks that progressively refine noise-added multimodal inputs, achieving high-fidelity outputs on datasets like CREPE and VGGSound. Developed by Ruan et al., this approach models the reverse diffusion process in a shared latent space, surpassing prior GAN-based methods in sample diversity and temporal consistency.^[29] Complementing these, hybrid GAN-VAE architectures combine the stable latent modeling of VAEs with the sharp output generation of GANs for audio-visual tasks; for instance, Sound2Sight employs a VAE-GAN encoder-decoder to generate video frames from audio cues and contextual priors, leveraging a stochastic prior conditioned on multimodal inputs to produce dynamic visual sequences on the AVE dataset. By 2025, such hybrids have evolved to incorporate conditional mechanisms, improving controllability in synthesis while maintaining probabilistic guarantees.^[30] Unlike discriminative models that focus on boundary separation for classification or regression, generative and probabilistic approaches in multimodal learning emphasize modeling the full data distribution to enable sampling of novel, unseen multimodal instances. This generative paradigm supports creative applications like data augmentation and scenario simulation, where the ability to draw from p(\mathbf{x}_1, \dots, \mathbf{x}_M) across M modalities contrasts with the conditional predictions of discriminative methods, often leading to more robust handling of incomplete or noisy inputs.^[28]

Applications

Vision and Language Integration

Vision and language integration represents a pivotal application of multimodal learning, where visual data from images or videos is combined with textual information to enable tasks that require understanding across both domains. Core tasks in this area include image and video captioning, which generates descriptive textual summaries of visual content; visual question answering (VQA), where systems respond to natural language queries about images; and cross-modal retrieval, which allows searching for images using text queries or vice versa. These tasks leverage the complementary strengths of vision and language, such as visual details for object recognition and textual context for semantic interpretation, to achieve more robust comprehension than unimodal approaches. In practical examples, vision-language models have been integrated into search engines to enhance user experiences. For instance, Google Lens by 2023 incorporated multimodal capabilities, allowing users to upload images and receive textual explanations or related searches, improving accessibility and information retrieval efficiency. A specific case is the VQA v2 dataset introduced in 2017, where multimodal models demonstrated accuracy gains of 10-20% over unimodal baselines by jointly processing visual and linguistic inputs, highlighting the value of integration in handling complex queries. Additionally, contrastive learning techniques, as exemplified by the CLIP model, enable zero-shot tasks like classifying images based on textual descriptions without task-specific training, by aligning visual and language embeddings in a shared space. The real-world impact of these integrations is evident in accessibility tools for the visually impaired, such as apps that provide audio descriptions of surroundings via image captioning and VQA, empowering users with real-time environmental understanding. By 2025, deployments in e-commerce have advanced multimodal product search, where customers can query items using a combination of text and images—such as describing a "red dress like this photo"—leading to more precise recommendations and higher conversion rates in platforms like Amazon and Alibaba. These applications underscore the scalability of vision-language models in everyday digital interactions.

Multimodal AI in Healthcare and Robotics

In healthcare, multimodal AI systems integrate diverse data modalities, such as medical imaging, electronic health records (EHR), and genomic information, to enable more precise diagnostics and personalized treatment planning. For example, frameworks leveraging MRI scans, structured EHR data, and genomic single nucleotide polymorphisms (SNPs) have advanced early detection of Alzheimer's disease by fusing these inputs through deep learning models, achieving superior predictive performance compared to single-modality approaches. Applications utilizing the MIMIC-IV dataset, which provides de-identified EHR from intensive care units spanning 2008–2019, have incorporated imaging and clinical text in the 2020s to predict patient outcomes, demonstrating improved risk stratification for conditions like sepsis and mortality. These integrations often yield accuracy gains, with multimodal models for Alzheimer's detection reporting up to a 15% improvement in classification accuracy over unimodal baselines, highlighting the value of complementary data sources in capturing disease heterogeneity.^[31]^[32]^[33] An example of multimodal AI in diagnostics involves integrating chest X-rays with clinical parameters and EHR using transformer-based architectures to enhance pathology identification and severity assessment, with applicability to scenarios like the COVID-19 pandemic for triage and resource allocation in healthcare settings. These models fuse radiographic features with textual symptom descriptions, outperforming unimodal imaging approaches by improving accuracy in detecting disease progression and reducing false positives through contextualization of visual findings with narrative data, as demonstrated on large-scale ICU datasets.^[34] In robotics, multimodal AI supports robust perception and decision-making through sensor fusion, particularly for navigation and interaction tasks. Autonomous vehicles, for instance, rely on integrating LiDAR for depth mapping, cameras for semantic understanding, and inertial measurement units (IMUs) for motion tracking, enabling real-time environmental modeling in dynamic scenarios like urban driving. Multimodal reinforcement learning further advances manipulation tasks, such as grasping and assembly, by incorporating tactile, visual, and proprioceptive signals to improve sample efficiency and adaptability in contact-rich environments. By 2025, robotic assistants in eldercare have incorporated speech recognition and gesture interpretation to facilitate natural human-robot interactions, aiding activities like medication reminders and mobility support while enhancing user engagement.^[35]^[36]^[37] Ethical considerations in these applications are paramount, particularly regarding bias in multimodal health data, which can arise from underrepresented demographics in training sets and lead to inequitable outcomes. For example, imbalances in EHR and genomic datasets may exacerbate disparities in disease prediction for minority groups, necessitating mitigation strategies like fairness-aware algorithms and diverse data curation to ensure equitable deployment. Regulatory bodies, including the World Health Organization, emphasize transparency and bias auditing in large multimodal models to uphold patient trust and clinical validity.^[38]^[39]

Evaluation and Challenges

Datasets and Metrics

Multimodal learning relies on diverse datasets that integrate multiple data modalities, such as images with text or audio with visuals, to train and evaluate models effectively. One foundational dataset is Microsoft COCO (Common Objects in Context), introduced in 2014, which comprises over 330,000 images, of which more than 200,000 are labeled with object segmentations, keypoints, and 616,745 captions for 123,349 images, primarily supporting image-captioning tasks in vision-language integration. Building on this, the Visual Genome dataset, released in 2016, extends dense annotations to over 108,000 images sourced from COCO, including 3.8 million object instances, 2.8 million attributes, and 2.3 million relationships between entities, enabling more comprehensive scene understanding and visual question answering.^[40] For audio-visual scenarios, AudioSet, launched in 2017 by Google, provides 2.1 million 10-second YouTube video clips labeled with 632 audio event classes, facilitating tasks like sound event detection and cross-modal alignment. Scaling to massive sizes, LAION-5B, developed in 2022, offers 5.85 billion CLIP-filtered image-text pairs scraped from the web, with extensions like LAION-Aesthetics (2023) incorporating quality filters for aesthetic relevance, supporting large-scale pretraining of multimodal models up to 2025.^[41] More recently, the CASTLE dataset (2024) provides multimodal video data from ego- and exo-centric perspectives for human activity understanding, comprising thousands of recordings.^[42] Standard metrics for evaluating multimodal performance emphasize alignment, retrieval accuracy, and robustness across modalities. In image captioning, CIDEr (Consensus-based Image Description Evaluation), proposed in 2015, computes n-gram overlap weighted by term frequency to capture human consensus, outperforming earlier metrics like BLEU on benchmarks such as COCO. Complementing this, SPICE (Semantic Propositional Image Caption Evaluation), introduced in 2016, parses captions into semantic graphs to measure object, relation, and attribute recall, providing deeper semantic assessment beyond surface-level similarity. For cross-modal retrieval tasks, Recall@K (R@K) quantifies the proportion of relevant items retrieved in the top K results, commonly used in vision-language models like CLIP, where R@1, R@5, and R@10 establish retrieval efficacy on datasets like Flickr30K. Multimodal-specific metrics, such as the modality gap in robustness tests, evaluate performance discrepancies between modalities under perturbations like noise or occlusion, highlighting vulnerabilities in fused representations as seen in audiovisual benchmarks. Despite their utility, multimodal datasets face significant challenges, including class imbalance where underrepresented modalities or categories skew training, and high annotation costs due to the labor-intensive alignment of diverse data types like synchronized audio-visual clips.^[43] To address these, synthetic data generation techniques, such as those using diffusion models or GANs, create augmented multimodal samples to balance distributions and reduce reliance on costly human labeling, as demonstrated in recent works improving dataset diversity for vision-language tasks. Evaluation protocols in multimodal learning distinguish between zero-shot and fine-tuned settings to assess generalization. Zero-shot evaluation tests models on unseen tasks using only pretraining alignments, as in CLIP's image-text retrieval without task-specific data, measuring broad transferability. In contrast, fine-tuned benchmarks adapt pretrained models to downstream tasks with labeled data, yielding higher accuracy but requiring computational resources, with protocols often comparing both on standardized splits of datasets like Visual Genome for visual reasoning.

Dataset	Year	Modalities	Size	Primary Use	Source
COCO	2014	Image, Text	330K images, 617K captions	Image captioning, object detection	arXiv:1405.0312
Visual Genome	2016	Image, Text (dense)	108K images, 3.8M objects	Scene graphs, VQA	arXiv:1602.07332
AudioSet	2017	Audio, Video	2.1M clips, 632 classes	Audio event detection	arXiv:1702.08721
LAION-5B	2022	Image, Text	5.85B pairs	Large-scale pretraining	arXiv:2210.08402

Limitations and Future Directions

Despite significant progress, multimodal learning faces notable limitations in scalability, particularly when handling high-dimensional data from diverse modalities such as images, text, and audio, where fusion processes demand substantial computational resources that hinder deployment on standard hardware.^[44] Bias amplification across modalities remains a critical issue, as errors or imbalances in one modality, like skewed visual datasets, can propagate and exacerbate unfair outcomes in integrated representations during reasoning tasks. Furthermore, the lack of interpretability in fused representations poses challenges, rendering complex models as opaque "black boxes" that obscure decision-making processes, especially in high-stakes domains. Specific challenges include difficulties in handling rare modalities, such as olfactory data, where the absence of standardized benchmarks and diverse datasets complicates objective perception and integration with visual or textual inputs. Privacy concerns in multimodal health applications are pronounced, as combining sensitive data like medical images and patient audio risks unauthorized exposure during model training and inference, necessitating robust safeguards. Computational costs also represent a barrier, exemplified by training CLIP-scale models on massive datasets like 400 million image-text pairs, which required hundreds of GPUs (e.g., 592 V100s for 12 days) to achieve viable performance. Looking ahead, neurosymbolic multimodal AI emerges as a promising direction, integrating neural learning with symbolic reasoning to enhance interpretability and handle complex spatial or logical tasks across modalities. Lifelong learning frameworks for dynamic modalities are gaining traction, enabling models to adapt incrementally to evolving data streams without catastrophic forgetting, thus supporting continuous updates in real-world scenarios. Integration with edge computing is anticipated to facilitate real-time multimodal applications by 2030, driven by projected market growth to $10.81 billion and advancements in localized processing to reduce latency.^[45] As of 2025, trends emphasize efficient fusion through sparse attention mechanisms, which mitigate redundancy in long sequences by focusing on relevant tokens, and the development of ethical frameworks to promote multimodal fairness via standardized bias auditing and inclusivity evaluations.

References

[1]
[PDF] Multimodal Machine Learning: A Survey and Taxonomy
Multimodal machine learning builds models that process and relate information from multiple modalities, such as natural language, visual, and vocal signals.
[2]
[PDF] Multimodal Learning with Transformers: A Survey - arXiv
This survey covers multimodal learning with Transformers, including background, reviews of different types of Transformers, applications, challenges, and open ...
[3]
None
Summary of each segment:
[4]
survey on multimodal large language models - Oxford Academic
This paper presents the first survey on Multimodal Large Language Models (MLLMs), highlighting their potential as a path to Artificial General Intelligence.
[5]
A Review on Methods and Applications in Multimodal Deep Learning
Feb 18, 2022 · This paper focuses on multiple types of modalities, ie, image, video, text, audio, body gestures, facial expressions, and physiological signals.
[6]
Hearing lips and seeing voices - PubMed
Hearing lips and seeing voices. Nature. 1976 Dec;264(5588):746-8. doi: 10.1038/264746a0. Authors. H McGurk, J MacDonald. PMID: 1012311; DOI: 10.1038/ ...Missing: paper | Show results with:paper
[7]
[PDF] Multimodal Machine Learning: A Survey and Taxonomy - arXiv
Aug 1, 2017 · Multimodal machine learning builds models that process and relate information from multiple modalities, such as natural language, visual, and ...
[8]
[PDF] Multimodal Convolutional Neural Networks for Matching Image and ...
In this paper, we propose multimodal convolutional neu- ral networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework ...
[9]
[PDF] DEEP CAPTIONING WITH MULTIMODAL RECURRENT NEURAL ...
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability ...
[10]
Learning Transferable Visual Models From Natural Language ...
Feb 26, 2021 · Abstract page for arXiv paper 2103.00020: Learning Transferable Visual Models From Natural Language Supervision.
[11]
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Jul 30, 2021 · We propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs.
[12]
Flamingo: a Visual Language Model for Few-Shot Learning - arXiv
Apr 29, 2022 · Flamingo is a Visual Language Model (VLM) designed for few-shot learning, rapidly adapting to novel tasks with few examples. It handles ...
[13]
VideoMAE: Masked Autoencoders are Data-Efficient Learners for ...
Mar 23, 2022 · In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP).
[14]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen ...
Jan 30, 2023 · This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained ...
[15]
[2304.08485] Visual Instruction Tuning - arXiv
Apr 17, 2023 · This paper introduces LLaVA, a multimodal model trained using GPT-4 generated data, achieving 85.1% relative score and 92.53% on Science QA.
[16]
[2206.07682] Emergent Abilities of Large Language Models - arXiv
Jun 15, 2022 · Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This ...
[17]
A Survey on Deep Learning for Multimodal Data Fusion
May 1, 2020 · This review presents a survey on deep learning for multimodal data fusion to provide readers, regardless of their original community, with the fundamentals.
[18]
[PDF] Multi-Modal Factorized Bilinear Pooling With Co-Attention Learning ...
For multi- modal feature fusion, here we develop a Multi-modal Fac- torized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal ...Missing: seminal | Show results with:seminal
[19]
[1908.02265] ViLBERT: Pretraining Task-Agnostic Visiolinguistic ...
Aug 6, 2019 · We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
[20]
VisualBERT: A Simple and Performant Baseline for Vision ... - arXiv
Aug 9, 2019 · Abstract:We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
[21]
Learning Cross-Modality Encoder Representations from Transformers
Aug 20, 2019 · In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality ...
[22]
GIT: A Generative Image-to-text Transformer for Vision and Language
May 27, 2022 · In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question ...
[23]
[2302.14045] Language Is Not All You Need: Aligning Perception ...
Feb 27, 2023 · In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (ie, few-shot), and follow ...
[24]
Hello GPT-4o - OpenAI
May 13, 2024 · We're announcing GPT-4 Omni, our new flagship model which can reason across audio, vision, and text in real time.
[25]
Multimodal Learning with Deep Boltzmann Machines
We propose a Deep Boltzmann Machine for learning a generative model of such multimodal data. We show that the model can be used to create fused representations.
[26]
Learning Multi-Modal Diffusion Models for Joint Audio and Video ...
Dec 19, 2022 · To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising ...
[27]
Sound2Sight: Generating Visual Dynamics from Sound and Context
Jul 23, 2020 · In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion ...Missing: GAN | Show results with:GAN
[28]
Multimodal deep learning models for early detection of Alzheimer's ...
Feb 5, 2021 · In this study, we further the multi-modal AD data fusion to advance AD stage prediction by using DL to combine imaging, EHR, and genomic SNP ...Missing: MIMIC- | Show results with:MIMIC-
[29]
Integrated multimodal artificial intelligence framework for healthcare ...
Sep 20, 2022 · We propose and evaluate a unified Holistic AI in Medicine (HAIM) framework to facilitate the generation and testing of AI systems that leverage multimodal ...
[30]
Attention-driven hybrid deep learning and SVM model for early ...
Jul 1, 2025 · ... improving generalization across diverse datasets. Comparative analysis highlights a 15% improvement in accuracy, a 12% reduction in false ...
[31]
Multimodal Deep Learning for Integrating Chest Radiographs and ...
Oct 3, 2023 · A transformer-based artificial intelligence architecture was developed to integrate multimodal patient data and demonstrated improved diagnostic ...
[32]
A Review of Multi-Sensor Fusion in Autonomous Driving - MDPI
Multi-modal sensor fusion has become a cornerstone of robust autonomous driving systems, enabling perception models to integrate complementary cues from ...
[33]
M2CURL: Sample-Efficient Multimodal Reinforcement Learning via ...
Jan 30, 2024 · We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation ...
[34]
https://pubs.rsna.org/doi/full/10.1148/radiol.230806
[35]
Bias recognition and mitigation strategies in artificial intelligence ...
Mar 11, 2025 · This review examines the origins of bias in healthcare AI, strategies for mitigation, and responsibilities of relevant stakeholders towards achieving fair and ...
[36]
WHO releases AI ethics and governance guidance for large multi ...
Jan 18, 2024 · Furthermore, LMMs may be trained on data that are of poor quality or biased, whether by race, ethnicity, ancestry, sex, gender identity, or age.
[37]
Visual Genome: Connecting Language and Vision Using ... - arXiv
Feb 23, 2016 · In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects ...
[38]
LAION-5B: An open large-scale dataset for training next generation ...
Oct 16, 2022 · We present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
[39]
https://www.who.int/news/item/18-01-2024-who-releases-ai-ethics-and-governance-guidance-for-large-multi-modal-models
[40]
[2411.17040] Multimodal Alignment and Fusion: A Survey - arXiv
This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion within the field of machine learning, driven ...
[41]
Multimodal Fusion And Sparse Attention-based Alignment Model for ...
Aug 13, 2025 · To address these issues, we propose MUFASA, a MUltimodal Fusion And Sparse Attention-based Alignment model for long sequential recommendation.Missing: efficient | Show results with:efficient
[42]
A Review of Fairness, Transparency, and Ethics in Vision-Language ...
Apr 14, 2025 · This review explores the trustworthiness of multimodal artificial intelligence (AI) systems, specifically focusing on vision-language tasks.