Fact-checked by Grok 2 weeks ago

Multimodal learning

Multimodal learning, also referred to as multimodal machine learning, is a subfield of focused on developing computational models that process, integrate, and relate information from multiple heterogeneous data modalities—such as text, images, audio, video, and sensory signals—to emulate human-like and enhance understanding of complex real-world phenomena. This approach leverages the complementary strengths of different modalities, where, for instance, visual data provides spatial context while linguistic data offers semantic meaning, leading to improved performance over unimodal systems in tasks requiring holistic reasoning. Originating from early applications in audio-visual in the , the field has evolved significantly with advancements in , particularly since the introduction of the architecture in 2017, which enabled scalable processing of sequential multimodal data. At its core, multimodal learning addresses five key technical challenges: , which involves encoding diverse data types into compatible formats; , mapping information between modalities (e.g., generating text from images); , establishing correspondences across modalities to handle temporal or spatial discrepancies; , combining s at data, , or decision levels to produce unified outputs; and co-learning, transferring between modalities to mitigate data scarcity in underrepresented ones. strategies, a of the field, are categorized as early (input-level ), late (output-level ), or (intermediate merging), with recent methods using attention mechanisms showing superior results in noisy or incomplete data scenarios. techniques, such as contrastive learning in models like CLIP (2021), have been pivotal in creating shared spaces, facilitating zero-shot across modalities. The field has seen explosive growth with the advent of Multimodal Large Language Models (MLLMs), which extend large language models like to incorporate visual and auditory inputs, enabling emergent capabilities in areas such as visual and cross-modal generation; this growth has continued into 2025 with advanced models including Llama 4 and Grok-4 Multimodal. Notable applications span (e.g., image captioning), (e.g., with audio cues), healthcare (e.g., multimodal diagnostics from scans and patient records), and robotics (e.g., for navigation). Despite these advances, challenges persist, including the modality gap due to data heterogeneity, handling missing or noisy inputs, and ensuring ethical alignment in biased multimodal datasets. Looking ahead, ongoing research emphasizes self-supervised pretraining on vast multimodal corpora and integration with knowledge graphs to boost reasoning, positioning multimodal learning as a pathway toward .

Fundamentals

Definition and Core Concepts

Multimodal learning is a subfield of that focuses on integrating and processing from multiple heterogeneous data sources, or modalities, such as text, images, audio, and video, to perform tasks that surpass the capabilities of unimodal approaches. This integration allows models to leverage complementary information across modalities, enabling more comprehensive understanding and decision-making in complex scenarios. Unlike unimodal learning, which relies on a single data type and may suffer from limitations in representation depth, multimodal learning exploits the synergies between modalities to capture richer contextual cues. The core objectives of multimodal learning include enhancing model robustness against or incomplete , deriving richer representations that encode inter-modal relationships, and fostering cross-modal understanding through like modality alignment and joint spaces. Modality alignment ensures that corresponding elements across different types—such as synchronizing visual and textual descriptions—are mapped into a shared representational framework, while joint spaces project diverse modalities into a common to facilitate interaction and transfer of knowledge. These objectives aim to mitigate the silos of unimodal processing, allowing models to generalize better across varied inputs and tasks. Fundamental principles of multimodal learning address the inherent heterogeneity of data types, where each modality exhibits unique statistical properties, dimensions, and noise characteristics, necessitating tailored preprocessing and representation strategies. A key principle is the need for , either temporal (e.g., synchronizing audio and video streams) or spatial (e.g., linking textual labels to image regions), to establish meaningful correspondences between modalities. Additionally, handling missing modalities is crucial, as real-world often involves incomplete inputs; techniques must robustly infer or compensate for absent data streams without significant performance degradation. For instance, in , combining visual image with descriptive text can improve accuracy by providing contextual disambiguation—such as distinguishing between similar objects like a "red apple" versus a "red ball"—outperforming image-only models, as demonstrated in theoretical analyses showing superiority even on unimodal evaluation tasks.

Types of Modalities

In learning, the primary modalities encompass visual, auditory, textual, and sensory types, each contributing distinct forms of information to enhance model performance through complementary representations. Visual modalities, including static images and dynamic videos, form a due to their rich spatial and temporal details, while auditory modalities capture acoustic signals like speech and environmental sounds. Textual modalities handle sequences, and sensory modalities incorporate physical or biological signals such as haptic feedback or physiological measurements. These modalities differ fundamentally in structure, enabling systems to address limitations of unimodal approaches by integrating diverse streams. Key characteristics of these modalities include variations in dimensionality, susceptibility to , and synchronization requirements. Visual data exhibits high dimensionality from pixel grids (e.g., RGB channels in images), making it computationally intensive, and is prone to from factors like or lighting variations; synchronization challenges arise in videos, where temporal with audio is essential to maintain . Auditory signals are inherently sequential and time-dependent, with moderate dimensionality but high levels from background , necessitating precise in multimodal contexts like processing. Textual , in contrast, is lower-dimensional and discrete, yet introduces through linguistic ambiguities or context dependencies, with synchronization often involving temporal or semantic matching to other modalities. Sensory modalities, such as physiological signals (e.g., electrocardiograms) or haptic inputs, vary in dimensionality based on resolution but are typically noisy due to environmental artifacts or biological variability, requiring for applications involving . Preprocessing is crucial to standardize these heterogeneous inputs for integration. For visual modalities, convolutional neural networks (CNNs) are commonly employed to extract hierarchical features, reducing raw pixel data into compact representations that preserve spatial hierarchies. Auditory data is transformed into spectrograms—visual representations of frequency over time—or Mel-frequency cepstral coefficients (MFCCs), which mimic human auditory perception by emphasizing perceptually relevant frequencies. Textual inputs undergo tokenization to segment words or subwords, followed by embedding techniques (e.g., via transformer models) to map discrete tokens into continuous vector spaces that capture semantic relationships. Sensory signals often require modality-specific filtering, such as bandpass filters for physiological data, to mitigate noise and extract relevant features like signal amplitude or frequency components. Hybrid modalities extend these primaries by combining them to leverage synergies, as seen in multimodal sentiment analysis, where textual transcripts are paired with visual facial expressions to detect nuanced emotions more accurately than single-modality methods. This approach addresses individual modality weaknesses—such as textual undetected in isolation—by cross-validating cues across data types, though it amplifies preprocessing demands for alignment.

Historical Development

Early Foundations

The foundations of multimodal learning trace back to , where researchers drew inspiration from human perceptual processes that integrate multiple sensory inputs. A seminal demonstration of this integration is the , discovered in 1976, which illustrates how conflicting auditory and visual cues in can lead observers to perceive an illusory that aligns neither with the heard sound nor the seen lip movements. This effect underscored the brain's natural multimodal processing, particularly in audiovisual speech, influencing early computational efforts to mimic such human-like integration for robust perception in noisy environments. Early computational approaches to multimodal learning emerged in the with initial work on audio-visual speech recognition (AVSR), such as Petajan's automatic lipreading system to enhance accuracy. In the pre-deep learning era, these efforts evolved primarily in the domain of AVSR during the , leveraging statistical models to combine auditory and visual features. Markov models (HMMs) became a cornerstone for these systems, extending traditional audio-only by incorporating visual lip motion data to enhance accuracy under adverse acoustic conditions. For instance, early HMM-based AVSR frameworks fused audio and visual streams through multi-stream processing, where visual features like or shape parameters from lip regions were concatenated with acoustic features to model temporal dependencies in speech sequences. Work in the late demonstrated how such multi-stream HMMs could jointly model audiovisual observations, achieving notable improvements in word error rates for isolated digits and words. Key surveys and foundational papers from this period highlight the progression toward simple fusion techniques, laying the groundwork for later advancements. Baltrušaitis et al. (2018) provide an overview of multimodal machine learning, emphasizing early methods like feature-level concatenation as a straightforward way to integrate modalities without complex architectures, often applied in AVSR to boost recognition robustness in noisy settings. These pre-2010s milestones, including extensions of HMMs for audiovisual integration in the late 1990s and early 2000s, focused on probabilistic modeling of temporal alignments between modalities, prioritizing interpretability and computational efficiency over scale.

Key Advancements in Deep Learning Era

The advent of marked a significant shift in multimodal learning, enabling more effective feature extraction from diverse modalities. Around 2012–2015, convolutional neural networks (CNNs) began to be widely adopted for processing visual data, while recurrent neural networks (RNNs) handled sequential modalities like text or audio, allowing for joint representations in multimodal tasks. For instance, early works integrated CNNs with sentence embeddings for image-sentence matching, demonstrating improved alignment through end-to-end training. Similarly, multimodal RNNs were introduced to generate image captions by modeling probabilistic distributions over sequences conditioned on visual features, paving the way for unified architectures. Key milestones in the deep learning era further advanced cross-modal understanding. In 2021, CLIP introduced contrastive pretraining on large-scale image-text pairs, enabling zero-shot transfer to diverse vision-language tasks without task-specific fine-tuning. That same year, Perceiver IO emerged as a unified architecture capable of processing arbitrary multimodal inputs and outputs, scaling efficiently beyond traditional Transformers for tasks like image classification and audio generation. By 2022, Flamingo advanced few-shot learning by combining frozen vision and language models into a visual language model that could handle interleaved image-text inputs, achieving state-of-the-art performance on open-ended multimodal benchmarks with minimal examples. Concurrently, VideoMAE extended masked autoencoder pretraining to videos, improving self-supervised learning for spatiotemporal data by achieving high masking ratios while maintaining data efficiency. Recent trends up to 2025 have emphasized large-scale pretraining and integration across modalities. Models like BLIP-2 (2023) bootstrapped vision-language capabilities using frozen image encoders and large language models, enhancing generative tasks through efficient parameter utilization. LLaVA (2023) further pushed visual instruction tuning, aligning vision encoders with LLMs to enable general-purpose multimodal reasoning. In 2024, OpenAI's GPT-4o integrated real-time processing of audio, , and text for more natural interactions, while Meta's 3.2 introduced vision capabilities to the open-source LLM ecosystem. In video-audio domains, extensions of masked modeling have supported broader applications, while unified frameworks continue to proliferate. The impact of scaling—through vast datasets and increased compute—has led to emergent multimodal capabilities, such as robust zero-shot and in-context learning across modalities, mirroring patterns observed in large language models but extended to vision and audio integration.

Architectures and Techniques

Fusion Methods

Fusion methods in multimodal learning refer to the techniques used to integrate information from multiple data modalities, such as visual, textual, and audio inputs, to produce a unified representation that captures inter-modal relationships. These methods are essential for leveraging complementary strengths across modalities, enabling improved performance in tasks like sentiment analysis and visual question answering. Fusion can occur at different levels depending on the stage of processing where integration happens. Early fusion combines low-level features from each modality immediately after extraction, often through simple concatenation to form a joint embedding, such as \mathbf{z} = [\mathbf{x}_v; \mathbf{x}_t], where \mathbf{x}_v represents the visual embedding and \mathbf{x}_t the textual embedding; this approach allows joint training but can suffer from high dimensionality and noise propagation. Late fusion, in contrast, integrates high-level decisions or predictions from individual unimodal models, typically via averaging, voting, or weighted summation, which preserves modality-specific processing but may overlook subtle cross-modal interactions. Intermediate fusion strikes a balance by merging mid-level representations, such as through tensor operations or attention mechanisms, to capture both intra- and inter-modal dependencies more effectively than the other levels. Strategies for fusion broadly divide into joint and sequential processing paradigms. Joint training processes all modalities simultaneously in a shared network, projecting them into a common to learn unified features, which is particularly effective for capturing correlations but requires aligned data across modalities. Sequential processing, however, handles modalities in a , where outputs from one modality inform the next, often using recurrent structures; this is useful for asynchronous data but can introduce error accumulation. To address modality imbalance, where some inputs may dominate due to varying informativeness or noise, gating mechanisms dynamically weight contributions from each modality, such as by computing a gate vector \mathbf{g} = \sigma(\mathbf{W} [\mathbf{x}_v; \mathbf{x}_t]) to modulate the fused output, enhancing robustness in unbalanced scenarios. Key techniques for effective fusion include cross-modal attention and bilinear pooling. Cross-modal attention mechanisms align and weigh relevant features across modalities, for instance, by attending to textual cues when processing visual inputs in tasks like image captioning, improving alignment and interpretability. Bilinear pooling captures second-order interactions between modalities through outer products, as in multi-modal factorized bilinear pooling, which reduces while modeling fine-grained dependencies, achieving state-of-the-art results in visual with approximately 1% accuracy improvement over prior bilinear methods like MCB. Common challenges in fusion include modality dropout, where one or more inputs are missing or unreliable, and misalignment between heterogeneous data. Modality dropout is mitigated by training with random omission of modalities, promoting robustness in networks. For alignment, canonical correlation analysis (CCA) projects modalities into a shared space by maximizing their correlation, formulated as finding projection matrices \mathbf{W}_v and \mathbf{W}_t to optimize \rho = \frac{\mathbf{W}_v^T \mathbf{C}_{vt} \mathbf{W}_t}{\sqrt{\mathbf{W}_v^T \mathbf{C}_{vv} \mathbf{W}_v \mathbf{W}_t^T \mathbf{C}_{tt} \mathbf{W}_t}}, where \mathbf{C} denotes cross-covariance matrices; this unsupervised method has been foundational for tasks like audiovisual .

Transformer-Based Architectures

Transformer-based architectures have revolutionized multimodal learning by leveraging the self- mechanism to capture interactions across different modalities, such as and , through shared or cross-attention layers. Originally introduced for sequence transduction tasks, the transformer model enables parallelizable processing and long-range dependencies, which are particularly advantageous for aligning heterogeneous data like images and text. In multimodal settings, adaptations involve modality-specific encoders—such as convolutional networks for visual features or BERT-like models for text—that feed into a transformer, allowing cross-modal attention to model relationships between elements from different inputs. Early seminal works in demonstrated these adaptations effectively. ViLBERT, for instance, employs a two-stream architecture with co-attention layers that enable separate processing of visual and linguistic inputs before fusing them via shared layers, achieving state-of-the-art results on vision-and-language tasks like visual . Similarly, VisualBERT integrates visual features directly into a single BERT-like stack, treating regions as additional to learn joint representations in a unified manner. LXMERT extends this by using three specialized encoders—an object relationship encoder for vision, a encoder, and a cross-modality encoder—connected through -based cross-attention, which pretrains on large-scale datasets to capture bidirectional alignments. Subsequent evolution led to unified transformer models that handle multiple modalities end-to-end without separate streams. The Generative Image-to-Text Transformer (GIT), introduced in 2022, unifies vision-and-language tasks like captioning and question answering in a single autoregressive transformer, where visual tokens are prefixed to text sequences for joint generation. Building on this, Kosmos-1 (2023) presents a multimodal large language model that aligns perception with language through interleaved image and text tokens in a transformer decoder, enabling emergent capabilities such as zero-shot image captioning and multilingual understanding. At the core of these architectures is the cross-attention mechanism, which allows one modality to attend to another. Typically, after modality-specific encoding, a fusion transformer applies cross-attention where queries from one modality (e.g., visual) interact with keys and values from another (e.g., text). This is formalized as: \text{Attention}(Q_v, K_t, V_t) = \softmax\left(\frac{Q_v K_t^T}{\sqrt{d_k}}\right) V_t Here, Q_v represents visual queries, K_t and V_t are text keys and values, and d_k is the key dimension, enabling the model to weigh textual context relevant to visual elements. This mechanism, rooted in the original design, facilitates scalable fusion in multimodal transformers. By 2024, these principles scaled to production models like GPT-4o, which integrates real-time processing of audio, vision, and text within a unified framework, supporting applications from speech-to-text translation to visual reasoning with low . This progression highlights transformers' dominance in multimodal learning, emphasizing efficient cross-modal interactions for robust representation learning. As of 2025, continued advancements in scaling and efficiency have further enhanced cross-modal tasks in models like updated MLLMs.

Generative and Probabilistic Models

Generative and probabilistic models in learning provide a framework for capturing the distribution of multiple data modalities through latent variable representations, enabling the synthesis of coherent multimodal outputs. These approaches model the underlying probabilistic structure of data, often using undirected graphical models or to approximate complex joint probabilities. A foundational example is the multimodal deep (MDBM), which extends deep Boltzmann machines to handle diverse modalities like images and text by stacking energy-based layers that learn shared hidden representations. Introduced by and Salakhutdinov, MDBMs define an energy function over visible and hidden units across modalities, allowing for generative sampling via Gibbs or annealed , and demonstrating improved performance in tasks like cross-modal retrieval on datasets such as NYU2. Building on these foundations, multimodal variational autoencoders (MVAEs) advance probabilistic modeling by incorporating variational inference to learn disentangled latent representations that factorize modality-specific and shared information. Proposed by Wu and Goodman, MVAEs optimize a lower bound on the joint likelihood by training separate encoders and decoders for each modality while sharing a common , enabling efficient inference and generation of missing modalities. The core formulation expresses the joint likelihood as p(\mathbf{x}_v, \mathbf{x}_t) = \int p(\mathbf{z}) p(\mathbf{x}_v|\mathbf{z}) p(\mathbf{x}_t|\mathbf{z}) \, d\mathbf{z}, where \mathbf{x}_v and \mathbf{x}_t represent visual and textual modalities, \mathbf{z} is the latent variable, and the integral is approximated via the evidence lower bound (ELBO). This structure facilitates weakly supervised learning, as shown in experiments on MNIST variants and CMU-MOSEI, where MVAEs outperform unimodal VAEs in reconstruction and imputation quality. Recent generative advancements have integrated probabilistic models with diffusion processes and adversarial training to enhance multimodal synthesis capabilities. Diffusion models, such as the MM-Diffusion framework, extend denoising diffusion probabilistic models (DDPMs) to joint audio-video generation by defining coupled denoising networks that progressively refine noise-added multimodal inputs, achieving high-fidelity outputs on datasets like CREPE and VGGSound. Developed by Ruan et al., this approach models the reverse diffusion process in a shared latent space, surpassing prior GAN-based methods in sample diversity and temporal consistency. Complementing these, hybrid GAN-VAE architectures combine the stable latent modeling of VAEs with the sharp output generation of GANs for audio-visual tasks; for instance, Sound2Sight employs a VAE-GAN encoder-decoder to generate video frames from audio cues and contextual priors, leveraging a stochastic prior conditioned on multimodal inputs to produce dynamic visual sequences on the AVE dataset. By 2025, such hybrids have evolved to incorporate conditional mechanisms, improving controllability in synthesis while maintaining probabilistic guarantees. Unlike discriminative models that focus on boundary separation for or , generative and probabilistic approaches in learning emphasize modeling the full data distribution to enable sampling of novel, unseen multimodal instances. This generative paradigm supports creative applications like and scenario simulation, where the ability to draw from p(\mathbf{x}_1, \dots, \mathbf{x}_M) across M modalities contrasts with the conditional predictions of discriminative methods, often leading to more robust handling of incomplete or noisy inputs.

Applications

Vision and Language Integration

Vision and language integration represents a pivotal application of multimodal learning, where visual data from images or videos is combined with textual information to enable tasks that require understanding across both domains. Core tasks in this area include image and video captioning, which generates descriptive textual summaries of visual content; visual question answering (VQA), where systems respond to queries about images; and cross-modal retrieval, which allows searching for images using text queries or vice versa. These tasks leverage the complementary strengths of vision and language, such as visual details for and textual context for semantic interpretation, to achieve more robust comprehension than unimodal approaches. In practical examples, vision-language models have been integrated into search engines to enhance user experiences. For instance, by 2023 incorporated multimodal capabilities, allowing users to upload images and receive textual explanations or related searches, improving accessibility and efficiency. A specific case is the VQA v2 dataset introduced in 2017, where multimodal models demonstrated accuracy gains of 10-20% over unimodal baselines by jointly processing visual and linguistic inputs, highlighting the value of integration in handling complex queries. Additionally, contrastive learning techniques, as exemplified by the CLIP model, enable zero-shot tasks like classifying images based on textual descriptions without task-specific training, by aligning visual and language embeddings in a shared space. The real-world impact of these integrations is evident in accessibility tools for the visually impaired, such as apps that provide audio descriptions of surroundings via image captioning and VQA, empowering users with real-time environmental understanding. By 2025, deployments in have advanced multimodal product search, where customers can query items using a combination of text and images—such as describing a "red dress like this photo"—leading to more precise recommendations and higher conversion rates in platforms like and Alibaba. These applications underscore the scalability of vision-language models in everyday digital interactions.

Multimodal AI in Healthcare and Robotics

In healthcare, multimodal AI systems integrate diverse data modalities, such as , electronic health records (EHR), and genomic information, to enable more precise diagnostics and personalized treatment planning. For example, frameworks leveraging MRI scans, structured EHR data, and genomic single nucleotide polymorphisms (SNPs) have advanced early detection of by fusing these inputs through models, achieving superior predictive performance compared to single-modality approaches. Applications utilizing the , which provides de-identified EHR from intensive care units spanning 2008–2019, have incorporated imaging and clinical text in the 2020s to predict patient outcomes, demonstrating improved risk stratification for conditions like and mortality. These integrations often yield accuracy gains, with multimodal models for Alzheimer's detection reporting up to a 15% improvement in classification accuracy over unimodal baselines, highlighting the value of complementary data sources in capturing disease heterogeneity. An example of multimodal AI in diagnostics involves integrating chest X-rays with clinical parameters and EHR using transformer-based architectures to enhance pathology identification and severity assessment, with applicability to scenarios like the for and in healthcare settings. These models fuse radiographic features with textual symptom descriptions, outperforming unimodal approaches by improving accuracy in detecting progression and reducing false positives through contextualization of visual findings with narrative data, as demonstrated on large-scale ICU datasets. In , multimodal AI supports robust perception and decision-making through , particularly for and interaction tasks. Autonomous vehicles, for instance, rely on integrating for depth mapping, cameras for semantic understanding, and inertial measurement units () for motion tracking, enabling real-time environmental modeling in dynamic scenarios like urban driving. Multimodal reinforcement learning further advances manipulation tasks, such as grasping and assembly, by incorporating tactile, visual, and proprioceptive signals to improve sample efficiency and adaptability in contact-rich environments. By 2025, robotic assistants in eldercare have incorporated and gesture interpretation to facilitate natural human-robot interactions, aiding activities like medication reminders and mobility support while enhancing user engagement. Ethical considerations in these applications are paramount, particularly regarding in multimodal health data, which can arise from underrepresented demographics in training sets and lead to inequitable outcomes. For example, imbalances in EHR and genomic datasets may exacerbate disparities in prediction for minority groups, necessitating strategies like fairness-aware algorithms and diverse data curation to ensure equitable deployment. Regulatory bodies, including the , emphasize transparency and auditing in large multimodal models to uphold patient trust and clinical validity.

Evaluation and Challenges

Datasets and Metrics

Multimodal learning relies on diverse datasets that integrate multiple data modalities, such as images with text or audio with visuals, to train and evaluate models effectively. One foundational dataset is COCO (Common Objects in Context), introduced in 2014, which comprises over 330,000 images, of which more than 200,000 are labeled with object segmentations, keypoints, and 616,745 captions for 123,349 images, primarily supporting image-captioning tasks in vision-language integration. Building on this, the Visual Genome dataset, released in 2016, extends dense annotations to over 108,000 images sourced from COCO, including 3.8 million object instances, 2.8 million attributes, and 2.3 million relationships between entities, enabling more comprehensive scene understanding and visual . For audio-visual scenarios, , launched in 2017 by , provides 2.1 million 10-second video clips labeled with 632 audio event classes, facilitating tasks like sound event detection and cross-modal alignment. Scaling to massive sizes, LAION-5B, developed in 2022, offers 5.85 billion CLIP-filtered image-text pairs scraped from the web, with extensions like LAION-Aesthetics (2023) incorporating quality filters for aesthetic relevance, supporting large-scale pretraining of multimodal models up to 2025. More recently, the dataset (2024) provides multimodal video data from ego- and exo-centric perspectives for human activity understanding, comprising thousands of recordings. Standard metrics for evaluating multimodal performance emphasize alignment, retrieval accuracy, and robustness across modalities. In image captioning, (Consensus-based Image Description Evaluation), proposed in 2015, computes n-gram overlap weighted by term frequency to capture human consensus, outperforming earlier metrics like on benchmarks such as COCO. Complementing this, (Semantic Propositional Image Caption Evaluation), introduced in 2016, parses captions into semantic graphs to measure object, relation, and attribute recall, providing deeper semantic assessment beyond surface-level similarity. For cross-modal retrieval tasks, Recall@K (R@K) quantifies the proportion of relevant items retrieved in the top K results, commonly used in vision-language models like CLIP, where R@1, R@5, and R@10 establish retrieval efficacy on datasets like Flickr30K. Multimodal-specific metrics, such as the modality gap in robustness tests, evaluate performance discrepancies between modalities under perturbations like or , highlighting vulnerabilities in fused representations as seen in benchmarks. Despite their utility, multimodal datasets face significant challenges, including class imbalance where underrepresented modalities or categories skew , and high annotation costs due to the labor-intensive of diverse data types like synchronized audio-visual clips. To address these, generation techniques, such as those using models or GANs, create augmented multimodal samples to balance distributions and reduce reliance on costly labeling, as demonstrated in recent works improving diversity for vision-language tasks. Evaluation protocols in multimodal learning distinguish between zero-shot and fine-tuned settings to assess . Zero-shot evaluation tests models on unseen tasks using only pretraining alignments, as in CLIP's image-text retrieval without task-specific data, measuring broad transferability. In contrast, fine-tuned benchmarks adapt pretrained models to downstream tasks with , yielding higher accuracy but requiring computational resources, with protocols often comparing both on standardized splits of datasets like Visual Genome for visual reasoning.
DatasetYearModalitiesSizePrimary UseSource
COCO2014Image, Text330K images, 617K captionsImage captioning, arXiv:1405.0312
Visual Genome2016Image, Text (dense)108K images, 3.8M objectsScene graphs, VQAarXiv:1602.07332
AudioSet2017Audio, Video2.1M clips, 632 classesAudio event detectionarXiv:1702.08721
LAION-5B2022Image, Text5.85B pairsLarge-scale pretrainingarXiv:2210.08402

Limitations and Future Directions

Despite significant progress, multimodal learning faces notable limitations in scalability, particularly when handling high-dimensional data from diverse modalities such as images, text, and audio, where fusion processes demand substantial computational resources that hinder deployment on standard hardware. Bias amplification across modalities remains a critical issue, as errors or imbalances in one modality, like skewed visual datasets, can propagate and exacerbate unfair outcomes in integrated representations during reasoning tasks. Furthermore, the lack of interpretability in fused representations poses challenges, rendering complex models as opaque "black boxes" that obscure decision-making processes, especially in high-stakes domains. Specific challenges include difficulties in handling rare modalities, such as olfactory data, where the absence of standardized benchmarks and diverse datasets complicates objective and with visual or textual inputs. Privacy concerns in multimodal health applications are pronounced, as combining sensitive data like medical images and patient audio risks unauthorized exposure during model training and inference, necessitating robust safeguards. Computational costs also represent a barrier, exemplified by training CLIP-scale models on massive datasets like 400 million image-text pairs, which required hundreds of GPUs (e.g., 592 V100s for 12 days) to achieve viable performance. Looking ahead, neurosymbolic multimodal AI emerges as a promising direction, integrating neural learning with symbolic reasoning to enhance interpretability and handle complex spatial or logical tasks across modalities. Lifelong learning frameworks for dynamic modalities are gaining traction, enabling models to adapt incrementally to evolving data streams without catastrophic , thus supporting continuous updates in real-world scenarios. Integration with is anticipated to facilitate multimodal applications by 2030, driven by projected market growth to $10.81 billion and advancements in localized to reduce . As of 2025, trends emphasize efficient through sparse mechanisms, which mitigate in long sequences by focusing on relevant tokens, and the development of ethical frameworks to promote multimodal fairness via standardized auditing and inclusivity evaluations.

References

  1. [1]
    [PDF] Multimodal Machine Learning: A Survey and Taxonomy
    Multimodal machine learning builds models that process and relate information from multiple modalities, such as natural language, visual, and vocal signals.
  2. [2]
    [PDF] Multimodal Learning with Transformers: A Survey - arXiv
    This survey covers multimodal learning with Transformers, including background, reviews of different types of Transformers, applications, challenges, and open ...
  3. [3]
    None
    Summary of each segment:
  4. [4]
    survey on multimodal large language models - Oxford Academic
    This paper presents the first survey on Multimodal Large Language Models (MLLMs), highlighting their potential as a path to Artificial General Intelligence.
  5. [5]
    A Review on Methods and Applications in Multimodal Deep Learning
    Feb 18, 2022 · This paper focuses on multiple types of modalities, ie, image, video, text, audio, body gestures, facial expressions, and physiological signals.
  6. [6]
    Hearing lips and seeing voices - PubMed
    Hearing lips and seeing voices. Nature. 1976 Dec;264(5588):746-8. doi: 10.1038/264746a0. Authors. H McGurk, J MacDonald. PMID: 1012311; DOI: 10.1038/ ...Missing: paper | Show results with:paper
  7. [7]
    [PDF] Multimodal Machine Learning: A Survey and Taxonomy - arXiv
    Aug 1, 2017 · Multimodal machine learning builds models that process and relate information from multiple modalities, such as natural language, visual, and ...
  8. [8]
    [PDF] Multimodal Convolutional Neural Networks for Matching Image and ...
    In this paper, we propose multimodal convolutional neu- ral networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework ...
  9. [9]
    [PDF] DEEP CAPTIONING WITH MULTIMODAL RECURRENT NEURAL ...
    In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability ...
  10. [10]
    Learning Transferable Visual Models From Natural Language ...
    Feb 26, 2021 · Abstract page for arXiv paper 2103.00020: Learning Transferable Visual Models From Natural Language Supervision.
  11. [11]
    Perceiver IO: A General Architecture for Structured Inputs & Outputs
    Jul 30, 2021 · We propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs.
  12. [12]
    Flamingo: a Visual Language Model for Few-Shot Learning - arXiv
    Apr 29, 2022 · Flamingo is a Visual Language Model (VLM) designed for few-shot learning, rapidly adapting to novel tasks with few examples. It handles ...
  13. [13]
    VideoMAE: Masked Autoencoders are Data-Efficient Learners for ...
    Mar 23, 2022 · In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP).
  14. [14]
    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen ...
    Jan 30, 2023 · This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained ...
  15. [15]
    [2304.08485] Visual Instruction Tuning - arXiv
    Apr 17, 2023 · This paper introduces LLaVA, a multimodal model trained using GPT-4 generated data, achieving 85.1% relative score and 92.53% on Science QA.
  16. [16]
    [2206.07682] Emergent Abilities of Large Language Models - arXiv
    Jun 15, 2022 · Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This ...
  17. [17]
    A Survey on Deep Learning for Multimodal Data Fusion
    May 1, 2020 · This review presents a survey on deep learning for multimodal data fusion to provide readers, regardless of their original community, with the fundamentals.
  18. [18]
    [PDF] Multi-Modal Factorized Bilinear Pooling With Co-Attention Learning ...
    For multi- modal feature fusion, here we develop a Multi-modal Fac- torized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal ...Missing: seminal | Show results with:seminal
  19. [19]
    [1908.02265] ViLBERT: Pretraining Task-Agnostic Visiolinguistic ...
    Aug 6, 2019 · We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
  20. [20]
    VisualBERT: A Simple and Performant Baseline for Vision ... - arXiv
    Aug 9, 2019 · Abstract:We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
  21. [21]
    Learning Cross-Modality Encoder Representations from Transformers
    Aug 20, 2019 · In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality ...
  22. [22]
    GIT: A Generative Image-to-text Transformer for Vision and Language
    May 27, 2022 · In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question ...
  23. [23]
    [2302.14045] Language Is Not All You Need: Aligning Perception ...
    Feb 27, 2023 · In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (ie, few-shot), and follow ...
  24. [24]
    Hello GPT-4o - OpenAI
    May 13, 2024 · We're announcing GPT-4 Omni, our new flagship model which can reason across audio, vision, and text in real time.
  25. [25]
    Multimodal Learning with Deep Boltzmann Machines
    We propose a Deep Boltzmann Machine for learning a generative model of such multimodal data. We show that the model can be used to create fused representations.
  26. [26]
    Learning Multi-Modal Diffusion Models for Joint Audio and Video ...
    Dec 19, 2022 · To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising ...
  27. [27]
    Sound2Sight: Generating Visual Dynamics from Sound and Context
    Jul 23, 2020 · In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion ...Missing: GAN | Show results with:GAN
  28. [28]
    Multimodal deep learning models for early detection of Alzheimer's ...
    Feb 5, 2021 · In this study, we further the multi-modal AD data fusion to advance AD stage prediction by using DL to combine imaging, EHR, and genomic SNP ...Missing: MIMIC- | Show results with:MIMIC-
  29. [29]
    Integrated multimodal artificial intelligence framework for healthcare ...
    Sep 20, 2022 · We propose and evaluate a unified Holistic AI in Medicine (HAIM) framework to facilitate the generation and testing of AI systems that leverage multimodal ...
  30. [30]
    Attention-driven hybrid deep learning and SVM model for early ...
    Jul 1, 2025 · ... improving generalization across diverse datasets. Comparative analysis highlights a 15% improvement in accuracy, a 12% reduction in false ...
  31. [31]
    Multimodal Deep Learning for Integrating Chest Radiographs and ...
    Oct 3, 2023 · A transformer-based artificial intelligence architecture was developed to integrate multimodal patient data and demonstrated improved diagnostic ...
  32. [32]
    A Review of Multi-Sensor Fusion in Autonomous Driving - MDPI
    Multi-modal sensor fusion has become a cornerstone of robust autonomous driving systems, enabling perception models to integrate complementary cues from ...
  33. [33]
    M2CURL: Sample-Efficient Multimodal Reinforcement Learning via ...
    Jan 30, 2024 · We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation ...
  34. [34]
  35. [35]
    Bias recognition and mitigation strategies in artificial intelligence ...
    Mar 11, 2025 · This review examines the origins of bias in healthcare AI, strategies for mitigation, and responsibilities of relevant stakeholders towards achieving fair and ...
  36. [36]
    WHO releases AI ethics and governance guidance for large multi ...
    Jan 18, 2024 · Furthermore, LMMs may be trained on data that are of poor quality or biased, whether by race, ethnicity, ancestry, sex, gender identity, or age.
  37. [37]
    Visual Genome: Connecting Language and Vision Using ... - arXiv
    Feb 23, 2016 · In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects ...
  38. [38]
    LAION-5B: An open large-scale dataset for training next generation ...
    Oct 16, 2022 · We present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
  39. [39]
  40. [40]
    [2411.17040] Multimodal Alignment and Fusion: A Survey - arXiv
    This survey provides a comprehensive overview of recent advances in multimodal alignment and fusion within the field of machine learning, driven ...
  41. [41]
    Multimodal Fusion And Sparse Attention-based Alignment Model for ...
    Aug 13, 2025 · To address these issues, we propose MUFASA, a MUltimodal Fusion And Sparse Attention-based Alignment model for long sequential recommendation.Missing: efficient | Show results with:efficient
  42. [42]
    A Review of Fairness, Transparency, and Ethics in Vision-Language ...
    Apr 14, 2025 · This review explores the trustworthiness of multimodal artificial intelligence (AI) systems, specifically focusing on vision-language tasks.