Fact-checked by Grok 2 weeks ago

Saliency map

A saliency map is a topographic two-dimensional that represents the relative saliency or conspicuity of different locations across the , serving as a to prioritize stimuli for further processing in models of . This concept was first proposed by and Shimon Ullman in as part of a neural circuitry model for shifts in selective , where the integrates simple feature contrasts like color, , and motion to highlight salient regions without prior knowledge of the scene. The idea gained prominence through the 1998 computational implementation by Laurent Itti, , and Ernst Niebur, which constructs the saliency map by generating separate feature maps for , color, and across multiple spatial scales, followed by , , and iterative inhibition-of-return to select attentional foci in a bottom-up manner. This model, inspired by early pathways, enables rapid scene analysis by simulating , where the saliency map's peak values indicate locations most likely to attract overt or covert shifts. Empirical validations of such models have shown strong correlations with human eye-tracking data in natural scenes, underscoring their biological plausibility. In contemporary applications, saliency maps extend beyond biological modeling to interpretable , particularly in deep convolutional neural networks (CNNs), where gradient-based methods compute pixel-wise importance scores to visualize which input regions most influence a model's decision. Pioneered by Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman in 2013, these visualizations reveal the "focus" of CNNs on discriminative features, aiding in debugging, bias detection, and trust-building in AI systems. Despite their utility, saliency maps in this context have faced critiques for sensitivity to and potential artifacts, prompting ongoing refinements like integrated gradients to enhance robustness.

Fundamentals

Definition and Core Concepts

A saliency map is a two-dimensional topographic representation of an in which the at each encodes the saliency or conspicuity of the corresponding location, thereby highlighting regions likely to attract visual . This map simulates bottom-up attentional mechanisms by integrating low-level visual features to identify salient areas without relying on task-specific knowledge. Core concepts of saliency maps distinguish between bottom-up and top-down processes in visual . Bottom-up saliency is stimulus-driven, emerging automatically from intrinsic properties such as contrasts in color, , or , whereas top-down saliency is goal-directed, modulated by cognitive factors like expectations or search objectives. Saliency maps typically serve as probabilistic or -based encodings, where higher values indicate greater likelihood of attentional fixation, often normalized across the to represent relative importance. These maps draw from , which posits that preattentive vision processes basic features—such as color, , and —in parallel across the before binds them into coherent percepts. Mathematically, a saliency map S at pixel coordinates (x, y) can be formulated as S(x,y) = f(I(x,y)), where I denotes the input image and f is a that aggregates conspicuity from multiple feature channels. A foundational for this is the center-surround operation, which detects local anomalies by subtracting coarser-scale representations from finer-scale ones within maps, mimicking properties in early : for instance, conspicuity arises from differences like (I_c - I_s), where I_c and I_s are center and surround scales, respectively. These maps differ from , which isolates boundaries via changes, or segmentation , which delineate object regions; instead, saliency emphasizes holistic attentional priority over structural delineation.

Historical Development

The concept of saliency maps in computational vision drew early inspiration from psychological studies on visual attention during the 1970s and 1980s, particularly Anne Treisman's , which posited that simple visual features like color and are processed in parallel across multiple specialized maps before binds them into coherent objects via a master map mechanism. This framework highlighted the preattentive stage of vision where salient elements pop out effortlessly, influencing later computational models by suggesting a unified representation of feature conspicuity. A pivotal advancement occurred in 1985 when Christof Koch and Shimon Ullman proposed the saliency map as a central topographic structure in the visual system, integrating outputs from early feature maps—such as those for color, orientation, and motion—through a winner-take-all network to select the most conspicuous location for attention shifts. This model formalized saliency computation as a bottom-up process independent of specific tasks, positing its neural locus in areas like the lateral geniculate nucleus or superior colliculus, and laid the groundwork for simulating selective visual attention in machines. In the early 1990s, John Tsotsos contributed to formalizing saliency through selective tuning mechanisms, emphasizing hierarchical and localized computations to solve feature binding problems in visual search. The late 1990s marked a landmark in practical with Laurent Itti, , and Ernst Niebur's 1998 model, which built on Koch and Ullman's ideas by creating a biologically inspired saliency map from center-surround differences in , color, and orientation features, followed by iterative normalization and convergence to guide rapid scene analysis. This approach, tested on natural images, demonstrated high correlation with fixations and became a benchmark for bottom-up saliency detection. In the , saliency models evolved toward more sophisticated integrations, including graph-based methods that treated images as graphs to compute saliency via random walks or , enhancing global context over local contrasts as in earlier center-surround models. These developments, exemplified by Harel et al.'s 2006 graph-based visual saliency , improved accuracy on diverse scenes by modeling similarities across scales. Comprehensive surveys, such as Borji et al.'s 2013 analysis of over 30 models, underscored the dominance of hand-crafted features in pre-2010 approaches while highlighting persistent challenges in matching human gaze patterns. A significant shift occurred post-2012 following the success of , which catalyzed data-driven saliency models trained end-to-end on large eye-tracking datasets, moving away from rule-based toward learned representations that better captured contextual and semantic saliency. This transition, as reviewed in subsequent benchmarks, marked the integration of into saliency computation, building on the foundational timeline from psychological theory to computational maturity.

Biological Foundations

Visual Attention Mechanisms

Visual attention mechanisms in humans are often divided into bottom-up and top-down processes, with saliency maps primarily modeling the former to predict where gaze is directed based on inherent stimulus properties. Bottom-up attention refers to pre-attentive processing that involuntarily captures gaze toward salient features, such as abrupt changes in luminance, color, or orientation that stand out against the background. This mechanism enables rapid detection without conscious effort, as demonstrated in visual search tasks where targets "pop out" from distractors due to unique feature differences, resulting in search times independent of the number of items present. For instance, a single red circle among green ones elicits immediate fixation because of the color contrast, illustrating how bottom-up saliency guides attention efficiently in feature-based singleton searches. Eye-tracking studies provide empirical evidence for these processes by recording gaze patterns, or scanpaths, during free viewing or task-oriented observation of complex scenes. Pioneering experiments by Alfred Yarbus in the revealed that eye movements form distinct trajectories influenced by scene content, with fixations clustering on high-contrast regions and informative elements, such as faces or objects of interest in paintings. These scanpaths are quantified through fixation maps, which aggregate dwell times across multiple observers to highlight attended areas, showing strong correlations with predicted locations in natural images. Yarbus's work demonstrated that even without explicit instructions, gaze is drawn to edges and structural discontinuities, supporting the role of stimulus-driven cues in initial allocation. While top-down influences, such as task goals or expectations, modulate by prioritizing relevant features, saliency maps focus predominantly on bottom-up aspects to capture the stimulus-driven component of gaze selection. This emphasis allows models to predict fixations in unconstrained viewing scenarios, where involuntary capture by salient elements occurs before volitional control takes over. Neurological correlates, such as activity in early visual areas, further underpin these behavioral observations but are explored in greater detail elsewhere. Empirical validation of saliency-based models involves comparing generated maps to fixation data from eye-tracking datasets, often using metrics like the area under the curve () to assess predictive accuracy. Studies show that effective models achieve scores around 0.7-0.8 for fixation prediction, indicating moderate to strong alignment with by ranking salient pixels higher than chance. For example, in free-viewing tasks, saliency maps better predict initial fixations on pop-out elements than uniform baselines, with shuffled variants accounting for center bias to ensure robust evaluation. These comparisons highlight how saliency modeling captures core aspects of bottom-up , though performance varies with scene complexity and observer variability.

Neurological Basis

The neurological basis of visual saliency lies in a network of brain regions that prioritize conspicuous stimuli through competitive neural processes, inspiring computational saliency maps. Key areas include the (), which generates orienting responses by encoding a topographic saliency map in its superficial layers via center-surround inhibition, directing toward salient features like abrupt motion or contrast changes. The lateral intraparietal area () in the posterior parietal cortex facilitates shifts by integrating bottom-up saliency signals with top-down goals, functioning as a priority map that modulates neural activity to select relevant locations. The pulvinar nucleus of the acts as a for saliency, relaying signals from the to cortical areas and suppressing irrelevant distractors through inhibitory mechanisms, thereby enhancing the representation of behaviorally important stimuli. Neural pathways underlying saliency detection involve both cortical and subcortical routes, with the stream playing a prominent role in rapid processing. The ("where/how") pathway, extending from primary () through areas like V5/MT to the , handles spatial and motion-based saliency via the magnocellular pathway, which excels at detecting low-contrast, high-speed changes such as moving predators or prey. In contrast, the ventral ("what") stream focuses on object identification but contributes less to initial saliency. Subcortical pathways, including retina-to-SC connections, enable fast, reflexive orienting independent of cortical involvement, bypassing slower ventral processing for survival-critical detection. Electrophysiological evidence from single-unit recordings and fMRI supports these mechanisms, revealing winner-take-all dynamics in the . Single-unit studies in show that neurons robustly encode saliency through , with responses peaking for the most conspicuous stimuli before primary visual areas. fMRI data demonstrate enhanced BOLD signals in and pulvinar during saliency-guided , correlating with suppressed activity for competing distractors. The biased model, supported by these findings, posits that multiple stimuli vie for limited neural resources in , with saliency biasing the outcome via mutual inhibition, akin to winner-take-all selection. From an evolutionary perspective, visual saliency represents an adaptive mechanism for survival, honed in primates to detect threats or opportunities in complex environments. Primate vision research highlights how saliency processing, rooted in ancient tectal structures like the SC (homologous to the frog's optic tectum for "bug detection"), enables rapid prioritization of motion or contrast anomalies, aiding predator evasion and foraging efficiency. This conservation across species underscores saliency's role in enhancing reproductive fitness through efficient resource allocation in visually rich habitats.

Computational Approaches

Classical Algorithms

Classical algorithms for saliency map generation rely on hand-crafted features and rule-based computations, predating the widespread adoption of . These methods typically process images through multi-scale feature extraction and to highlight visually conspicuous regions, drawing inspiration from biological visual processing such as center-surround mechanisms in the . A seminal feature-based model is the Itti-Koch framework, introduced in 1998, which computes saliency by simulating early visual processing pathways. The process begins with the creation of feature maps for color, intensity, and orientation using center-surround filters at multiple scales, typically nine scales ranging from σ=1 to σ=8 pixels for Gaussian pyramids. For intensity, differences between fine and coarse scales yield six center-surround maps per scale, such as (c,s) = 2 where c is the center scale and s = c + δ with δ ∈ {2,3,4}; these are computed as |I(c) − I(s)|, where I(·) denotes the intensity channel after subsampling. Similar operations produce maps for color channels (red-green and blue-yellow opponent colors) and orientation (using Gabor filters at 0°, 45°, 90°, and 135°). Across-scale saliency is then obtained by iteratively combining these maps via "across-scale comparison," reducing the six maps per feature type to three dyadic scales using winner-take-all and inhibition-of-return mechanisms. Conspicuity maps are formed by linearly summing the across-scale maps for each feature type (intensity, color, orientation), followed by normalization to the range [0,1] using a process that scales each map by its maximum value and applies iterative suppression to emphasize peaks. The final saliency map integrates the three conspicuity maps through element-wise addition after further normalization, producing a top-down modulation-free saliency signal that guides attentional shifts. This model has been widely adopted for its biological plausibility and efficiency in rapid scene analysis. Spectral methods leverage frequency domain properties to capture global image structure for saliency detection. Achanta et al.'s 2009 approach emphasizes global contrast by tuning saliency to perceptual frequency characteristics, computing pixel-wise saliency as the Euclidean distance to the frequency-tuned average feature vector. Specifically, the saliency value S(\mathbf{x}) for a pixel \mathbf{x} is given by S(\mathbf{x}) = \|\mathbf{F}(\mathbf{x}) - \bar{\mathbf{F}}_\omega\| where \mathbf{F}(\mathbf{x}) extracts the Lab color features at \mathbf{x}, and \bar{\mathbf{F}}_\omega is the average feature vector filtered via a discrete cosine transform (DCT)-based band-pass approximation derived from a Gaussian blob, suppressing low-frequency components while retaining perceptual relevance. This results in full-resolution saliency maps with well-defined boundaries, outperforming local contrast methods on benchmark datasets by better capturing uniform regions and object silhouettes. Information-theoretic approaches model saliency as the potential for information gain, quantifying surprise or rarity in the visual . Bruce and Tsotsos's 2009 model uses self-information of image patches, termed proto-objects, to compute saliency based on reduction. The is first segmented into overlapping rectangular patches at multiple scales, each represented by a feature (e.g., color, histograms). The self-information of a proto-object p is calculated as -log P(p), where P(p) is the probability of p under a estimated from the entire , often using a non-parametric kernel density estimate over all patches. Saliency at each location is then the sum of self-informations from all proto-objects containing that location, effectively measuring how much each patch contributes to the overall . This -based computation highlights regions that are atypical or informative relative to the global context, with empirical validation showing strong correlation to fixations in search tasks. The approach extends earlier work by incorporating spatial priors but remains computationally intensive due to the pairwise probability estimations. Graph-based methods treat the as a to model region importance through connectivity and probabilities. Harel et al.'s 2006 graph-based visual saliency (GBVS) first generates activation maps using channels similar to the Itti-Koch model (, color, ) at multiple scales, then constructs a fully connected where represent image locations and edges are weighted by similarities. Saliency is computed using Markov chain times, where each is treated as a and a subset of "absorbing" states (e.g., high-activation ) are defined; the equilibrium distribution of a starting from each yields the probability of , with higher values indicating greater saliency as they reflect the 's in connecting to salient absorbers. To handle computational cost, the is sparsified using k-nearest neighbors in a embedding space, and multi-scale combines results across maps via and summation. This method improves upon by capturing contextual relationships, achieving superior performance in predicting human fixations compared to earlier models on diverse sets.

Modern Deep Learning Methods

Modern deep learning methods for saliency map computation primarily rely on convolutional neural networks () to extract hierarchical features and predict pixel-wise saliency scores in an end-to-end manner. A foundational approach involved leveraging pre-trained architectures like VGG for multi-scale feature extraction to model visual saliency. For example, Li et al. (2015) utilized VGG-16 to capture low-, mid-, and high-level features from images via nested contextual windows, followed by fully connected layers to generate saliency maps trained via binary cross-entropy loss, enabling data-driven prediction that outperformed hand-crafted features on benchmark datasets. This end-to-end training paradigm allows models to learn complex patterns directly from labeled saliency data, with loss functions like binary cross-entropy optimizing the similarity between predicted and ground-truth saliency maps. Attention mechanisms have further advanced saliency computation by incorporating global contextual dependencies, particularly through transformer architectures introduced after 2017. These models adapt self-attention to weigh relevant image regions dynamically, enhancing focus on salient areas. A representative integration appears in deeply supervised frameworks that employ attention-like short connections for multi-level feature refinement. The attention-weighted saliency can be formulated as: \mathbf{S} = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}}\right) \mathbf{V} where \mathbf{Q}, \mathbf{K}, and \mathbf{V} are query, key, and value projections of visual features, and d is the , adapted for tasks to produce spatially aware saliency maps. Such mechanisms improve delineation and suppress background noise compared to purely convolutional approaches. State-of-the-art models as of 2020, such as variants, have achieved superior performance through encoder-decoder architectures that fuse multi-resolution features for precise pixel-wise saliency prediction. For instance, U²-Net (2020) employs nested U-structures to capture intricate details and boundaries, yielding mean absolute errors below 0.04 and maximum F-measures over 0.91 on datasets like DUT-OMRON and ECSSD. These deep methods consistently surpass classical algorithms, with area under the curve () scores exceeding 90% on standard benchmarks, demonstrating scalability and robustness in diverse scenes. Since 2020, transformer-based architectures, such as (ViT) adaptations for saliency detection, have further advanced the field, achieving scores over 95% on benchmarks like ECSSD as of 2024.

Applications

In Human Visual Perception Modeling

Saliency maps serve as computational tools to predict human patterns by generating heatmaps that highlight regions likely to attract visual in static images, closely aligning with empirical eye-tracking data collected from observers viewing natural scenes. These models simulate bottom-up by emphasizing low-level features such as , color, and , enabling predictions of initial fixations that match human behavior with high accuracies on datasets. In (UI) design, saliency maps guide optimal layout decisions by identifying focal points for key elements, such as placing buttons in high-saliency areas to enhance and reduce , as demonstrated in studies optimizing compositions for efficiency. Psychophysical validation of saliency models involves controlled experiments where predicted heatmaps are compared to human fixation maps from eye-tracking studies, often using tasks that isolate attentional capture without top-down influences. For instance, saliency-anywhere paradigms present stimuli where observers report or fixate on the most prominent feature regardless of location, allowing assessment of how well models capture perceptual pop-out effects. Model accuracy is quantified through metrics like , which measures the information loss between the probability distributions of predicted saliency and actual human fixations; lower divergence values indicate better alignment, with state-of-the-art models achieving low KL scores across diverse image sets. These validations confirm that saliency maps effectively replicate human perceptual priorities in free-viewing scenarios. Extensions to dynamic saliency incorporate temporal dynamics for video sequences, modeling how shifts over time by integrating motion cues and frame-to-frame changes to predict trajectories in dynamic environments. These models extend static frameworks by adding spatiotemporal filters that capture motion saliency, achieving high correlation coefficients with human data in video datasets like Hollywood-2, thus simulating overt shifts in real-world viewing like watching films or navigating . Such approaches build briefly on biological visual mechanisms, where neural circuits in areas like the prioritize moving stimuli. Open-source libraries facilitate the integration of saliency models into experiments, with PyGaze providing Python-based tools for designing eye-tracking protocols that incorporate saliency predictions to test hypotheses on attentional guidance. PyGaze supports stimulus presentation and data collection, enabling researchers to validate models against live human responses in controlled settings.

In Explainable Artificial Intelligence

In (XAI), saliency maps serve as post-hoc explanations for black-box models, particularly in tasks, by visualizing the input regions that most strongly influence a model's predictions. These maps highlight discriminative pixels or features, such as edges or textures in , enabling users to understand why a model classifies an image as belonging to a specific category, like identifying a in autonomous scenarios. Unlike inherently interpretable models, saliency maps provide localized insights into complex deep neural networks without requiring architectural changes, making them valuable for and in high-stakes applications. Gradient-based methods form the foundation of many saliency techniques in XAI. The vanilla gradient approach, introduced by Simonyan et al. (2013), computes the saliency map as the partial derivative of the model's class score with respect to the input image pixels, directly indicating sensitivity to changes in each feature. To address limitations like attribution to irrelevant baselines or saturation in deep networks, Integrated Gradients (IG) was proposed by Sundararajan et al. (2017), which attributes the prediction to inputs by integrating gradients along a path from a baseline input x' (often a black image) to the actual input x: \text{IG}(x) = (x - x') \times \int_{0}^{1} \nabla F(x' + \alpha (x - x')) \, d\alpha where F is the model function and \nabla F is its ; this method satisfies axioms like (total attribution sums to the prediction difference) and implementation invariance across equivalent networks. For noise reduction in these gradient maps, SmoothGrad, developed by Smilkov et al. (2017), adds to multiple input copies, computes gradients for each, and averages them to produce sharper, more reliable visualizations without altering the underlying model. Despite their utility, saliency maps face fidelity challenges, where the explanations may not accurately reflect the model's true process, often due to to model or variations. For instance, gradient-based maps can produce noisy or inconsistent highlights when model weights are randomized, often failing tests in segmentation tasks. Compared to perturbation-based alternatives like (which approximates local behavior via sampled perturbations) or SHAP (which uses game-theoretic values for feature contributions), saliency maps are faster but less robust to architectural differences, potentially leading to misleading interpretations in complex models. A prominent in involves using saliency maps for classification on MRI scans from the BR35H , where a achieved high validation accuracy, with maps highlighting tumor regions and boundaries as key influencers in correct predictions. In misclassified cases, maps revealed focus on non-tumor areas or irregular shapes, guiding improvements like enhanced preprocessing for bone removal. However, evaluations on datasets like RSNA Detection underscore trustworthiness issues, as saliency methods underperformed dedicated localization networks (AUPRC of 0.160–0.519 vs. 0.596), emphasizing the need for hybrid approaches in clinical interpretability.

In Image Segmentation and Processing

Saliency maps serve as effective priors in by highlighting regions likely to represent foreground objects, thereby guiding foreground-background separation in algorithms like GrabCut. In the SaliencyCut method, saliency values are used to automatically initialize foreground and background models in an iterative GrabCut framework, improving segmentation accuracy without manual user intervention. This integration leverages classical saliency detection techniques, such as contrast-based methods, to provide robust initial seeds for graph-cut optimization. In practical applications, saliency maps enable selective by allocating higher bit rates to salient regions while aggressively compressing non-salient backgrounds, preserving perceptual quality. For instance, saliency-driven perceptual compression models incorporate maps to modulate quantization and encoding, achieving better visual fidelity at lower bitrates compared to uniform schemes. Similarly, in software, saliency maps facilitate adaptive cropping by identifying compositionally important areas, allowing automatic reframing that retains key elements like subjects or focal points. For , temporal saliency extends this to summarization, where spatiotemporal maps detect dynamic salient events to select representative keyframes, reducing video length while maintaining narrative coherence. Real-time saliency computation is crucial for applications, with models optimized for devices enabling on-the-fly in (AR) filters. These optimizations allow AR systems to highlight user-focused areas, such as overlaying effects on detected salient objects in live camera feeds, at frame rates suitable for interactive experiences. Hybrid approaches further enhance scene understanding by fusing saliency maps with outputs; for example, saliency-guided edge refinement strengthens boundary delineation in cluttered scenes, improving overall object isolation and contextual parsing.

Evaluation and Resources

Performance Metrics

Evaluating the quality of saliency maps requires quantitative metrics that compare predicted saliency distributions against , typically derived from human fixation data. These metrics assess aspects such as with fixations, distributional similarity, and predictive utility, enabling systematic comparison of computational models. Common approaches treat saliency maps either as continuous probability distributions or as binary classifiers, with each perspective highlighting different strengths and limitations of the maps. Similarity metrics directly measure the correspondence between a model's saliency map and an empirical fixation map. The Normalized Scanpath Saliency (NSS) evaluates fixation by normalizing the saliency map to zero mean and unit standard deviation, then computing the mean saliency value at human fixation locations; higher values indicate better , with random maps yielding a score near zero. Introduced for bottom-up saliency , NSS emphasizes location-specific predictions but assumes Gaussian-like saliency distributions. The Similarity metric (SIM) quantifies overlap between saliency and fixation maps by treating them as histograms and computing their after binning; it ranges from -1 (anti-correlated) to 1 (perfect match) and is robust to scaling. For distribution matching that accounts for spatial structure, the Earth Mover's Distance () measures the minimal "work" required to transform the saliency into the fixation , incorporating distances as costs; lower EMD values signify better spatial agreement, making it suitable for maps where proximity matters. From a binary classification viewpoint, saliency maps can be thresholded to predict fixation probability, allowing standard performance measures. The Area Under the Curve of the (AUC-ROC) treats saliency values as classifier scores, plotting true positive rate against across thresholds; an near 1 indicates strong discrimination of fixation points from non-fixations, while values around 0.5 match chance performance. Variants like shuffled AUC (sAUC) account for center bias in human viewing. For object saliency tasks, the F-measure assesses binarized maps against ground-truth masks using , often with an adaptive threshold (e.g., twice the map's mean value) to optimize overlap; it balances false positives and negatives, yielding values up to 1 for perfect segmentation. Information-theoretic metrics evaluate the predictive power of saliency maps by quantifying uncertainty reduction. These often employ to measure the information content in fixation distributions or between saliency predictions and actual fixations, where higher mutual information reflects greater explanatory value. The Kullback-Leibler (KL) divergence, a related asymmetric measure, computes the extra bits needed to encode fixations using the saliency distribution as a ; low divergence indicates the map closely approximates empirical attention patterns. Such approaches unify evaluation under probabilistic frameworks, emphasizing how well maps capture attentional entropy. Despite their utility, saliency metrics face challenges, including variability in from inter-observer differences in eye-tracking data, which can inflate noise and bias rankings toward models exploiting viewing tendencies like center bias. Metrics also differ in sensitivity: location-based ones like NSS penalize spatial errors harshly, while distribution-based ones like tolerate them, leading to inconsistent model rankings across measures. To address this, experts recommend multi-metric evaluation, combining , , and information-based scores for a robust , as no single captures all facets of saliency .

Benchmark Datasets

Benchmark datasets play a crucial role in training, testing, and comparing saliency map models, providing standardized annotations derived from human visual data. These datasets vary in scale, complexity, and annotation type, enabling evaluation of both bottom-up saliency prediction and salient . Static datasets dominate early benchmarks, focusing on natural scenes with pixel-wise or fixation-based labels, while video datasets incorporate temporal dynamics to assess spatiotemporal saliency. Key static image datasets include the , which comprises 5,000 images primarily featuring a single dominant salient object, with pixel-wise binary annotations for all 5,000 images created by multiple human annotators to ensure reliability. The (Extended Complex Scene Saliency Dataset) extends this with 1,000 images of semantically rich but structurally complex scenes, including cluttered backgrounds and multiple interacting objects, annotated via pixel-wise masks by expert labelers to capture intricate saliency patterns. Similarly, the offers 5,168 high-quality outdoor natural images, selected manually from over 140,000 candidates, with pixel-accurate binary masks derived from consensus among multiple annotators, emphasizing robust saliency in unconstrained environments. For dynamic scenes, video datasets introduce motion and temporal coherence. The UVSD (Unconstrained Videos Saliency Dataset) includes 18 challenging videos with complex motions and scenes, providing frame-by-frame pixel-wise binary annotations obtained through manual labeling by several observers to delineate salient objects over time. The DIEM (Dynamic Images and Eye Movements) dataset features 85 diverse video clips, such as movie trailers and documentaries, with fixation maps recorded via eye-tracking from over 250 participants, aggregating data to represent probabilistic distributions in naturalistic viewing conditions. Ground truth in these datasets typically falls into two categories: binary masks, which delineate object-level saliency for tasks like segmentation (as in MSRA-B, ECSSD, DUT-OMRON, and UVSD), and fixation maps, which provide probabilistic heatmaps of eye density for attention prediction (as in DIEM). methods prioritize reliability, often involving multiple human annotators—such as 5–9 per image in MSRA-B or labeling in DUT-OMRON—to mitigate inter-observer variability and enhance annotation quality. Recent post-2020 additions address limitations in prior benchmarks, such as bias toward simple scenes. For instance, the COCO-Freeview (2022) provides free-viewing fixation data on 100 natural images from the MS COCO , recorded from multiple observers and added to the /Tuebingen Saliency Benchmark in 2024 to support of saliency models in free-viewing scenarios. Similarly, Saliency-Bench (2023) is a collection of eight curated for evaluating saliency methods in image classification tasks, covering diverse domains like and scene understanding. The (Salient Objects in Clutter) , introduced in 2018 but extended for zero-shot in subsequent works, contains 6,000 images with multiple salient objects amid heavy clutter, featuring pixel-wise annotations and subitizing labels ( salient items) to test without task-specific training. Accessibility has improved through platforms like SALICON, which provides a large-scale of 10,000 MS COCO images with mouse-tracking-derived saliency maps as fixation proxies, collected via to simulate free-viewing behavior and facilitate model training and . These resources, often referenced alongside metrics like or NSS for validation, support comprehensive assessment of saliency models across static and dynamic domains.

References

  1. [1]
    [PDF] Shifts in selective visual attention: towards the underlying neural ...
    Since the saliency map is still a part of the early visual system, it most likely encodes the conspicuity of objects in terms of simple properties such as color ...
  2. [2]
    A model of saliency-based visual attention for rapid scene analysis
    Abstract: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented.
  3. [3]
    A saliency-based search mechanism for overt and covert shifts of ...
    Most models of visual search, whether involving overt eye movements or covert shifts of attention, are based on the concept of a saliency map.
  4. [4]
    Visualising Image Classification Models and Saliency Maps - arXiv
    Dec 20, 2013 · This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets).
  5. [5]
  6. [6]
    A feature-integration theory of attention - ScienceDirect.com
    The feature-integration theory of attention suggests that attention must be directed serially to each stimulus in a display whenever conjunctions of more than ...
  7. [7]
    Saliency Based on Information Maximization - NIPS papers
    Authors. Neil Bruce, John Tsotsos. Abstract. A model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from ...Missing: computation | Show results with:computation
  8. [8]
    [PDF] A Model of Saliency-Based Visual Attention for Rapid Scene Analysis
    All feature maps feed, in a purely bottom-up manner, into a master “saliency map,” which topographically codes for local conspicuity over the entire visual.
  9. [9]
    [PDF] Graph-Based Visual Saliency - NIPS papers
    A new bottom-up visual saliency model, Graph-Based Visual Saliency (GBVS), is proposed. It consists of two steps: first forming activation maps on certain ...
  10. [10]
    [PDF] Quantitative Analysis of Human-Model Agreement in Visual Saliency ...
    This paper compares 35 saliency models using different datasets and evaluation scores, finding that some models consistently perform better.
  11. [11]
    [PDF] Shallow and Deep Convolutional Networks for Saliency Prediction
    In our case we have adopted a completely data-driven approach, using a large amount of annotated data for saliency prediction. Figure 1 provides an example ...Missing: shift post-
  12. [12]
    Models of Bottom-up Attention and Saliency
    The most obvious and famous example of the bottom-up saliency of a stimulus is the pop-out effect (Treisman, 1986;Treisman & Gelade, 1980; see Figure 14). ...<|control11|><|separator|>
  13. [13]
    Five Factors that Guide Attention in Visual Search - PMC - NIH
    There are two fundamental rules of bottom-up salience. Salience of a target increases with difference from the distractors (target-distractor – TD- ...
  14. [14]
    Yarbus, eye movements, and vision - PMC - PubMed Central - NIH
    The impact of Yarbus's research on eye movements was enormous following the translation of his book Eye Movements and Vision into English in 1967.
  15. [15]
    Defending Yarbus: Eye movements reveal observers' task | JOV
    Buswell (1935) and Yarbus (1967), who were the first to investigate the relationship between eye-movement patterns and high-level cognitive factors. Yarbus ...
  16. [16]
    Saliency map - Scholarpedia
    Aug 28, 2007 · The original definition of the saliency map by Koch and Ullman (1985) is in terms of neural processes and transformations, rather than in terms ...Definition · Example · Anatomical localization of the... · Saliency maps in other...
  17. [17]
    Interaction between bottom-up saliency and top-down control
    Jan 16, 2012 · We found evidence for a hierarchy of saliency maps in human early visual cortex (V1 to hV4) and identified where bottom-up saliency interacts with top-down ...
  18. [18]
    Objects predict fixations better than early saliency - Journal of Vision
    Weighted with recall frequency, these objects predict fixations in individual images better than early saliency, irrespective of task.
  19. [19]
    Information-theoretic model comparison unifies saliency metrics - PMC
    Here we bring saliency evaluation into the domain of information by framing fixation prediction models probabilistically and calculating information gain.
  20. [20]
    How Well Can Saliency Models Predict Fixation Selection in Scenes ...
    Here, we adopt this approach to evaluate how well a given saliency map model predicts where human observers fixate in naturalistic images, above and beyond what ...
  21. [21]
  22. [22]
    Magnocellular Bias in Exogenous Attention to Biologically Salient ...
    Oct 1, 2017 · Results support a magnocellular bias in exogenous attention toward distractors of any nature during initial processing, a bias that remains in later stages.Methods · Results · Experimental Effects
  23. [23]
    Superior colliculus encodes visual saliency before the ... - PNAS
    Aug 14, 2017 · Our results show that neurons in the superficial visual layers of the superior colliculus (SCs) encoded saliency earlier and more robustly than V1 neurons.
  24. [24]
    [PDF] Frequency-tuned Salient Region Detection
    In this paper, we introduce a method for salient region detection that outputs full reso- lution saliency maps with well-defined boundaries of salient objects.
  25. [25]
    [PDF] Saliency, attention, and visual search: An information theoretic ...
    Mar 13, 2009 · This paper proposes a saliency computation model to maximize information sampled, where saliency is the output of combining features into a ...
  26. [26]
    [PDF] Visual Saliency Based on Multiscale Deep Features
    Visual saliency attempts to determine the amount of at- tention steered towards various regions in an image by the human visual and cognitive systems [6]. It is ...Missing: seminal | Show results with:seminal
  27. [27]
    Grad-CAM: Visual Explanations from Deep Networks via Gradient ...
    Oct 7, 2016 · We propose a technique for producing visual explanations for decisions from a large class of CNN-based models, making them more transparent.
  28. [28]
    Net: going deeper with nested U-structure for salient object detection
    In this paper, we design a simple yet powerful deep network architecture, U2-Net, for salient object detection (SOD). The architecture of our U2-Net is a ...U-Net: Going Deeper With... · 3. Proposed Method · 4. Experimental ResultsMissing: saliency | Show results with:saliency<|separator|>
  29. [29]
    Salient Object Detection in the Deep Learning Era: An In-Depth Survey
    Apr 19, 2019 · To facilitate the in-depth understanding of deep SODs, in this paper we provide a comprehensive survey covering various aspects ranging from ...
  30. [30]
    Saliency Map for Human Gaze Prediction in Still Images
    Aug 9, 2025 · 'Perceptual Image Difference Metrics – Saliency Maps & Eye Tracking'. His research interests include image processing and Digital Signal.
  31. [31]
    Predicting human gaze beyond pixels | JOV - Journal of Vision
    We propose a new saliency architecture that incorporates information at three layers: pixel-level image attributes, object-level attributes, and semantic-level ...
  32. [32]
    UEyes: Understanding Visual Saliency across User Interface Types
    Apr 19, 2023 · Given a UI as input, a saliency model can predict saliency maps or scanpaths, simulating how users perceive that UI. These models assist UI ...
  33. [33]
    How Well Can Saliency Models Predict Fixation Selection in Scenes ...
    Regarding model evaluation, the best solution to the issue of center bias is to design suitable evaluation metrics (Borji et al., 2013a), an approach we adopt ...
  34. [34]
    (PDF) Saliency and Human Fixations: State-of-the-Art and Study of ...
    Kullback-Leibler divergence is a measure used to evaluate the similarity between the probability distribution of the computational saliency map and the ...
  35. [35]
    [PDF] Quantitative Analysis of Human-Model Agreement in Visual Saliency ...
    2In addition to above scores, Kullback-Leibler (KL) (the divergence be- tween the saliency distributions at human fixations and at randomly shuffled fixations ...<|control11|><|separator|>
  36. [36]
    SalFoM: Dynamic Saliency Prediction with Video Foundation Models
    Apr 3, 2024 · Deep learning-based video saliency prediction, as explored in [26] , has recently become a prominent method for modeling human gaze in dynamic ...
  37. [37]
    [PDF] Gaze Prediction in Dynamic 360deg Immersive Videos
    Gaze prediction in 360 videos determines where a user will look, using history scan path and VR content, and is based on a deep learning model.
  38. [38]
    [PDF] Modelling Spatio-Temporal Saliency to Predict Gaze Direction ...
    A spatio-temporal saliency model that predicts eye movement during video free viewing inspired by the biology of the first steps of the human visual system ...
  39. [39]
  40. [40]
    PyGaze: An open-source, cross-platform toolbox for minimal-effort ...
    Nov 21, 2013 · The PyGaze toolbox is an open-source software package for Python, a high-level programming language. It is designed for creating eyetracking experiments in ...
  41. [41]
    [1703.01365] Axiomatic Attribution for Deep Networks - arXiv
    Mar 4, 2017 · View a PDF of the paper titled Axiomatic Attribution for Deep Networks, by Mukund Sundararajan and 2 other authors ... Integrated Gradients. Our ...
  42. [42]
    [1706.03825] SmoothGrad: removing noise by adding noise - arXiv
    Jun 12, 2017 · This paper makes two contributions: it introduces SmoothGrad, a simple method that can help visually sharpen gradient-based sensitivity maps.
  43. [43]
  44. [44]
    Survey on Explainable AI: From Approaches, Limitations and ...
    Aug 10, 2023 · This article aims to present a comprehensive overview of recent research on XAI approaches from three well-defined taxonomies.
  45. [45]
    Saliency Maps as an Explainable AI Method in Medical Imaging
    Saliency maps are used to highlight important regions in an image and have been found a user-friendly explanation method for deep learning-based imaging tasks.
  46. [46]
    [PDF] Saliency Driven Perceptual Image Compression - CVF Open Access
    This paper proposes a new end-to-end trainable model for lossy image compression, which includes several novel components. The method incorporates 1) an ...
  47. [47]
    (PDF) Saliency Based Image Cropping - ResearchGate
    Aug 7, 2025 · Image cropping is a technique that is used to select the most relevant areas of an image, discarding the useless ones.
  48. [48]
    A Framework for Video Summarization using Visual Attention ...
    This paper proposes Histogram based Weighted Fusion (HWF) algorithm that uses spatial and temporal saliency maps to act as guidance in creating the summary of ...
  49. [49]
    Real-time adjustment of contrast saliency for improved information ...
    Aug 7, 2025 · In this work, we present a technique based on image saliency analysis to improve the conspicuity of the foreground augmentation to the ...
  50. [50]
    Automatic video scene text detection based on saliency edge map
    The saliency map is conducive to detecting the text with cluttered backgrounds whereas the edge map is suitable for detecting the scene text with low resolution ...<|control11|><|separator|>
  51. [51]
    Evaluation - MIT/Tuebingen Saliency Benchmark
    More precisely, each metric is evaluated with the saliency map which the model itself predicts to have highest metric performance. This will result in models ...
  52. [52]
    Information-theoretic model comparison unifies saliency metrics
    Dec 10, 2015 · A major problem known to the field is that existing model comparison metrics give inconsistent results, causing confusion. We argue that the ...Results · Materials And Methods · Phrasing Saliency Maps...Missing: challenges | Show results with:challenges
  53. [53]
    The DIEM Project | Dynamic Images and Eye Movements
    The DIEM project is an investigation of how people look and see. DIEM has so far collected data from over 250 participants watching 85 different videos.Missing: attention | Show results with:attention
  54. [54]
    [PDF] SALICON: Saliency in Context - CVF Open Access
    Saliency in Context (SALICON) is an ongoing effort that aims at understanding and predicting visual attention. This paper presents a new method to collect ...