Fact-checked by Grok 2 weeks ago

False sharing

False sharing is a issue in systems with protocols, occurring when multiple processors or threads access different data elements that happen to reside within the same line or coherence block, thereby triggering unnecessary invalidations and coherence operations despite no actual between the accesses. This phenomenon arises primarily in shared-memory multiprocessors where hardware maintains consistency across cores by treating the entire cache line—typically 64 bytes—as the unit of , causing writes to one data item to evict or invalidate the line for other processors accessing unrelated data in the same line. As a result, false sharing can significantly degrade application and throughput, leading to increased misses, higher from main accesses, and serialized execution patterns that undermine the benefits of multicore architectures. In multithreaded programs, false sharing is particularly prevalent when global or shared data structures are accessed concurrently without proper alignment, such as counters or locks placed adjacently in memory without padding to separate them across cache lines. For instance, in kernel code like the Linux mmap_lock within mm_struct, modifications to a frequently updated reference count can inadvertently affect reads of adjacent read-only fields, forcing repeated cache line reloads across CPUs. Experimental analyses have shown that eliminating false sharing can yield performance improvements of an order of magnitude in affected workloads, highlighting its critical impact on shared-memory systems. Mitigation strategies focus on data layout optimization and protocol enhancements to minimize unnecessary coherence traffic. Common techniques include manual padding of data structures with unused bytes to ensure frequently accessed variables occupy separate lines, as demonstrated in multithreaded applications where adding buffers between objects improved by up to sixfold on dual-core systems. Compiler-assisted approaches, such as automatic data reorganization or relaxed consistency models (e.g., in systems like or Munin), can reduce delays associated with false sharing. Advanced detection tools, including counters for misses or sampling-based profilers, enable identification of false sharing hotspots, while runtime adaptations like dynamic block sizing further address it in production environments.

CPU Cache Basics

Cache Lines

A cache line represents the smallest unit of data transfer between main memory and the CPU , consisting of a fixed-size block that is loaded or evicted as a whole during cache operations. This design ensures that data access is efficient by minimizing the number of transfers required for small, localized memory requests. In modern processors, cache lines are typically bytes in length, allowing the cache to hold multiple words of data that are likely to be accessed together. When the CPU requests data not present in the cache, a cache miss triggers the loading of an entire line from main into the cache, even if only a single byte or word is needed. This mechanism exploits spatial locality, the principle that programs tend to access data elements close in to recently used ones, thereby prefetching adjacent data that may be required soon and reducing future misses. Eviction occurs similarly in fixed-size blocks when the cache is full, often using replacement policies like least recently used (LRU) to select which line to remove, further emphasizing the block-based nature of cache management. Cache lines are organized within the using associativity schemes, such as set-associative mapping, which divides the into sets of lines (or ways) to balance speed and flexibility in data placement. In a k-way set-associative , each block maps to one specific set but can reside in any of the k lines within that set, determined by the address's index field, while the tag field verifies the exact match. This organization mitigates conflicts that could arise in simpler direct-mapped caches, where lines are strictly one-to-one with blocks, improving overall hit rates without the full search overhead of fully associative designs. Hardware implementations vary, but Intel x86 processors use 64-byte cache lines across L1, L2, and L3 levels to optimize for common workloads. Similarly, ARM architectures, such as those in Cortex-A series, standardize on 64-byte lines for efficient data handling in embedded and high-performance systems. In contrast, older processors like certain Alpha and SPARC64 models employed 32-byte cache lines, reflecting earlier trade-offs in bandwidth and locality assumptions before the shift to larger blocks in contemporary designs.

Coherency Protocols

In () systems, where multiple share a common space, the coherency problem arises because each maintains a local to reduce , potentially leading to multiple copies of the same block across caches. Without coordination, a writing to its cached copy may not update other caches, causing inconsistencies when another reads stale . coherency protocols address this by ensuring that all caches observe a consistent view of , typically through mechanisms that track and propagate updates or invalidations for shared blocks, often aligned to lines. One of the most common protocols is MESI (Modified, Exclusive, Shared, Invalid), an invalidate-based scheme used in many bus-based SMP systems to manage cache line states. In the Modified state, a cache line has been updated locally and is the only valid copy, differing from main memory; Exclusive indicates a clean, unique copy matching main memory; Shared means the line is present in multiple caches without modifications; and Invalid marks the line as unusable, requiring a fetch on access. State transitions occur via bus snooping, where each cache controller monitors (snoops) bus transactions: for example, a processor's read miss triggers a Bus Read (BusRd) request, transitioning the line to Exclusive if no other cache responds or to Shared if another provides the data; a write to an Exclusive line issues Bus Upgrade (BusUpgr) to invalidate other copies, moving to Modified. These transitions ensure serialization of accesses, preventing conflicts while minimizing bus traffic through write-back policies. An extension, , adds an Owned state to MESI, allowing a modified line to be supplied to other caches without immediately writing back to memory, reducing latency in certain sharing patterns. This state is particularly useful in systems like processors, where and later architectures implement to optimize data transfers between caches, as the owning cache can respond to read requests directly while retaining responsibility for eventual . Transitions in mirror MESI but include paths like Modified to Owned on snooped reads, enabling efficient sharing without full memory intervention. For large-scale systems beyond bus-based , such as (NUMA) architectures with many processors, -based coherency replaces snooping to improve . A centralized or distributed tracks the location and state (often using MESI-like tags) of each cache line across nodes, responding to requests with point-to-point messages rather than broadcasts; for instance, a write invalidates sharers listed in the directory, avoiding the bandwidth explosion of snooping in hundreds of caches. This approach suits NUMA by localizing traffic to nearby nodes, though it introduces storage overhead. Implementing these protocols incurs overhead, primarily from snoop in bus-based systems, where every is broadcast, leading to increased bus contention and energy use as processor count grows. Invalidation messages further contribute, as writes trigger broadcasts to flush or update remote copies, potentially causing up to 50% or more of misses in shared workloads due to actions alone. Directory protocols mitigate this by limiting messages to involved caches but add complexity in directory management and potential latency from indirection.

Core Concepts of False Sharing

Definition and Causes

False sharing is a performance degradation in multi-threaded programs running on shared-memory multiprocessor systems, where multiple threads access distinct data elements that reside within the same cache line, leading to unnecessary cache line invalidations and transfers despite no actual data dependency between the threads. This phenomenon arises in environments employing shared memory models, such as those implemented via POSIX threads (pthreads) or OpenMP, where threads execute concurrently on different processor cores with private caches but share a common address space. The root causes of false sharing stem from the interplay between scheduling, organization, and hardware coherency mechanisms. Threads are typically affinity-bound to specific cores to minimize context-switching overhead, ensuring that each operates on a dedicated with its own local . Write operations by one to its data invalidate the entire line in other cores' caches to maintain coherency, as enforced by protocols such as MESI, which treat the line as the atomic unit of sharing rather than individual data words. This enforcement, while essential for correctness, induces overhead when threads interleave writes to non-overlapping offsets within the same line, amplifying inter-core communication and misses. The mechanism unfolds as follows: consider two threads, A and B, accessing variables at offsets 0 and 32 (assuming a 64-byte cache line) in the same line. Thread A reads and writes its variable, placing the line in its cache in the exclusive state. When Thread B subsequently writes to its variable, the coherency protocol invalidates Thread A's copy, forcing Thread A to reload the line from memory or another cache on its next access. This results in "ping-ponging," where the cache line repeatedly migrates between cores, incurring high latency from invalidation requests and data fetches. False sharing first gained prominence in the 1990s with the rise of shared-memory multiprocessors, as evidenced in early performance analyses of benchmarks like those in the SPEC suite, where it contributed to elevated miss rates in parallel workloads. Seminal studies from that era, such as those examining sizes and spatial locality, quantified its impact, showing that false sharing misses scaled with larger cache lines and interleaved access patterns in applications like graph algorithms.

True vs. False Sharing

True sharing occurs when multiple threads or processors legitimately access the same item, such as a shared , with at least one access involving a write, necessitating mechanisms like locks or barriers to ensure correct semantics and prevent issues like race conditions. In contrast, false sharing arises when threads access distinct elements that happen to reside within the same line, triggering unintended traffic without any actual or semantic conflict. The key distinction lies in their implications: true sharing addresses fundamental correctness concerns, such as atomicity violations where concurrent writes to the same can corrupt or lead to inconsistent reads, whereas false sharing is purely a performance artifact that induces line invalidations and thrashing without risking data races or errors. A classic example of true sharing pitfalls involves two incrementing a shared counter without proper , resulting in lost updates due to non-atomic operations and violating the program's intended logic. False , however, manifests as thrashing when one writes to its private variable, invalidating the entire line for other accessing unrelated variables in the same line, leading to repeated misses and coherence protocol overhead. This difference underscores that true sharing requires algorithmic fixes for correctness, while false sharing demands awareness of memory layout to avoid incidental performance degradation. False sharing can sometimes mimic true sharing in scenarios involving read-only accesses, where multiple threads read from distinct elements in the same cache line; although less severe than write-induced cases since no invalidations occur, it still incurs bandwidth waste from redundant cache line transfers across cores. In such read-only false sharing, the lack of writes means no ping-ponging of ownership, but the spatial proximity still amplifies traffic unnecessarily. Detecting these phenomena poses challenges, as both true and false sharing often manifest as cache contention or elevated coherence misses in standard profilers, requiring deeper analysis of memory access patterns and data layouts to differentiate them—such as using tools that track per-cache-line sharing types or simulate block granularities. For instance, while true sharing may show overlapping addresses in contention reports, false sharing demands examination of offsets within cache lines to confirm independent accesses. This necessitates specialized profiling, like cache-line bounce tracking, to avoid misattributing performance issues.

Practical Examples

Code Illustration

A example of false sharing can be demonstrated using a simple multithreaded C++ program where two threads independently increment separate integer counters within a shared struct. On modern x86 architectures, where lines are typically 64 bytes, the two 4-byte integers fit within the same line, leading to unnecessary traffic as each thread's write invalidates the line for the other. The following code snippet uses threads to create two threads, each performing one million increments on its respective counter (a or b) in a shared Counters struct:
cpp
#include <pthread.h>
#include <iostream>
#include <atomic>  // For thread-safe increments, though not preventing false sharing

struct Counters {
    int a = 0;
    int b = 0;
};

void* increment_a(void* arg) {
    Counters* counters = static_cast<Counters*>(arg);
    for (int i = 0; i < 1000000; ++i) {
        counters->a++;
    }
    return nullptr;
}

void* increment_b(void* arg) {
    Counters* counters = static_cast<Counters*>(arg);
    for (int i = 0; i < 1000000; ++i) {
        counters->b++;
    }
    return nullptr;
}

int main() {
    Counters counters;
    pthread_t thread1, thread2;
    pthread_create(&thread1, nullptr, increment_a, &counters);
    pthread_create(&thread2, nullptr, increment_b, &counters);
    pthread_join(thread1, nullptr);
    pthread_join(thread2, nullptr);
    std::cout << "a: " << counters.a << ", b: " << counters.b << std::endl;
    return 0;
}
This program compiles with g++ -O2 -pthread example.cpp -o example and runs on a multi-core x86 system. Without synchronization beyond the increments (which are not atomic here for simplicity, but the issue persists even with atomics), the false sharing causes each increment to trigger cache line invalidations, serializing the threads' progress. In terms of performance, on a typical multi-core x86 processor, the expected throughput without contention might approach 1 million increments per second per thread, but false sharing reduces the combined throughput to around 10,000 increments per second due to coherence overheads, resulting in a roughly 10x slowdown compared to an ideal non-contended case. To demonstrate the impact, consider a variant where is added to separate the counters onto different lines:
cpp
struct PaddedCounters {
    int a = 0;
    char padding1[60];  // Pad to next 64-byte boundary
    int b = 0;
    char padding2[60];
};
Replacing Counters with PaddedCounters in the above aligns a and b to separate lines, eliminating false sharing. This modification yields a of up to 10x, restoring throughput to near the expected 1 million increments per second per . This example assumes a multi-core x86 system (e.g., Intel Core i7 with 4 cores), compiled at optimization level -O2, and executed with two threads pinned to different cores using taskset for clarity. To observe the cache misses empirically, tools like Linux perf (e.g., perf stat -e cache-misses ./example) or Intel VTune can profile the execution, revealing elevated L1 cache invalidations in the false-sharing version compared to the padded one.

Performance Effects

False sharing induces significant performance overhead by elevating misses, as concurrent writes to distinct variables within the same line trigger unnecessary traffic across processors, thereby increasing consumption. This results in amplified at accesses, exacerbating non-parallelizable portions of workloads and diminishing overall in line with . In multi-threaded applications, such effects can waste substantial CPU cycles on maintenance, with benchmarks indicating up to an degradation in execution time due to repeated line migrations. Quantitative metrics highlight the severity: in synthetic tight loops with frequent updates, false sharing can provoke latency spikes exceeding 100x compared to aligned accesses, as threads stall awaiting line transfers. Within the PARSEC benchmark suite, affected workloads like linear-regression exhibit up to 9x slowdowns from coherence overhead, while streamcluster sees 5.4% performance loss, underscoring how even moderate contention consumes 20-50% of cycles in invalidation handling for data-intensive tasks. These impacts scale poorly in NUMA systems, where cross-node misses lead to linear degradation in throughput as core counts increase, often doubling remote access latencies. Similarly, workloads in graphics pipelines, such as those in PARSEC's bodytrack, experience elevated LLC misses from shared image buffers, amplifying slowdowns by 15-25% on multi-socket systems. can be profiled using hardware counters, with tools like perf c2c quantifying LLC misses and HITM (hit to modified) events to isolate false sharing contributions.

Mitigation Approaches

Padding Techniques

Data padding involves inserting unused bytes, often referred to as dummy , between variables that are frequently accessed by different threads to ensure each resides in its own line, thereby preventing false sharing. This low-level software approach aligns data structures to line boundaries, typically 64 bytes on modern x86 architectures, to isolate modifications and reduce unnecessary traffic. Early research identified false sharing as a significant source of misses in shared-memory systems and proposed records with dummy words to separate scalars or elements across cache blocks, achieving average reductions of about 10% in shared misses across benchmarks. In practice, padding can be implemented manually by adding padding fields, such as arrays of bytes sized to fill the remainder of a line (e.g., char pad[64 - sizeof(int)];), or through compiler directives for automatic alignment. For instance, the GNU Compiler Collection (GCC) and support the __attribute__((aligned(64))) attribute on structures or variables to enforce line alignment, ensuring that each instance starts at a multiple of 64 bytes and avoids overlap with adjacent data. This method is straightforward to apply locally in code, particularly for thread-private variables in parallel loops or data structures like per-thread counters. The advantages of data padding include its simplicity and effectiveness in eliminating false sharing with minimal code changes, often yielding substantial performance improvements. However, it increases the by allocating unused space, which can lead to excessive consumption—particularly in large arrays or numerous objects—potentially bloating data structures by tens of percent depending on the of application. Cache line coloring complements by strategically assigning to specific classes (or "colors") in set-associative , ensuring that frequently accessed items from different threads map to distinct sets and lines to minimize both misses and false sharing. This technique, which involves offset-based allocation during , has been used in systems like operating caches to isolate and reduce overhead without always requiring explicit . Such padding and coloring methods trace their roots to foundational studies on multiprocessor cache behavior in the 1990s but gained prominence in high-performance computing during the 2000s, where they were routinely applied to optimize benchmarks like the NAS Parallel Benchmarks for scalable parallel performance on shared-memory clusters.

Design and Compiler Strategies

To mitigate false sharing, programmers can redesign algorithms to distribute data access patterns across threads in ways that minimize contention on shared cache lines. One effective approach is using thread-local storage, where each thread maintains its own private copy of data structures, such as counters or accumulators, avoiding global variables that multiple threads might update simultaneously. For instance, in parallel loops over arrays, implementing per-thread arrays allows each thread to accumulate results locally before a final reduction phase, reducing unnecessary cache invalidations. This technique has been shown to improve scalability in multithreaded applications, as demonstrated in benchmarks from parallel computing frameworks like OpenMP. Another strategy involves sharding data across , partitioning large datasets so that each thread operates on a disjoint , thereby eliminating overlapping line accesses. This is particularly useful in data-parallel workloads, such as operations or simulations, where static or dynamic scheduling ensures balanced load while preventing false sharing on index-based structures like IDs or . Seminal work on parallel algorithms emphasizes sharding for efficiency. Compiler optimizations play a crucial role in automating false sharing prevention without manual intervention. Modern compilers like LLVM/Clang support alignment attributes similar to GCC's for structures. Additionally, (PGO) can analyze runtime access patterns to improve code layout, though specific reductions in false sharing vary by application. At runtime, techniques like affinity pinning bind threads to specific cores or sockets, minimizing cross-NUMA node traffic that exacerbates false sharing in multi-socket systems. Tools such as numactl or pthread_setaffinity_np allow explicit control, ensuring threads access local memory domains and reducing remote overhead. Evaluations in NUMA-aware environments show this can improve for shared data accesses compared to default scheduling. Advanced runtime methods include software cache partitioning, such as Intel's Cache Allocation Technology (), which divides the last-level cache into non-overlapping ways for different threads or processes, isolating their working sets to prevent false sharing-induced evictions. This hardware-assisted feature, available on processors, enables fine-grained control via Linux's resctrl interface, with studies indicating throughput improvements in contended workloads like databases. These strategies involve trade-offs: algorithmic redesigns like increase memory usage and require careful synchronization for reductions, trading simplicity for performance, while compiler automations may reduce portability across toolchains or introduce compilation overhead. In cloud environments such as AWS EC2 instances with multi-socket designs, runtime pinning and partitioning enhance effectiveness but demand environment-specific tuning, as patterns can amplify false sharing without it. Overall, their impact varies by workload, with greater benefits in compute-intensive applications over I/O-bound ones.

References

  1. [1]
    [PDF] False Sharing and its E ect on Shared Memory Performance3
    Abstract. False sharing occurs when processors in a shared-memory parallel system make references to di erent data objects within the same coherence block ...
  2. [2]
    False Sharing - The Linux Kernel documentation
    False sharing is related with cache mechanism of maintaining the data coherence of one cache line stored in multiple CPU's caches.
  3. [3]
    .NET Matters: False Sharing | Microsoft Learn
    Sep 9, 2019 · False sharing occurs when threads access distinct data sets that appear as shared due to cache line size, causing performance issues.Missing: coherence | Show results with:coherence
  4. [4]
    Cache Line Size - an overview | ScienceDirect Topics
    A cache line is defined as the number of bytes loaded together in one entry in the cache, typically consisting of a few machine words per entry. 1. In computer ...Missing: historical | Show results with:historical
  5. [5]
    Cache Size ID Register, EL1 - Arm Developer
    16-way. [2:0], LineSize, Returns 0b010 to indicate that the cache line size is 64 bytes. The following table shows the individual bit field and complete ...
  6. [6]
    Exploiting spatial locality in data caches using spatial footprints
    Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the ...
  7. [7]
    11.4. CPU Caches - Dive Into Systems
    Recall that a cache's block size represents the smallest unit of data transfer between a cache and main memory. Thus, every memory address falls within one 32- ...
  8. [8]
    Set associative caches - Arm Developer
    Set associative caches are divided into a number of equal sized pieces, called ways. A memory location can then map to a way rather than a line. The index field ...
  9. [9]
    [PDF] The Basics of Caches | UCSD CSE
    The number of blocks per set is deter- mined by the layout of the cache (e.g. direct mapped, set-associative, or fully associative). tag - A unique ...
  10. [10]
    A Survey of CPU Caches - Meribold
    Oct 20, 2017 · Cache lines or cache blocks are the unit of data transfer between main memory and cache. They have a fixed size which is typically 64 bytes on ...
  11. [11]
    The Elements of Cache Programming Style - USENIX
    Typically a cache line is 32 bytes long and it is aligned to a 32 byte offset. First a block of memory, a memory line, is loaded into a cache line. This cost is ...Missing: historical | Show results with:historical
  12. [12]
    [PDF] A Primer on Memory Consistency and Cache Coherence, Second ...
    In the first approach, the coherence protocol ensures that writes are propagated to the caches ... Table 7.11: MESI Snooping protocol—cache controller. A shaded ...
  13. [13]
    Cache Coherency - CS 3410 - Cornell: Computer Science
    Cache coherency refers to the problem that arises when multiple processors cache shared data. Each processor may see different values for the same memory ...
  14. [14]
    MESI and MOESI protocols - Arm Developer
    There are a number of standard ways by which cache coherency schemes can operate. Most ARM processors use the MOESI protocol, while the Cortex-A9 uses the MESI ...
  15. [15]
    [PDF] Cache coherence in shared-memory architectures
    Cache on each core 'snoops' (i.e. watches continually) for write activity concerned with lines which it has cached. This assumes a bus structure which is ' ...
  16. [16]
    [PDF] AMD x86-64 Architecture Programmer’s Manual, Volume 2 ...
    Some processor implementations do not change the data MOESI state if the read probe is initiated by a device that does not intend to cache the data. State ...
  17. [17]
    MOESI hammer - gem5
    This is an implementation of AMD's Hammer protocol, which is used in AMD's Hammer chip (also know as the Opteron or Athlon 64).
  18. [18]
    [PDF] Directory-Based Cache Coherence
    Recall non-uniform memory access (NUMA) shared memory systems. Idea: locating regions of memory near the processors increases scalability: it yields.
  19. [19]
    [PDF] Lecture 7: Directory-Based Cache Coherence
    CC NUMA: Cache coherent non-uniform memory access. Page 3. 3. Directory-Based Protocol. • For each block, there is a centralized “directory” that maintains the ...
  20. [20]
    [PDF] Lecture 18: Snooping vs. Directory Based Coherency
    Individual Cache Block in a. Directory Based System. • States identical to snoopy case; transactions very similar. • Tranistions caused by read misses, write ...<|separator|>
  21. [21]
    [PDF] Comparison of Hardware and Software Cache Coherence Schemes
    ABSTRACT. We use mean value analysis models to compare representative hardware and software cache coherence schemes for a large-scale shared-memory system.
  22. [22]
    What Is False Sharing? - Oracle® Solaris Studio 12.4: OpenMP API ...
    False sharing occurs when threads on different processors modify variables on the same cache line, forcing a fetch of a more recent copy from memory.
  23. [23]
    [PDF] Cache coherence in shared-memory architectures
    • False-sharing. – Two or more cores read and write distinct variables or array elements that happen to be in the same cache line. • Sharing results in “ping ...
  24. [24]
    [PDF] False sharing and spatial locality in multiprocessor caches
    Abstract- The performance of the data cache in shared- memory multiprocessors has been shown to be different from that in uniprocessors.
  25. [25]
    False Sharing and its Effect on Shared Memory Performance - USENIX
    False sharing occurs when processors reference different data within the same cache line, causing unnecessary coherence operations.<|control11|><|separator|>
  26. [26]
    6.2 False Sharing And How To Avoid It
    Note that shared data that is read-only in a loop does not lead to false sharing. 6.2.2 Reducing False Sharing. Careful analysis of those parallel loops that ...
  27. [27]
    [PDF] Sheriff: Detecting and Eliminating False Sharing
    Aug 24, 2010 · Abstract. False sharing is an insidious problem for multi-threaded programs running on multicore processors, where it can silently degrade ...
  28. [28]
    Chapter 26. Detecting false sharing | Red Hat Enterprise Linux | 8
    You can use the perf c2c command to inspect cache-line contention to detect both true and false sharing.
  29. [29]
    [PDF] SHERIFF: Precise Detection and Automatic Mitigation of False Sharing
    We can see that SHERIFF-DETECT reports both false sharing and pseudo-sharing problems successfully, and correctly ignores the benchmarks with no actual false ...
  30. [30]
    [PDF] Cheetah: Detecting False Sharing Efficiently and Effectively
    Abstract. False sharing is a notorious performance problem that may occur in multithreaded programs when they are running on ubiquitous multicore hardware.Missing: scholarly | Show results with:scholarly
  31. [31]
    [PDF] Detection of False Sharing Using Machine Learning
    Nov 21, 2013 · False sharing induces certain memory access patterns on multiple cores. By looking at the combined performance event counts of the cores, we are ...Missing: history 1990s SPEC
  32. [32]
  33. [33]
    [PDF] Large Pages May Be Harmful on NUMA Systems | USENIX
    Jun 19, 2014 · False sharing leads to poor local- ity, which cannot be addressed by page migration alone. In this work we: • Quantify the performance ...<|control11|><|separator|>
  34. [34]
    [PDF] The PARSEC Benchmark Suite: Characterization and Architectural ...
    The strong increase of sharing of freqmine is caused by false sharing, as the program uses an array- based tree as its main data structure. Larger cache lines.
  35. [35]
    Chapter 26. Detecting false sharing | Red Hat Enterprise Linux | 9
    You can use the perf c2c command to inspect cache-line contention to detect both true and false sharing. Cache-line contention occurs when a processor core on a ...
  36. [36]
    [PDF] University of Limerick - The Linux Kernel Archives
    L1 CPU cache to avoid false sharing. If the CPU references an address that is ... cache line coloring. In Proceedings of the ACM SIGPLAN Conference on.