Scratchpad memory
Scratchpad memory (SPM), also known as scratchpad RAM or local store, is a high-speed on-chip static random-access memory (SRAM) that is explicitly managed by software, allowing programmers or compilers to directly allocate and access data without hardware intervention.[1] Unlike caches, which rely on automatic hardware mechanisms for data eviction and coherence, SPM operates in a distinct address space with fixed access latencies, enabling predictable performance in time-critical applications.[2] Gaining prominence as a viable alternative to caches in embedded systems during the early 2000s, SPM addresses the limitations of cache overheads by eliminating complex tag comparison and miss detection circuitry.[1] It provides significant efficiency gains, including average energy reduction of 40% and area savings of 34% compared to equivalent cache configurations, making it ideal for power-constrained devices such as mobile phones, digital signal processors, and wireless communication systems.[1] In modern architectures, SPM remains prominent in multicore processors and specialized accelerators, particularly for deep neural networks, where explicit data management facilitates reuse buffers and minimizes off-chip memory accesses, achieving performance improvements of up to three orders of magnitude over traditional CPU-based processing.[3] Its software-controlled nature supports compiler optimizations and dynamic allocation techniques, enhancing applicability in real-time and domain-specific computing environments, including recent advancements in neural processing units and GPU architectures as of 2025.[2][4]Fundamentals
Definition and Characteristics
Scratchpad memory is a high-speed, software-managed on-chip static random access memory (SRAM) that serves as temporary storage for data and instructions directly accessible by the processor core.[5] Unlike caches, it lacks automatic hardware mechanisms for data placement and eviction, requiring explicit programmer or compiler control to load and unload content.[6] This design positions scratchpad memory within the memory hierarchy between processor registers and main memory, facilitating low-latency access to critical program elements in embedded and resource-constrained systems.[5] Key characteristics of scratchpad memory include its fixed capacity, typically ranging from 1 KB to 512 KB, which supports direct addressing without the need for tag arrays or associativity logic found in caches.[5] Access times are highly predictable, as there are no miss penalties or coherence overheads; every valid address within the scratchpad yields a deterministic hit latency, often comparable to or better than L1 cache access due to simplified circuitry.[7] This predictability stems from the absence of hardware-managed replacement policies, making it particularly suitable for real-time applications where timing guarantees are essential.[6] In distinction from general-purpose memory structures, scratchpad memory focuses on minimizing latency for frequently accessed data in power- and area-limited environments, such as embedded processors, by integrating seamlessly as a software-controlled buffer.[5] Its basic operational principle involves explicit data movement via software instructions or direct memory access (DMA), ensuring that only selected program segments reside on-chip at any time and enabling fully deterministic execution without the variability introduced by cache misses.[6]Historical Development
The concept of scratchpad memory originated in the late 1950s and early 1960s as a form of fast, modifiable on-chip storage to support control functions in early computing systems. Honeywell pioneered its use with the H-800 system, announced in 1958 and first installed in 1960, which incorporated a 256-word core-based scratchpad for multiprogram control, enabling efficient task switching without relying solely on slower main memory.[8] By 1965, Honeywell's Series 200 minicomputers integrated scratchpad memories of varying sizes (up to 64 locations) as control storage, offering access speeds 2 to 6 times faster than main memory to enhance throughput in business applications.[8] A significant milestone came in 1966 with the Honeywell Model 4200 minicomputer, which utilized the TMC3162, a 16-bit bipolar TTL scratchpad memory developed by Transitron and second-sourced by multiple manufacturers including Fairchild, Sylvania, and Texas Instruments; this marked one of the first commercial semiconductor implementations of scratchpad for high-speed needs.[9] The 1980s saw widespread proliferation of scratchpad memory in digital signal processors (DSPs) for real-time applications, driven by the need for deterministic performance in embedded systems. Texas Instruments' TMS320 series, launched in 1983, incorporated on-chip scratchpad RAM as auxiliary storage for temporary data, complementing program and data memories to enable high-speed filtering and processing without external memory delays.[10] This design choice in the TMS32010 and subsequent models facilitated efficient algorithmic implementations in telecommunications and audio processing, establishing scratchpad as a staple in DSP architectures. During the 1990s and 2000s, scratchpad memory expanded into embedded and multicore systems, particularly with the rise of power-constrained devices. A key example is the IBM Cell Broadband Engine, designed starting in 2001 through the STI alliance (IBM, Sony, Toshiba), which featured 256 KB of local store per Synergistic Processing Unit (SPU) as explicitly managed scratchpad memory to support parallel workloads in gaming and scientific computing.[11] This architecture, first shipped in Sony's PlayStation 3 in 2006, demonstrated scratchpad's efficacy in reducing memory latency for vector operations across multiple cores. Post-2010 developments have integrated scratchpad into graphics processing units (GPUs) and explored hybrid designs for improved energy efficiency. NVIDIA's GPU architectures, such as those in the Kepler series from 2012 onward, treat shared memory as a configurable scratchpad, allowing programmers to allocate on-chip SRAM explicitly for thread-block data sharing, enhancing performance in parallel compute tasks.[12] Concurrent research has focused on hybrid cache-scratchpad systems, where portions of cache are dynamically repurposed as software-managed scratchpad to minimize energy consumption; for instance, adaptive schemes remap high-demand blocks to scratchpad, achieving up to 25% energy savings in embedded processors while maintaining hit rates.[13]Design and Operation
Software Management Techniques
Software management techniques for scratchpad memory (SPM) primarily involve explicit, compiler-directed, and dynamic strategies to allocate data and code, ensuring efficient use of this software-controlled on-chip storage. Explicit allocation requires programmers or compilers to specify placements using language directives, such as pragmas in C (e.g.,#pragma scratchpad), or runtime application programming interfaces (APIs) that map variables or functions to SPM regions. This approach allows precise control over data placement based on access patterns, often formulated as an optimization problem solved via integer linear programming (ILP) to minimize access times by assigning global and stack variables to SPM while respecting capacity constraints. For instance, the ILP model uses binary variables to decide allocations, incorporating profile-guided access frequencies, and achieves up to 44% runtime reduction through distributed stack management in embedded systems.[14]
Compiler-based techniques leverage static analysis to automate SPM allocation, analyzing variable lifetimes, access frequencies, and interferences to map frequently accessed ("hot") data to SPM for performance gains. These methods profile program execution to identify liveness intervals and prioritize placements that reduce energy consumption, such as assigning basic blocks or functions to SPM banks, yielding up to 22% energy savings in embedded applications. Graph coloring extends this by modeling allocation as an interference graph where nodes represent data objects and edges denote overlapping lifetimes; colors correspond to SPM "registers" of fixed sizes, resolved via standard coloring algorithms adapted from register allocation to handle conflicts and ensure non-overlapping assignments. This technique partitions SPM into alignment-based units, splits live ranges at loop boundaries for better fit, and improves runtime by optimizing for smaller SPM sizes, as demonstrated in benchmarks like "untoast" where it enhances utilization without manual intervention.[2][2][15]
Dynamic allocation methods enable runtime adaptation, particularly in multitasking environments, using compiler-inserted code or operating system (OS) support to load and evict data based on heuristics like access costs and future usage predictions. These approaches construct a data-program relationship graph to timestamp memory objects and greedily select transfers from off-chip memory to SPM at program points, avoiding runtime overheads like caching tags while maintaining predictability. In pointer-based applications, runtime SPM management can reduce execution time by 11-38% (average 31%) and DRAM accesses by 61% compared to static methods, with optimizations for dead data exclusion further lowering energy by up to 31%. OS-level support may involve adaptive loading via system calls, ensuring portability across varying workloads.[16][16]
Tools and frameworks facilitate these techniques through integrated compiler passes and simulators. Compiler frameworks like LLVM incorporate SPM allocation passes that perform static analysis and graph-based optimizations during code generation, enabling seamless integration with build systems for hybrid memory management. For energy profiling, simulation tools such as CACTI model SPM access energies and leakage, providing estimates for design space exploration; it computes capacitances and power based on technology parameters, supporting evaluations that confirm SPM's 20-30% lower energy than caches for equivalent sizes. Additionally, methods handling compile-time-unknown SPM sizes use binary search or OS queries within compiler flows to generate portable binaries, maintaining near-optimal allocations across hardware variants.[17][18][19]
===== END CLEANED SECTION =====