Fact-checked by Grok 2 weeks ago

String interning

String interning is a in for optimizing memory usage in programming languages by maintaining a single shared instance of each unique immutable value, allowing identical strings to be represented by the same object reference and enabling faster checks via pointer rather than content evaluation. This approach ensures that string literals, which are compile-time constants, are automatically pooled and reused across the program, preventing duplicate allocations for repeated values. In , for instance, the runtime maintains a private string pool where literals like "example" in different parts of the code or even across classes refer to the identical String instance, and the intern() method explicitly adds runtime-created strings to this pool for canonical representation. Similarly, in , short strings and identifiers are automatically interned by the interpreter, while the sys.intern() function allows manual interning of other strings to enhance dictionary key lookups by substituting expensive content comparisons with efficient reference equality. The benefits include substantial memory savings in applications handling large volumes of repeated strings, such as files or , and gains in operations like hashing or testing, though excessive manual interning can introduce overhead from pool management. Originating from symbol interning in early dialects, where unique symbols are cataloged in packages for efficient lookup, the concept has been adopted in modern languages including .NET (via String.Intern()) and implementations, underscoring its role in balancing resource efficiency with computational speed.

Core Concepts

Definition and Purpose

String interning is a technique in for optimizing the storage and usage of strings by maintaining a single canonical instance for each unique string value in a shared pool, known as the string pool, to which all references to that value are directed. This ensures that identical strings do not occupy multiple memory locations unnecessarily. The approach relies on strings being immutable, meaning their contents cannot be altered after creation, which prevents any risk of unintended modifications when multiple parts of a program share the same instance and allows safe reuse across threads or modules. Without immutability, interning could lead to or concurrency issues, making it a foundational requirement for the technique's reliability. The main purposes of string interning are to reduce the overall memory footprint by avoiding redundant copies of identical strings and to accelerate equality checks by leveraging reference equality—such as comparing object pointers—rather than performing slower content-based comparisons that scan characters sequentially. For instance, in programs with numerous repeated strings, like literals embedded in source code or datasets with common identifiers, interning guarantees that only one instance exists, conserving resources and improving performance in operations involving frequent string matching.

String Pool Mechanism

The string pool functions as a centralized that maintains a collection of unique string instances, typically structured as a where the keys are derived from the string contents and the values are references to the objects. This design allows for efficient storage and retrieval by associating each distinct string value with a single, in memory, ensuring that all references to equivalent strings point to the same location. The interning process begins with computing a hash value for the input string's content, which is then used to index into the . A lookup is performed at the corresponding bucket; if a matching string is found—verified through content comparison for equivalence—the reference to the existing pooled instance is returned immediately. If no match exists, a new string object is allocated, inserted into the table under its hash-derived key, and its reference is returned to the caller. This canonicalization mechanism guarantees that strings deemed equal by value share a unified , avoiding redundant allocations. Hash collisions, which occur when distinct strings produce the same hash value and map to the same index, are resolved using standard techniques such as with , where subsequent buckets are probed sequentially until an empty slot or the target is found. This approach preserves the integrity of the pool by ensuring each unique string occupies a distinct entry, while maintaining average-case O(1) access time despite occasional clustering. Deleted or obsolete entries may be marked with tombstones to avoid disrupting probe chains for other keys. Unlike regular string creation, which generates a fresh object allocation for every instance—even when contents are identical—the pool enforces reuse of pre-existing objects for matches, thereby establishing value-based uniqueness through reference .

Historical Development

Origins in Early Languages

String interning originated in early dialects during the late 1950s and 1960s, primarily as a for managing symbols in symbol tables to enable efficient lookups within interpreters. In these systems, symbols—essentially immutable strings representing identifiers—were stored uniquely in a central registry known as the oblist (object list), ensuring that identical symbols shared the same memory location. This approach facilitated rapid identification and reuse of symbols during symbolic processing, a core requirement for Lisp's list-processing paradigm. The concept emerged from the need to handle symbolic expressions in applications, where repeated symbol comparisons were frequent. John McCarthy's foundational 1960 paper on introduced the essential primitives for this interning-like behavior, including the atom predicate to identify atomic symbols and the eq function to test equivalence between them by comparing their memory locations. Although not explicitly termed "interning," the design ensured that each distinct atomic symbol maintained a unique association list in memory, implicitly supporting deduplication and pointer-based equality for atoms like identifiers or constants. This was implemented in the initial 1.5 system on the , where symbols were entered into the oblist upon reading, preventing duplication and optimizing storage in an era of constrained resources. Early motivations stemmed from the demands of symbolic computation in research, coupled with the limited hardware of the time—such as the 's maximum 32K words of memory—which necessitated compact representations to avoid exhaustion during list manipulation and evaluation. A key milestone in formalizing interning occurred in the with the development of Lisp machines, particularly through systems like on hardware. Interlisp's implementation used a hash table-based oblist to store interned symbols, where functions such as pack and mkatom would retrieve or create unique atoms based on their print names, ensuring pointer equality for identical symbols. This structure was integral to efficient garbage collection, as the centralized oblist allowed collectors to trace and reclaim unreferenced symbols alongside other objects, and it enhanced equality testing via direct pointer comparisons with eq. These advancements addressed the growing complexity of programs on dedicated Lisp machines, building on earlier dialects to support larger symbol pools while maintaining performance under evolving hardware constraints.

Evolution in Modern Systems

During the 1980s and 1990s, string interning transitioned from experimental concepts in earlier languages to a standard optimization in environments, particularly in Smalltalk and . In Smalltalk systems, such as those developed at PARC, symbols served as interned representations of strings, enabling efficient storage and comparison by maintaining a unique instance for each distinct string value in a global . This approach influenced subsequent languages by demonstrating interning's role in reducing memory overhead for frequently used string literals in dynamic, object-oriented contexts. formalized string interning upon its initial release with JDK 1.0 in 1996, where the String.intern() method created a canonical representation for strings in the (JVM), automatically interning string literals to promote reuse across the application.) Key developments in the and early extended interning to other major languages. Python introduced explicit string interning through the built-in intern() function in its early implementations, allowing developers to manually add strings to an intern pool for performance gains in string-heavy operations, a feature available since at least 2.0 in 2000; this was later moved to sys.intern() in 3.0 (2008). In C#, string interning became available with the .NET Framework 1.0 in 2002 via the String.Intern() method, which integrates with the (CLR) to manage a shared pool of unique string instances, optimizing equality checks and memory usage in managed code. Post-2000, string interning evolved to accommodate just-in-time (JIT) compilation and expanded character sets, particularly with Unicode adoption. Languages like and integrated support for and UTF-16 encodings in their intern pools, ensuring that interned strings could handle international text without fragmentation; for instance, 's Unicode strings from (2000) became eligible for interning, while 's UTF-16-based strings supported supplementary characters via updates in JDK 1.5 (2004). This alignment with JIT environments, such as the JVM, allowed interned strings to benefit from runtime optimizations like inline caching for faster string comparisons. In specifically, prior to JDK 7, interned strings resided in the permanent generation—a fixed-size area outside the main —leading to potential OutOfMemoryErrors in long-running applications; this was resolved in JDK 7 (2011) by relocating the string pool to the Java , enabling dynamic collection and improved scalability.

Motivations and Benefits

Memory Optimization

String interning achieves optimization primarily through deduplication, where multiple references to identical strings share a single canonical allocation in the string , thereby eliminating redundant storage for duplicate content. This is particularly beneficial in applications handling large datasets with repetitive text, such as configuration files, log entries, or parsed inputs, where the space savings scale directly with the frequency of duplicates. For instance, interning 10,000 instances of the same string can reduce usage by approximately 235 KB by consolidating them into one object. A representative quantitative example illustrates the potential impact: for 1 million identical strings each consisting of 10 characters, non-interned storage would require additional memory on the order of 10 MB (assuming typical object overhead of around 10 bytes per string instance beyond the character data), whereas interning confines the usage to a single allocation with negligible extra overhead beyond the initial pool entry. In practice, such savings are observed in Java applications where strings constitute up to 25% of the heap, making interning a targeted strategy for high-duplication scenarios. However, the string pool introduces overhead from its underlying structure, which provides O(1) average-case lookup time but incurs a per unique string for storage and indexing—typically managed as a native-memory hashtable with default sizes of 60,013 buckets in JVMs. This must be balanced, as excessive unique strings can bloat the without proportional benefits. Best practices for memory optimization emphasize interning only frequent or short strings to avoid ; for example, constants or common tokens in parsers benefit most, while unique or long dynamic strings should be excluded to prevent unnecessary overhead. Monitoring tools like JVM flags for table statistics can help tune size via parameters such as -XX:StringTableSize for optimal efficiency.

Comparison Efficiency

String interning enhances the efficiency of string operations by enabling -based comparisons rather than content-based ones. When strings are interned, identical values share the same memory , allowing checks to use a simple pointer comparison, which operates in constant time, O(1). In contrast, non-interned strings require scanning characters sequentially, resulting in linear , O(n), where n is the string length. For example, in , the == operator on interned strings performs this fast check, while the equals() method is needed otherwise for content verification. Similarly, in , the language's automatic interning of certain strings optimizes the == operator to leverage identity checks when possible. This optimization proves valuable in scenarios with high-frequency string comparisons, such as implementing hash tables—where equality resolves collisions after hash matching—sorting algorithms that repeatedly compare keys, and parsing structured data like XML or documents involving numerous exact-match tests. By reducing the computational overhead of these operations, interning contributes to overall program performance in string-intensive applications. However, these benefits apply solely to exact equality checks; interning does not accelerate fuzzy matching, searches, or case-insensitive comparisons, which still demand full content evaluation regardless of shared s.

Implementation Strategies

Language-Specific Approaches

In , string interning is automatically applied to string literals at and class loading, ensuring that identical literals the same object in the string pool, which resides in the heap memory since Java 7. Developers can manually intern dynamically created strings using the String.intern() method, which returns a to the canonical string from the pool if it exists, or adds the string to the pool otherwise; this pool is maintained privately by the String . Python provides manual string interning through the sys.intern() function in the sys module, which adds a string to the interpreter's intern pool—a dictionary-based structure—and returns a reference to it, promoting reuse for performance-critical cases like dictionary keys. In , the , identifiers (such as variable names and keywords) and string literals that are valid Python identifiers (up to approximately 4096 characters since Python 3.7) are automatically interned during and execution to optimize and comparison speed. In C#, the String.Intern() method enables manual interning by retrieving or adding a string to the Common Language Runtime's (CLR) global intern pool, a that conserves memory by storing strings with weak references, allowing garbage collection of unreferenced entries. String literals are automatically interned by the compiler, ensuring identical literals share the same instance across the application. Other languages lack built-in interning mechanisms but support custom implementations. JavaScript engines like V8 and automatically intern string literals and identifiers for internal efficiency, but there is no standard API for manual interning; developers can approximate it using a WeakMap to strings to instances while avoiding memory leaks. In Go, string interning is not built-in, requiring custom solutions such as maps or the unique package (introduced in Go 1.23), which deduplicates comparable values including strings using weak pointers for garbage collection integration.
LanguageAutomatic InterningManual Interning MethodPool Management
String literalsString.intern()Heap-based pool with strong references
Identifiers and valid identifier-like string literals (CPython, up to ~4096 chars since 3.7)sys.intern()Dictionary-based intern pool
C#String literalsString.Intern()Global hash table with weak references
Literals and identifiers (engine-dependent)Custom via WeakMapNo standard pool; user-managed
GoNoneCustom maps or unique packageUser-defined, often with weak pointers

Custom Pool Management

In languages lacking built-in string interning, developers can implement custom pools using hash-based data structures to map input strings to unique interned instances, ensuring that identical strings share the same memory location. For instance, in C++, std::unordered_map can serve as the underlying container, with std::string as the key and a pointer to a dynamically allocated interned string as the value; this approach allows lookup via hashing before insertion to avoid duplicates. Such DIY implementations provide flexibility for application-specific needs, such as customizing the storage format or integration with existing memory management. Libraries extend these manual approaches by offering reusable, optimized components. In C++, the Boost.Flyweight library facilitates interning through its , which wraps immutable types like std::string in shared handles, automatically managing references to a single instance per unique value and reducing memory overhead in scenarios with repetitive strings. Similarly, in , crates like provide safe, multithreaded interning by associating strings with unique keys via a concurrent hash map, leveraging for shared ownership to ensure without manual . These libraries handle allocation and deallocation internally, making them suitable for performance-critical code where built-in support is absent. Key design considerations for custom pools include ensuring , managing pool size, and seamless integration with the application's codebase. Thread safety can be achieved by wrapping hash map operations in mutex locks, such as std::mutex in C++ or RwLock in , to prevent concurrent insertions from creating duplicates or corrupting the pool. For large pools that risk memory exhaustion, eviction policies like least recently used (LRU) can be incorporated to remove infrequently accessed strings, often implemented via a combination of hash maps and doubly linked lists to track usage order. Integration involves defining clear for interning and functions, allowing existing code to adopt pointers or handles to interned strings with minimal refactoring, while avoiding cycles in graphs that could complicate lifetime management. A basic intern function in illustrates the core logic using a hash map:
String* intern(const String& input) {
    auto it = pool.find(input);
    if (it != pool.end()) {
        return it->second;  // Return existing interned instance
    }
    String* new_instance = new String(input);  // Allocate new unique instance
    pool[input] = new_instance;  // Insert into pool
    return new_instance;
}
This pattern checks for existence before allocation, promoting reuse; in practice, input hashing and collision resolution follow the container's semantics.

Challenges and Solutions

Multithreading Concerns

In multithreaded environments, string interning requires careful synchronization to avoid race conditions arising from concurrent access to the shared string pool. When multiple threads attempt to intern the same string simultaneously, unsynchronized operations can lead to duplicate insertions into the pool or lost updates, where one thread's insertion overwrites another's without detection. This compromises the uniqueness guarantee of interning and can increase memory usage unexpectedly. To mitigate these race conditions, implementations typically employ locks around pool access, such as mutexes, to ensure atomic insertions and lookups. These synchronization mechanisms ensure but introduce overhead, as acquiring and releasing locks serializes access in contended situations. Language-specific approaches vary in handling these concerns. In , the String.intern() method is inherently thread-safe across Java 6, 7, and 8 implementations. This design prevents race conditions during interning, though the shared pool can still lead to contention in high-throughput applications. In , the (GIL) in provides for interpreter operations, including string interning via sys.intern(), mitigating race conditions by preventing simultaneous execution of across threads. However, the GIL does not fully eliminate potential issues in extensions or when using alternative interpreters without it, where explicit locking may be needed. The performance impact of these synchronizations can be significant in high-throughput multithreaded scenarios, where locking overhead—such as context switches and wait times—may offset the memory and comparison benefits of interning. In extreme cases, frequent intern calls under contention can degrade throughput by up to 75% compared to non-interned access in some benchmarks, prompting developers to use per-thread pools or avoid interning in performance-critical paths.

Garbage Collection Integration

String interning integrates with garbage collection primarily through the use of weak references to ensure that unused interned strings can be reclaimed without causing memory leaks. In languages like , the string pool maintains weak references to interned strings, allowing the garbage collector to remove them when no strong references exist elsewhere in the application. This approach, akin to the internal implementation using structures like WeakHashMap, prevents the pool from indefinitely holding onto strings that are no longer in use, thereby supporting automatic . Reclamation strategies for interned strings vary but often rely on the garbage collector's behavior with weak references for . When the collector identifies an object reachable only via weak references, it reclaims the memory, effectively removing the string from the if no other references persist. Custom interning pools may implement periodic cleanup by scanning for weakly referenced entries that have been cleared, or based on usage patterns to maintain pool size. However, challenges arise with "immortal" strings—those held by strong references from literals or constants—which remain in the indefinitely, potentially leading to unbounded growth if not managed. Prior to Java 7, interned strings were stored in the Permanent Generation (PermGen) space, which was not subject to garbage collection, resulting in OutOfMemoryError exceptions when the pool filled with unique strings that could not be reclaimed. This issue was resolved in Java 7 by relocating the string pool to the main , where standard garbage collection applies, allowing weak-referenced interned strings to be eligible for reclamation. The use of weak references introduces trade-offs, including performance overhead from additional indirection and the need to handle potential inconsistencies, as the garbage collector may remove entries at unpredictable times, simulating asynchronous removals. In non-garbage-collected languages like C++, where automatic is absent, manual interning requires explicit uninterning—such as through or custom deallocation—to reclaim memory, adding complexity and risk of leaks if not handled carefully.

References

  1. [1]
  2. [2]
    Chapter 3. Lexical Structure
    ### Summary of String Literals and Interning (Section 3.10.5)
  3. [3]
  4. [4]
    Symbols (MIT/GNU Scheme 12.1) - GNU.org
    MIT/GNU Scheme provides two types of symbols: interned and uninterned. Interned symbols are far more common than uninterned symbols, and there are more ways to ...
  5. [5]
    String (Java Platform SE 8 )
    ### Definition and Purpose of String.intern()
  6. [6]
    Glossary
    ### Summary of "Interning" Definition (Especially for Strings)
  7. [7]
  8. [8]
    Hash Tables - Crafting Interpreters
    A hash table, whatever your language calls it, associates a set of keys with ... Languages vary in how much string interning they do and how it's exposed to the ...Hash Tables · 20 . 1an Array Of Buckets · 20 . 2 . 1separate ChainingMissing: mechanism | Show results with:mechanism
  9. [9]
    Hidden reefs in string pool, or another reason to think twice before ...
    Apr 6, 2021 · Using interning, we reduce the number of new string objects by working with existing ones through references obtained via the Intern method.
  10. [10]
    [PDF] Recursive Functions of Symbolic Expressions and Their ...
    John McCarthy, Massachusetts Institute of Technology, Cambridge, Mass. ∗. April 1960. 1 Introduction. A programming system called LISP (for LISt Processor) ...Missing: equivalence interning
  11. [11]
    [PDF] The Evolution of Lisp - UNM CS
    That is, parts of the stack are subject to the same garbage collection policies as are other. Lisp objects. Unlike closures, the retained environment captures ...Missing: interning | Show results with:interning
  12. [12]
    [PDF] History of Lisp - John McCarthy
    Feb 12, 1979 · This paper concentrates on the development of the basic ideas and distin- guishes two periods - Summer 1956 through Summer 1958 when most of ...
  13. [13]
    None
    Below is a merged summary of symbol handling in the *Interlisp Reference Manual (1974)*, consolidating all information from the provided segments into a comprehensive response. To maximize density and clarity, I’ve organized key details into tables where appropriate, supplemented by narrative text for context. All unique details are retained, and overlapping information is streamlined to avoid redundancy.
  14. [14]
    GNU Smalltalk Library Reference
    Note that this works because String>>#hash calculates the same hash value used by the VM when interning strings into the SymbolTable. Changing one of the ...<|separator|>
  15. [15]
    Why has the intern method changed in different versions?
    Apr 28, 2024 · String.intern() behaves as specified. From the JavaDoc of Java 1.0.2. Creates a canonical representation for the string object.How does string interning work in Java 7+? - Stack OverflowWhat is Java String interning? - Stack OverflowMore results from stackoverflow.comMissing: history 1996
  16. [16]
  17. [17]
    Will Java 8 Solve PermGen OutOfMemoryError? - InfoQ
    Mar 6, 2013 · With Java 8, there is no PermGen anymore. Some parts of it, like the interned Strings, have been moved to regular heap already in Java 7.Missing: pre- | Show results with:pre-
  18. [18]
    String (Java Platform SE 8 ) - Oracle Help Center
    The String class represents character strings. All string literals in Java programs, such as "abc" , are implemented as instances of this class.
  19. [19]
    [PDF] java.lang.String Catechism - Stay Awhile And Listen
    Q: Why wouldn't we optimize String.intern? A: We are improving it. It does not help the misuse of String.intern. Q: Should I rely on ...
  20. [20]
    String (Java SE 17 & JDK 17)
    ### Summary of String.intern()
  21. [21]
    [PDF] DEFCON: High-Performance Event Processing with Information ...
    terning” of strings. A string that has been interned is guaranteed to have a unique reference, common with all other strings of the same value in the JVM.
  22. [22]
    Guide to Java String Pool | Baeldung
    Aug 20, 2025 · Manual Interning. We can manually intern a String in the Java String Pool by calling the intern() method on the object we want to intern.
  23. [23]
    Interning in CPython - luminousmen
    As shown above, using intern() we can make Python intern strings no matter what the implicit rules are. In practice, if we need to compare several long strings ...
  24. [24]
  25. [25]
    New unique package - The Go Programming Language
    Aug 27, 2024 · This package lets you deduplicate values so that they point to a single, canonical, unique copy, while efficiently managing the canonical copies under the hood.New Unique Package · Enter The Unique Package · A Real-World ExampleMissing: built- | Show results with:built-
  26. [26]
    Boost.Flyweight Documentation - Tutorial - Basics
    Flyweight automatically performs the optimization just described behind the scenes, so that the net effect of this change is that the memory usage of the ...Missing: interning | Show results with:interning
  27. [27]
    Class DefaultEvictionPolicy<T> - Apache Commons
    The DefaultEvictionPolicy evicts objects if idle longer than the minimum, or if more idle objects exist and the object is idle longer than the soft minimum.
  28. [28]
    Synchronizing on Strings and String interning | octavian's blog
    Jun 3, 2013 · ... multithreading, but if more events are generated at the same time ... Synchronizing on Strings and String interning. I've recently had ...
  29. [29]
    RCU Concepts - The Linux Kernel documentation
    The basic idea behind RCU (read-copy update) is to split destructive operations into two parts, one that prevents anyone from seeing the data item being ...
  30. [30]
    String Is Synchronized or Not in Java? Exploring - JA-VA Code
    Oct 6, 2023 · In this article, we will delve into the intricacies of string synchronization in Java, examine the concept of string interning across Java ...
  31. [31]
    sys — System-specific parameters and functions — Python 3.14.0 ...
    This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter.Python Runtime Services · Fileinput · Audit events table · Operating System Utilities
  32. [32]
    Java String intern: Performance impact - yCrash
    Aug 11, 2022 · `intern()` reduces memory use by eliminating duplicates, but can increase response time. Performance depends on data, and testing is needed.Missing: paper | Show results with:paper
  33. [33]
    WeakHashMap (Java Platform SE 8 ) - Oracle Help Center
    Because the garbage collector may discard keys at any time, a WeakHashMap may behave as though an unknown thread is silently removing entries. In particular, ...
  34. [34]
    Garbage collection behaviour for String.intern() - java - Stack Overflow
    Mar 12, 2010 · This is old code, but if it were implemented anew, it would use a java.util.WeakHashMap . Weak references are a way to keep a pointer to an ...Is it good practice to use java.lang.String.intern()? - Stack OverflowWeakHashMap iteration and garbage collection - java - Stack OverflowMore results from stackoverflow.com
  35. [35]
    How can I do string interning in C or C++? - Stack Overflow
    May 17, 2012 · Interning is a mechanism to force object identity for references to strings with value identity. It's relevant in languages which use reference ...c++ - Does std::string use string interning? - Stack OverflowMemory-efficient C++ strings (interning, ropes, copy-on-write, etc)More results from stackoverflow.comMissing: manual | Show results with:manual