Fact-checked by Grok 2 weeks ago

Null-terminated string

In computer programming, particularly in the C programming language, a null-terminated string is a contiguous sequence of characters stored in an array and terminated by a null character with the value zero (often represented as \0 or the ASCII NUL control character).^[1] This convention allows string-handling functions to determine the end of the string by scanning sequentially until the terminator is encountered, without needing to store an explicit length.^[1] The use of null-terminated strings traces its origins to earlier languages in the lineage leading to C. In BCPL (Basic Combined Programming Language), developed in 1967, strings were represented as vectors where the first word contained the length followed by packed characters.^[2] This length-prefixed approach was simplified in the B language (circa 1969), a precursor to C, where the length was omitted for non-empty strings, and instead a special terminator character (*e) marked the end to avoid fixed limits and improve convenience on early machines.^[3] C, developed by Dennis Ritchie starting in 1972, adopted and refined this by standardizing the null character as the terminator, aligning with the byte-oriented architecture of the PDP-11 and enabling seamless integration with assembly-level operations like the ASCIZ directive for null-terminated constants.^[3] Null-terminated strings provide notable advantages in simplicity and efficiency: they require only one extra byte for the terminator, support variable lengths without separate metadata, and allow straightforward pointer-based operations for common tasks like copying or searching.^[4] However, they impose limitations, such as the inability to include the null character within the string itself (restricting binary data handling) and the need to linearly scan the entire string to compute its length, which can be inefficient for long strings.^[5] Additionally, manual memory management in C exacerbates risks like buffer overflows or forgotten terminators, leading to undefined behavior or security vulnerabilities—a concern highlighted in secure coding guidelines.^[4] Despite these drawbacks, null-terminated strings remain foundational in C and C++ (where they are known as null-terminated byte strings or NTBS), influencing POSIX APIs, system calls, and legacy codebases across operating systems like Unix.^[6] Modern alternatives in languages like Rust or Go often use length-prefixed or bounded strings to mitigate issues, but null-termination persists for interoperability with C libraries.^[7]

Fundamentals

Definition

A null-terminated string is a sequence of characters stored in contiguous memory locations as an array, delimited at the end by a special null character (NUL), which has an ASCII value of 0 and is typically represented as \0. This sentinel character marks the boundary of the string, allowing functions to determine its length by scanning until the NUL is encountered, without requiring an explicit length field. Unlike fixed-length strings, which allocate a predetermined amount of space and may include padding, or length-prefixed strings, which store an explicit length before the characters, null-terminated strings support variable lengths without separate metadata, determined dynamically by scanning to the terminator.^[8] The primary purpose of this convention is to enable efficient handling of variable-length character data in resource-constrained environments, such as early computer systems, by avoiding the need to store or maintain a separate length indicator alongside the string. In the development of the C language, this approach evolved from earlier languages like B, where strings were similarly terminated by a special end marker rather than a prefixed count, to circumvent hardware limitations like 8- or 9-bit fields that restricted string sizes and to simplify operations based on practical experience. For instance, the string "hello" would be represented in memory as the sequence ['h', 'e', 'l', 'l', 'o', '\0'], where the final null character ensures that string-processing routines stop at the correct point without overrunning into subsequent data.^[8] This structure inherently distinguishes null-terminated strings from arbitrary binary data, as the presence of an embedded NUL character (value 0) would prematurely terminate the string interpretation, preventing such strings from reliably containing binary content that includes null bytes internally. Thus, null-terminated strings are suited primarily for text data where null characters are not expected within the content itself.

Representation

A null-terminated string is represented in memory as a contiguous array of bytes, where each byte holds a character from the string, followed immediately by a single byte with the value zero, serving as the null terminator (NUL character). This layout ensures that the string data occupies sequential memory locations without gaps, making it suitable for direct pointer access in low-level programming environments.^[9] The total size required for storage is the number of characters in the string plus one additional byte for the terminator, regardless of the string's content. For instance, the string "cat" in ASCII encoding would be stored as the bytes {0x63, 0x61, 0x74, 0x00}, where 0x63 is 'c', 0x61 is 'a', 0x74 is 't', and 0x00 marks the end. There is no separate metadata field for the string's length; the terminator alone signals the boundary, allowing variable-length strings to share the same structure.^[9]^[10] Accessing the string involves starting at the initial memory address and traversing byte by byte until the null terminator is found, which enables operations like reading or copying without prior knowledge of the length. Determining the string's length requires this full traversal, performing a linear search that examines each character sequentially and incurs O(n time complexity, where n is the string length. This implicit length detection relies entirely on the terminator's presence and position.^[9]^[11] The following pseudocode illustrates a basic length calculation by iterating until the null byte:

length = 0
while memory[pointer + length] != 0:
    length = length + 1
return length
length = 0
while memory[pointer + length] != 0:
    length = length + 1
return length

This approach counts only the characters before the terminator, excluding it from the final length.^[9]

Historical Development

Origins

The concept of null-terminated strings originated in the assembly languages of the 1960s, particularly in systems developed by Digital Equipment Corporation (DEC) for their PDP series computers. In the MACRO-10 assembler for the PDP-10, introduced in 1966, the ASCIZ directive was used to define strings by storing ASCII characters followed by a trailing null byte (zero), creating a convenient end marker for variable-length text data.^[12] Similarly, the PDP-11, released in 1970, employed the .ASCIZ directive in its MACRO-11 assembler to generate null-terminated strings, appending a zero byte after the characters to delimit the end.^[13] These directives emerged in an era of severe hardware constraints, where memory was limited, making the null byte a natural and efficient sentinel without requiring additional hardware support.^[14] The rationale for this approach centered on simplicity and resource efficiency in pre-high-level-language environments. By using the null byte as a terminator, programmers avoided the overhead of storing explicit length prefixes for each string, which would consume extra bytes in memory-scarce systems like the PDP-10's 36-bit architecture or the PDP-11's 16-bit design. This method facilitated straightforward parsing and scanning routines in assembly code, reducing the complexity of string handling in low-level programming.^[15] The null terminator also leveraged the existing ASCII standard's NUL character (code 00), proposed in 1965 for padding and termination purposes, providing a standardized way to handle variable-length data without custom delimiters.^[16] This assembly-level convention influenced higher-level languages developed as precursors to C. BCPL, created by Martin Richards in the mid-1960s, initially used length-prefixed strings, but its successor B, implemented by Ken Thompson around 1969-1970 for the PDP-7, shifted to termination with a special character (*e) for easier parsing and to overcome length limitations.^[17] This shift to a special terminator in B was further refined in a 1971 revision by Steve Johnson for the PDP-11, replacing the *e with the null character, paving the way for C.^[3] The approach was formalized during the early 1970s development of Unix at Bell Labs, where Dennis Ritchie refined it in the emerging C language to support efficient string operations in the operating system's codebase.^[17]

Adoption in Programming Languages

Dennis Ritchie selected null-terminated strings for the C programming language during its development between 1971 and 1973, primarily to enable support for arbitrary-length strings without the fixed size constraints present in predecessor languages like BCPL, which prefixed strings with a length byte limiting their maximum size, or Pascal, which relied on fixed-sized arrays often capped at 255 characters.^[18] This design choice, rooted in the PDP-11 assembly language's ASCIZ directive for embedding strings, allowed C to handle variable-length text efficiently in resource-constrained environments.^[18] The adoption of null-terminated strings spread rapidly through Unix, where C was developed alongside the operating system at Bell Labs in 1972. They became embedded in the standard C libraries, such as libc, which provided foundational functions for string manipulation and were integral to Unix utilities and system calls. This integration influenced the POSIX standards, which explicitly define a "character string" as a contiguous sequence of characters terminated by a null byte, ensuring portability across Unix-like systems and solidifying null-terminated strings as a de facto standard in systems programming.^[19] The widespread use of null-terminated strings in C and Unix extended to hardware optimizations, as processor architectures evolved to accelerate common operations on such representations. For instance, the IBM z13 mainframe, introduced in 2015, incorporated a SIMD vector facility with dedicated instructions for string processing, including vector string copy operations that exploit the null terminator to efficiently handle variable-length data transfers, improving performance for workloads involving text manipulation.^[20] While later extensions in languages like variants of Fortran and COBOL added support for null-terminated strings primarily for interoperability with C interfaces—such as appending a null character to blank-padded Fortran strings or using hexadecimal literals in COBOL—it was C's pervasive influence that rendered the convention ubiquitous across modern programming ecosystems.^[21]^[22]

Implementation Details

In C and C++

In C, null-terminated strings are represented as arrays of char elements, where the sequence of characters is followed by a null character (\0 or NUL) to mark the end. For instance, the declaration char str[] = "hello"; creates an array of six char values—{'h', 'e', 'l', 'l', 'o', '\0'}—with the compiler automatically appending the null terminator to string literals.^[23] This representation allows functions to iterate until encountering the null character without needing explicit length information.^[23] The C standard library provides essential functions for manipulating these strings, declared in the <string.h> header. The strlen function computes the length by counting bytes from the start until the null terminator, excluding the terminator itself; for example, strlen("hello") returns 5.^[23] For copying, strcpy copies the source string including its null terminator to a destination buffer, as in strcpy(dest, src);, while strncpy limits the copy to a specified number of characters but may not always append a null terminator if the limit is reached.^[23] Comparisons use strcmp, which returns zero if two strings are equal, positive if the first is greater, or negative otherwise, based on lexicographical order.^[23] Common usage patterns include automatic null termination for string literals, which can be assigned to char* pointers. However, when manually allocating memory—such as with malloc—developers must ensure sufficient space for the string plus the null terminator and explicitly add it, e.g., char *str = malloc(6); strcpy(str, "hello"); str[5] = '\0'; if not using strcpy.^[23] A typical example demonstrating strcpy is:

c
#include <stdio.h>
#include <string.h>

int main() {
    char src[] = "hello";
    char dest[10];  // Buffer must be large enough for source + null terminator
    strcpy(dest, src);  // Copies "hello\0" to dest
    printf("%s\n", dest);  // Outputs: hello
    return 0;
}
#include <stdio.h>
#include <string.h>

int main() {
    char src[] = "hello";
    char dest[10];  // Buffer must be large enough for source + null terminator
    strcpy(dest, src);  // Copies "hello\0" to dest
    printf("%s\n", dest);  // Outputs: hello
    return 0;
}

This works correctly if the destination buffer size accommodates the source length plus the null terminator; otherwise, it risks overwriting adjacent memory, a potential pitfall requiring careful size checks.^[23] In C++, null-terminated strings retain the same char* and const char* representation and compatibility with C functions, but the language recommends using std::string from <string> for safer handling, as it manages memory automatically and avoids manual null termination concerns.^[24] The std::string::c_str() member function provides a const char* to a null-terminated version of the string for interfacing with C APIs, e.g., const char* cstr = std::string("hello").c_str();.^[25]

In Low-Level Languages

In low-level languages such as assembly, null-terminated strings are typically processed by scanning memory sequentially until a null byte (0x00) is encountered, often using loops that load and compare bytes from a register or memory address. For instance, in x86 assembly, a common approach involves initializing a source index register (ESI) with the string's starting address, zeroing a counter register (ECX), and entering a loop where a byte is loaded into the accumulator (AL) via MOV, compared to zero with CMP, and the pointer incremented if non-zero.^[26] This method ensures the string's end is detected without prior length knowledge, relying on byte-addressable memory to traverse the sequence. A representative example in x86-64 NASM syntax for computing string length is:

mov rsi, string_start
xor rcx, rcx
[loop](/page/Loop):
    mov al, [rsi]
    test al, al
    jz done
    inc rsi
    inc rcx
    jmp [loop](/page/Loop)
done:
    ; rcx holds [length](/page/Length)
mov rsi, string_start
xor rcx, rcx
[loop](/page/Loop):
    mov al, [rsi]
    test al, al
    jz done
    inc rsi
    inc rcx
    jmp [loop](/page/Loop)
done:
    ; rcx holds [length](/page/Length)

Such loops are fundamental in assembly routines for tasks like printing or copying, as they directly interface with memory without higher-level abstractions.^[26] Hardware architectures provide dedicated instructions to optimize this scanning, reducing the need for explicit loops. On x86 processors, the SCASB (Scan String Byte) instruction compares the byte at ES:[EDI] with AL and advances EDI, with the REPNE prefix repeating the operation until ECX reaches zero or equality (ZF=1) is found, ideal for locating a null terminator when AL=0.^[27] For example, to find the length of a null-terminated string, EDI is set to the string address, ECX to a large value (e.g., -1 for unbounded scan), AL to 0, direction flag cleared (CLD), and REPNE SCASB executed; the length is then derived from the decremented ECX.^[26] This string instruction, introduced in the 8086, persists in modern x86-64 CPUs from Intel and AMD, enhancing efficiency for repetitive memory scans in kernels or drivers. Earlier systems like the PDP-11, influential in Unix development, supported null-terminated strings via the ASCIZ assembler directive, which appended a null byte to literals, with scanning typically implemented through conditional branches testing bytes against zero in registers like R0.^[15] The PDP-11's byte-manipulating instructions, such as MOVB and CMPB, facilitated similar loop-based traversal in 16-bit memory-addressable environments.^[28] Null-terminated strings ensure binary compatibility in system interfaces, particularly for low-level calls expecting fixed formats. In Unix-like systems, the execve() syscall requires the path argument as a null-terminated byte string pointing to the executable, and argv as an array of pointers to null-terminated argument strings, terminated itself by a null pointer.^[29] This convention, defined in POSIX standards, allows assembly code or binaries to invoke processes without length prefixes, maintaining interoperability across kernels like Linux and BSD since the 1970s PDP-11 era.^[30] Assembly implementations must thus prepare memory buffers with explicit null terminators to avoid truncation or faults during kernel parsing. In resource-constrained environments like embedded systems and OS kernels, handling null-terminated strings involves careful management of byte-addressable memory to prevent overflows or invalid accesses. On ARM-based embedded platforms, assemblers use the .asciz directive to define strings with automatic null termination, scanned via loops loading bytes into registers (e.g., LDRB and CMP on R0) until zero, common in microcontroller firmware for display or UART output.^[31] Similarly, AVR assembly for devices like ATmega employs .db for strings followed by a manual zero byte, with scanning loops using LD and CP instructions to iterate program memory, essential where stack space is limited to 256 bytes. In OS kernels, such as Linux, null-terminated strings are processed in byte-addressable virtual memory, but edge cases arise in interrupt handlers or device drivers where unbounded scans risk page faults if terminators are missing; mitigations include bounded variants like strscpy() that enforce null termination within fixed buffers.^[32] These contexts highlight the reliance on explicit null bytes for safe, predictable termination in environments without dynamic allocation.

Limitations

Security Risks

Null-terminated strings are particularly susceptible to buffer overflow vulnerabilities due to the lack of explicit length information, relying instead on the terminating null character to signal the end. Functions such as strcpy() copy characters from a source string to a destination buffer until encountering the null terminator, without verifying if the destination has sufficient space, potentially overwriting adjacent memory regions including return addresses or critical data structures. This can enable attackers to inject and execute arbitrary code, leading to remote code execution or system compromise. A seminal example is the Morris Worm of 1988, which exploited a stack buffer overflow in the fingerd daemon on Unix systems by overflowing a fixed-size buffer during input processing, allowing the worm to propagate across networks and infect thousands of machines.^[33] Format string attacks further compound risks when user-controlled input is passed directly as the format argument to functions like printf(), which interpret null-terminated sequences containing specifiers (e.g., %s or %n) to read from or write to the stack, potentially leaking memory contents or overwriting variables. For instance, input such as %x%x%x to printf(user_input) can dump stack values, enabling information disclosure or control-flow hijacking.^[34]^[35] Null byte (NUL) injection poses another threat by embedding the null character (%00 or \0) in user input, prematurely truncating null-terminated strings and bypassing validation or access controls. In languages like C or PHP interfacing with C libraries, this can trick functions into treating a shortened string as valid, allowing path traversal (e.g., ../../etc/passwd%00) to access unauthorized files or enable arbitrary code execution via buffer overflows in components expecting complete strings.^[36] Such attacks exploit the fundamental reliance on the null terminator for string delineation, often evading filters that process only the visible portion of input.^[37] Basic mitigations involve using bounded string functions to limit copies, such as strncpy(), which caps the number of characters copied to the destination buffer size, preventing overflows from excessively long sources. However, strncpy() introduces its own risks: if the source string exceeds the specified length, it copies exactly that many characters without appending a null terminator, resulting in a non-null-terminated buffer that subsequent operations may treat as longer than intended, potentially causing further overflows or undefined behavior. Additionally, strncpy() pads the destination with nulls if the source is shorter, which is inefficient but can mask truncation issues. Developers must manually ensure null termination after such calls, underscoring the need for careful implementation to avoid compounding vulnerabilities.^[38]

Efficiency Drawbacks

Null-terminated strings impose several efficiency drawbacks due to their design, which relies on scanning for a terminating null character rather than storing explicit length information. One primary inefficiency arises in computing the string length, as functions like strlen in C must iterate through each character until encountering the null terminator, resulting in O(n time complexity where n is the string length. This scanning becomes particularly costly for long strings or when length queries are frequent, such as in loops or repeated operations, leading to unnecessary CPU cycles compared to length-prefixed alternatives that allow constant-time length retrieval. Memory usage is another concern, as every null-terminated string requires an additional byte for the null terminator, introducing a fixed overhead regardless of string length. This not only wastes storage but also prevents the representation from handling binary data containing embedded null bytes, limiting its applicability to text-only scenarios and requiring workarounds like separate length tracking for more general use cases. Modifying null-terminated strings, such as inserting characters, exacerbates these issues since insertions in the middle demand shifting the entire trailing portion of the string, an O(n operation that can be prohibitive for large n. Without an explicit length, even determining the insertion point or validating bounds often necessitates prior scanning, compounding the time cost and making dynamic updates less efficient than in representations with direct length access. String comparisons via functions like strcmp similarly suffer from the need to scan sequentially until a mismatch or the null terminator is found, yielding O(n) performance in the worst case for equal prefixes. This is slower than comparing length-prefixed strings of known equal length, where early length checks can short-circuit the process without full traversal.

Character Encoding Considerations

Single-Byte Encodings

Null-terminated strings integrate seamlessly with the ASCII character set, a 7-bit encoding standard that defines 128 characters from 0x00 (NUL) to 0x7F. The NUL byte (0x00) functions as the terminator without overlapping with the 95 printable ASCII characters (0x20–0x7E) or the other control characters (0x01–0x1F), ensuring that text data remains intact until the terminator is encountered. This design leaves the upper 128 values (0x80–0xFF) available for system-specific extensions in 8-bit environments, preserving compatibility while allowing for broader use. Extended ASCII encodings, such as those in the ISO/IEC 8859 family (e.g., ISO-8859-1 for Western European languages), maintain viability for null-terminated strings by reserving the NUL byte exclusively as the terminator, avoiding its use in character mappings. These 8-bit extensions build directly on the 7-bit ASCII subset, assigning semantic meanings to the 0x80–0xFF range for accented letters and symbols, and were commonly employed in early C programming as ASCIIZ strings—null-terminated sequences compatible with both ASCII and extended sets. This approach enabled straightforward handling of localized text in systems transitioning from 7-bit to 8-bit storage without altering the termination mechanism.^[39]^[40] A key limitation of null-terminated strings in single-byte encodings is their inability to represent arbitrary binary data or text containing embedded null bytes, as any occurrence of 0x00 is interpreted as the end of the string, resulting in truncation of subsequent content. For instance, attempting to store binary values including 0x00 through standard string functions would prematurely halt processing, rendering the format unsuitable for non-textual data like images or encrypted payloads. This restriction stems from the fundamental reliance on the absence of internal nulls to delineate string boundaries.^[41]^[7] Historically, null-terminated strings dominated Unix and C programming from the late 1960s through the 1990s, originating from the PDP-7's ASCIZ (ASCII with zero) type and becoming standard in early Unix implementations. System libraries and functions were engineered to be 8-bit clean—transmitting all byte values unaltered across pipes and files—but often operated under the assumption of 7-bit safe text to mitigate risks from potential embedded nulls in extended encodings. This balance supported efficient text processing in resource-constrained environments until the rise of internationalized systems in the late 1990s prompted broader encoding considerations.^[42]^[43]

Multi-Byte Encodings

In multi-byte encodings such as UTF-8 and UTF-16, null-terminated strings face significant challenges due to the variable width of character representations, where the null terminator (0x00 for UTF-8 or 0x0000 for UTF-16) can interfere with proper string processing. In UTF-8, the null byte may appear as part of an invalid multi-byte sequence if it occurs mid-character, rendering such sequences ill-formed according to the encoding standard, while an embedded U+0000 character (encoded as a single 0x00 byte) prematurely terminates the string when scanned by functions expecting null termination. This restriction prevents UTF-8 null-terminated strings from legitimately containing embedded null characters without breaking the termination mechanism, a limitation that arises because systems like C treat 0x00 as the end marker regardless of encoding intent. To address these issues in UTF-8, adaptations like Modified UTF-8 have been developed, particularly in Java's JNI and class file formats, where the null character U+0000 is encoded as the two-byte sequence 0xC0 0x80 instead of 0x00, ensuring no embedded null bytes appear in the string while preserving compatibility with null-terminated C-style strings. This modification allows any Unicode character to be represented without introducing actual 0x00 bytes, though it deviates from standard UTF-8 by altering the encoding of surrogates and the null character. In contrast, UTF-16 null-terminated strings use a 16-bit null terminator (0x0000), which aligns with the encoding's fixed-width code units and avoids single-byte null issues, as seen in Windows API functions like wcslen that scan until encountering this terminator.^[44]^[45]^[46] Programming languages handle these adaptations variably when interfacing with null-terminated multi-byte strings. In Python 3, bytes objects can represent null-terminated UTF-8 data, but APIs like PyBytes_FromStringAndSize raise a ValueError if an embedded null byte is detected when length is unspecified, effectively warning against strings that could terminate prematurely. Similarly, Objective-C's NSString class internally stores text as UTF-16 code units without relying on null termination, but when bridging to C APIs, it can provide null-terminated UTF-16 representations via methods like UTF16String, ensuring compatibility while avoiding byte-count mismatches.^[47]^[48]^[49] A key compatibility problem in multi-byte encodings is that standard scanning functions like strlen() count bytes rather than characters, leading to incorrect length calculations for variable-width strings; for instance, a UTF-8 string with accented characters spanning multiple bytes will report a byte length longer than the actual character count. This byte-oriented behavior, defined in the POSIX standard, exacerbates issues in UTF-8 where non-ASCII characters require 2–4 bytes each, potentially causing buffer overruns or truncation if character counts are assumed. In UTF-16, equivalent functions like wcslen() count 16-bit units correctly but still require awareness of surrogate pairs for full Unicode scalar values.

Alternatives and Improvements

Length-Prefixed Strings

Length-prefixed strings, also known as prefixed strings, represent a string format where an explicit length field precedes the sequence of characters, eliminating the need for a terminating null character. This length field, typically an integer indicating the number of characters that follow, allows the string's boundaries to be determined directly from the prefix without scanning the content.^[50] One primary advantage of this approach is constant-time (O(1)) access to the string length, as it requires only reading the prefix rather than iterating until a terminator is found, which contrasts with the O(n) time complexity of null-terminated strings in the worst case. Additionally, length-prefixed strings inherently support embedded null characters or arbitrary binary data within the string body, since the length field defines the extent regardless of content, facilitating safer handling of binary-safe data. This format also simplifies bounds checking during operations like concatenation or substring extraction, reducing the risk of buffer overruns compared to relying on implicit termination.^[50]^[51] Classic examples include Pascal's short strings, where the first byte serves as an 8-bit length prefix limiting the string to 255 characters, followed directly by the character data; this design originated in early Pascal implementations like Turbo Pascal for efficient variable-length strings up to that size. In modern systems, the Windows COM BSTR type uses a 32-bit length prefix (stored as a DWORD immediately before the character array pointer) followed by wide characters (UTF-16) and a terminating null for compatibility, enabling embedded nulls while providing explicit length via the SysStringLen function. Similarly, the .NET Framework's String class internally maintains a length field in its object header alongside a contiguous char array (without a null terminator), allowing immutable strings with O(1) length retrieval through the Length property and support for any Unicode characters, including nulls.^[52]^[53] Despite these benefits, length-prefixed strings introduce fixed overhead from the length field—such as one byte in Pascal short strings or four bytes in BSTR—which adds memory cost per string instance, particularly for short or numerous strings. Corruption of the length field can lead to mismatches between the declared size and actual content, potentially causing security vulnerabilities like buffer overflows if not validated, though this is mitigated by explicit checks in robust implementations.^[50]^[53]

Advanced String Types

Ropes represent an advanced string data structure designed for efficient manipulation of large texts, particularly through concatenation and splitting operations that achieve O(log n) time complexity, where n is the string length. Introduced as an alternative to traditional contiguous string representations, ropes organize strings as binary trees where leaf nodes hold substrings and internal nodes store weights indicating the size of subtrees, enabling balanced operations without frequent memory reallocations. This structure is particularly beneficial for applications like text editors handling dynamic content, as it avoids the quadratic time costs associated with repeated concatenations in null-terminated strings. The GNU C++ Standard Library implements ropes via the <ext/rope> header, providing a practical extension for C++ developers seeking scalable string handling.^[54] Immutable strings in modern languages address limitations of mutable, null-terminated designs by enforcing read-only semantics, reducing errors from unintended modifications and enabling optimizations like hash caching. In Java, the String class stores characters in UTF-16 encoding with an explicit length field and caches the hash code to accelerate equality checks and hash-based collections, improving performance in scenarios involving frequent string comparisons. Similarly, Python's str type is Unicode-aware, internally using a length-prefixed representation with compact storage for common code point ranges (e.g., Latin-1 or UCS-2), which supports efficient slicing and concatenation without null terminators. These designs prioritize safety and performance for general-purpose string use, contrasting with the vulnerability-prone scanning in null-terminated formats.^[55] To mitigate security risks in C while retaining compatibility with null-terminated conventions, extensions like strlcpy() and strlcat() from BSD systems provide bounded string copying and concatenation, returning the total length needed (including null terminator) to prevent overflows without assuming null termination in destinations. These functions, originally developed for OpenBSD, ensure safer operations by truncating sources if necessary and always null-terminating results, influencing their adoption in glibc 2.38 for broader portability. Complementing these, C11's Annex K introduces bounds-checked functions such as strncpy_s(), strncat_s(), and memcpy_s(), which require explicit size parameters and report errors on truncation, offering a standardized path to safer string handling in legacy C environments.^[56] Contemporary systems languages like Rust and Go integrate advanced string types that emphasize memory safety and efficiency, eschewing null terminators entirely. Rust's &str is a fat pointer comprising a byte pointer and length, validated as UTF-8, enabling zero-copy views into strings with compile-time bounds checking to eliminate buffer errors common in C-style handling. Go's string type mirrors this with an immutable pointer to a byte slice plus length, facilitating efficient passing by value in a garbage-collected environment while supporting UTF-8 natively. These implementations enhance security through immutability and explicit lengths, and boost performance for large-scale text processing, as seen in ropes for editor buffers or Unicode operations in web applications.^[57]^[58]

References

[1]
[PDF] Contents - Open Standards
C. 1. Scope. 1. This International ... C. 2. This International Standard does not specify. — the mechanism by which ...
[2]
[PDF] BCPL Reference Manual - Software Preservation Group
A string constant with length other than one is represented as a BCPL vector; the length and the string characters are packed in successive words of the vector.
[3]
The Development of the C Language - Nokia
This paper is about the development of the C programming language, the influences on it, and the conditions under which it was created.
[4]
https://www.cs.yale.edu/homes/aspnes/pinewiki/C(2f)Strings.html
[5]
[PDF] Variables and Types - Brown CS
advantages ... A null-terminated string is a sequence of ... advantage that the length of the string doesn't have to be stored explicitly (saving space), but also ...
[6]
Null-terminated byte strings - cppreference.com - C++ Reference
Feb 9, 2025 · A null-terminated byte string (NTBS) is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).
[7]
What are the advantages/disadvantages of null-terminated strings vs ...
May 17, 2023 · A null-terminated string is a character string stored as an array containing the characters and terminated with a null character, often this is the null ...
[8]
[PDF] The Development of the C Language - Nokia
The Development of the C Language. Dennis M. Ritchie. Bell Labs/Lucent Technologies. Murray Hill, NJ 07974 USA dmr@bell-labs.com. ABSTRACT. The C programming ...
[9]
https://en.cppreference.com/w/cpp/string/byte
[10]
Null-terminated String
A null-terminated string is a sequence of ASCII characters, one to a byte, followed by a zero byte (a null byte). null-terminated strings are common in C and C ...
[11]
Null-Terminated String - an overview | ScienceDirect Topics
A Null-Terminated String is defined as a character string in which the length computation starts at the beginning and examines each character sequentially ...
[12]
[PDF] MACRO ASSEMBLER Reference Manual - Columbia University
ASCIZ, .DIRECTIVE FLBLST, RADIX50, SIXBIT using double for example,. Using the delimiter character in the text string. Missing the end delimiter (that is ...
[13]
None
Below is a merged summary of the `.ASCIZ` directive in PDP-11 MACRO-11, consolidating all information from the provided segments into a single, dense response. To maximize clarity and retain all details, I’ve organized the information into a table in CSV format, followed by a concise narrative summary. This approach ensures all specifics (e.g., purpose, relation to null-terminated strings, syntax, references, and additional details) are preserved and easily accessible.
[14]
[TUHS] origin of null-terminated strings
Dec 16, 2022 · ... ASCIZ was an assembler directive used for a number of different DEC computers, and also the name for null-terminated strings. I learned it for ...
[15]
[TUHS] origin of null-terminated strings
Dec 17, 2022 · ASCIZ was an assembler directive used for a number of different DEC computers, and also the name for null-terminated strings.Missing: MACRO- | Show results with:MACRO-
[16]
https://dl.acm.org/doi/10.1145/363831.363839
[17]
The Development of the C Language - CSCI-E26
In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which ...
[18]
Chistory
### Summary: Strings in B and C, and Adoption of Null-Termination
[19]
Definitions - The Open Group Publications Catalog
3.92 Character String. A contiguous sequence of characters terminated by and including the first null byte. 3.93 Child Process. A new process created (by fork() ...
[20]
[PDF] SIMD Business Analytics Acceleration on z Systems - IBM Redbooks
Feb 5, 2015 · The SIMD accelerator is integrated into the newest z Systems. CPUs so that analytics workloads can act on multiple data items.
[21]
Handling null-terminated strings - IBM
You can manipulate null-terminated strings (passed from a C program, for example) by using string-handling mechanisms such as those in the following code:Missing: z13 acceleration
[22]
<nul> Characters in Fortran 77 - NVIDIA Developer Forums
Apr 10, 2007 · C character arrays are NULL terminated, using ASCII 0 (NULL). When passing Fortran character arrays to C, the NULL character must be appended.Missing: COBOL history
[23]
https://en.cppreference.com/w/c/string/byte
[24]
https://en.cppreference.com/w/cpp/string/basic_string
[25]
https://en.cppreference.com/w/cpp/string/basic_string/c_str
[26]
Guide to x86 Assembly - Computer Science
Mar 8, 2022 · This guide describes the basics of 32-bit x86 assembly language programming, covering a small but useful subset of the available instructions ...
[27]
SCAS/SCASB/SCASW/SCASD — Scan String
This instruction compares a byte, word, doubleword or quadword specified using a memory operand with the value in AL, AX, or EAX.Missing: terminated | Show results with:terminated
[28]
The PDP-11 Assembly Language
Aug 3, 2011 · The PDP-11 was a 16-bit mini-computer manufactured by the Digital Equipment Corporation (DEC) during the 70's and 80's.
[29]
execve(2) - Linux manual page - man7.org
The envp array must be terminated by a null pointer. This manual page describes the Linux system call in detail; for an overview of the nomenclature and the ...
[30]
exec - The Open Group Publications Catalog
... null-terminated character strings. These strings shall constitute the argument list available to the new process image. The list is terminated by a null pointer ...
[31]
String definition directives - Arm Developer
ascii directive does not append a null byte to the end of the string. .asciz. The .asciz directive appends a null byte to the end of the string.Missing: PDP- 11 10 history
[32]
Better string handling for the kernel - LWN.net
Oct 26, 2023 · The kernel uses NUL-terminated strings, which lack size info, causing buffer overflows. The `seq_buf` API is proposed as a solution, with ` ...
[33]
Morris Worm fingerd Stack Buffer Overflow - Rapid7
This module exploits a stack buffer overflow in fingerd on 4.3BSD. This vulnerability was exploited by the Morris worm in 1988-11-02.
[34]
OpenSSL 'Heartbleed' vulnerability (CVE-2014-0160) | CISA
Oct 5, 2016 · This flaw allows an attacker to retrieve private memory of an application that uses the vulnerable OpenSSL library in chunks of 64k at a time.
[35]
Testing for Format String Injection - OWASP Foundation
Summary. A format string is a null-terminated character sequence that also contains conversion specifiers interpreted or converted at runtime.
[36]
CWE-134: Use of Externally-Controlled Format String
The Format String problem appears in a number of ways. A *printf() call without a format specifier is dangerous and can be exploited.
[37]
Embedding Null Code - OWASP Foundation
The Embedding NULL Bytes/characters technique exploits applications that don't properly handle postfix NULL terminators.
[38]
CWE-626: Null Byte Interaction Error (Poison Null Byte) (4.18)
The product does not properly handle null bytes or NUL characters when passing data between different representations or components.Missing: risks | Show results with:risks
[39]
Strncpy And Safety - Inspirel
In this article I will debunk the myth of "secure strncpy" and show that from the safety practice point of view the strncpy-family of functions is actually ...
[40]
RFC 3492 - Punycode: A Bootstring encoding of Unicode for ...
Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications (IDNA).<|control11|><|separator|>
[41]
Null Character - an overview | ScienceDirect Topics
In the specific case where an ASCII-encoded string is terminated with a numerical zero, the string is termed an ASCIIZ string. The termination character ...
[42]
ASCII / ISO 8859-1 (Latin-1) Table with HTML Entity Names
ISO 8859-1 (Latin-1) Characters List which lists all 256 character references. ISO Latin 1 Character Entities and HTML Escape Sequence Table" which lists ...
[43]
On the sadness of treating counted strings as null-terminated strings
Jun 19, 2024 · Treating a counted string as a null-terminated string is a lossy operation, because any embedded nulls in the counted string are mistakenly ...
[44]
[TUHS] Re: origin of null-terminated strings - TUHS - www.tuhs.org
Newline terminated lines have the prefix property, but we cannot just concatenate such components and do an strncmp, because newlines compare low to most bytes, ...
[45]
Fifty years of strings: Language design and the string datatype | ℤ→ℤ
May 2, 2022 · In the 1980s, implementations started to converge on NUL terminated strings. A NUL terminated string is an array of characters ended by the NUL ...
[46]
[PDF] A survey of Unicode compression
Jan 30, 2004 · This caused problems for systems that treat certain bytes specially: 00 (null) is used as a string terminator in C, C++, and other programming ...
[47]
JNI Types and Data Structures
Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, ...
[48]
Chapter 4. The class File Format - Oracle Help Center
String content is encoded in modified UTF-8. Modified UTF-8 strings are encoded so that code point sequences that contain only non-null ASCII characters can ...
[49]
Using Null-terminated Strings - Win32 apps | Microsoft Learn
Jan 7, 2021 · The code 0x0000 is the Unicode string terminator for a null-terminated string. A single null byte is not sufficient for this code.Missing: UTF- 16
[50]
Bytes Objects — Python 3.14.0 documentation
If length is NULL , the bytes object may not contain embedded null bytes; if it does, the function returns -1 and a ValueError is raised. The buffer refers ...
[51]
Parsing arguments and building values — Python 3.14.0 ...
The bytes buffer must not contain embedded null bytes; if it does, a ValueError exception is raised. ... It only works for encoded data without embedded NUL bytes ...
[52]
NSString | Apple Developer Documentation
An NSString object encodes a Unicode-compliant text string, represented as a sequence of UTF–16 code units. All lengths, character indexes, and ranges are ...
[53]
C/Strings
The disadvantage is that maintaining a separate count typically requires more space than adding a one-byte delimiter (unless you limit your string length to ...
[54]
[PDF] HLA Strings and the HLA String Library Chapter One
HLA's strings are both length-prefixed and zero terminated. There- fore, HLA strings require a few extra bytes but enjoy the advantages of both schemes.
[55]
String - Free Pascal wiki
Nov 1, 2022 · ShortString has a fixed maximum length that is decided by the programmer (e.g. name : String[25];) but is limited to 255 characters. If a ...Missing: prefixed | Show results with:prefixed
[56]
https://www.usenix.org/legacy/event/usenix99/full_papers/millert/millert.pdf
[57]
[PDF] Ropes: an Alternative to Strings - Department of Computer Science
For performance reasons, it is desirable to deal with the common case in which the right argument is a short flat string specially. If both arguments are short ...
[58]
Unicode Objects and Codecs — Python 3.14.0 documentation
There are special cases for strings where all code points are below 128, 256, or 65536; otherwise, code points must be below 1114112 (which is the full Unicode ...Unicode Objects · Unicode Type · Built-In Codecs
[59]
[PDF] strlcpy and strlcat— Consistent, Safe, String Copy and Concatenation
strlcpy and strlcat are designed for consistent, safe string copy and concatenation, addressing issues with standard C functions like strncpy() and strncat().
[60]
str - Rust Documentation
Nov 3, 2020 · A &str is made up of two components: a pointer to some bytes, and a length. You can look at these with the as_ptr and len methods: use std:: ...Mod.rs - source · Chars · Pattern · FromStr
[61]
The Go Programming Language Specification
Aug 12, 2025 · Strings are immutable: once created, it is impossible to change ... The predeclared string type is string ; it is a defined type. The ...