Whitespace character
In computing and typography, a whitespace character is a glyph or control code that represents empty or blank space within text, primarily used to separate words, tokens, lines, or other structural elements without contributing visible content.[1] These characters are essential for text processing, rendering, and layout in digital systems, where they facilitate readability and parsing while often being normalized or collapsed in display contexts.[1] In the ASCII standard, whitespace characters are specifically defined as the space (U+0020 or 0x20), horizontal tab (U+0009 or 0x09), line feed (U+000A or 0x0A), vertical tab (U+000B or 0x0B), form feed (U+000C or 0x0C), and carriage return (U+000D or 0x0D); these are the characters recognized by functions likeisspace() in the C standard library for delimiting tokens in programming languages.[2] This set forms the foundation for whitespace handling in many legacy and modern systems, ensuring compatibility in environments like command-line interfaces and file formats.[2]
The Unicode Standard broadens this definition via the White_Space property, which—as of Unicode 16.0—marks 25 characters as whitespace for use in text processing and internationalization; these include all ASCII whitespaces plus space separators (general category Zs) like the no-break space (U+00A0) and ideographic space (U+3000), line separators (Zl), paragraph separators (Zp), and certain control characters.[3] This property supports diverse scripts by incorporating culture-specific spacing, such as the Ogham space mark (U+1680), while maintaining stability across Unicode versions to avoid breaking existing software. Note that the narrower Pattern_White_Space property from UAX #31, used specifically for pattern matching and identifiers, includes only 11 characters.[4]
Beyond basic separation, whitespace characters influence text layout and behavior in applications like web browsers and document editors; for instance, multiple consecutive spaces may collapse to one in HTML rendering, while non-breaking variants prevent unwanted line breaks in typography.[1] In professional design, specialized forms such as the en quad (U+2000, 1/2 em width) and em quad (U+2001, 1 em width)—with widths tied to font metrics—enable precise kerning and justification; thinner spaces typically range from 1/5 em to 1/6 em.[5]
Definition and Fundamentals
Core Concept
A whitespace character is a type of character in text encoding systems that represents invisible blank space or produces no visible output when rendered, serving primarily to facilitate separation between words, alignment of text, and overall formatting in documents and code.[6] These characters encompass both control characters, which manage device or display behavior without printing glyphs, and certain graphic characters that explicitly denote space.[7] Unlike visible characters such as letters or symbols, whitespace characters contribute to the structural layout of text rather than its semantic content.[8] Primary examples of whitespace characters include the basic space (U+0020), which inserts a fixed horizontal gap; the horizontal tabulation (U+0009), which advances the cursor to the next tab stop; the vertical tabulation (U+000B), which advances the cursor to the next vertical tab stop; the line feed (U+000A), which moves to the next line; the carriage return (U+000D), which returns the cursor to the line start; and the form feed (U+000C), which advances to the next page or form.[6] In the Unicode standard, these fall into categories such as space separators and specific format controls, with further details covered in the encoding specifications.[6] In text processing and computing applications, whitespace characters function as delimiters to separate tokens during parsing and lexical analysis, as structural elements to define paragraphs and sections in documents, and as layout controls to influence rendering in displays or printers.[8] For instance, in programming languages, they often delineate keywords, operators, and variables without affecting the program's logic.[7] Certain whitespace characters are a specialized subset of control characters focusing on creating spatial gaps or positional advances, while others are graphic characters that denote space, in contrast to other control characters that handle non-spatial functions such as emitting a bell sound (U+0007) or initiating escape sequences for formatting shifts.[9] This distinction ensures that whitespace is treated uniformly in operations like string trimming or tokenization, where only spatial elements are collapsed or preserved for readability.[7]Historical Development
The concept of whitespace characters traces its roots to 19th-century advancements in telegraphy and data processing. In 1874, Émile Baudot patented a printing telegraph system that used a five-unit binary code, incorporating "letters shift" and "figures shift" controls to toggle between alphabetic and numeric modes, enabling efficient transmission of formatted messages over telegraphic lines.[10] These early spacing mechanisms addressed the need for separating elements in printed output from asynchronous telegraph receivers. Concurrently, in the 1890s, Herman Hollerith developed punch-card technology for the U.S. Census, where the absence of punches in specific columns represented spaces, allowing mechanical tabulators to align and process tabular data accurately.[11][10] By the early 20th century, refinements in telegraphy further solidified whitespace's role. Donald Murray's 1905 enhancements to Baudot-style codes introduced explicit "line control" characters for carriage return (CR) and line feed (LF), which moved the print head and advanced paper in teletypes, ensuring proper formatting in remote printing applications.[10] Punch-card systems evolved similarly; by 1915, Robert Neil Williams devised a full alphabetic code that included space representations via punch configurations, optimizing for data sorting and stability in business machines.[10] These developments directly influenced computing, as seen in the 1950s with FORTRAN, IBM's pioneering high-level language, which relied on fixed columnar formats where spaces aligned code elements on 80-column punch cards for reliable parsing by early compilers.[12] Standardization accelerated in the mid-20th century amid growing computer adoption. The 1963 American Standard Code for Information Interchange (ASCII, ASA X3.4) formalized whitespace with dedicated codes for space (code 32), horizontal tab (HT, code 9), line feed (LF, code 10), carriage return (CR, code 13), and form feed (FF, code 12), shaped by teletype compatibility to support uniform data interchange across devices.[13] IBM's Extended Binary Coded Decimal Interchange Code (EBCDIC), devised in 1963–1964 for System/360 mainframes, extended this framework with 8-bit encodings—space at hex 40, tab at hex 05, LF at hex 25, CR at hex 0D, and FF at hex 0C—tailored for punched-card and tape inputs in enterprise computing.[14] A pivotal international milestone arrived in 1972 with ISO 646, which adopted ASCII's core structure, including identical whitespace controls, to harmonize 7-bit encodings for global telecommunications while allowing national variants in graphic characters.[15] The 1980s and 1990s marked a transition toward accommodating diverse writing systems. As standards like ISO 8859 (introduced 1987) expanded to 8-bit encodings for Western European languages, whitespace retained its foundational roles but required adjustments for script-specific rendering, such as in combined Latin and non-Latin texts.[15] The push for bidirectional text handling, evident in early 1990 drafts for the Unicode Bidirectional Algorithm, introduced rules classifying whitespace as neutral elements that adapt to surrounding directional contexts in mixed left-to-right and right-to-left layouts, addressing needs in Arabic and Hebrew processing.[16] These evolutions built on prior controls to support increasingly international digital text flows without altering core whitespace definitions.Encoding and Standards
Early Character Sets
In the American Standard Code for Information Interchange (ASCII), formalized as ANSI X3.4 in 1968, whitespace characters were assigned specific 7-bit code points to facilitate text formatting and control on early computing systems. The space (SP) occupied 0x20, serving as the primary word separator; the horizontal tab (HT) was at 0x09 for columnar alignment; the line feed (LF) at 0x0A for advancing to the next line; the form feed (FF) at 0x0C for page ejection or clearing; the vertical tab (VT) at 0x0B for vertical spacing; and the carriage return (CR) at 0x0D for returning the cursor to the line start.[17] These assignments prioritized compatibility with teleprinters and punch-card systems, though the limited 128-character set constrained handling of non-English text. IBM's Extended Binary Coded Decimal Interchange Code (EBCDIC), introduced in the 1960s for mainframe environments, diverged significantly from ASCII to align with IBM's hardware architecture. Whitespace encoding included the space at 0x40, horizontal tab at 0x05, line feed at 0x25, form feed at 0x0C, vertical tab at 0x0B, and carriage return at 0x0D.[18][19] These positions reflected EBCDIC's non-contiguous layout, which grouped controls separately from graphics for punch-card efficiency but introduced conversion complexities when interfacing with ASCII-based peripherals.[19] The 8-bit nature allowed more characters overall, yet whitespace remained sparse, emphasizing mainframe-specific operations over universal portability. The International Organization for Standardization's ISO 646, ratified in 1972 as a 7-bit international variant of ASCII, preserved the core whitespace code points—SP at 0x20, HT at 0x09, LF at 0x0A, FF at 0x0C, VT at 0x0B, and CR at 0x0D—to promote global data interchange while permitting national substitutions for invariant positions.[20] Subsequent 8-bit extensions, such as ISO 8859-1 (Latin-1) standardized in 1987, retained these whitespace assignments in the lower 128 positions but focused on adding Western European accented letters in the upper range, revealing limitations in supporting non-Latin scripts and bidirectional text without further extensions. A key interoperability challenge arose from varying line-ending conventions rooted in these early sets: Unix systems, emerging in the early 1970s, employed LF alone for efficiency on minimal hardware, while DOS platforms adopted the CR+LF sequence inherited from CP/M and teleprinter standards, causing file corruption or display errors when transferring text across environments.[21] These mismatches persisted into networked computing, underscoring the need for standardized conversions in later protocols.Unicode Specification
In the Unicode Standard, whitespace characters are formally defined through the White_Space (WSpace) binary property, which identifies code points that separate words, lines, or paragraphs in text processing.[22] This property encompasses characters from various general categories, primarily the Space Separator (Zs) category for horizontal spacing, Line Separator (Zl) for line breaks, Paragraph Separator (Zp) for paragraph breaks, and certain Other Control (Cc) characters for formatting controls like tabs and newlines. For instance, the basic space is U+0020 SPACE (Zs), the non-breaking space is U+00A0 NO-BREAK SPACE (Zs), the tab is U+0009 CHARACTER TABULATION (Cc), the line separator is U+2028 LINE SEPARATOR (Zl), and the paragraph separator is U+2029 PARAGRAPH SEPARATOR (Zp).[23][22] As of Unicode 17.0 (released in 2025), exactly 25 code points are assigned the WSpace=Yes property, maintaining the count from previous versions with no additions or removals since Unicode 6.3.[23] These include legacy ASCII controls such as U+000A LINE FEED and U+000D CARRIAGE RETURN, along with specialized separators like U+1680 OGHAM SPACE MARK (Zs) for Ogham script and U+3000 IDEOGRAPHIC SPACE (Zs) for East Asian typography. Additional properties relevant to whitespace include Bidi_Class=WS in the bidirectional algorithm, which treats them as neutral for right-to-left text directionality, and Join_Type=Transparent for shaping in scripts like Arabic, preventing them from participating in ligatures.[22] The specification of whitespace originates in Unicode 1.0 (1991), which initially included basic ASCII-derived spaces and controls, with subsequent versions expanding to support international scripts by adding separators like those in the General Punctuation block (e.g., U+2000–U+200A range for various width spaces). Standardization details are outlined in Unicode Standard Annex #44 (Unicode Character Database), which documents the WSpace property and general categories, and Annex #9 (Unicode Bidirectional Algorithm), which defines their role in text directionality.[22] Further behavioral rules appear in Unicode Technical Report #29 (Unicode Text Segmentation), specifying how whitespace influences word boundaries, sentence breaks, and line breaks in internationalization algorithms. In normalization processes per Unicode Standard Annex #15, most whitespace characters are preserved as they are not default ignorable, though some like U+200B ZERO WIDTH SPACE (not WSpace) may be handled differently; whitespace generally remains stable across normalization forms to maintain text structure. These properties ensure consistent handling across applications, building on early ASCII whitespace like space and tab while extending to global linguistic needs.[23]| Category | Description | Examples |
|---|---|---|
| Zs (Space Separator) | Horizontal spacing characters that separate tokens without breaking lines. | U+0020 SPACE, U+00A0 NO-BREAK SPACE, U+2000 EN QUAD, U+3000 IDEOGRAPHIC SPACE (17 total in Unicode 17.0). |
| Zl (Line Separator) | Explicit line break without paragraph implications. | U+2028 LINE SEPARATOR. |
| Zp (Paragraph Separator) | Explicit paragraph break, often implying a new block. | U+2029 PARAGRAPH SEPARATOR. |
| Cc (Control) | Formatting controls acting as whitespace. | U+0009 TAB, U+000A LINE FEED, U+000B VERTICAL TAB, U+000C FORM FEED, U+000D CARRIAGE RETURN, U+0085 NEXT LINE (6 total). |
Variant and Non-Space Blanks
In Unicode, variant blank characters include a series of fixed-width spaces in the General Punctuation block (U+2000 to U+200A), such as the en quad (U+2000), em quad (U+2001), en space (U+2002), em space (U+2003), three-per-em space (U+2004), four-per-em space (U+2005), six-per-em space (U+2006), figure space (U+2007), punctuation space (U+2008), thin space (U+2009), and hair space (U+200A). These characters provide precise spacing for typographic purposes, with widths typically measured as fractions of an em (e.g., the en space at half an em, the hair space at about one-sixteenth of an em), and were introduced in Unicode 1.1 in 1993 to support legacy typesetting conventions.[24] Although classified under the Zs (Space Separator) general category like the basic space (U+0020), they function as specialized variants for layout control rather than general word separation.[24] Beyond these, other specialized blank variants include the ideographic space (U+3000), a full-width equivalent primarily used in CJK (Chinese, Japanese, Korean) text to separate ideographs, matching the width of a typical CJK character. The Mongolian vowel separator (U+180E), added in Unicode 3.0, acts as a thin, word-internal whitespace specifically before final vowels in Mongolian script (e.g., before U+1820 or U+1821), creating a small gap without affecting line breaking. Similarly, the zero-width space (U+200B), introduced in Unicode 1.1, provides invisible separation for word boundaries or line breaks in languages without spaces, such as Thai or Southeast Asian scripts, while having no visible width.[24] These variants, unlike standard Zs spaces, belong to the Cf (Format) category and do not participate in default word-breaking algorithms. Non-space blanks often serve as visual proxies rather than functional whitespace. The open box (U+2423), from the Control Pictures block and added in Unicode 1.1, is a graphic symbol (So category) traditionally used in proofreading and documentation to represent an actual space character, depicting it as an outlined square for clarity in printed or digital text analysis. The middle dot (U+00B7), an Other Punctuation (Po) character from the Latin-1 Supplement block also added in Unicode 1.1, occasionally functions as a visual stand-in for spaces in debugging or editing contexts, such as highlighting positions in code or text where whitespace occurs, though its primary role is as a midpoint or interpunct in typography. These substitute images do not contribute to spacing or separation in rendering but aid in human-readable representations of invisible elements.Typography and Rendering
Display Considerations
In digital displays, whitespace characters such as spaces and tabs are often collapsed during rendering to improve readability and layout efficiency. For instance, web browsers following HTML and CSS standards treat multiple consecutive spaces, tabs, or line breaks as a single space by default, unless overridden by properties likewhite-space: pre.[25] This collapsing occurs after HTML parsing normalizes carriage returns and line feeds to line feeds, preserving the content semantically while simplifying visual output.[25] In terminal emulators, tabs advance the cursor to the next predefined tab stop, commonly set at intervals of 8 characters in fixed-width contexts, with the exact width determined by the font's metrics for consistent alignment.[26]
Rendering of whitespace varies significantly between print and digital media, as well as between font types. Monospace fonts, such as Courier, assign a uniform fixed width to all characters including spaces, ensuring predictable alignment in both printed documents and on-screen displays like code editors.[27] In contrast, proportional fonts treat spaces as variable-width elements, allowing dynamic adjustments during text justification to evenly distribute lines, which can lead to challenges in line wrapping where excessive stretching or hyphenation may occur to avoid awkward breaks.[28] These differences highlight how digital rendering prioritizes flexibility over the rigid spacing typical in print typography.
Platform-specific handling introduces further display variations, particularly with line endings. Files created on Windows using CRLF (carriage return followed by line feed) may appear distorted in Unix-like terminals, where the carriage return causes the cursor to return to the line start before the line feed advances it, often resulting in overlaid text or visible "^M" artifacts unless normalized.[29] Accessibility tools, such as screen readers, interpret whitespace by introducing pauses or collapsing multiples, but inserting spaces within words for visual formatting can cause them to vocalize letters separately (e.g., "W e l c o m e" as individual letters), disrupting comprehension and violating WCAG guidelines for meaningful sequence.[30]
Modern displays present additional challenges influenced by evolving font technologies and hardware. Variable fonts based on the OpenType specification, introduced in the 2010s, enable dynamic adjustments to glyph widths and kerning pairs, which can subtly alter space rendering around elements like emoji that have non-standard metrics, potentially leading to inconsistent inter-character spacing.[31] On high-DPI screens, subpixel rendering techniques like ClearType enhance legibility of text but can make thin whitespace characters appear narrower or blurrier due to scaling and antialiasing, affecting perceived uniformity across resolutions.[32] Unicode properties, such as White_Space, further guide these rendering decisions by classifying characters that trigger collapsing or line breaks in compliant systems.[33]
General-Purpose Spaces
General-purpose spaces in typography are variable-width whitespace characters designed for flexible layout and separation in text composition, with widths defined relative to the em unit—the nominal width of the capital 'M' in a given font, equivalent to the font's point size. These spaces originated in 19th-century hot-metal typesetting, where metal slugs of varying thicknesses were cast to fill lines and create indents, with the em space serving as the foundational unit from which others were proportioned.[34][5] The term "em" derives from the width of the letter 'M', while "en" refers to half that width, based on the letter 'N'.[34] The primary general-purpose spaces standardized in Unicode's General Punctuation block include the em space (U+2003), en space (U+2002), and thin space (U+2009). The em space has an advance width equal to one em, making it suitable for paragraph indents or block-like separations in traditional typesetting.[5][24] The en space, at half the em width, is commonly used for medium separations, such as in parenthetical inserts or to approximate the width of an en dash.[5] The thin space measures approximately 1/5 em, providing subtle spacing for elements like thousands separators in numerals (e.g., 1 000) or between initials and surnames.[24] These widths are not fixed pixels but scale with the font size, ensuring proportional rendering across different typefaces and sizes.[5]| Character | Unicode Code | Relative Width | Common Usage |
|---|---|---|---|
| Em space | U+2003 | 1 em | Paragraph indents, block spacing |
| En space | U+2002 | 1/2 em | Medium separations, punctuation spacing |
| Thin space | U+2009 | 1/5 em | Number grouping, subtle word division |
word-spacing property, introduced in CSS Level 1 in 1996, allowing adjustments in em units to mimic typographic effects without inserting characters directly. Their rendering remains font-dependent, influencing line justification and kerning in digital displays.[5]
For contexts requiring line-break prevention, such as numeric ranges or French typography, non-breaking variants like the narrow no-break space (U+202F) provide a thin-space equivalent that inhibits word wrapping.[35] This character, typically the width of a thin space, is recommended for separating elements in numbers or before punctuation where breaks could disrupt readability.[35]
Specialized Spacing in Text
In typography, hair spaces (U+200A) serve as ultra-thin separators, typically measuring about 1/6 em in width, and are employed for precise adjustments around punctuation such as dashes and quotation marks.[24] This usage stems from 18th-century French printing traditions, where fine spacing enhanced readability and aesthetic balance in complex layouts involving em dashes or guillemets.[36] For instance, in French typography, a hair space may flank an em dash to create subtle separation without disrupting flow, as recommended in classic typographic references for maintaining visual harmony.[37] Punctuation integration often requires specialized whitespace to adhere to linguistic conventions, particularly in French, where a narrow non-breaking space (U+202F), known as the 'espace fine insécable', precedes colons, semicolons, question marks, and exclamation points to prevent line breaks while providing a thin inter-element gap.[38] This practice ensures typographic elegance and has been standard in French typesetting since the adoption of fixed-width spaces in the 19th century.[39] In contrast, CJK (Chinese, Japanese, Korean) scripts utilize the ideographic space (U+3000), a full-width character equivalent to the width of an ideograph, to delineate phrases without the variable spacing common in Latin-based systems.[40] International variations further diversify whitespace applications; in Arabic script, the tatweel (U+0640, also called kashida) extends horizontal lines between letters for justification, allowing even distribution across lines while preserving cursive connections.[41] Similarly, in Devanagari, the zero-width joiner (U+200D) influences spacing by forcing glyph joining between consonants or matras, effectively reducing visible gaps in compound forms like conjuncts, which is crucial for orthographic accuracy in Hindi and related languages.[42] These conventions are codified in authoritative standards, such as the Chicago Manual of Style (first published 1906, with ongoing updates), which advises minimal or no spacing around dashes in English but accommodates thin spaces for international punctuation like French colons to respect script-specific norms.[43] Unicode Technical Report #14 further details script-specific line-breaking rules, prohibiting breaks around non-breaking spaces in French typography and allowing flexible justification via tatweel in Arabic, while treating CJK ideographic spaces as non-breaking separators in East Asian layouts.[44]Computing and Software Uses
Programming Language Handling
In many programming languages, whitespace characters such as spaces and tabs are treated as insignificant during lexical analysis, serving primarily as separators between tokens while having no impact on the program's semantics beyond that role. In the C programming language, defined by the ISO/IEC 9899 standard, whitespace separates preprocessing tokens like identifiers, keywords, and literals, but adjacent tokens without whitespace are still parsed correctly unless ambiguity arises, and indentation levels do not affect code structure. Similarly, in C++, as specified in ISO/IEC 14882, whitespace is discarded after tokenization in translation phase 3, except where it is required to disambiguate tokens, rendering indentation irrelevant to syntax. Other languages employ significant whitespace, where the presence, absence, or extent of spaces and newlines directly influences code structure and meaning. Python, introduced in 1991, mandates consistent indentation to delineate code blocks, with the lexer tracking indentation levels to enforce block boundaries; mixing tabs and spaces can lead to IndentationError if it alters the perceived level.[45] Haskell, released in 1990, uses layout rules for off-side indentation in constructs like where clauses and let expressions, where the relative column position of statements determines their scope, though explicit braces and semicolons can override this.[46] Specific handling varies across languages, often enabling optimizations or pattern matching that accounts for whitespace. In JavaScript, governed by the ECMAScript specification, whitespace is insignificant post-tokenization, allowing minification tools to collapse multiple spaces, tabs, and newlines into single spaces or remove them entirely without altering execution, as long as token separation is preserved.[47] For regular expressions, Perl's \s metacharacter matches ASCII whitespace by default but, in UTF-8 mode, includes Unicode whitespace characters like U+00A0 (non-breaking space) per the Unicode standard. The PCRE library, widely used for Perl-compatible regex in tools like PHP and grep, extends \s similarly to include Unicode whitespace when compiled with Unicode support, matching characters in the Unicode Separator categories.[48] Programming languages commonly provide escape sequences to represent whitespace in string literals and character constants. In C and C++, \t denotes a horizontal tab (U+0009) and \n a line feed (U+000A), allowing literal inclusion without relying on physical input; these are processed during translation phase 7. Trailing whitespace in source code can complicate version control, as seen in Git since its 2005 inception, where diffs highlight such changes by default via core.whitespace configuration, often flagging them as errors to maintain clean patches unless ignored with options like --ignore-space-at-eol.Command-Line and Shell Processing
In POSIX-compliant shells, such as the Bourne shell originally developed by Stephen Bourne in 1977, command-line arguments are parsed by splitting input on whitespace characters—specifically spaces and tabs—unless the input is quoted.[49][50] This field splitting process occurs after expansions like parameter substitution and uses the Internal Field Separator (IFS) variable to define delimiters, with the default IFS consisting of space, tab, and newline, treating consecutive whitespace as a single separator while ignoring leading and trailing sequences.[51][52] Quoting mechanisms allow preservation of whitespace within arguments. Single quotes (' ') treat the enclosed content literally, preventing any expansion or splitting, so that embedded spaces are retained as part of a single field.[53] Double quotes (" ") similarly preserve whitespace and treat the content as a single field, but permit parameter expansion (e.g., var), command substitution (e.g., (cmd)), and backslash escapes for specific characters like literal spaces (e.g., \ ).[54][55] Backslashes alone can escape individual whitespace characters outside quotes, such as inserting a literal space in an unquoted argument.[56] In shell scripting, behaviors extend this parsing to script structure and execution. For instance, in Bash, newlines serve as default statement separators, delineating individual commands unless continued with a backslash or semicolon. Paths containing spaces pose challenges in both Unix-like shells and Windows CMD; in Unix shells, such paths must be quoted to avoid splitting (e.g., cd "/path with spaces"), while Windows CMD similarly requires double quotes around paths with spaces to treat them as single arguments, though it lacks native support for single quotes.[50][57] Command-line tools like grep and awk further interpret whitespace as delimiters during processing. Grep matches patterns across lines but uses whitespace in regular expressions to define word boundaries (e.g., via -w for whole words separated by non-whitespace). Awk, by default, employs whitespace (spaces, tabs, newlines) as the field separator (FS), splitting input records into fields while collapsing multiple consecutive whitespace into one delimiter and trimming leading/trailing spaces.[58] Extensions in shells like Zsh, released in 1990, enhance expansions involving whitespace through advanced globbing. Zsh's filename generation (globbing) treats unquoted whitespace as separators during pattern expansion but supports qualifiers and modifiers to handle spaces in paths more flexibly, such as recursive globbing with */ that preserves embedded spaces when quoted.[59][60]Markup Languages and Formatting
In markup languages, whitespace characters are often processed to ensure consistent rendering, with default rules that collapse multiple consecutive spaces, tabs, and line breaks into a single space, while providing mechanisms to override this behavior for preservation when needed.[61][62] In HTML, whitespace in element content is preserved during parsing as text nodes in the DOM, but rendering collapses sequences of whitespace (including spaces, tabs, and newlines) to a single space, and ignores leading or trailing whitespace in most inline and block contexts.[61][25] To preserve original whitespace formatting, the<pre> element can be used, which maintains spaces, tabs, and line breaks as authored, or the CSS white-space: pre property, introduced in CSS Level 1 in 1996, achieves the same effect on any element.[63] For non-collapsible spacing, the non-breaking space entity (U+00A0) inserts a space that prevents line breaks and is not subject to collapsing.[64]
XML follows similar principles but distinguishes between element content and attribute values in its handling of whitespace. In element content, all whitespace is preserved by the processor and passed to the application as character data, though applications may apply further normalization unless directed otherwise.[65] Attribute values, however, are normalized by replacing any sequence of whitespace characters (space, carriage return, line feed, or tab) with a single space and trimming leading and trailing spaces, except when the attribute type is CDATA.[66] CDATA sections preserve all characters, including whitespace, as literal text without interpreting markup, allowing unescaped whitespace and other content to be included directly.[67] The xml:space attribute with value "preserve" signals that applications should retain all whitespace in the element's content without normalization, overriding default processing modes.[68]
In LaTeX, which builds on the TeX typesetting system developed by Donald Knuth in 1978, spaces are handled through category codes (catcodes) that classify characters during input processing, with the space character assigned catcode 10 (space) by default, causing it to be discarded after inserting a stretchable glue in horizontal mode. To insert a non-breaking space that prevents line breaks, the tilde ~ command produces an unbreakable tie of fixed width, commonly used after abbreviations or numbers.[69] Custom horizontal spacing is achieved with \hspace{length}, which adds a specified amount of space (e.g., \hspace{1em}) that can be flexible or rigid, while active characters (catcode 13) allow spaces to be redefined for specialized behaviors like verbatim modes.[69]
Other formats exhibit varied whitespace rules tailored to their purposes. In Markdown, as defined by the CommonMark specification, multiple consecutive spaces within a paragraph are collapsed to a single space during HTML output conversion, though four or more leading spaces denote a code block where whitespace is preserved. In JSON, per RFC 8259, whitespace between tokens (such as objects, arrays, or values) is ignored for parsing, but within string values, all whitespace characters are preserved as part of the literal content.[70]