Translation Memory eXchange
Translation Memory eXchange (TMX) is an XML-based open standard designed for the exchange of translation memory data between computer-aided translation (CAT) tools and localization systems, enabling the efficient sharing of previously translated text segments with minimal data loss.[1] Developed by the OSCAR Standardization Special Interest Group under the Localization Industry Standards Association (LISA), TMX facilitates interoperability across diverse translation environments by providing a vendor-neutral format for storing source and target language pairs, along with associated metadata such as dates, tools used, and segmentation information. Following LISA's dissolution in 2011, the standard was adopted and is maintained by the European Telecommunications Standards Institute (ETSI) through its Industry Specifications Group on Localization Standards (ISG-LIS).[2] The standard's core structure consists of a root<tmx> element containing a <header> for administrative details and a <body> section with <tu> (translation unit) elements, each comprising <tuv> (language variant) tags that hold <seg> (segment) content.[1] TMX supports two conformance levels: Level 1 for plain text segments and Level 2 for preserving inline formatting and codes through elements like <bpt>, <ept>, <it>, and <ph>, ensuring compatibility with complex content such as HTML or formatted documents.[2] It adheres to international standards, including Unicode encoding (UTF-8, UTF-16, or US-ASCII), ISO language codes, and ISO date formats, and has been adopted by ETSI as GS LIS 002 V1.4.2 in 2013, with version 1.4b from April 2005, which remains the current iteration as of 2025.[1][2] Widely used in the global localization industry, TMX promotes consistency in translations, reduces redundancy, and supports integration with related standards like Segmentation Rules eXchange (SRX) for handling text breaks.[2]
Overview
Definition and Purpose
Translation Memory eXchange (TMX) is an open XML-based specification developed by the OSCAR special interest group of the Localization Industry Standards Association (LISA) for the exchange of translation memory (TM) data between computer-aided translation (CAT) tools and localization software.[1] It structures TM data as aligned segments of source and target text in multiple languages, enabling the transfer of reusable translation units without dependency on specific vendor implementations.[2] The primary purpose of TMX is to offer a vendor-neutral format that minimizes data loss during transfers between tools, facilitating the accurate recreation of source and target documents. This standardization emerged in the late 1990s in response to the proliferation of proprietary TM formats, which hindered collaboration among translators and interoperability across diverse CAT systems.[3] By providing a common exchange mechanism, TMX promotes efficiency in translation workflows and protects investments in accumulated translation assets.[4] Key benefits of TMX include its standardized structure for TM data, which supports multiple languages and granular segment handling, thereby enhancing reusability and consistency in localization projects. Additionally, TMX integrates seamlessly into broader frameworks like the Open Architecture for XML Authoring and Localization (OAXAL), aiding in the management of multilingual content across authoring, translation, and review processes.[5]Relation to Translation Memory
Translation memory (TM) consists of databases that store previously translated text segments in a source language paired with their corresponding translations in one or more target languages, enabling reuse in computer-aided translation (CAT) tools to promote consistency, reduce redundancy, and improve efficiency in localization workflows.[6][7] These segments, typically sentence-like units, form a bilingual corpus that supports exact and partial matches during new translation projects, minimizing manual effort while maintaining terminological and stylistic uniformity. By segmenting content into manageable units, TM systems facilitate rapid retrieval and adaptation of prior work, which is essential in professional translation environments where large volumes of related texts are processed repeatedly.[6] The Translation Memory eXchange (TMX) functions primarily as a standardized export and import format for TM databases, transforming tool-specific, proprietary TM data into a vendor-neutral XML structure that ensures portability and interoperability across diverse CAT and localization software.[2] This role allows translation professionals and organizations to share TM resources seamlessly between systems from different vendors, avoiding vendor lock-in and enabling collaborative workflows without significant data reformatting.[8] As an open standard, TMX bridges the gap between isolated TM implementations, supporting the exchange of aligned bilingual segments while preserving essential metadata for practical reuse.[2] TMX enhances the utility of TM systems by enabling the exchange of segments that tools can then use for fuzzy matching, where similar but non-identical segments are identified and leveraged based on similarity thresholds, thus extending the reuse potential of stored data during translation.[8] It also retains key segment attributes, such as creation and modification dates or project identifiers, which provide context for quality control and version tracking without requiring alterations to core TM processes.[2] Furthermore, TMX accommodates multilingual support within individual translation units, allowing multiple language variants to coexist and facilitating bidirectional or multi-target exchanges in global localization efforts.[2] In contrast to native TM formats, which serve as internal storage mechanisms optimized for querying and management within specific software ecosystems, TMX is designed exclusively for exchange purposes, emphasizing lossless transfer of TM content rather than comprehensive database operations like indexing or querying.[2] This exchange-only focus makes TMX lightweight and universally compatible, relying on XML's extensibility to handle diverse TM data without the overhead of full-fledged database structures.[8] As a result, TMX complements rather than replaces native formats, prioritizing data migration and integration in heterogeneous translation environments.[2]History
Origins and Development
The Translation Memory eXchange (TMX) format was developed by the OSCAR (Open Standards for Container/Content Allowing Re-use) special interest group, established under the Localization Industry Standards Association (LISA) in the mid-1990s to address the challenges of exchanging translation memory data in the burgeoning software localization industry.[9] LISA, founded in 1990 as a non-profit organization dedicated to standardizing globalization, internationalization, localization, and translation (GILT) practices, coordinated input from vendors, translators, and industry stakeholders worldwide to create TMX as part of its broader suite of standards, including the Term Base eXchange (TBX) for terminology management.[10][11] TMX's initial release occurred in 1997, directly responding to the fragmentation of proprietary translation memory formats that hindered interoperability among tools in the rapidly expanding localization sector.[4] The OSCAR group focused on defining a vendor-neutral XML-based structure to enable seamless import and export of translation units, thereby facilitating reuse of linguistic assets across diverse systems.[12] In March 2011, LISA was declared insolvent and ceased operations, prompting the transfer of its standards portfolio, including TMX, to community maintenance under a Creative Commons Attribution 3.0 license.[2] The Globalization and Localization Association (GALA) assumed responsibility for hosting and preserving TMX specifications, ensuring ongoing accessibility for the industry.[13] A key milestone in this period was TMX's integration into the Open Architecture for XML Authoring and Localization (OAXAL) reference model in 2009, which incorporated it as a core component for exchanging translation memories within broader XML-based localization workflows.[5]Versions and Evolution
The Translation Memory eXchange (TMX) format was first introduced in version 1.0 in 1997 by the Localisation Industry Standards Association (LISA) through its OSCAR special interest group, establishing a basic XML-based structure for exchanging translation memory data between tools and vendors with minimal loss of information.[14] Subsequent updates refined the format based on user feedback. Version 1.1, released in August 1998, added support for segment notes to allow translators to include annotations directly within translation units.[15] Version 1.2 in 1999 improved attribute handling, enhancing the flexibility of metadata such as creation dates and language codes. By version 1.3, released in August 2001, enhancements to inline markup enabled better preservation of formatting and structural elements during exchange.[16] The development of version 1.4b began with a minor revision, 1.4a in July 2002, which adjusted attribute requirements for better compatibility, followed by drafts in April and October 2004; the final release occurred on April 26, 2005, further refining properties for translation units and deprecating unused elements like<ut> to streamline the schema and reduce redundancy.[1]
In March 2007, LISA released a working draft of TMX 2.0 for public comment, proposing extensions for richer metadata support and integration with web services to address evolving needs in distributed translation workflows; however, development was abandoned following LISA's closure in 2011 due to financial insolvency.[17]
As of 2025, version 1.4b remains the de facto standard, with formal adoption by the European Telecommunications Standards Institute (ETSI) in 2013 through Group Specification GS LIS 002 V1.4.2, ensuring continued maintenance and interoperability.[2] Post-LISA, the Globalization and Localization Association (GALA) has hosted the legacy standards and focused community efforts on compatibility testing rather than new versions, emphasizing backward compatibility across tools.
The evolution of TMX was primarily driven by feedback from tool vendors highlighting data loss issues, such as inconsistencies in segmentation and inline codes; no major updates have occurred since 2005 due to the format's sufficient maturity for core exchange needs and a broader industry shift toward complementary standards like XLIFF 2.0 for handling complex inline content.[18]
Technical Specification
Document Structure
The Translation Memory eXchange (TMX) format is structured as a well-formed XML document to ensure interoperability in exchanging translation memory data.[2] At its core, a TMX file begins with the root element<tmx>, which is mandatory and includes a required version attribute specifying the TMX standard version, such as "1.4".[2] This root element may also include an optional xml:lang attribute to indicate the administrative language used for metadata within the file.[2] The <tmx> element encapsulates the entire document hierarchy, typically containing a <header> section followed by a <body> section.[2]
The <header> element provides essential metadata about the TMX file and is required for complete documentation of the translation memory's origin and properties.[2] It includes several required attributes: creationtool (the tool that generated the file), creationtoolversion (the version of that tool), segtype (the type of segments, such as "sentence"), o-tmf (the originating translation memory format), adminlang (the language for administrative elements), srclang (the source language code), and datatype (the data type, e.g., "plaintext" or "xml").[2] Optional attributes in the <header> encompass o-encoding (the original encoding of the source data), creationdate (the file creation timestamp in YYYYMMDDTHHMMSSZ format), creationid (an identifier for the creator), changedate (last modification timestamp), and changeid (identifier for the last modifier).[2] This metadata facilitates tracking and compatibility during data exchange without altering the core translation units.[2]
Following the header, the <body> element houses the primary content of the TMX file, consisting of zero or more <tu> (translation unit) elements that represent the actual translation memory segments.[2] There is no prescribed order for the <tu> elements within the body, allowing flexibility in how translation data is organized.[2]
TMX files must adhere to XML well-formedness rules to ensure parseability across tools, and they support UTF-8 encoding by default, with compatibility for UTF-16 and ISO-646 (US-ASCII).[2] Files are conventionally saved with the .tmx extension and recognized under the MIME type application/x-tmx, which underscores their XML-based nature while specifying the TMX format for precise handling in software applications.[19]
Core Elements and Attributes
The core elements of the Translation Memory eXchange (TMX) format within the document body primarily revolve around the<tu> element, which serves as the fundamental container for bilingual or multilingual translation data, encapsulating aligned source and target segments along with associated metadata. This structure enables the exchange of translation memories between tools while preserving the integrity of the paired linguistic content. The TMX specification, maintained under ETSI standards, defines these elements to ensure interoperability in localization workflows.[2]
The <tu> (translation unit) element acts as the primary container for a source-target pair, representing a single entry in the translation memory. It may include optional attributes such as tuid for a unique identifier (e.g., tuid="tu001" to facilitate tracking and referencing across systems), srclang to specify the source language using RFC 3066 codes (e.g., srclang="en-us"), datatype to indicate the content type (e.g., datatype="[plaintext](/page/Plaintext)" or datatype="[html](/page/HTML)"), and segtype to denote the segmentation level (e.g., segtype="sentence" for sentence-based units). Additional attributes like creationdate (in ISO 8601 format, e.g., creationdate="2025-01-01T00:00:00Z") record when the unit was created, supporting provenance tracking. These attributes are placed within the <body> section of the TMX file, building on the overall document hierarchy. The <tu> can contain multiple <tuv> elements for multilingual extensions, as well as <prop> and <note> for metadata.[2][1]
Nested within each <tu>, the <tuv> (translation unit variant) element represents the language-specific portion of the translation unit, holding the text for a particular language variant. It requires the xml:lang attribute (e.g., xml:lang="fr-fr") to identify the language per RFC 3066 standards, ensuring clear delineation of source or target content. Optional attributes include datatype (mirroring the parent <tu> for consistency) and temporal ones like creationdate or changedate to log modifications. The <tuv> contains exactly one <seg> element for the textual content, along with optional <prop> and <note> elements, allowing for bilingual pairs through pairing one <tuv> for the source with another for the target. This design supports tri- or more-lingual units by permitting multiple <tuv> instances per <tu>, facilitating comprehensive multilingual data exchange.[2][1]
The <seg> (segment) element is the core carrier of the actual translatable text within a <tuv>, directly enclosing the source or target strings while supporting inline markup for formatting preservation (though detailed markup is addressed separately). It has no attributes. This element ensures that the bilingual data remains intact and aligned, with the text content forming the basis for translation memory reuse. For instance, a <seg> might hold "Hello world" in English paired with "Bonjour le monde" in French within adjacent <tuv> elements.[2][1]
To accommodate custom metadata, the <prop> (property) element provides a flexible mechanism for attaching additional information to <tu>, <tuv>, or other parents, such as project-specific details or quality scores. It requires a type attribute (e.g., type="x-Project" for custom project identifiers or type="match-quality" for fuzzy match indicators like "80" to denote an 80% similarity score). An optional xml:lang attribute allows language-specific properties. The element's content holds the property value as text, enabling tool interoperability without mandating universal standards for all metadata. This is particularly useful for representing fuzzy matches in translation memories, where match-quality="80" via a <prop> signals partial alignments for leverage.[2][1]
Finally, the <note> element allows for annotations, comments, or provenance details, enhancing the interpretability of bilingual data without altering the core segments. It can be placed inside <tu>, <tuv>, or <seg> and supports an optional xml:lang attribute for multilingual notes (e.g., xml:lang="en" for an English-language comment). The content is plain text, providing context such as translator remarks or revision history, which aids in collaborative translation workflows. For example, a <note> might explain a cultural adaptation in a target segment.[2][1]
Through these elements, TMX ensures robust representation of bilingual data, with attributes like tuid and xml:lang enabling precise tracking and language handling across multiple <tuv> variants per <tu>, as standardized in version 1.4b and subsequent ETSI updates.[2][1]
Inline Markup and Conformance Levels
Inline markup in TMX enables the preservation of document formatting and structural elements within translation units, ensuring that inline codes or tags from the source document are not modified during translation and can be accurately reproduced in the target document. These elements are embedded within the<seg> element of a translation unit, protecting native formatting such as bold text, hyperlinks, or placeholders from the translation process. The primary purpose of inline markup is to maintain the integrity of non-translatable content, facilitating seamless integration back into the original file formats like HTML or XML.[1]
TMX defines several inline elements to handle different types of formatting:
<bpt>(begin paired tag): Marks the start of a paired inline code, such as an opening HTML tag; it requires aniattribute for pairing with the end tag and may include anxattribute for type identification, e.g.,<bpt i="1" x="1"><b>. It can contain<sub>for sub-flow content like footnotes.[1]<ept>(end paired tag): Marks the end of a paired inline code, corresponding to the<bpt>; it also requires aniattribute for matching and may includex, e.g.,</b><ept i="1" x="1">. Like<bpt>, it supports<sub>.[1]<it>(isolated tag): Represents an inline code that does not span a segment boundary, such as a self-closing tag; it requires aposattribute to indicate its position within the segment and may includex, e.g.,<it pos="13" x="2"/>. It can nest<sub>.[1]<ph>(placeholder): Denotes a standalone inline element, like an image or non-translatable placeholder; it is empty or may contain<sub>and supports an optionalassocattribute for alignment, e.g.,<ph x="3"/>.[1]<hi>(highlight): Delimits spans of text with special meaning, such as terms or proper names, for processing purposes; it may include optionaltype(e.g., type="term") andxattributes, and can contain text and other inline elements. Example:<hi type="term">important text</hi>.[1]<sub>(sub-flow): Encapsulates nested content within other inline elements, such as alternative text in a code; it delimits sub-flows like tooltips and may contain further inline markup.[1]
xml:lang and seg, making it suitable for simple content without formatting, such as software strings.[1] Level 2 offers full support for inline elements (<bpt>, <ept>, <it>, <ph>), properties, notes, and extended attributes, enabling handling of complex documents like those with HTML or XML tags; however, <hi> and <sub> are optional even at this level to ensure core compliance.[1]
Validation of TMX documents relies on the associated DTD (tmx14b.dtd), which defines the structure and allowable elements; files must be well-formed XML, and full validity requires parsing against the DTD using a validating XML tool. Deprecated elements, such as <ut> (unpaired tag from earlier versions), are ignored in the 1.4b specification to maintain compatibility.[1]
Despite these features, TMX has notable limitations in inline markup handling: it preserves but does not interpret native codes, focusing solely on text and markup without support for binary objects like images or embedded media. Unlike formats such as XLIFF, TMX lacks mechanisms for versioning history or workflow metadata, restricting it to static translation exchanges.[1]
Implementation and Usage
Supported Tools and Software
Several industry-standard computer-assisted translation (CAT) tools provide native support for the Translation Memory eXchange (TMX) format, enabling seamless import and export of translation memories across different platforms. SDL Trados Studio, a leading CAT tool, has offered full Level 2 import and export capabilities for TMX since its 2007 version, allowing preservation of inline markup and attributes during exchanges.[20] MemoQ designates TMX as its primary exchange format, facilitating direct integration for translation memory sharing and workflow automation.[21] OmegaT, an open-source CAT tool, reads and writes TMX version 1.4b, supporting basic interoperability for professional translators on multiple operating systems.[22] Other notable tools also incorporate TMX handling to enhance flexibility in localization projects. Déjà Vu provides native TMX support for importing and exporting translation units, ensuring compatibility with diverse memory databases. Wordfast allows TMX imports to populate its translation memory, streamlining the migration of existing assets into its environment. Cloud-based platforms like Smartcat and Transifex support TMX uploads for seeding translation memories, enabling collaborative teams to leverage prior translations without proprietary format barriers.[23] In enterprise settings, systems such as SDL WorldServer and RWS Translation Management System (TMS) utilize TMX for inter-tool transfers, promoting data portability in large-scale localization pipelines. The Okapi Framework offers libraries for TMX processing, allowing developers to build custom applications that handle TMX files for advanced filtering and conversion tasks. Compatibility across tools generally adheres to TMX Level 1 for core structure and segments, while Level 2 support varies; for instance, SDL Trados preserves inline markup effectively, whereas simpler tools like OmegaT may experience potential loss of complex formatting during processing.[16] As of 2025, TMX integration is increasingly common in AI-assisted workflows, with tools combining it alongside APIs from DeepL and Google Translate to support hybrid human-AI translation processes.[24]Export, Import, and Best Practices
The export process for TMX in computer-assisted translation (CAT) tools typically begins by selecting the relevant translation memory database within the tool's interface, followed by choosing the export option to TMX format. Users can specify the desired conformance level—either Level 1 for plain text or Level 2 for content with inline markup—and apply filters such as specific language pairs or date ranges to include only pertinent segments. This generates a .tmx file containing a mandatoryExamples
Basic Translation Unit
The basic translation unit in TMX represents the smallest bilingual segment for exchange, consisting of a source text paired with its translation in one or more target languages, encapsulated within a simple XML structure. This unit adheres to the core elements of the TMX format, enabling straightforward interoperability between translation memory tools without requiring advanced features.[25] A minimal example of a TMX file containing a single basic translation unit is as follows, demonstrating the exchange of an English sentence to French:This snippet illustrates Level 1 conformance of the TMX specification, which supports only plain text segments without inline markup or protected content, ensuring compatibility for basic data transfer.[25] Thexml<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE tmx PUBLIC "-//LISA OSCAR:1998//DTD for Translation Memory eXchange//EN" "tmx14b.dtd"> <tmx version="1.4"> <header creationtool="ExampleTool" creationtoolversion="1.0" segtype="sentence" o-tmf="TMX" adminlang="en-us" srclang="en-us" datatype="plaintext" /> <body> <tu tuid="tu001"> <tuv xml:lang="en-us"> <seg>Hello</seg> </tuv> <tuv xml:lang="fr"> <seg>Bonjour</seg> </tuv> </tu> </body> </tmx><?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE tmx PUBLIC "-//LISA OSCAR:1998//DTD for Translation Memory eXchange//EN" "tmx14b.dtd"> <tmx version="1.4"> <header creationtool="ExampleTool" creationtoolversion="1.0" segtype="sentence" o-tmf="TMX" adminlang="en-us" srclang="en-us" datatype="plaintext" /> <body> <tu tuid="tu001"> <tuv xml:lang="en-us"> <seg>Hello</seg> </tuv> <tuv xml:lang="fr"> <seg>Bonjour</seg> </tuv> </tu> </body> </tmx>
<tu> element includes a tuid attribute for unique identification of the unit, while the <header> specifies the source language (srclang="en-us") and other metadata to define the file's context.[25]
In practice, such a basic translation unit is used for exchanging a single sentence pair between translation tools where no formatting or structural preservation is needed, facilitating quick imports into memory databases for simple localization workflows.[25]
This example conforms to the TMX 1.4b DTD (tmx14b.dtd), resulting in a file size under 1KB due to its minimal content.[25]
Advanced Features in Practice
In practice, advanced TMX files often incorporate metadata within the<header> element to track origin and provenance, such as the creationtool attribute specifying the originating software and creationdate indicating when the file was generated.[2] For instance, a sample header might appear as:
This setup ensures interoperability while preserving workflow details during exchange.[2] A complex translation unit (<header creationtool="MemoQ" creationtoolversion="9.5" creationdate="20231114T100000Z" segtype="sentence" o-tmf="MemoQ" adminlang="en-us" srclang="en" datatype="plaintext"> </header><header creationtool="MemoQ" creationtoolversion="9.5" creationdate="20231114T100000Z" segtype="sentence" o-tmf="MemoQ" adminlang="en-us" srclang="en" datatype="plaintext"> </header>
<tu>) can leverage attributes like tuid for unique identification and matchquality to denote similarity scores, such as "95" for high-confidence matches, alongside nested <tuv> elements for bilingual content.[2] Within the <seg> of a <tuv>, inline markup protects formatting, as seen in this excerpt:
This structure demonstrates Level 2 conformance by supporting protected inline codes without altering the original segmentation or tags.[2] The<tu tuid="tu001" matchquality="95"> <tuv xml:lang="en"> <seg><bpt i="1" x="1" type="bold"><b></bpt>Warning:<ept i="1" x="1"></b></ept> Do not operate without safety gear.</seg> <prop type="x-JobID">123</prop> <note>Approved translation for [technical](/page/Technical) [manual](/page/Manual) [section](/page/Section) 4.2</note> </tuv> <tuv xml:lang="fr"> <seg><bpt i="1" x="1" type="bold"><b></bpt>Avertissement :<ept i="1" x="1"></b></ept> Ne pas utiliser sans équipement de sécurité.</seg> </tuv> </tu><tu tuid="tu001" matchquality="95"> <tuv xml:lang="en"> <seg><bpt i="1" x="1" type="bold"><b></bpt>Warning:<ept i="1" x="1"></b></ept> Do not operate without safety gear.</seg> <prop type="x-JobID">123</prop> <note>Approved translation for [technical](/page/Technical) [manual](/page/Manual) [section](/page/Section) 4.2</note> </tuv> <tuv xml:lang="fr"> <seg><bpt i="1" x="1" type="bold"><b></bpt>Avertissement :<ept i="1" x="1"></b></ept> Ne pas utiliser sans équipement de sécurité.</seg> </tuv> </tu>
<prop> element with a custom x- prefixed type adds proprietary metadata, such as job identifiers, for internal tracking, while the <note> provides contextual comments to aid translators or reviewers.[2]
Such advanced features are particularly useful in exchanging formatted segments from technical manuals, where preserving bold text for warnings ensures visual consistency across tools and maintains metadata like approval status to streamline collaborative workflows.[2] This approach protects inline elements during import/export, reducing errors in rich-content localization projects.[2]
Validation of these features involves verifying the balance of paired inline elements, such as ensuring every <bpt> has a corresponding <ept> with matching i values, to prevent parsing failures in compliant tools.[2] Common issues, like unclosed tags, can lead to data loss or rejection during processing, underscoring the need for tools to enforce structural integrity as per the TMX conformance guidelines.[2]