Fact-checked by Grok 2 weeks ago

International Components for Unicode

International Components for Unicode (ICU) is an open-source project consisting of mature C/C++ and Java libraries that provide comprehensive Unicode and globalization support for software applications, enabling robust handling of text in multiple languages and locales.^[1] Developed to address the complexities of internationalizing software, ICU offers tools for character encoding conversion, text collation, date and number formatting, and locale-specific data management, drawing from the Unicode Consortium's Common Locale Data Repository (CLDR).^[2] It is designed to be portable across platforms, allowing developers to create applications that seamlessly support global users without region-specific variants.^[3] Originating in the mid-1990s at Taligent—a joint venture between Apple and IBM—ICU evolved from early internationalization efforts, with its Java components incorporated into JDK 1.1 and later ported to C/C++ by IBM's Unicode team.^[3] By 1999, a Project Management Committee was established under IBM stewardship, and in 2016, the project formally joined the Unicode Consortium as a technical committee, ensuring ongoing alignment with evolving Unicode standards.^[4] Released under a permissive license, ICU's source code is hosted on GitHub, with regular updates tracking Unicode versions and incorporating enhancements like improved collation algorithms and support for over 200 locales.^[5] ICU is widely adopted by major technology companies and software projects, including Adobe, Amazon (for Kindle), Apple, Apache, Google, IBM, Microsoft (integrated into Windows), and Oracle, among others, powering e-commerce, operating systems, and web applications for global scalability.^[1] Its reliability and extensibility have made it a de facto standard for software internationalization, reducing development costs and enhancing cross-cultural functionality in diverse environments.^[3]

Overview

Purpose and Scope

The International Components for Unicode (ICU) is an open-source project comprising mature C/C++ and Java libraries that deliver robust Unicode support alongside tools for software internationalization (i18n) and globalization (g11n). These libraries enable developers to build applications capable of handling multilingual text and cultural adaptations across diverse environments.^[1] At its core, ICU focuses on essential Unicode text processing tasks, including collation for sorting strings according to locale-specific rules based on the Unicode Collation Algorithm, normalization to standardize text representations, and case folding for consistent comparisons. Additionally, it offers locale-aware formatting capabilities for dates, times, numbers, and currencies—such as rendering "1,234.56 USD" in en_US or "1 234,56 €" in fr_FR—along with message construction via MessageFormat to generate dynamic, plural-sensitive strings like "You have {count, plural, one {# message} other {# messages}}."^[1]^[6]^[7]^[8]^[9] As a widely adopted solution, ICU powers globalization in numerous software applications, operating systems including Windows, and databases such as IBM Informix, drawing on the Unicode Common Locale Data Repository (CLDR) to support over 700 locales for comprehensive cultural and linguistic coverage.^[2]^[10]^[11]^[12]

Licensing and Platforms

The International Components for Unicode (ICU) is distributed under the Unicode License Agreement, a permissive open-source license that grants users the right to freely use, copy, modify, merge, publish, distribute, and/or sell copies of the Unicode Data Files and Software, including ICU libraries, without royalties or other charges, provided that the copyright notice and permission notice are included in all copies or substantial portions of the Software. This license, similar in permissiveness to BSD or MIT licenses, explicitly disclaims any warranties, including but not limited to implied warranties of merchantability, fitness for a particular purpose, or non-infringement, and holds the Unicode Consortium harmless from any claims or damages arising from its use.^[13]^[14] ICU source code is hosted on GitHub under the repository unicode-org/icu, enabling developers to access, fork, and contribute to the project while adhering to the Contributor License Agreement for submissions. Binary distributions are available for download from the official ICU website, including pre-built libraries for various platforms to simplify integration without compilation. Additionally, ICU is accessible through popular package managers, such as vcpkg for C/C++ on Windows, Linux, and macOS (vcpkg install icu), Homebrew for macOS (brew install icu4c), and Maven or Gradle for the Java variant (e.g., <dependency><groupId>com.ibm.icu</groupId><artifactId>icu4j</artifactId><version>78.1</version></dependency> in Maven).^[5]^[15]^[16] ICU4C, the C/C++ implementation, supports a wide range of platforms including Windows (version 7 and later), Linux distributions, macOS, Android, iOS (via cross-compilation), and others like z/OS and IBM i, with routine testing on recent versions of Linux, macOS, and Windows. The ICU4J library, for Java, integrates with any Java Runtime Environment (JRE)-supported platform, including those for desktop, server, and mobile applications. Starting with ICU version 75 (released in 2024), the C++ components require a compiler supporting C++17, while C code requires C11 compliance, ensuring modern standards for robustness and portability.^[17]^[18]^[19]^[20] Building ICU4C typically involves platform-specific tools: on UNIX-like systems (Linux, macOS), Autotools via the runConfigureICU script followed by configure and GNU Make (version 3.80 or later); on Windows, Microsoft Visual Studio (2017 or later) using solution files and MSBuild. Cross-compilation is supported for targets like Android and iOS using appropriate toolchains. For ICU4J, integration occurs through Java build tools such as Maven (version 3+ with JDK 11+) for versions 78 and later, or Apache Ant for earlier releases, allowing seamless compilation into JAR files within Java projects.^[17]^[19]

History

Origins

The International Components for Unicode (ICU) originated in the early 1990s at Taligent, a joint venture between Apple Computer and IBM established in 1989 to develop advanced cross-platform object-oriented operating systems and applications.^[21] Taligent's efforts focused on creating a robust Locale System to enable internationalization (i18n) and Unicode support in software, addressing the need for multilingual text processing in a unified framework.^[22] This foundational work laid the groundwork for ICU's emphasis on portable, standards-compliant globalization tools.^[21] Following IBM's acquisition of full ownership of Taligent in early 1996, the company's Text and International group collaborated with Sun Microsystems to integrate key internationalization components into the Java Development Kit (JDK).^[21] These contributions formed the basis of the java.text package (including classes like Format, Collator, and BreakIterator) and elements of the java.util package (such as ResourceBundle, Calendar, and TimeZone), which were incorporated into JDK 1.1 and released in early 1997.^[21] The initial implementation prioritized compliance with Unicode 2.0, ensuring support for the era's character encoding standards and supplementary characters.^[22] In 1999, IBM open-sourced these Java-based components under the name IBM Classes for Unicode, marking ICU's entry into the public domain via CVS and Jitterbug systems.^[23] To extend functionality beyond Java environments, a C/C++ port known as ICU4C was developed shortly thereafter, with the project officially renamed International Components for Unicode in 2001 to reflect its broader scope and Unicode-centric mission.^[22]^[24] Key early contributors included the Taligent development team, IBM's Globalization Center of Competency, and figures like Dr. Mark Davis, who led the Unicode integration efforts.^[21] This evolution positioned ICU for ongoing stewardship by the Unicode Consortium.^[1]

Development and Releases

The International Components for Unicode (ICU) project was initially released as open-source software by IBM in 1999, providing C/C++ and Java libraries for Unicode and globalization support.^[2] In May 2016, IBM transferred stewardship of the project to the Unicode Consortium to enable formal governance, broader community involvement, and alignment with Unicode standards.^[25] ICU follows an annual cadence for major releases, typically aligning with updates to the Unicode Standard and Common Locale Data Repository (CLDR).^[15] Version numbers follow a structure where the major version increments roughly yearly for stable releases (since ICU 49); earlier versions (up to 4.8) used even numbers for stable reference releases and odd numbers for development snapshots leading to the next stable version.^[26] Examples include the initial release in 1999 and the most recent major release, ICU 78.1, on October 30, 2025.^[27]^[28] Key milestones in ICU's development include ICU 4.0 in 2008, which provided full support for Unicode 4.0 along with enhanced APIs for internationalization.^[29] In 2016, ICU 58 deprecated and later removed the complex text layout engine, shifting focus to more modern rendering solutions while adding full support for Unicode 9.0.^[30] ICU 73.2, released in 2023, introduced compliance with the updated GB18030-2022 encoding standard for Chinese character support.^[31] Subsequent releases built on this: ICU 74 in 2023 added support for Unicode 15.1; ICU 75 in 2024 mandated C++17 for C++ code and C11 for C code to improve robustness and modernize the codebase; ICU 76 in 2024 added support for Unicode 16.0; ICU 77 in 2025 focused on bug fixes and CLDR 47 updates; and ICU 78 in 2025 introduced support for Unicode 17.0.^[32]^[33]^[27] As of 2025, ICU remains under active development on GitHub, marking over 25 years of continuous evolution since its inception.^[5] The project encourages community contributions through pull requests, with ongoing enhancements to Unicode conformance, locale data, and performance.^[5]

Core Architecture

ICU4C and ICU4J Libraries

The International Components for Unicode (ICU) project provides two primary library implementations: ICU4C for C/C++ environments and ICU4J for Java. These libraries form the foundational building blocks for Unicode and internationalization support in software applications, offering low-level operations for text processing and globalization.^[34]^[19] ICU4C is the core C/C++ library designed for efficient, low-level Unicode operations in native applications. It includes headers such as <unicode/utypes.h> for defining basic types and constants, along with APIs in directories like source/common/ for utilities (e.g., UnicodeString class) and source/i18n/ for internationalization features. The library supports UTF-8, UTF-16, and UTF-32 encodings to handle Unicode text across various platforms. It is commonly used in performance-sensitive native applications and databases, such as MySQL, where it enables features like regular expression support with Unicode awareness.^[34]^[35] ICU4J serves as the Java counterpart, mirroring the APIs of ICU4C to provide consistent functionality in JVM-based environments. Organized into packages like com.ibm.icu.text for text processing and com.ibm.icu.util for utilities, it extends Java SE's built-in internationalization (i18n) capabilities through service provider interfaces in java.text.spi and java.util.spi. This integration allows for advanced features beyond standard Java libraries, such as enhanced collation and formatting. ICU4J is widely adopted in Android applications (requiring API level 21 or later with library desugaring) and enterprise Java systems for robust multilingual support.^[19] Both libraries share a common architecture to ensure portability and consistency, including the use of binary data files like .dat for locale-specific information, which are generated at build time from source data in source/data/. This packaging allows for customizable inclusion of locales and resources, with API compatibility maintained across versions via reports tracking changes. The design avoids platform-specific dependencies by isolating them in dedicated files, such as platform.h.in in ICU4C, enabling compilation on diverse systems without native code ties.^[34]^[19]^[36] Key differences between ICU4C and ICU4J reflect their target environments: ICU4C prioritizes high-performance execution in native code for resource-constrained or speed-critical scenarios, while ICU4J leverages the JVM for seamless integration in Java ecosystems, including optional ties to JDK time zones but operating independently since version 2.1. Despite these distinctions, both libraries draw from the same data sources, such as Common Locale Data Repository (CLDR) files briefly referenced in builds, to maintain synchronized globalization capabilities.^[34]^[19]

Data Sources and Dependencies

The International Components for Unicode (ICU) relies primarily on the Common Locale Data Repository (CLDR), maintained by the Unicode Consortium, as its core data source for locale-specific information. CLDR supplies structured data for over 700 locales, encompassing details such as date and time formats, calendars, collation sequences, number and currency patterns, and measurement units across hundreds of languages and regions.^[11]^[37] This integration ensures that ICU can deliver culturally appropriate and linguistically accurate globalization features without requiring developers to maintain custom datasets. ICU further integrates the Unicode Character Database (UCD), a comprehensive repository of character properties, encoding mappings, and algorithmic data maintained by the Unicode Consortium. The UCD enables ICU to handle full Unicode text processing, including normalization, case mapping, and bidirectional text support. ICU synchronizes with the latest Unicode releases; for instance, version 78 incorporates Unicode 17.0, adding support for new characters, scripts, emoji, and updated collation rules. For external dependencies, ICU's core functionality operates independently without mandatory third-party libraries, promoting portability across platforms. However, advanced text shaping for complex scripts—such as those in Arabic, Devanagari, or Indic languages—optionally utilizes the HarfBuzz open-source shaping engine, following the deprecation of ICU's internal layout engine in version 54 and its eventual removal in later releases.^[30] Data in ICU is managed through flexible loading mechanisms and build processes to accommodate varying deployment needs. At runtime, locale resources (.res files) and conversion tables (.cnv files) are loaded on demand from directories specified via the u_setDataDirectory() API or the ICU_DATA environment variable, with caching for performance. Build-time incorporation uses tools from the icuapps suite, such as makeconv for generating .cnv files from source mappings and pkgdata for packaging data into compact .dat archives or static libraries. Updates to CLDR data are incorporated via periodic releases, ensuring ICU remains aligned with evolving locale standards without manual reconfiguration.^[37]^[38]

Key Features

Unicode Text Processing

ICU's Unicode text processing capabilities form the foundation for handling multilingual text in applications, enabling operations such as normalization, collation, regular expression matching, character set conversion, text boundary analysis, transliteration, and string searching while adhering to Unicode standards. These mechanisms ensure consistent and correct manipulation of Unicode strings across diverse scripts and languages, supporting the Unicode Standard's requirements for text processing.^[39]^[40]^[41]^[42]^[43]^[44]^[45] Normalization in ICU implements the four standard Unicode normalization forms—NFC (pre-composed), NFD (decomposed), NFKC (compatibility pre-composed), and NFKD (compatibility decomposed)—as defined in Unicode Technical Report #15 and Unicode Standard Chapter 5. These forms canonicalize text by rearranging and decomposing characters to achieve equivalence, with NFKC and NFKD specifically handling compatibility decompositions for legacy characters, such as mapping full-width forms to their ASCII equivalents. The Normalizer2 API, introduced in ICU 4.4, provides efficient operations including quick checks for normalization status, fast copy for already-normalized text, and support for custom normalization data like NFKC_Casefold for case folding in normalization. For example, the API can normalize a string like "é" (U+00E9) to its decomposed form "é" (U+0065 U+0301) in NFD. Additionally, ICU supports Fast C or D (FCD/FCC) modes for partial normalization, useful in collation and searching to avoid full normalization overhead.^[39] Collation services in ICU enable locale-sensitive sorting and comparison of Unicode strings through the UCollator class, which extends the Unicode Collation Algorithm (UCA) as specified in Unicode Technical Standard #10. UCollator supports tailored sorting for specific locales by integrating collation data from the Common Locale Data Repository (CLDR), including the Default Unicode Collation Element Table (DUCET) and language-specific tailorings, ensuring culturally appropriate ordering such as phonebook order in German ("ä" after "a"). Key features include search capabilities via CollationElementIterator for language-sensitive matching, case-insensitive comparisons adjustable through attributes like case level, and generation of sort keys for efficient binary comparisons. For instance, ucol_strcoll or Collator::compare can sort strings like "apple" and "äpple" according to locale rules, while ucol_getSortKey produces binary keys for database indexing. Custom rules allow further tailoring, such as "&9 < a, A < b, B" to define non-standard orders.^[40]^[46] ICU's regular expression engine, accessed via URegularExpression (or RegexPattern/RegexMatcher in C++), provides Unicode-aware pattern matching compliant with Unicode Technical Standard #18 at levels 1 and 2, supporting operations like searching, replacing, and splitting on Unicode strings. It handles grapheme clusters through the \X metacharacter, which matches entire user-perceived characters including combining marks as defined in UTS #29, preventing splits within clusters like "é". Unicode properties are fully supported, allowing patterns such as \p{Script=Latn} to match Latin script characters or \p{Letter} for any letter across scripts, with case-insensitive matching that accounts for Unicode's variable-length case mappings, such as "fußball" matching "FUSSBALL". The engine includes Perl-like syntax with quantifiers (, +, ?), possessive operators (+), and word boundaries (\b) adapted for Unicode, enabling robust text processing in multilingual contexts. For example, the pattern "abc+" can find "abccc" within a larger string, while \p{Script=Latn} selects only Latin text.^[41] Character set conversion in ICU facilitates transformation between Unicode and legacy encodings using converter APIs, supporting over 200 charsets including UTF-8, UTF-16, ISO-8859-1, and Shift-JIS, with bidirectional conversion and handling of fallbacks for unmapped characters. Converters like those for UTF-8 to ISO-2022-JP process streaming data efficiently, using callbacks for invalid sequences and ensuring platform consistency. Charset detection, via the CharsetDetector class, analyzes byte sequences to identify the most likely encoding, such as distinguishing EUC-JP from Shift-JIS based on byte patterns, aiding in legacy data import. These tools are essential for interoperability with non-Unicode systems.^[42] Text boundary analysis in ICU uses the BreakIterator class to identify logical boundaries in Unicode text, implementing Unicode Standard Annex #29 (UTS #29) for grapheme clusters, words, lines, and sentences. This enables proper text wrapping, cursor movement, and highlighting in editors and UIs, with locale-specific rules from CLDR for handling dictionary words in languages like Thai or Japanese. For example, BreakIterator can split "café" at the word boundary after "café" while treating "é" as a single grapheme, or compute line breaks avoiding hyphenation points. APIs like ubrk_setText allow incremental processing for large texts.^[43] Transliteration services allow conversion of text between different scripts or systems via the Transliterator class, supporting predefined rules (e.g., "Any-Latin" for Cyrillic to Latin) derived from CLDR and custom rule syntax like "a > b; ä > ae". Useful for romanization in search engines or input methods, it handles bidirectional transforms and filters, such as converting "Привет" to "Privet" or fullwidth "ＡＢＣ" to halfwidth "ABC". The engine chains multiple rules for complex mappings, ensuring reversible transformations where possible.^[44] String searching extends collation with the StringSearch class for finding substrings using locale-sensitive matching, ignoring case, accents, or punctuation as configured. It integrates with collators for rules like treating "Straße" equivalent to "Strasse" in German searches, supporting incremental iteration over matches in large documents. This is distinct from regex by focusing on exact or fuzzy substring location rather than pattern complexity.^[45] For bidirectional text, ICU implements the Unicode Bidirectional Algorithm from Unicode Standard Annex #9, reordering logical strings containing mixed left-to-right (LTR) and right-to-left (RTL) scripts, such as Arabic embedded in English, into visual display order. The Bidi class in ubidi.h processes paragraphs to generate embedding levels and mirrored glyphs, supporting RTL languages like Arabic, Hebrew, and Persian spoken by over 600 million people. It provides functions for writing reordered strings and an "inverse" mode for visual-to-logical conversion, though the latter is approximate. This ensures proper rendering in user interfaces without roundtrip losses when combined with shaping APIs in ushape.h.^[47]

Internationalization and Formatting Tools

ICU provides a suite of tools for locale-sensitive formatting of dates, times, numbers, and currencies, enabling applications to display output appropriately for users' cultural and regional preferences. These tools build on Unicode text processing by applying locale-specific rules to generate human-readable representations, such as adjusting decimal separators or date orders based on the target locale. Central to this are classes like Calendar and TimeZone, which handle temporal data across diverse systems, and formatting classes that produce strings compliant with standards from the Common Locale Data Repository (CLDR).^[48] The Calendar class serves as an abstract base for multiple calendar systems, including the Gregorian, Buddhist, and Japanese calendars, allowing developers to select the appropriate type via locale keywords (e.g., @calendar=buddhist for the Buddhist calendar in a French locale). The GregorianCalendar implements both the proleptic Gregorian and Julian systems, with a default transition date of October 4, 1582, which can be adjusted using setGregorianChange(). The BuddhistCalendar offsets the Gregorian year by 543, displaying eras like "BE" (Buddhist Era), while the JapaneseCalendar tracks historical eras such as Heisei or Reiwa, ensuring accurate representation in locales like ja_JP@calendar=japanese. Time zone handling integrates the IANA tzdata database, providing offsets from GMT and daylight saving rules through the TimeZone class, which supports IDs like "America/Los_Angeles" and methods for offset calculation and display names (e.g., "PDT").^[49]^[50] Number and currency formatting is managed primarily by the NumberFormat class and its subclass DecimalFormat, which use pattern strings to control output, such as "#,##0.00" to produce "1,234.56" in en_US locales with grouping separators and two decimal places. This supports various notations, including percentages (e.g., multiplying by 100 and appending "%"), scientific notation (e.g., "1.23E4"), and compact forms like "1.2K" for large numbers. Currency formatting leverages CLDR data for symbols and placement, so NumberFormat.getCurrencyInstance() might yield "$1,234.56" in the US or "1 234,56 €" in France, adapting to locale conventions for decimal points, thousands separators, and symbol positioning.^[51] Date and time formatting utilizes SimpleDateFormat, which interprets pattern strings like "yyyy-MM-dd" to output "2025-11-13" or skeletons via DateTimePatternGenerator for locale-appropriate variants (e.g., the skeleton "yMMMd" generates "Nov 13, 2025" in en_US). Relative time formatting is supported through styles like RELATIVE_SHORT, producing phrases such as "yesterday" or "in 2 hours" for recent dates, falling back to absolute formats for distant ones. These formatters integrate with Calendar instances to respect the chosen calendar and time zone, ensuring outputs like "13/11/2025" in British locales or era-specific dates in Japanese contexts.^[52] Resource bundles facilitate the storage and retrieval of locale-specific strings and data, loaded via APIs like ResourceBundle in Java or ures_open() in C, allowing access to keys such as error messages or UI labels tailored to locales like "en_US". They employ a fallback mechanism to resolve missing resources, chaining from specific locales (e.g., en_US) to parent ones (en) and ultimately the root bundle, ensuring graceful degradation without application crashes; warnings like U_USING_FALLBACK_WARNING signal when fallbacks occur. This system supports internationalization by embedding locale data in binary formats derived from CLDR, enabling efficient loading of strings, arrays, and nested resources without hardcoding.^[53]

MessageFormat System

Syntax and Functionality

The ICU MessageFormat employs a pattern-based syntax for constructing dynamic messages, utilizing placeholders enclosed in curly braces {} to insert arguments. These placeholders can be numbered (e.g., {0}) or named (e.g., {userName}), allowing for flexible substitution of values such as strings, numbers, or dates. The core structure supports basic argument replacement, where the pattern string defines the message template, and arguments are provided at runtime for formatting.^[54] To handle linguistic variations, MessageFormat includes select and plural formatters. The select syntax enables conditional selection based on non-numeric values, such as gender or case, using keywords within the placeholder: {gender, select, male{He is} female{She is} other{They are}}. This selects the appropriate sub-message based on the argument's value matching one of the keywords. Similarly, the plural syntax supports locale-specific plural rules for numeric arguments, with categories like one, few, many, and other: {count, plural, one{# item} other{# items}}. This ensures messages adapt to languages with complex plural forms, such as Arabic or Russian.^[54] Nesting allows for complex compositions, where one formatter can embed another, such as a plural inside a select, to build hierarchical logic without external concatenation. Offsets provide fine-tuned control in plural handling by adjusting the numeric value before applying rules; for instance, {showCount, plural, offset:1 =0{no new} one{1 new} other{# new}} notifications subtracts 1 from the count to phrase messages like "You have 5 notifications" as "4 new notifications" when showCount is 5. This feature is particularly useful for scenarios involving relative counts, such as updates or comparisons.^[54] ICU's MessageFormat aligns closely with Java SE's java.text.MessageFormat but extends it with advanced features like named arguments, improved plural and select support, and selectordinal for ordinal numbers (e.g., {rank, selectordinal, one{1st place} two{2nd place} few{3rd place} other{#th place}}). These enhancements address limitations in the standard Java API, such as its reliance on separate ChoiceFormat for simple conditions, by integrating plural and select directly into the pattern parser.^[55] Implementation occurs through the MessageFormat class in both ICU4J and ICU4C libraries. In ICU4J, com.ibm.icu.text.[MessageFormat](/page/MessageFormat) provides parsing APIs like parse(String, ParsePosition) to extract arguments from formatted strings and evaluation methods such as format(Object[], StringBuffer) to generate localized output from patterns and argument arrays or maps. In ICU4C, icu::[MessageFormat](/page/MessageFormat) offers analogous C++ APIs, including format(const Formattable* arguments, int32_t count, UnicodeString& appendTo) for formatting and parse(const UnicodeString& source, int32_t& count) for parsing, with underlying C API support via umsg.h functions like umsg_format. These classes handle pattern validation, locale-aware rule application, and error reporting through UErrorCode.^[55]^[56] As of November 2025, a successor specification, MessageFormat 2.0, has been stabilized in Unicode CLDR 47 (released March 2025) and is available in technology preview implementations within ICU: for Java in ICU 77 and later, and for C++ in ICU 78 and later. This new version introduces enhancements such as improved syntax for functions, literals, and better support for complex formatting, aiming to replace the original MessageFormat in future releases.^[57]^[58]

Usage in Applications

The MessageFormat system in ICU provides a straightforward API for integrating dynamic, locale-sensitive text generation into applications, primarily through the MessageFormat class in both Java (ICU4J) and C++ (ICU4C) libraries. In Java, basic usage involves constructing a MessageFormat object with a pattern string and then calling its format method with an array of arguments, as shown in the following example:

java
import com.ibm.icu.text.MessageFormat;
import java.util.[Locale](/page/Locale);

MessageFormat mf = new MessageFormat("You have {0,number,integer} messages.", Locale.ENGLISH);
String result = mf.format(new Object[]{5});  // Output: "You have 5 messages."
import com.ibm.icu.text.MessageFormat;
import java.util.[Locale](/page/Locale);

MessageFormat mf = new MessageFormat("You have {0,number,integer} messages.", Locale.ENGLISH);
String result = mf.format(new Object[]{5});  // Output: "You have 5 messages."

This compiles the pattern once upon instantiation, allowing repeated formatting calls. In C++, the equivalent involves creating a MessageFormat object and invoking format with a formattable array:

cpp
#include <unicode/msgfmt.h>
#include <unicode/formattable.h>

UnicodeString pattern(u"You have {0,number,integer} messages.");
LocalPointer<MessageFormat> mf(new MessageFormat(pattern, status));
Formattable args[] = {5};
UnicodeString result;
FieldPosition ignore;
mf->format(args, 1, result, ignore, status);  // Output: "You have 5 messages."
#include <unicode/msgfmt.h>
#include <unicode/formattable.h>

UnicodeString pattern(u"You have {0,number,integer} messages.");
LocalPointer<MessageFormat> mf(new MessageFormat(pattern, status));
Formattable args[] = {5};
UnicodeString result;
FieldPosition ignore;
mf->format(args, 1, result, ignore, status);  // Output: "You have 5 messages."

Error handling for invalid syntax, such as malformed placeholders, typically involves catching exceptions like IllegalArgumentException in Java or checking UErrorCode status in C++ for failures like U_ILLEGAL_ARGUMENT_ERROR. For instance, an invalid pattern like {0,invalid} would trigger an exception during construction, enabling developers to log or recover gracefully.^[59] Advanced patterns extend this by incorporating locale-specific behaviors, such as plural selection, which adapts output based on language grammar rules derived from Unicode's Common Locale Data Repository (CLDR). For English, which uses simple "one" and "other" categories, a pattern like {quantity, plural, one{item} other{items}} selects "1 item" for quantity 1 and "2 items" for others. In contrast, Arabic requires more categories (zero, one, two, few, many, other), so the same logical pattern might expand to

{quantity, plural, zero{لا عناصر} one{عنصر واحد} two{عنصران} few{عدد قليل من العناصر} many{عدد كبير من العناصر} other{عناصر}}

when resolved for locale ar, handling cases like 0 (zero) or 3-10 (few) appropriately. To apply locales, specify them during MessageFormat instantiation, e.g., new MessageFormat(pattern, new ULocale("ar")) in Java or MessageFormat(pattern, Locale::getArabic(), status) in C++, ensuring the plural rules are loaded from ICU's data. Escaping literals, such as curly braces or apostrophes in text, uses single quotes: {0} isn't '{1}' renders as "5 isn't 'done'" without interpreting the inner braces, preventing syntax errors.^[54]^[60] Best practices emphasize performance and reliability: pre-compile patterns by reusing MessageFormat instances rather than recreating them per call, as compilation parses the syntax once and caches formatters for arguments like numbers or dates, reducing overhead in high-volume applications. For cross-locale testing, leverage ICU's built-in test suites in the intltest module, which validate plural rules and formatting across hundreds of locales via scripts like runtest.pl or Java's TestFmwk, helping identify issues like incomplete plural coverage before deployment.^[54] In e-commerce applications, MessageFormat enables dynamic product descriptions, such as {price, number, currency} for {quantity, plural, one{item} other{items}}, which formats to "$19.99 for 1 item" in English or adapts currency and plurals for locales like Japanese (¥2,000 for 1商品). Mobile apps benefit from it for UI text, like notifications: {numNotifications, plural, one{You have a new message} other{You have {numNotifications} new messages}}, ensuring concise, localized strings that update in real-time without hardcoding variants.^[59]

Adoption and Alternatives

Integration in Software and Systems

The International Components for Unicode (ICU) library is integrated into major operating systems to provide robust Unicode and globalization support. Since Windows 10 version 1703 (Creators Update), Microsoft has bundled core ICU components as system DLLs, enabling native access for applications without requiring separate installations.^[2] Android, the world's most widely used mobile operating system, leverages ICU for Unicode text processing and internationalization features across its platform.^[20] On macOS and iOS, ICU support is partial, with developers often building static libraries for app-specific use, as the OS primarily relies on Core Foundation for locale handling.^[61] In Linux distributions, ICU is commonly packaged and included, such as in Ubuntu, Arch Linux, and Alpine Linux, facilitating Unicode compliance in open-source environments.^[62]^[63] ICU powers internationalization in key software ecosystems, including web browsers, databases, and application servers. Google Chrome, built on the Chromium engine, depends on ICU for text rendering, collation, and locale-sensitive operations. PostgreSQL has supported ICU collations since version 10, with full database-level integration available from version 15, enhancing sorting and Unicode handling for global data.^[64] MySQL incorporates ICU for advanced regular expression functionality starting with version 8.0, improving Unicode-aware pattern matching.^[65] For Java-based servers like Apache Tomcat, ICU4J can be integrated into applications to manage locale formatting and message handling, though it is not a core Tomcat component.^[19] Adoption metrics underscore ICU's broad reach and community involvement. The project reaches over 1 billion devices through its inclusion in Android and Windows ecosystems.^[20] Its GitHub repository has garnered more than 2,000 stars, reflecting developer interest.^[5] Contributions primarily originate from IBM, with significant input from Google and Apple, ensuring ongoing enhancements for cross-platform compatibility.^[66] Case studies highlight ICU's role in achieving Unicode compliance for global applications. In enterprise software, ICU enables consistent handling of multilingual data, reducing errors in text processing across diverse locales. A prominent example is Salesforce's 2025 migration to ICU locale formats during the Spring '25 release, standardizing date, number, and currency formatting for improved accuracy and integration with international partners.^[67] This update, enforced across all orgs, addresses legacy JDK limitations and supports Unicode best practices in cloud-based CRM systems serving millions of users worldwide.^[68]

Comparable Libraries

Boost.Locale serves as a C++ library that primarily acts as a wrapper around the International Components for Unicode (ICU), providing a more idiomatic modern C++ interface while adding features such as resource acquisition is initialization (RAII) management and iterator support for enhanced usability in contemporary C++ development.^[69] Although it depends on ICU for core Unicode and localization functionality, Boost.Locale also offers limited non-ICU backends using operating system APIs or standard C++ library components, making it suitable for scenarios where full ICU integration is undesirable but basic localization is required.^[70] This approach allows developers to leverage ICU's robustness through a streamlined API, though it inherits ICU's dependencies and may not fully eliminate them in practice.^[71] The GNU libintl library, part of the GNU gettext system, focuses on basic message translation and catalog management for internationalization, enabling software to support multiple languages through locale-specific string substitutions.^[72] It excels in simple translation workflows, such as handling plural forms and word order variations in messages, and is widely integrated into Linux environments via the GNU C Library (glibc) for lightweight i18n needs. However, libintl lacks comprehensive Unicode support for advanced tasks like collation, normalization, or complex text processing, limiting its scope to translation without full globalization capabilities.^[73] In Java environments, the built-in java.text package from Java SE provides foundational internationalization tools, including classes for date, number, and message formatting tailored to specific locales. These native utilities support over 100 locales and handle basic cultural adaptations, such as currency symbols and date patterns, making them adequate for straightforward applications without external dependencies.^[74] Compared to ICU4J, however, java.text offers less depth in handling emerging Unicode standards, complex locale variants, or exhaustive collation rules, often requiring supplementation for globally diverse or high-precision needs. Other notable alternatives include the .NET System.Globalization namespace, which delivers cross-platform support for culture-specific formatting, calendars, and sorting in .NET applications, though certain advanced features like invariant culture handling remain optimized for Windows ecosystems.^[75] For web development, Mozilla's implementation of the JavaScript Intl API provides a subset of ICU-derived functionality, enabling browser-based number, date, and string formatting with locale sensitivity, but it omits deeper ICU features like full text boundary analysis or transliteration.^[76] These options cater to platform-specific or lightweight use cases, contrasting ICU's broader, standalone Unicode ecosystem.^[77]

References

[1]
ICU - International Components for Unicode
ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications.Downloading ICUC++
[2]
International Components for Unicode (ICU) - Win32 apps
Jun 1, 2021 · ICU is a set of open-source globalization APIs using Unicode's CLDR, providing code conversion, collation, formatting, and time calculations.
[3]
ICU Documentation
ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. The ICU User Guide ...
[4]
ICU - Former Project Management Committee
Until 2016-May-18, ICU was a project under IBM stewardship. The Project Management Committee (PMC) was formed in October 1999 and was responsible for the ...
[5]
unicode-org/icu: The home of the ICU project source code. - GitHub
This is the repository for the International Components for Unicode. The ICU project is under the stewardship of The Unicode Consortium.
[6]
UTS #10: Unicode Collation Algorithm
This report is the specification of the Unicode Collation Algorithm (UCA), which details how to compare two Unicode strings while remaining conformant to the ...
[7]
Unicode Locale Data Markup Language (LDML) Part 4: Dates
Part 3: Numbers (number & currency formatting); Part 4: Dates (date, time, time zone formatting); Part 5: Collation (sorting, searching, grouping); Part 6 ...
[8]
Unicode Locale Data Markup Language (LDML) Part 3: Numbers
Part 3: Numbers (number & currency formatting); Part 4: Dates (date, time, time zone formatting) ... The syntax is carried over from the ICU based RBNF rules.
[9]
https://unicode.org/reports/tr35/tr35-messageFormat.html
[10]
International Components for Unicode (ICU) - IBM
The International Components for Unicode (ICU) is a set of C/C++ and Java libraries for Unicode support and software internationalization.
[11]
Unicode CLDR Project
News. 2025-10-29 CLDR 48 released; 2025-03-13 CLDR 47 released. What ... If your locale is not already available in the Survey Tool, see Adding new locales.CLDR Releases/Downloads · CLDR Charts · CLDR Specifications · ICU
[12]
International Components for Unicode (ICU) Data - LocalePlanet
ULocale List ; af_NA, Afrikaans (Namibia), Afrikaans (Namibië) ; af_ZA, Afrikaans (South Africa), Afrikaans (Suid-Afrika) ; agq, Aghem, Aghem ; agq_CM, Aghem ( ...
[13]
ICU joins the Unicode Consortium
May 18, 2016 · The ICU (International Components for Unicode) project has long provided software that implements the Unicode data and algorithms. ICU is a ...
[14]
https://unicode-org.github.io/icu-docs/legal/copyright.html
[15]
ICU Copyrights - The Unicode Consortium
Home of ICU, Internationalization, International component for Unicode. ... These are files that originally come from the Unicode Consortium, and as of Unicode ...Missing: International Components
[16]
Downloading ICU | ICU Documentation
2024-04-17: ICU 75 updates to CLDR 45 (beta blog) locale data with new locales and various additions and corrections. C++ code now requires C++17 and is being ...
[17]
icu - vcpkg package
Jun 24, 2025 · Mature and widely used Unicode and localization library. Dependencies; Features; Versions; Port Content. Dependencies. icu.Missing: managers Homebrew
[18]
Building ICU4C | ICU Documentation
ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. The ICU User Guide ...
[19]
ICU 75 - ICU - International Components for Unicode
ICU4C requires C++17 and has been tested with up to C++20. We routinely test on recent versions of Linux, macOS, and Windows. We accept patches for other ...
[20]
ICU4J
### Summary of ICU4J, Integration with Maven/Gradle, and Build Process
[21]
Unicode and internationalization support | App architecture
Android leverages the ICU library and CLDR project to provide Unicode and other internationalization support.
[22]
A brief history of IBM and Sun's internationalization efforts
Thus, a partnership was born: IBM arranged for Taligent's Text and International group to contribute international classes to Sun's Java Development Kit ...Missing: origins Components
[23]
[PDF] ICU User Guide - IBM
Jul 10, 1996 · Page 1. ICU User Guide. International Components For Unicode. Version 3.4. 1. ICU ... ICU was originally developed by the Taligent company. The ...
[24]
Source Code Access - ICU - International Components for Unicode
You can view ICU source code online: https://github.com/unicode-org/icu. Make sure you have git lfs installed. See the following section.Missing: package | Show results with:package
[25]
UTF-8 and Unicode FAQ for Unix/Linux
Jun 4, 1999 · The International Components for Unicode (ICU) (formerly IBM Classes for Unicode) have become what is probably the most powerful cross ...
[26]
ICU Architectural Design
ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. The ICU User Guide ...
[27]
Releases · unicode-org/icu - GitHub
We are pleased to announce the release of Unicode® ICU 78. It updates to Unicode 17 (blog), including new characters and scripts, emoji, collation & IDNA ...
[28]
ICU 78 Released - The Unicode Blog
Oct 30, 2025 · Thursday, October 30, 2025 ICU 78 updates to Unicode 17 (blog), including new characters and scripts, emoji, collation & IDNA changes, and ...
[29]
International Components for Unicode - ICU 4.0 Archive
ICU4C Download. Release Date. 2009-01-15 (version 4.0.1). Source Code Download.
[30]
Layout Engine | ICU Documentation
The ICU Line LayoutEngine has been removed in ICU 58. It had not had active development for some time, had many open bugs, and had been deprecated in ICU 54 ...
[31]
ICU 73.2 & CLDR 43.1 released: GB18030 compliance updates ...
Jun 15, 2023 · There are significant changes for GB18030-2022 compliance support: CLDR extends the support for “short” Chinese sort orders to cover some ...
[32]
International Components for Unicode - ICU 74
ICU 74 is a major release updating to Unicode 15.1 and CLDR 44, including new characters, emoji, and locale data. The initial release is 74.1.
[33]
ICU4C | ICU Documentation
ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. The ICU User Guide ...
[34]
ICU 78.1: common/unicode/utypes.h File Reference
This file defines basic types, constants, and enumerations directly or indirectly by including other header files, especially utf.h for the basic character and ...
[35]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utypes_8h.html
[36]
ICU Data | ICU Documentation
ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. The ICU User Guide ...
[37]
https://unicode-org.github.io/icu/userguide/icu_data/
[38]
Normalization | ICU Documentation
The ICU normalization APIs support the standard normalization forms which are described in great detail in Unicode Technical Report #15 (Unicode Normalization ...
[39]
Collation | ICU Documentation
In other words, ICU implements the CLDR Collation Algorithm which is an extension of the Unicode Collation Algorithm (UCA) which is an extension of ISO 14651.Missing: UCollator | Show results with:UCollator
[40]
Regular Expressions | ICU Documentation
ICU's Regular Expressions package provides applications with the ability to apply regular expression matching to Unicode string data.Missing: QRegularExpression | Show results with:QRegularExpression
[41]
Conversion | ICU Documentation
A converter is used to convert from one character encoding to another. In the case of ICU, the conversion is always between Unicode and another encoding, or ...
[42]
API Details | ICU Documentation
To use the Collation Service, you must instantiate a Collator. The Collator defines the properties and behavior of the sort ordering.<|separator|>
[43]
BiDi Algorithm | ICU Documentation
ICU provides an implementation of the Unicode BiDi algorithm, as well as simple functions to write a reordered version of the string using the generated meta- ...
[44]
Formatting | ICU Documentation
By invoking the methods provided by the NumberFormat class, you can format numbers, currencies, and percentages according to the specified or default locale.
[45]
Calendar Services | ICU Documentation
ICU has two main calendar classes used for parsing and formatting Calendar information correctly: Calendar An abstract base class that defines the calendar API.Missing: Components bundles
[46]
TimeZone Classes | ICU Documentation
ICU supports time zones through two classes: Timezone classes are related to UDate, the Calendar classes, and the DateFormat classes.Time Zones in ICU · Timezone Class in ICU · Updating the Time Zone Data
[47]
Formatting Numbers | ICU Documentation
NumberFormatter supports the formatting of: Decimal Formatting; Currencies; Measurement Units; Percentages; Scientific Notation; Compact Notation. For number ...Missing: decimalformat | Show results with:decimalformat
[48]
Formatting Dates and Times | ICU Documentation
The DateFormat interface in ICU enables you to format a Date in milliseconds into a string representation of the date. It also parses the string back to the ...Missing: bundles | Show results with:bundles
[49]
Resources
### Summary of ICU Resource Bundles: Loading Locale-Specific Strings and Fallback Chains
[50]
Formatting Messages | ICU Documentation
The ICU MessageFormat class uses message "pattern" strings with variable-element placeholders (called “arguments” in the API docs) enclosed in {curly braces}.MessageFormat 2.0 · Message Formatting Examples
[51]
MessageFormat (ICU4J 78)
### Summary of MessageFormat Class in ICU4J
[52]
ICU 78.1: icu::MessageFormat Class Reference
MessageFormat prepares strings for display to users, with optional arguments (variables/placeholders). The arguments can occur in any order.
[53]
Message Formatting Examples | ICU Documentation
MessageFormat Class. ICU's MessageFormat class can be used to format messages in a locale-independent manner to localize the user interface (UI) strings.
[54]
Plural Rules - Unicode CLDR Project
The way plurals are defined in CLDR, when a message (eg for 'two') is missing, it always falls back to 'other'. So the translation is no worse than before.Missing: documentation | Show results with:documentation
[55]
apotocki/icu4c-iosx: This project builds ICU static libraries ... - GitHub
This repo provides a universal script for building static ICU libraries for use in iOS, watchOS, tvOS, visionOS, and macOS applications.
[56]
icu-78.1 - Linux From Scratch!
The International Components for Unicode (ICU) package is a mature, widely used set of C/C++ libraries providing Unicode and Globalization support for software ...<|separator|>
[57]
icu 78.1-1 (x86_64) - Arch Linux
Architecture: x86_64. Repository: Core. Description: International Components for Unicode library. Upstream URL: https://icu.unicode.org.
[58]
Documentation: 18: 23.2. Collation Support - PostgreSQL
Collations provided by ICU are created in the SQL environment with names in BCP 47 language tag format, with a “private use” extension -x-icu appended, to ...Missing: MySQL | Show results with:MySQL
[59]
New Regular Expression Functions in MySQL 8.0
Apr 9, 2018 · In MySQL 8.0 we introduce the ICU library to handle our regular expression support. This library is maintained by the Unicode Consortium and ...Missing: integration | Show results with:integration
[60]
ICU Code Contributions
ICU Code Contributions. The overwhelming majority (≅99.7%) of all code has been contributed by IBM employees, or by people under contract to IBM.Missing: stars | Show results with:stars
[61]
JDK Locale Format Retirement and the Enable ICU Locale Formats ...
JDK Locale Format Retirement and the Enable ICU Locale Formats Salesforce Release Update. Publish Date: Oct 7, 2025. Description. Updated September 12, 2025.
[62]
Salesforce Locale Update from JDK to ICU - Marketing Nation
Feb 8, 2025 · Salesforce is deprecating the JDK Locale Formats and forcing a migration to the ICU Locale Formats with the Spring '25 Salesforce update.
[63]
Boost.Locale: Design Rationale
Thus Boost.Locale wraps ICU with a modern C++ interface, allowing future reimplementation of parts with better alternatives, but bringing localization support ...
[64]
Boost.Locale
Boost.Locale provides non-ICU based localization support as well. It is based on the operating system native API or on the standard C++ library support.
[65]
Boost.Locale: Using Localization Backends
By default, Boost.Locale uses ICU for all localization and text manipulation tasks. This is the most powerful library available, but sometimes we don't need ...
[66]
Internationalization (GNU Coding Standards) - GNU.org
5.8 Internationalization ¶. GNU has a library called GNU gettext that makes it easy to translate the messages in a program into various languages.
[67]
Introduction to Internationalization Programming | Linux Journal
Nov 1, 2002 · GNU gettext can manage translating problems like word order, plural forms and ambiguities, but you have to use extra functions that hold ...Missing: features limitations
[68]
Java Internationalization - Oracle
In the Java SE Platform, internationalization support is fully integrated into the classes and packages that provide language- or culture-dependent ...Missing: capabilities | Show results with:capabilities
[69]
System.Globalization Namespace | Microsoft Learn
The information includes the names for the culture, the writing system, the calendar used, the sort order of strings, and formatting for dates and numbers.
[70]
Intl - JavaScript - MDN Web Docs - Mozilla
Sep 24, 2025 · The Intl namespace object contains several constructors as well as functionality common to the internationalization constructors and other ...Intl.DateTimeFormat · Intl.NumberFormat · Intl.Locale · Intl.DisplayNames
[71]
Introducing the JavaScript Internationalization API - Mozilla Hacks
Dec 11, 2014 · Under the hood, Firefox's implementation depends upon the International Components for Unicode library ( ICU ), which in turn depends upon the ...The Intl Interface · Date/time Formatting · Collation