Fact-checked by Grok 2 weeks ago

Hutter Prize

The Hutter Prize is an ongoing competition funded by researcher , offering a total prize pool of €500,000 to encourage the development of advanced algorithms capable of reducing a 1 GB excerpt of XML text, known as enwik9, to the smallest possible size under specified computational constraints. Established in 2006, the prize motivates progress toward () by leveraging the principle that superior data compression correlates with higher levels of machine intelligence, as effective compression requires understanding and predicting patterns in human knowledge. Key rules include compressing enwik9 losslessly into a , with limits of under 100 hours of runtime on a single CPU core, no more than 10 of , and no use of GPUs or ; submissions must include the compressor executable, decompressor, and full source code, with the total size of the compressed file plus the decompressor determining eligibility. Prizes are awarded proportionally to verified improvements over the prior record, requiring at least a 1% gain to claim a minimum of €5,000, with €29,945 distributed to date across eight winners as of November 2025. Notable advancements have come from context-mixing algorithms like PAQ derivatives and more recent neural network-based approaches, such as the current record of 110,793,128 bytes achieved by Kaido Orav and Byron Knoll in September 2024 using fx2-cmix, representing a of approximately 11.1% of the original .

Background and Motivation

Origins and Establishment

The Hutter Prize for of Human Knowledge was established in 2006 by , a prominent researcher in (AGI) at the Swiss AI Lab IDSIA and the Australian . Hutter personally funded the initial prize pool with €50,000 to incentivize advancements in data compression as a proxy for measuring progress toward AGI. The prize aimed to reward improvements in compressing a specific representing human knowledge, drawing on Hutter's theoretical work linking compression efficiency to intelligent information processing. The prize was publicly announced in July 2006, shortly after the AGI-06 conference, with the initial challenge focused on the 100 enwik8 —a snapshot of articles. The baseline performance was set at 18,324,887 bytes using Matt Mahoney's paq8f v. 11 , dated March 24, 2006, providing a verifiable starting point for submissions. An official website, prize.hutter1.net, was launched concurrently to host rules, submissions, and verification processes, ensuring transparency and reproducibility. Administration of the prize has evolved to maintain relevance and fairness, with periodic updates to submission guidelines. On February 21, 2020, Hutter increased the total prize pool to €500,000 and shifted the dataset to the larger 1 GB enwik9 file to scale the challenge, reflecting sustained commitment and growing interest in compression-based benchmarks. The first payout occurred on September 25, 2006, awarded to Alexander Rhatushnyak for his paq8hp5 compressor, which achieved a 6.88% improvement over the baseline (compressing enwik8 to 17,073,018 bytes) and earned €3,416.

Theoretical Foundations

The theoretical foundations of the Hutter Prize are rooted in , particularly the concept of , which provides a measure of the intrinsic complexity of a object by identifying the shortest possible that can generate it. Formally, the Kolmogorov complexity K(x) of a x is defined as the length of the shortest p that outputs x on a U: K(x) = \min \{ |p| : U(p) = x \} where |p| denotes the length of program p. This uncomputable measure serves as an ideal benchmark for data compression, as practical lossless compression algorithms approximate K(x) by finding concise representations that capture the data's regularities and redundancies. The prize draws motivation from the idea that intelligence can be formalized as the ability to minimize the description length of observed data, aligning with principles of Occam's razor in universal induction. In this framework, an intelligent agent excels at identifying underlying patterns in complex environments, effectively compressing information to predict and act optimally. Marcus Hutter's universal AI model, AIXI, embodies this by positing a reinforcement learning agent that maximizes expected rewards through a Bayesian mixture over all possible computable environments, weighted by their Kolmogorov complexity. AIXI's decision-making process involves selecting actions that minimize the total expected description length of observations and rewards, providing a theoretical unification of machine learning, prediction, and sequential decision theory. This connection to underpins the Hutter Prize's use of on a Wikipedia text corpus as a proxy for general intelligence. By challenging participants to compress a large, diverse like enwik9—a 1GB snapshot of human —the prize tests algorithms' to discover semantic and structural patterns across , , and encyclopedic content, mirroring the broad modeling required for (). Advances in such directly contribute to scalable approximations of AIXI-like systems, as shorter programs imply deeper understanding of the data's generative processes.

Prize Mechanics

Goals and Objectives

The Hutter Prize aims primarily to incentivize the development of advanced algorithms that demonstrate a deep understanding of human knowledge, serving as a practical stepping stone toward (AGI). By challenging participants to achieve superior of representative human-generated text, the prize posits that effective requires not mere data reduction but an intelligent grasp of patterns, semantics, and predictive structures inherent in knowledge, thereby advancing AI capabilities that mimic human cognition. This objective draws from the theoretical insight that optimal approximates , where the shortest program describing data reflects true understanding. Secondary objectives include establishing an objective, quantifiable for measuring progress in AI-driven and , while promoting open-source contributions to foster collaborative in . The prize provides a standardized metric—relative improvement in —allowing direct comparison of algorithmic advancements without subjective evaluation, thus enabling the to track incremental gains in machine understanding over time. Additionally, by requiring winners to release their , it encourages reusable tools and methodologies that extend beyond the contest to broader applications in AI . The total prize pool stands at 500,000€, with awards allocated proportionally to the degree of improvement achieved over the current record; for instance, a 1% relative gain typically yields approximately 5,000€ from the remaining pool, ensuring that even modest breakthroughs are rewarded while reserving larger sums for transformative progress. This structure motivates sustained effort across varying levels of achievement, with the minimum payout threshold set to make participation viable for independent researchers and teams alike. To emphasize its long-term impact, the Hutter Prize operates indefinitely, remaining open until the full pool is distributed or compression approaches theoretical limits dictated by , thereby committing to ongoing stimulation of AGI-related without arbitrary deadlines. This enduring framework underscores the belief that persistent, incremental advancements in will cumulatively contribute to solving fundamental challenges in .

Rules and Submission Process

The Hutter Prize is open to individuals or teams worldwide, with no entry fees required for participation. Submissions must be sent via email to the prize administrators at addresses including [email protected], [email protected], and [email protected]. Participants are required to provide a numbered list detailing the submission, including direct download links to the compressed file, source code, and decompressor executables, all of which must be publicly released under an OSI-approved open license to ensure transparency and reproducibility. Valid submissions focus on lossless compression of the enwik9 dataset, a 1,000,000,000-byte file derived from English Wikipedia. Each entry must include the source code for the compressor, the compressed archive (e.g., archive9.bhm), and a decompressor program (e.g., decomp9.exe) capable of exactly reconstructing the original file. Additional details in the submission encompass execution instructions (limited to a single line for simplicity), program versions and options used, file sizes, approximate compression and decompression times, maximum memory usage (both RAM and disk), and hardware specifications of the test machine, such as processor type, operating system, and benchmark scores like Geekbench5. Documentation explaining the algorithmic ideas, compilation instructions, and any optional further information must also be provided or linked. Following submission, a mandatory waiting period of at least 30 days allows for public comments and independent verification by the organizers. During this phase, the is tested on the organizers' machines to confirm exact reconstruction of the original enwik9 , with practical constraints applied for feasibility: decompression must complete in under 50 hours on a 2.7 GHz i7 processor and use no more than 10 GB of RAM. While there are no formal limits on the computational resources used for the compression phase itself, the verification process imposes these practical bounds to ensure the submission can be reliably evaluated. Awards are granted only for verified submissions achieving a minimum 1% relative improvement over the current record size L, defined as \frac{L - S}{L} > 0.01, where S is the new compressed size in bytes. The prize amount scales with the improvement, calculated as $500{,}000€ \times (1 - S/L), with the smallest claimable portion being 5,000€ for exactly 1% improvement; after an award, the L is updated, and the is adjusted accordingly.

Technical Specifications

The Dataset

The enwik9 dataset serves as the fixed compression target for the Hutter Prize, comprising exactly 1,000,000,000 bytes of text data derived from an early snapshot of the English Wikipedia. It represents a benchmark corpus designed to test algorithms' ability to capture and encode human knowledge through lossless compression. Sourced from the official Wikipedia dump file enwiki-20060303-pages-articles.xml.bz2, released on March 3, 2006, enwik9 captures the first 1 GB of the decompressed XML content from this approximately 1.1 GB bz2-compressed archive (4.8 GB when decompressed). This dump includes only textual elements of Wikipedia articles, excluding binary media like images, which are stored separately. The dataset is publicly available for download in compressed form as enwik9.zip (approximately 300 MB) from Matt Mahoney's data compression resources site. Compositionally, enwik9 consists primarily of English-language article content, encompassing titles, main body text, abstracts, references, and navigational elements such as internal hyperlinks and tables. It contains 243,426 article entries in total, including 85,560 redirect pages prefixed with #REDIRECT to handle linking to other articles. The material embodies a diverse cross-section of human knowledge as documented on Wikipedia at the time, spanning topics from science and history to culture and biographies, with high editorial quality and few grammatical or spelling errors. Roughly 75% of the bytes form clean natural language text, while the remaining 25% includes structured "artificial" data like wiki markup (e.g., [[links]] and {{templates}}), XML tags for metadata (e.g., revision timestamps and authors), and formatted elements such as lists and infoboxes. Some non-English text, numbers, and symbols appear naturally within articles. Preparation of enwik9 focused on creating a consistent, accessible without altering its representational fidelity to Wikipedia's structure. The original bz2-compressed dump was decompressed, and the initial 1 GB was extracted verbatim, with the entire standardized to encoding to handle characters from U+0000 to U+10FFFF (1-4 bytes per character). Byte-level cleaning removed invalid sequences, including bytes 0xC0, 0xC1, and 0xF5-0xFF, while restricting control characters to (0x09) and linefeed (0x0A), which occur only at ends for . No markup was stripped, no redirects were eliminated, and no content was filtered beyond these encoding fixes, ensuring the retains the full complexity of real-world encyclopedic data. This approach preserves the challenges of compressing mixed natural and , aligning with the prize's goal of advancing intelligence-like compression.

Compression Requirements and Metrics

The Hutter Prize requires all submissions to employ , ensuring that the original 1 enwik9 can be perfectly reconstructed from the compressed output without any loss of . This stipulation preserves the integrity of the human encoded in the and aligns with the prize's emphasis on intelligent, precise of data. The primary evaluation metric is the total size of the compressed in bytes, where the goal is to minimize this value while adhering to the lossless constraint; smaller sizes directly correspond to superior compression performance. A secondary metric, often used for and comparison, is bits per character (bpc), defined as: \text{bpc} = \frac{\text{compressed size in bits}}{1{,}000{,}000{,}000} Since the compressed size is measured in bytes, this simplifies to \text{bpc} = (\text{compressed bytes} \times 8) / 1{,}000{,}000{,}000. For context, the initial baseline achieved by paq8f v4 was 115,788,014 bytes, corresponding to approximately 0.93 bpc. Prize awards are determined by relative improvements over the current record size, calculated using the formula: \text{Improvement} = 1 - \frac{\text{new size}}{\text{old size}} Submissions demonstrating at least a 1% improvement (i.e., improvement \geq 0.01) qualify for a proportional share of the prize fund, with 5,000 euros allocated per percentage point gained. This mechanism incentivizes incremental advances in compression efficiency. Theoretically, the ultimate limit of compression approaches the entropy rate of English text, estimated by Shannon at 0.6 to 1.3 bpc, implying a potential minimum size of around 75 MB for enwik9; however, practical constraints such as computational resources and time limits prevent reaching this bound, with current achievements hovering near 1 bpc due to feasible algorithmic complexity.

Achievements and Impact

List of Winners

The Hutter Prize has recognized multiple individuals and teams for setting successive records in of Wikipedia snapshots, with prizes awarded based on verifiable improvements over prior benchmarks. Alexander Rhatushnyak stands out as the most frequent winner, claiming four awards on the enwik8 dataset from 2006 to 2017 using variants of the PAQ and compressors, for a combined €8,847. Following the 2020 expansion to a €500,000 total pool and shift to the enwik9 dataset, newer winners have employed advanced context-mixing techniques to achieve incremental gains. The following table lists all winners, ordered chronologically by record-setting submission, including the dataset, method, compressed file size in bytes, percentage improvement over the immediate prior record (N/A for initial records on each dataset), and prize amount where applicable.
DatasetWinner(s)DateMethodCompressed Size (bytes)Improvement (%)Prize (€)
enwik8Alexander Rhatushnyak25 Sep 2006paq8hp517,073,018N/A3,416
enwik8Alexander Rhatushnyak14 May 2007paq8hp1216,481,6553.431,732
enwik8Alexander Rhatushnyak23 May 2009decomp815,949,6883.231,614
enwik8Alexander Rhatushnyak4 Nov 2017phda915,284,9444.182,085
enwik9Alexander Rhatushnyak4 Jul 2019phda9v1.8116,673,681N/AN/A
enwik9Artemiy Margaritov31 May 2021starlit115,352,9381.139,000
enwik9Saurabh Kumar16 Jul 2023fast cmix114,156,1551.045,187
enwik9Kaido Orav2 Feb 2024fx-cmix112,578,3221.406,911
enwik9Kaido Orav & Byron Knoll3 Sep 2024fx2-cmix110,793,1281.597,950
As of November 2025, the total prizes paid out from the expanded €500,000 pool for enwik9 amount to €29,945, leaving approximately €470,055 remaining. The enwik8 phase had separate prizes totaling €8,847.

Progress and Milestones

The Hutter Prize records demonstrate incremental advancements in of text, reflecting broader developments in algorithmic . The contest originated in 2006 with the 100 MB enwik8 dataset, establishing a compressed size of 18,324,887 bytes using the paq8f . The first milestone came in 2007, when Rhatushnyak received the inaugural award for a 3.5% improvement over the prior record, achieving 16,481,655 bytes with paq8hp12. Significant progress followed in the late 2010s through refinements to PAQ-family algorithms, including context mixing and enhancements. Between and 2019, entries like phda9 by Alexander Rhatushnyak in November marked a 4.17% gain, reducing the size to 15,284,944 bytes on enwik8, while decomp8 in had earlier contributed a 3.2% advance to 15,949,688 bytes. These PAQ variants exemplified early successes in adaptive modeling for text redundancy. In 2020, the prize shifted to the 1 GB , with an initial of 116,673,681 bytes set by phda9v1.8 (total compressed file plus decompressor size). By mid-2023, an additional 1.04% improvement over the prior record was achieved by Saurabh Kumar's fast cmix, which compressed enwik9 to 114,156,155 bytes. Recent gains in 2023–2024, totaling approximately 5.0% from the enwik9 baseline of ~116.7 MB to the current record of 110.8 MB (110,793,128 bytes by Kaido Orav and Byron Knoll using fx2-cmix in September 2024), have accelerated via hybrid context mixers incorporating neural-inspired preprocessing. These milestones underscore the prize's role in benchmarking AI, where compression efficiency correlates with language model sophistication, as superior compressors capture semantic patterns akin to those in transformer-based systems. Despite this, diminishing returns pose ongoing challenges, with each percentage-point gain requiring exponentially more computational resources and innovation.

References

  1. [1]
    Hutter Prize
    500'000€ Prize for Compressing Human Knowledge by Marcus Hutter.Frequently Asked QuestionsDetailed Rules for Participation
  2. [2]
    Human Knowledge Compression Contest - Hutter Prize
    The contest is about compressing the human world knowledge as well as possible. There's a prize of nominally 500'000€ attached to the contest.
  3. [3]
    Human Knowledge Compression Contest: Detailed Rules for Participation
    ### Summary of Human Knowledge Compression Contest (http://prize.hutter1.net/hrules.htm)
  4. [4]
    Large Text Compression Benchmark
    Below is a merged response that consolidates all the information from the provided segments into a dense, organized format. I’ve used tables in CSV-like format where appropriate to retain as much detail as possible while ensuring clarity and completeness. The response is structured into key sections: **Current Best Compression Sizes**, **Brief History of Progress or Notable Winners for the Hutter Prize**, and **Useful URLs**, with all details preserved.
  5. [5]
  6. [6]
    Introducing the Hutter Prize for Lossless Compression of Human ...
    Jan 20, 2021 · The purse for the Hutter Prize was initially underwritten with a 50,000. Euro commitment to the prize fund by Marcus Hutter of the Swiss Dalle Molle ...Alexander Ratushnyak Wins First Hutter Prize Payout - Google GroupsGPT4 Recommends NSF Increase Hutter Prize Payouts 100xMore results from groups.google.com
  7. [7]
    Alexander Ratushnyak Wins First Hutter Prize Payout - Google Groups
    Oct 27, 2006 · On 25 Sep 2006, just two months after the announcement of the Hutter Prize, Alexander Ratushnyak submitted his program paq8hp5 which compressed ...Missing: Rhatushnyak date 2007
  8. [8]
    [PDF] Universal Artificial Intelligence - of Marcus Hutter
    Abstract. The dream of creating artificial devices that reach or outperform human intelligence is many centuries old. In this talk I present an elegant.
  9. [9]
    Universal Artificial Intelligence - of Marcus Hutter
    Motivation. The dream of creating artificial devices which reach or outperform human intelligence is an old one. What makes this challenge so interesting? A ...
  10. [10]
    About the Test Data - Matt Mahoney
    Sep 1, 2011 · We filter the 1 GB test file enwik9 to produce a 715 MB file fil9, and compress this with 17 compressors. Furthermore, we produce the file text8 ...
  11. [11]
    Large Text Compression Benchmark - Matt Mahoney
    enwik8: compressed size of first 108 bytes of enwik9. This data is used for the Hutter Prize, and is also ranked here but has no effect on this ranking. enwik9: ...
  12. [12]
    phda9v1.8 results - Google Groups
    size of compression executable in .zip is: 558298 bytes. compressed enwik9 size is: 116544849 bytes. size of decompression executable in .zip is: 42944 bytes.