Fact-checked by Grok 2 weeks ago

Archive Team

Archive Team is a volunteer-driven collective of archivists, programmers, and enthusiasts dedicated to preserving digital heritage, particularly websites and online data threatened by service shutdowns or content purges.^[1] Founded in 2009 by Jason Scott, the group employs crowdsourced methods to capture and store vast amounts of internet content before it becomes inaccessible.^[2] The organization operates without formal hierarchy, coordinating through IRC channels and wikis to identify at-risk platforms via a "Deathwatch" list and launch rapid-response archiving campaigns.^[1] Key tools include the ArchiveTeam Warrior, a virtual machine that distributes downloading tasks across participants' computers, enabling efficient, parallel data grabs from targets like defunct forums or image hosts.^[3] Much of the salvaged material is donated to repositories such as the Internet Archive, ensuring long-term accessibility.^[4] Notable efforts have preserved petabytes of data from services including GeoCities, Yahoo Groups, and Imgur, countering corporate decisions to erase user-generated content.^[5] While praised for democratizing preservation, Archive Team's guerrilla tactics have occasionally drawn criticism from institutional archivists for prioritizing volume over curatorial standards.^[6]

History and Founding

Origins and Jason Scott's Role

Archive Team emerged in 2009 as a volunteer effort to salvage digital content from websites at risk of permanent deletion, spearheaded by Jason Scott, a self-taught archivist and technology historian who had long advocated for preserving online ephemera through his site textfiles.com. The group's formation was catalyzed by Yahoo's April 2009 announcement of the GeoCities shutdown, a free web hosting service that had hosted over 38 million user pages since 1994 but was being terminated on October 26, 2009, with most content slated for erasure. Scott mobilized a distributed network of crawlers starting that April, coordinating volunteers to download terabytes of data while navigating bandwidth limits imposed by Yahoo to prevent server overload. This initial project captured an estimated 650 terabytes of GeoCities material, representing millions of personal homepages that documented early internet culture, hobbies, and user creativity.^[7]^[8] Scott's role was central as the founder and de facto leader, often characterizing himself as the "mascot" and "in-house loudmouth" to emphasize the collective's decentralized, irreverent ethos over hierarchical structure. Drawing from his background in documenting BBS culture and critiquing corporate data purges, he framed Archive Team's mission as a rogue intervention against the "erasure of digital history," prioritizing rapid, technically adept preservation over formal permissions. By leveraging IRC channels for coordination and custom scripts for scalable downloading, Scott enabled the group to respond nimbly to shutdown notices, establishing a model of activist archiving that bypassed traditional institutional delays. His efforts gained traction through public appeals and media coverage, underscoring the fragility of user-generated web content in the face of platform decisions.^[9]^[10] The origins reflected broader concerns in the late 2000s about web impermanence, as services like GeoCities exemplified the shift from user-controlled hosting to centralized platforms prone to abrupt terminations. Scott's initiative transformed ad hoc rescues into a sustained operation, with early successes like GeoCities laying groundwork for future projects by demonstrating the feasibility of crowd-sourced, high-volume archiving. This approach relied on Scott's technical foresight and rhetorical drive to rally participants, positioning Archive Team as a counterforce to data loss without affiliation to established archives at the outset.^[11]

Early Initiatives and Expansion (2009–2012)

Archive Team's inaugural project focused on preserving GeoCities, a web hosting service that Yahoo announced for closure on October 26, 2009, following an initial disclosure in April.^[12] Volunteers, coordinated through IRC channels, deployed scraping scripts to capture user pages, HTML files, images, and other content from the platform, which had enabled millions of amateur websites since its 1994 launch. This effort yielded approximately 641 GB of archived material, distributed via torrents and contributed to the Internet Archive's collections.^[13]^[14] The GeoCities initiative established Archive Team's model of rapid-response, decentralized preservation, attracting a broader base of programmers and archivists. Between 2010 and 2012, the group scaled up to address multiple shutdowns, including Yahoo Video's user-upload service, which ceased operations around mid-2010, and Google Video, whose decommissioning was revealed in April 2011. For Google Video, participants downloaded over 2.24 terabytes of hosted files before access terminated.^[15] Expansion during this period involved refined techniques for bulk data extraction and URL enumeration, enabling larger hauls such as the 14-terabyte Friendster archive completed in April 2012, which encompassed profiles from 20 million accounts on the pioneering social network. These undertakings demonstrated exponential growth in data volume—from gigabytes in 2009 to terabytes by 2012—and solidified IRC as the hub for real-time volunteer synchronization and progress tracking.^[16]

Organizational Model

Volunteer-Driven Collective

Archive Team functions as a decentralized, volunteer-driven collective comprising individuals who self-identify as rogue archivists, programmers, writers, and others committed to digital preservation, without any formal hierarchy, membership requirements, or paid personnel.^[1] This structure emphasizes open participation, where contributors donate their personal time, computing resources, coding expertise, and bandwidth to execute archiving initiatives on an ad-hoc basis.^[17] The absence of centralized control allows for rapid mobilization in response to imminent data losses, such as site shutdowns, but relies on intrinsic motivation rather than institutional incentives, resulting in a fluid roster of participants that fluctuates with project demands.^[1] Coordination occurs predominantly via public Internet Relay Chat (IRC) channels, serving as hubs for real-time strategy discussions, technical troubleshooting, and recruitment of additional volunteers.^[18] These channels enable asynchronous and synchronous collaboration, with volunteers sharing scripts, progress updates, and calls to action, though response times vary due to participants' independent schedules and non-professional commitments.^[18] Entry-level involvement is facilitated through user-friendly tools like the ArchiveTeam Warrior, a virtual machine that automates data grabbing and upload to repositories such as the Internet Archive, allowing even those without advanced programming skills to contribute effectively by providing hardware resources.^[1] The collective's volunteer model has proven scalable for large-scale efforts, as demonstrated by projects archiving millions of items from platforms like Yahoo Groups, where distributed downloading mitigated bandwidth limits imposed by hosts.^[19] However, this informality can lead to challenges, including inconsistent documentation and reliance on a core group of repeat contributors for sustained momentum, underscoring the dependence on community goodwill over structured governance.^[20] Despite these dynamics, the approach has preserved vast troves of at-risk digital content that might otherwise have been lost to proprietary deletions or neglect.^[17]

Key Contributors and Decentralized Operations

Jason Scott co-founded Archive Team in 2009 to preserve digital content threatened by platform shutdowns and deletions, drawing on his experience as a digital historian and archivist.^[21] As the group's most prominent figure, Scott has coordinated high-profile archiving efforts and developed tools like the Archive Team Warrior virtual machine, which enables distributed downloading by volunteers.^[22] His leadership emphasizes rapid response to preservation crises, often leveraging his position at the Internet Archive to facilitate data handoffs.^[23] Archive Team operates as a decentralized collective without formal membership or hierarchy, relying on self-motivated volunteers including programmers, sysadmins, and enthusiasts worldwide.^[1] Coordination occurs primarily through IRC channels on the hackint.org network, such as #archiveteam, where project announcements, technical discussions, and task assignments happen in real-time.^[18] This model allows for agile scaling: volunteers download and run provided software, like the Warrior appliance, to contribute compute power and bandwidth to "preservation of service attacks" against at-risk sites, uploading results to distributed storage.^[24] The absence of centralized authority fosters innovation but introduces challenges, such as variable data quality and reliance on community norms for deduplication and verification before transfer to repositories like the Internet Archive.^[25] Volunteers operate independently, often anonymously, with contributions tracked via IRC logs and project-specific channels rather than formal credits.^[26] This structure has enabled Archive Team to archive petabytes of data since inception, prioritizing speed over institutional protocols.^[27]

Technical Infrastructure

Warrior/Tracker System

The ArchiveTeam Warrior is a virtual machine appliance designed to facilitate distributed web archiving by allowing volunteers to contribute idle computing resources. Participants download and run the appliance, typically via virtualization software like VirtualBox or VMware, which then executes project-specific scripts to crawl targeted websites, capture data in WARC format, and upload it to a central repository.^[3]^[28] This setup minimizes setup complexity, enabling rapid scaling during time-sensitive preservation efforts, such as site shutdowns.^[29] Central to the system's coordination is the Tracker software, which acts as a task distributor and progress monitor for multiple Warrior instances. The Tracker assigns discrete items—such as URLs or pages—to connected Warriors, tracks completion status to prevent redundant downloads, and provides real-time dashboards and leaderboards displaying aggregate statistics like bytes archived and active nodes.^[30] Accessible at tracker.archiveteam.org, it employs a proprietary protocol for job allocation, with APIs available for integration and oversight.^[30] Warriors communicate with the Tracker over the internet, often registering via IRC channels for project-specific instructions, and handle retries for failed grabs while respecting rate limits to avoid overwhelming source servers.^[3] The architecture supports modular grabbers, commonly using wget for HTTP requests, with outputs compressed and transmitted periodically; completed WARC files are then processed for integration into larger archives, such as those at the Internet Archive.^[31] This peer-to-peer model has enabled Archive Team to archive petabytes of data across projects, leveraging thousands of volunteer machines without centralized hardware dependency.^[30]

ArchiveBot and IRC Integration

ArchiveBot functions as an IRC-based automation tool developed by Archive Team to facilitate the archival of smaller websites, typically those comprising up to a few hundred thousand URLs, by queuing and distributing crawl jobs to volunteer-operated nodes. Users submit starting URLs via IRC commands, triggering the bot to initiate web scraping, capture content, and upload WARC files to the Internet Archive's Wayback Machine for preservation.^[32]^[33] The system's IRC integration centers on the #archivebot channel hosted on the hackint IRC network, where the control node resides as a persistent bot listener, processing directives like !archive <URL> from authorized participants and broadcasting real-time status updates such as job queuing, progress percentages, and completion notifications directly in the channel. This enables collaborative decision-making among distributed volunteers, who monitor and intervene as needed to refine crawls, exclude problematic paths via ignore patterns, or prioritize urgent sites facing shutdowns. The interface enforces rate limits and permissions to mitigate spam or overload, ensuring efficient resource allocation across the network.^[32]^[34] Architecturally, ArchiveBot separates concerns into a central control node—managing IRC interactions, job bookkeeping with Redis for persistent state tracking, and task dispatch—and peripheral crawler pipelines run by volunteers on dedicated hardware with ample storage and bandwidth. Crawlers employ scripts based on wget-lua for recursive downloading, incorporating custom grabs to handle JavaScript-rendered elements, media extraction, and avoidance of infinite loops or external redirects, before compressing and transmitting data upstream for integration into the Internet Archive. A public dashboard at archivebot.com provides WebSocket-driven monitoring of active jobs, including URL counts, bytes archived, and error logs, complementing IRC feedback without requiring direct channel access.^[33]^[35] Volunteer involvement is essential, as operators deploy pipeline instances via provided Docker images or scripts, contributing CPU, disk (often terabytes per job), and connectivity to process queued items in a peer-to-peer fashion, with the control node load-balancing across available nodes. Limitations include unsuitability for massive sites better handled by dedicated projects, potential incompleteness against paywalls or heavy client-side rendering, and dependency on manual oversight for complex domains, underscoring ArchiveBot's role as a responsive, community-orchestrated supplement to broader archiving efforts rather than a fully autonomous system.^[32]

Other Archiving Tools and Protocols

Archive Team employs a range of open-source software tools for web crawling and data preservation beyond its primary Warrior and ArchiveBot systems, often integrating them into custom pipelines for specific archiving needs.^[36] These tools facilitate recursive downloading, handling of dynamic content, and output in standardized formats suitable for long-term storage.^[37] Among general-purpose crawlers, GNU Wget is frequently used for mirroring static websites, supporting options like recursive retrieval with customizable depth limits and exclusion patterns to avoid unnecessary files such as images or binaries.^[36] HTTrack serves similar functions, generating offline browsable copies of sites while respecting robots.txt directives and allowing configuration for link depth and file filtering.^[36] cURL complements these by enabling precise HTTP requests for testing or fetching individual resources, often scripted for batch operations.^[36] Specialized tools developed or maintained by Archive Team include grab-site, a preconfigured web crawler designed for comprehensive site backups, featuring a web-based dashboard for monitoring crawls, dynamic ignore patterns to skip irrelevant sections, and direct output to WARC files for archival integrity.^[38] Wpull, a Python-based Wget alternative, enhances crawling with better handling of JavaScript-rendered pages, retries for transient errors, and compatibility with Archive Team's distributed workflows, often forked for performance improvements like faster HTML parsing.^[39] WikiTeam provides scripts tailored for MediaWiki installations, dumping content including revisions, user pages, and images via database exports and API queries, with extensions planned for other wiki engines.^[40] The seesaw-kit library supports building reusable scraping pipelines, abstracting common tasks like item processing and error handling across projects.^[37] Central to these efforts is the WARC (Web ARChive) format, an ISO standard (ISO 28500:2017) for encapsulating web harvests, storing HTTP requests, responses, and metadata in a single, deduplicable file to ensure bit-level fidelity and reprocessability.^[41] Archive Team tools prioritize WARC output for interoperability with repositories like the Internet Archive, supplemented by utilities for validation (e.g., warc-tools for integrity checks) and concatenation (e.g., megawarc for merging large collections).^[41] This protocol enables causal reconstruction of archived sessions, mitigating issues like link rot through timestamped, self-contained records.^[41]

Major Projects

High-Profile Preservation Efforts

One of Archive Team's earliest prominent efforts targeted GeoCities, a pioneering web hosting service with over 38 million user-generated pages representing early internet culture, which Yahoo announced for shutdown on October 26, 2009. In response, Archive Team mobilized volunteers to systematically download content using custom scripts and distributed crawling, capturing a substantial portion of the site's neighborhoods and user files before deletion; this effort preserved artifacts like personal homepages mimicking virtual "cities" that documented 1990s online creativity.^[42]^[43] In 2019, Archive Team mounted a large-scale operation to salvage Yahoo! Groups, a platform hosting nearly 1.5 million public groups with an estimated 2.1 billion messages, files, and attachments accumulated over 20 years, ahead of Verizon's (Yahoo's owner) planned data purge on December 14. Volunteers employed IRC-coordinated grabs and user-submitted dumps to archive textual posts, attachments, and metadata despite throttling and IP blocks imposed by Yahoo, resulting in partial but extensive recovery transferred to the Internet Archive for public access.^[44]^[19]^[45] The group's response to Google+'s consumer shutdown on April 2, 2019, involved archiving public profiles, posts, photos, and communities from the platform, which had amassed over 1 billion users since 2011 but suffered from low engagement and data breaches. Using grabbers integrated with the Warrior system, Archive Team ingested raw data into the Wayback Machine, focusing on openly accessible content while noting limitations on private materials; this preserved discussions and media from tech enthusiasts, photographers, and niche communities.^[46]^[47] Archive Team also targeted Tumblr's impending ban on adult content effective December 17, 2018, which risked erasing millions of NSFW posts central to the site's subcultures and fan communities. Amid platform-imposed IP blocks and rate limiting, volunteers scraped flagged blogs and explicit media using automated tools, emphasizing cultural documentation over selective censorship; the effort highlighted tensions between preservation imperatives and site policies, yielding archives of erotic art, fandom works, and marginalized expressions now hosted via the Internet Archive.^[25]^[48] Following the January 6, 2021, U.S. Capitol events, when Parler faced deplatforming and data wipe threats, Archive Team contributed to scraping over 413 million posts, profiles, and media files totaling 56.7 terabytes from the alt-tech social network favored by conservative users. Coordinated via trackers and grabbers, the rapid response captured geotagged content and user interactions before server shutdowns, providing a comprehensive dataset for historical analysis despite debates over the platform's role in event coordination; raw files were made available for research while underscoring Archive Team's commitment to unfiltered digital records.^[49]^[50]^[51]

Scale of Archived Data

Archive Team has preserved tens of petabytes of digital content through its distributed archiving efforts, with data primarily uploaded to the Internet Archive for long-term storage.^[52] As of September 2025, the collective's largest ongoing project, URLs—a continuous effort to capture random web links from diverse sources—accounts for 13.92 pebibytes (PiB) of archived material.^[53] Other major initiatives include Telegram channels at 5.08 PiB, Reddit links exceeding 3.37 PiB (encompassing over 10.8 billion URLs captured by June 2023), and YouTube content at 3.11 PiB, demonstrating the scale of targeted rescues from at-risk platforms.^[53]^[54] Early projects further illustrate the growth in volume: the 2012 Friendster archive rescued 20 million user accounts spanning 14 terabytes, while URL shortener backups from services like goo.gl and others totaled hundreds of gigabytes to terabytes in compressed torrents.^[55] More recent single-project feats, such as the Imgur preservation effort, secured 760 million image files by May 2023, though exact byte totals for such media-heavy grabs vary with file sizes and deduplication.^[56] These efforts rely on volunteer contributions via tools like the Warrior virtual machine, enabling petabyte-scale accumulation without centralized funding, though storage costs are tracked publicly to encourage efficiency.^[57] The cumulative impact positions Archive Team's holdings as a substantial subset of the Internet Archive's broader collections, which exceed 200 petabytes overall but include non-Archive Team content like the Wayback Machine's 57 PiB.^[58] Precision in totals is challenged by ongoing projects, item discarding in trackers for massive queues, and the focus on unique, deduplicated data rather than raw captures.^[30] Nonetheless, the group's output underscores a commitment to empirical preservation metrics, prioritizing verifiable transfers over unquantified "heritage" claims.

Impact and Achievements

Contributions to Digital Heritage

Archive Team has advanced digital heritage preservation through volunteer-coordinated efforts to capture imperiled online content, amassing datasets that document the internet's ephemeral cultural and historical record. Operating since 2009 as a decentralized collective, the group identifies platforms facing shutdowns or content purges and deploys crowdsourced crawling to salvage web pages, user-generated media, and interactive elements that commercial entities often discard.^[59] This approach has rescued artifacts from obsolescence, enabling retrospective analysis of digital social dynamics otherwise lost to proprietary deletions or technical decay.^[60] Notable contributions include the 2009 GeoCities archive, where Archive Team mobilized to download millions of personal homepages—hallmarks of early web amateurism and subcultural expression—before the site's decommissioning erased them from public access.^[42] Similarly, in response to Tumblr's 2018 policy shift banning adult content, the collective archived over 100 million "Not Safe for Work" posts, preserving niche communities' creative outputs and providing scholars with primary sources for studying online identity, censorship effects, and marginalized digital narratives.^[25] These initiatives highlight Archive Team's role in countering selective corporate curation, ensuring diverse internet histories endure for empirical scrutiny rather than filtered retrospectives. By distributing tools like ArchiveTeam Warrior—a virtual appliance that automates site scraping for participants worldwide—the group lowers barriers to preservation, engaging thousands in distributed crawls that have secured billions of files, such as 760 million Imgur images at risk of platform attrition.^[3] This democratization extends digital stewardship beyond institutions, fostering resilience against data loss and underscoring the causal link between proactive archiving and sustained access to born-digital heritage for future research and validation.^[20]

Influence on Broader Archiving Practices

Archive Team's pioneering of rapid-response, volunteer-coordinated archiving in response to platform shutdowns has shaped decentralized practices in digital preservation. Formed in 2009 amid Yahoo's announcement to discontinue GeoCities, the group mobilized hundreds of volunteers to download millions of user pages before the service's termination on October 26, 2009, demonstrating that non-institutional actors could execute large-scale crawls effectively.^[61]^[13] This model of preemptive, distributed data grabs—coordinated via IRC channels and shared scripts—has been replicated in subsequent efforts against deletions on platforms like MySpace and Tumblr, influencing community-driven responses to digital ephemerality.^[42] The development and open distribution of tools such as the ArchiveTeam Warrior, a virtual machine enabling participants to contribute bandwidth without advanced technical setup, has democratized access to archiving workflows. Launched around 2012, it facilitates parallel downloading and seeding to repositories, reducing reliance on centralized infrastructure and inspiring similar peer-to-peer systems in preservation communities.^[42] By prioritizing "save everything" over curation, Archive Team has challenged institutional selectivity, prompting broader adoption of comprehensive scraping protocols that capture dynamic, user-generated content often overlooked by formal archives.^[25] Their efforts have fostered a cultural recognition of web archiving as activist practice, emphasizing preservation of non-commercial and subcultural materials against corporate data purges. This has informed ethnographic and policy discussions on digital heritage, highlighting the need for agile, community-led interventions to complement institutional strategies amid accelerating platform volatility.^[25]^[62]

Relationship with Internet Archive

Collaborative Data Transfers

Archive Team facilitates collaborative data transfers to the Internet Archive primarily through the creation of dedicated collection items on archive.org, where scraped content is bundled into torrent files for peer-to-peer distribution and ingestion. Volunteers participating in Archive Team projects, such as those using the ArchiveTeam Warrior virtual machine, collect raw data in standardized formats like WARC (Web ARChive) files, which capture web pages, metadata, and associated resources. These files are then aggregated, named consistently with the target item identifier, and uploaded via torrents to the corresponding Internet Archive item page, enabling the Internet Archive's systems to seed and retrieve data from uploaders and other peers without requiring direct server-to-server transfers for large volumes.^[58] This torrent-based method leverages the Internet Archive's BitTorrent integration, allowing efficient handling of terabyte-scale dumps that would strain conventional HTTP uploads, while ensuring redundancy through distributed seeding. Archive Team maintains a special arrangement with the Internet Archive, permitting bulk uploads to collections like "archiveteam," which bypasses some standard upload limits imposed on general users and integrates directly with the Wayback Machine for web crawl preservation.^[63]^[64] The process is coordinated via Archive Team's IRC channels and project wikis, where participants verify completeness before final transfer, minimizing data loss during handoff.^[65] Such transfers underscore Archive Team's dependence on the Internet Archive's storage infrastructure for long-term preservation, as Archive Team itself lacks dedicated data centers and instead focuses on acquisition and initial processing. Post-transfer, the Internet Archive processes ingested WARC files for indexing, deduplication, and public access, often resulting in seamless integration into broader collections like government data archives or defunct platform scrapes. This model has enabled the preservation of millions of web artifacts, though it relies on the Internet Archive's capacity to manage incoming volumes without specified quotas for Archive Team contributions.^[65]^[58]

Independence and Complementary Roles

Archive Team maintains operational independence from the Internet Archive, functioning as a decentralized volunteer collective unbound by the latter's institutional governance or funding structures. Established in 2009, the group coordinates via IRC channels and distributed tools to execute ad-hoc archiving missions, often targeting sites facing imminent deletion without prior institutional approval. This autonomy enables swift, guerrilla-style responses to digital threats, contrasting with the Internet Archive's systematic, permission-based crawls governed by legal and resource constraints.^[1]^[58] The roles of Archive Team and the Internet Archive complement each other through data exchange and shared preservation goals, with Archive Team frequently uploading scraped collections—such as terabytes from defunct platforms like GeoCities or Tumblr—to the Internet Archive for redundant storage and public access. Archive Team's focus on niche, high-risk content fills gaps in the Internet Archive's broader web snapshots, which may miss dynamic or restricted materials due to robots.txt compliance or scale limitations. This grassroots supplementation has resulted in millions of preserved items integrated into the Internet Archive's holdings, enhancing overall digital redundancy without overlapping core missions.^[58]^[66] Further complementarity arises in reciprocal safeguarding efforts; Archive Team has initiated projects like INTERNETARCHIVE.BAK to mirror the Internet Archive's data against potential outages, demonstrating volunteer-driven resilience that bolsters the institution's permanence. While Jason Scott, Archive Team's co-founder, transitioned to an advisory role after joining the Internet Archive staff, the collective's project decisions remain volunteer-led and independent, occasionally exploring alternative repositories to avoid over-reliance on any single entity. This dynamic fosters a robust ecosystem where agility and scale mutually reinforce long-term cultural preservation.^[67]^[68]

Controversies and Criticisms

Legal Challenges in Web Scraping

Archive Team's web scraping activities, which utilize tools like ArchiveBot to systematically download public web content from endangered sites, navigate a landscape fraught with potential legal pitfalls under U.S. statutes such as the Computer Fraud and Abuse Act (CFAA). The CFAA criminalizes accessing a computer without authorization or exceeding authorized access, but the Supreme Court's 2021 ruling in Van Buren v. United States limited its scope, holding that mere violation of a website's terms of service (TOS)—such as prohibitions on automated scraping—does not qualify as unauthorized access when data is publicly available without technical barriers like passwords. This decision, affirmed in subsequent cases like the 2022 Department of Justice policy update narrowing CFAA prosecutions to cases involving clear technical circumvention, has shielded non-intrusive scraping of open web pages from federal criminal charges, though civil claims for trespass or contract breach remain possible.^[69] Copyright law poses a parallel risk, as scraping inherently reproduces protected works, potentially infringing the exclusive rights of holders under the Copyright Act unless excused by fair use (17 U.S.C. § 107). Archive Team's preservation efforts emphasize non-commercial archiving of at-risk content for historical access, akin to library practices, which courts have sometimes deemed transformative and favorable under fair use factors—particularly when original sites face shutdown, as in GeoCities' 2009 closure or Tumblr's 2018 content purges.^[70] However, unlike the Internet Archive's tested defenses in publishing lawsuits, Archive Team has not litigated fair use claims, relying instead on decentralized, volunteer-driven grabs that avoid mass redistribution.^[71] In practice, targeted sites more frequently deploy technical countermeasures than pursue litigation against Archive Team, including IP blocking, rate limiting, and user-agent detection to thwart crawlers. For instance, platforms like Reddit have restricted archival access to curb data extraction, citing TOS and resource strain, though such blocks often spur workarounds rather than escalate to court.^[72] This pattern underscores a broader tension: while Archive Team's focus on ephemeral, publicly accessible data minimizes exposure to aggressive enforcement, the absence of explicit legal exemptions for rogue preservation leaves their operations vulnerable to evolving platform policies and opportunistic suits from rights holders wary of uncontrolled copying.^[73]

Debates on Ethical Scope and Resource Use

During the 2018 archiving of Tumblr's "Not Safe for Work" (NSFW) content ahead of the platform's content purge, Archive Team volunteers debated the boundaries of inclusion and exclusion beyond initial seed lists, weighing comprehensive preservation against the risks of capturing exploitative, illegal, or non-consensual material in user-generated archives.^[74] These discussions highlighted tensions in defining ethical scope, as the group's default stance of archiving "everything on the internet" clashed with practical curation needs for sensitive digital artifacts, though no formal exclusion policies were ultimately adopted.^[25] Critics of broad web archiving, including practices akin to Archive Team's, argue that indiscriminate scraping disregards site owners' intent and moral rights, potentially perpetuating harmful content without contextual remediation or consent from creators.^[73] Archive Team counters this by prioritizing at-risk public data, asserting that preservation urgency for disappearing platforms outweighs retrospective permissions, a position rooted in the causal reality that unarchived content vanishes irretrievably.^[75] On resource use, the distributed model relying on volunteers' ArchiveTeam Warrior virtual machines enables massive parallel downloads but generates substantial server load, as seen in the 2009 GeoCities project where coordinated scraping effectively "assaulted" Yahoo's infrastructure to capture over 600 terabytes before shutdown.^[75] This approach, while effective for time-sensitive grabs, prompts debates on whether the bandwidth intensity constitutes an unintended denial-of-service risk to live sites, even with built-in throttling limits of 1-2 requests per second per warrior.^[3] Proponents note that targeting endangered domains minimizes harm to ongoing operations, and empirical evidence shows rare formal complaints, but general web scraping ethics frameworks emphasize monitoring and politeness to avoid overload.^[76] In urgent scenarios, such as site closures with fixed deadlines, Archive Team justifies accelerated crawling over strict rate-limiting, prioritizing data recovery over transient disruptions.^[76]

Ongoing Developments and Legacy

Recent Projects Post-2020

Following the end of major efforts like the Adobe Flash preservation in late 2020, Archive Team shifted focus to ongoing threats in social media, government sites, and legacy platforms. A prominent short-term project targeted Typepad, a blogging service that ceased operations on September 30, 2025, prompting the group to coordinate grabs of user blogs and associated content via IRC channel #typebad to mitigate data loss from the shutdown.^[77]^[5] Long-term initiatives have emphasized dynamic platforms with high volumes of ephemeral data. The Telegram project archives public messages from notable channels, employing tools to capture web-accessible content in WARC format, with contributions welcomed via a dedicated bot for channel suggestions; this effort remains active without a fixed endpoint.^[78]^[79] Similarly, Twitch archiving has ramped up in response to policy shifts, including the 2025 elimination of indefinite video storage, resulting in comprehensive metadata collection and selective video preservation to counter routine deletions of on-demand broadcasts.^[80] Medium-term projects include the Meta Ad Library grab, which systematically downloads advertisements from Facebook and affiliated Meta platforms, aiming to create a persistent record of social and political ads amid platform opacity and potential retroactive removals; operated via IRC #fads, it processes the public database to ensure verifiability of ad histories.^[81] In parallel, the group extended its GitHub project—initially launched in 2020—through regular updates to repository snapshots and metadata, partnering with the Internet Archive to maintain an evolving backup against platform risks like policy changes or outages.^[82] Governmental archiving efforts post-2020 centered on U.S. federal sites during the Joe Biden administration (2021–2025), tracking subdomains and content for changes or deletions, with sub-projects addressing agencies like the United States Agency for Global Media; this built on prior Trump-era work to document administrative transitions comprehensively.^[83] These projects underscore Archive Team's adaptation to accelerated content turnover on modern web services, prioritizing scalable grabs over one-off rescues.

Future Challenges in Digital Preservation

As web platforms increasingly deploy sophisticated anti-bot defenses, such as CAPTCHAs, proof-of-work challenges, and rate limiting, Archive Team's scraping operations face growing technical obstacles that hinder timely data capture before content disappears.^[84]^[85] These measures, often implemented to protect against unauthorized access, complicate the automation essential for archiving vast, dynamic sites, particularly ephemeral social media or user-generated content.^[86] Concurrently, the exponential proliferation of digital data—estimated to grow at 23% annually through 2025—exacerbates scalability issues, demanding immense computational resources and storage that strain volunteer-led efforts without institutional backing.^[87] Legal uncertainties further imperil future preservation, as web scraping navigates ambiguous boundaries under laws like the U.S. Computer Fraud and Abuse Act and platform terms of service, with companies pursuing blocks or litigation to enforce control, including risks of copyright infringement when reproducing protected materials without permission.^[88] Ethical debates intensify around consent, as seen in Archive Team's rapid grabs of sensitive communities, raising questions of ownership versus public heritage without explicit permissions.^[62] Emerging regulations, including EU data access frameworks and AI training data rules, may restrict scraping for research or archiving, prioritizing proprietary interests over long-term accessibility.^[89]^[90] Sustaining a volunteer model amid these pressures poses existential risks, with dependence on a small cadre risking burnout and knowledge gaps, compounded by inadequate documentation and funding volatility for petabyte-scale storage.^[91] Format obsolescence and hardware dependencies threaten archived data integrity over decades, requiring ongoing migration and emulation that outpaces ad-hoc resources.^[87] Without scalable automation for metadata and verification, distinguishing authentic from corrupted or low-quality AI-generated content becomes untenable, potentially eroding the reliability of preserved digital heritage.^[92]^[93]

References

[1]
Archive Team
Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.Internet Archive · ArchiveTeam Warrior · Deathwatch · YouTube
[2]
Jason Scott, Rogue Archivist | The Signal - Library of Congress Blogs
Feb 28, 2012 · The Archive Team was an idea I dreamed up in 2009 when I realized how many of these for-free hosting sites were beginning to shut down, taking ...
[3]
ArchiveTeam Warrior
Aug 17, 2025 · The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the Archive Team archiving efforts. It will download sites and upload ...
[4]
Save Pages in the Wayback Machine - Internet Archive Help Center
Archive Team is an entirely volunteer driven group who are interested in saving Internet history. Many of the sites and pages they save end up in the Wayback ...
[5]
Projects - Archiveteam
Donate to keep our projects going. Anything shutting down? Put it on the Deathwatch or tell us on IRC! Want to code for Archive Team? Here's a starting point.
[6]
Traditional Archivists views on ArchiveTeam and vice versa - Reddit
Jul 10, 2025 · I think the biggest thing that is the difference between ArchiveTeam and digital archivists is the standard to which work is done.
[7]
Ghost Pages: A Wired.com Farewell to GeoCities
Nov 3, 2009 · Jason Scott and his group Archive Team, with help from Archive.org and other like-minded folks, deployed web crawlers in April to start raking ...
[8]
Internet Atrocity! GeoCities' Demise Erases Web History | TIME
Nov 9, 2009 · ArchiveTeam is still sorting through the data, but Scott estimates that he was able to save more than a million accounts, which translates to ...
[9]
Where do old websites go to die? Jason Scott of Archive Team
May 26, 2011 · Jason Scott is the founder and in-house loudmouth for Archive Team, a loose coalition of concerned data hoarders dedicated to saving as much of online user- ...
[10]
The 'Archive Team' Rescues User Content From Doomed Sites
Apr 12, 2012 · GeoCities was just the start for the Archive Team. Scott and his people began getting word of other sites that were on the verge of closing and ...<|control11|><|separator|>
[11]
The Splendiferous Story of Archive Team - ASCII by Jason Scott
I figured I'd show you what I wrote as notes for my presentation at the Internet Archive. The audio from this presentation is here. (20min, 28mb).
[12]
Web 0.2 archivists save Geocities from deletion - The Register
Apr 28, 2009 · A group of web preservationists called the Archive Team is trying to save most of Geocities for the ages before Yahoo! erases the beloved ...
[13]
GeoCities, Preserved! - Internet Archive Blogs
Aug 25, 2009 · Additionally, please refer to another independent project, the Archive Team, who is working to save cultural information that may be lost with ...
[14]
Search the Geocities history! – sobre.arquivo.pt
Sep 29, 2020 · ... Archive Team project which gathered 641 GB of information in 2009, oOCities or Geocities.ws. Arquivo.pt also integrated Geocities history in ...
[15]
Google Video to go away, but video search remains - NBC News
Apr 18, 2011 · However, the site also notes Google's Archive Team effort to help archive the video, as well as instruction for those who want to save their own ...
[16]
Fire in the Library - MIT Technology Review
Dec 20, 2011 · In 2009, he founded the Archive Team, and last March he became an official employee of the Internet Archive, the San Francisco–based nonprofit ...
[17]
6 Ways to Save Pages In the Wayback Machine
Jan 25, 2017 · Archive Team is an entirely volunteer driven group who are interested in saving Internet history. Many of the sites and pages they save end up ...Missing: structure | Show results with:structure
[18]
Archiveteam:IRC
Sep 28, 2025 · We're volunteers so we can't always respond immediately. We eat, drink, sleep, and archive just like you! Note that IRC channels are not like ...
[19]
Yahoo! Groups - Archiveteam
Oct 9, 2022 · Yahoo! Groups was Yahoo!'s combination mailing list service/web forum, the result of the acquisition of eGroups and some other Yahoo! stuff.Missing: collective | Show results with:collective
[20]
Archive Team and Crowdsourced Digital Preservation
Mar 13, 2016 · The site provides advice, software tools, and coordination for crowd-based campaigns to archive born-digital history such as websites and social media ...
[21]
Jason Scott - Free Range Archivist and Software Curator at Internet ...
Since April of 2009, I have been involved with ARCHIVE TEAM, a group I co-founded to counteract the trend of deleting all digital culture online in massive ...
[22]
Jason Scott | Internet Archive Blogs
Jun 17, 2025 · He held a number of positions in journalism but one of the most memorable was as a Hollywood correspondent for the NY Post, where he would write ...
[23]
Jason Scott - MuseumNext
Jason Scott is the Software Curator and Free-Range Archivist at the Internet Archive ... He is a co-founder of Archive Team, an activist archiving group.
[24]
Dev/Infrastructure - Archiveteam
Nov 11, 2024 · The Archive Team infrastructure is a distributed web processing system used for distributed preservation of service attacks.Missing: decentralized structure
[25]
“Everything on the internet can be saved”: Archive Team, Tumblr ...
This article frames the cultural significance of web archiving through an ethnographic study of Archive Team and their efforts to archive “Not Safe for Work” ...
[26]
IRC Channel - Archiveteam
Oct 29, 2020 · The current official Archive Team IRC Channel is #archiveteam @ irc.hackint.org. ... A list of other Archive Team channels can be found at ...Missing: organization | Show results with:organization
[27]
DEF CON 19 - Jason Scott - Archive Team: A Distributed ... - YouTube
Nov 2, 2013 · ... Archive Team arrives on the scene to duplicate as much as they possibly can for history before all the data is wiped forever. To do this ...
[28]
ArchiveTeam/warrior4-vm: Warrior virtual machine ... - GitHub
This repository contains the Warrior Virtual Machine Appliance Version 4 for ArchiveTeam. If you are looking to download the warrior, a ready-to-use ...
[29]
ArchiveTeam Warrior
The ArchiveTeam Warrior is a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload ...
[30]
Tracker - Archiveteam
Aug 3, 2025 · It hands out items to be downloaded and keeps track of what is completed. Items can be usernames, subdomains, full URLs, basically any unit we ...Using the proprietary tracker · Terminologies · API
[31]
ArchiveTeam/warrior-code - GitHub
The ArchiveTeam Warrior is a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. Download the appliance (242MB) and run ...
[32]
ArchiveBot - Archiveteam
The bot listens for commands in the IRC channel and then reports back status on the IRC channel. You can ask it to archive a whole website or single webpage, ...Details · Components · Volunteer to run a Pipeline · Usage Caveats a.k.a. Things...
[33]
ArchiveBot, an IRC bot for archiving websites - GitHub
ArchiveBot has two major backend components: the control node, which runs the IRC interface and bookkeeping programs, and the crawlers, which do all the Web ...
[34]
ArchiveBot Administration
It runs the actual bot that sits in an IRC channel and listens to commands about which websites to archive. It runs the Redis server that keeps track of all the ...Missing: integration | Show results with:integration
[35]
ArchiveBot/Monitoring - Archiveteam
ArchiveBot communicates with its dashboard using a WebSocket. This means that ArchiveBot can be monitored for other purposes and with other means.Missing: integration | Show results with:integration
[36]
Software - Archiveteam
General Tools · grab-site · GNU WGET · cURL · HTTrack - HTTrack options · Pavuk -- a bit flaky, but very flexible · belweder - tries to be a maintained alternative to ...
[37]
Archive Team - GitHub
We Are Going To Rescue Your Shit. Archive Team has 760 repositories available. Follow their code on GitHub ... archiveteam@archiveteam.org · Overview ...
[38]
ArchiveTeam/grab-site: The archivist's web crawler - GitHub
grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC ...Issues 82 · Pull requests 14 · Actions · Security
[39]
ArchiveTeam/wpull: Wget-compatible web downloader and crawler.
Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler. A dog pulling a box via a harness.
[40]
WikiTeam - Archiveteam
Oct 14, 2025 · Formed by group of ex-Miraheze volunteers, some who had also volunteered at Orain and TropicalWikis. Self-dumps every month. Wikkii (site) ...
[41]
The WARC Ecosystem - Archiveteam
Jul 31, 2025 · Everything about the WARC format and the tools that support it. WARC is a file format for accurately storing Web traffic.
[42]
The race to save our online lives from a digital dark age
Aug 19, 2024 · His group, called Archive Team, quickly mobilized and downloaded as many GeoCities pages as possible before it closed for good. He and the team ...
[43]
GeoCities - Archiveteam
In April 2009, Yahoo announced they would be closing GeoCities "later this year". ... Articles and Mentions of Archive Team's GeoCities Project. The Register: A ...Glorious history · The GeoCities Project and... · Press Mentions of the...
[44]
The Deletion of Yahoo! Groups and Archive Team's Rescue Effort
Nov 5, 2019 · On December 14, everything ever posted to Yahoo! Groups in its 20-year history will be permanently deleted from the web.
[45]
Yahoo Groups shutting down: Archive Team wants to save old forum ...
Dec 11, 2019 · Those archivists say Yahoo has blocked their attempts at coordinated preservation of the Yahoo Groups forums, deepening their frustrations.
[46]
Google+ - Archiveteam
Archives. Content archived from Google+ is being uploaded and ingested into the Wayback Machine, and the raw archive files are also made available in the ...Shutdown of the consumer... · Shutdown of the business... · How can I help?
[47]
Archiveteam:Press releases/2019-03-28 Help archive Google+
Mar 28, 2019 · Archive Team is a loose collective of archivists, programmers, and writers, dedicated to saving our digital heritage. The team has been ...Help Archive Google+ · Consider Donating · Can't Donate?Missing: volunteer | Show results with:volunteer<|control11|><|separator|>
[48]
Archivists Say Tumblr IP Banned Them For Trying to Preserve Adult ...
Dec 19, 2018 · Archivists on Tumblr led by a digital preservationist collective called Archive Team have been racing to preserve adult content from Tumblr.Missing: Parler | Show results with:Parler
[49]
Parler - Archiveteam
Parler was an American social network primarily used by Donald Trump supporters, conspiracy theorists, and far-right extremists.
[50]
The Hacker Who Archived Parler Explains How She Did It (and What ...
Jan 12, 2021 · A self-described hacker by the name of donkenby and a host of amateur data hoarders preserved more than 56.7 terabytes of data from Parler.
[51]
Every Deleted Parler Post, Many With Users' Location Data, Has ...
Jan 11, 2021 · Operating on little sleep, @donk_enby began the work of archiving all of Parler's posts, ultimately capturing around 99.9 percent of its content ...
[52]
Download & Streaming : Web Crawls - Internet Archive
592,513 items 5.6 petabytes · i More info. Internet Archive Web Crawls. 2,085,725 items 18.1 petabytes · i More info. Archive Team. 4,034,271 items 50.6 ...
[53]
List of largest ArchiveTeam projects
Sep 26, 2025 · List of largest ArchiveTeam projects ; 1, URLs, 13.92 PiB ; 2, Telegram, 5.08 PiB ; 3, Reddit, 3.37 PiB ; 4, YouTube, 3.11 PiB ...
[54]
ArchiveTeam has saved over 10.8 BILLION Reddit links so far. We ...
Jun 6, 2023 · So far, we have archived 10.81 billion links, with 150 million to go. Recent news of the Reddit API cost changes will force many of the top 3rd ...<|separator|>
[55]
Main Page/Archive - Archiveteam
Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.Missing: initiatives | Show results with:initiatives
[56]
ArchiveTeam has saved 760 MILLION Imgur files, but it's not ... - Reddit
we'll show your progress on the leaderboard. · Go to the All projects tab and select ArchiveTeam's Choice to let your warrior ...
[57]
The ArchiveTeam has a "cost shameboard" of the top users ... - Reddit
May 17, 2024 · ArchiveTeam is just a community that regularly archives content and often uploads to the Internet Archive. This seems to be stats for their ArchiveBot.TIL the Library of Congress has a 2.129 petabyte (and growing ...Let's Say You Wanted to Back Up The Internet Archive : r/DataHoarderMore results from www.reddit.com
[58]
Internet Archive - Archiveteam
Wayback Machine: 57 PetaBytes; Books/Music/Video Collections: 42 PetaBytes; Unique data: 99 PetaBytes; Total used storage: 212 PetaBytes. Items added per year.
[59]
Partners | End of Term Web Archive
Archive Team is a loose collective of rogue archivists, programmers, and writers dedicated to saving our digital heritage. Since 2009, this volunteer driven ...
[60]
Archive Team, Tumblr and the cultural significance of web archiving ...
This article frames the cultural significance of web archiving through an ethnographic study of Archive Team and their efforts to archive “Not Safe for Work” ...
[61]
The Archive Team - The History of the Web
Jan 9, 2009 · Jason Scott rounds up a group of volunteers to help download and archive all of Geocities before it is deleted.
[62]
Memory in Uncertainty: Web Preservation in the Polycrisis
Archive Team co-founder and Internet Archive software curator Jason Scott describes their work as motivated by a shared sense of powerlessness against digital ...Iii. Key Findings · Archive Complexity Is... · Webrecorder & Wacz
[63]
Donate Idle Bandwidth to Internet Archive - Make sure the ... - Reddit
Jul 4, 2025 · What I mean is: the Archive Team is a mostly separate organization that has been given permission to upload much larger sets of data to archive.
[64]
How you can help archive U.S. government data right now - Reddit
Feb 4, 2025 · Archive Team is a collective of volunteer digital archivists led by Jason Scott (u/textfiles), who holds the job title of Free Range ...
[65]
Frequently Asked Questions - Archiveteam
Sep 17, 2025 · However, Archive Team have had to restrict direct access to WARCs created by Archive Team, due to scraping for LLM training and to make ...
[66]
There is also the Archive team: https://archiveteam.org/ It's aligned ...
It's aligned with the Internet Archive in many ways and there is cooperation. They are looking for volunteers! Other than that, if you design/build websites you ...
[67]
INTERNETARCHIVE.BAK - Archiveteam
Valhalla - Valhalla was a discussion about the "ultimate home" for uploaded Archive Team data, be it the Internet Archive, elsewhere, or both. IA.BAK is an ...
[68]
This incident brings up a good point: Who archives the archives ...
The co-founder, Jason Scott, retired from Archive Team years ago and stays around as a cheeleader and advisor. He is employed by Internet Archive.
[69]
Department of Justice Announces New Policy for Charging Cases ...
May 19, 2022 · The Department of Justice today announced the revision of its policy regarding charging violations of the Computer Fraud and Abuse Act (CFAA).Missing: Team | Show results with:Team
[70]
Federal Judge Rules It Is Not a Crime to Violate a Website's Terms ...
Apr 6, 2020 · A federal judge in Washington, DC has ruled that violating a website's terms of service does not violate the Computer Fraud and Abuse Act (CFAA).Missing: Team | Show results with:Team
[71]
The Internet Archive Loses Its Appeal of a Major Copyright Case
Sep 4, 2024 · The Internet Archive has lost a major legal battle—in a decision that could have a significant impact on the future of internet history.
[72]
Reddit will block the Internet Archive - The Verge
Aug 11, 2025 · Reddit says that it has caught AI companies scraping its data from the Internet Archive's Wayback Machine, so it's going to start blocking ...Missing: Team | Show results with:Team
[73]
Web Scraping for Research: Legal, Ethical, Institutional, and ... - arXiv
Oct 30, 2024 · This paper proposes a comprehensive framework for web scraping in social science research for US-based researchers, examining the legal, ethical, institutional ...
[74]
[PDF] Archive Team, Tumblr and the cultural significance of web
Oct 21, 2021 · This article frames the cultural significance of web archiving through an ethnographic study of Archive Team and their efforts.
[75]
Archiveteam! The Geocities Torrent - ASCII by Jason Scott
Oct 26, 2010 · Can it really be a year ago that Archive Team had dozens of people assaulting Yahoo's servers desperately trying to save disappearing history?
[76]
https://utcc.utoronto.ca/~cks/space/blog/web/WebScrapingItsNotJustLoad
[77]
Typepad - Archiveteam
Typepad is shutting down. We have made the difficult decision to discontinue Typepad, effective September 30, 2025. What Does This Mean for You?
[78]
Telegram - Archiveteam
Sep 17, 2025 · ArchiveTeam's telegram project archives as WARC, supporting all available web data (and including outlinks). Suggestions are welcome! A bot in ...How to help if you have lists of... · Export methods
[79]
Main Page/Current Projects - Archiveteam
ArchiveTeam works on short-term projects like Typepad, medium-term projects like Meta Ad Library, and long-term projects like Telegram and Twitch.Missing: 2022 2024
[80]
Twitch - Archiveteam
Archives. 2025 DPoS project. As part of the 2025 changes, Archive Team did a complete metadata grab and saved some videos.Broadcast retention changes... · Known exceptions · Software · Archives
[81]
Meta Ad Library - Archiveteam
Sep 20, 2025 · Meta Ad Library is a database of advertisements that have appeared on Facebook and other Meta products. The library was created to help ...
[82]
GitHub - Archiveteam
In 2020 Archive Team started a project to archive GitHub and keep the archive up to date as new content is added. This project is a collaboration with Internet ...Missing: post- | Show results with:post-
[83]
US Government - Archiveteam
US Government websites are at the risk of going offline or seeing drastic changes in content under the Trump administration. The US Government/War room ...
[84]
What is the ArchiveTeam crawler bot & How to block it? - DataDome
The ArchiveTeam crawler bot is a web scraping tool used by ArchiveTeam, a volunteer group dedicated to preserving digital content that might otherwise be ...
[85]
Anubis - Archiveteam
Aug 24, 2025 · Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access ...
[86]
AI and Data Scraping on the Archive
May 13, 2023 · We've put in place certain technical measures to hinder large-scale data scraping on AO3, such as rate limiting, and we're constantly monitoring our traffic.
[87]
Digital Preservation And Its Challenges - DCIG
May 19, 2022 · Digital preservation faces challenges like corruption, machine dependencies, obsolescence, scalability, legal issues, and ongoing management.
[88]
Web Scraping and the Rise of Data Access Agreements
Aug 5, 2025 · Unauthorized scraping may violate terms of service, exceed authorized access under laws like the Computer Fraud and Abuse Act (CFAA), or ...
[89]
https://techpolicy.press/determining-which-researchers-can-collect-public-data-under-the-dsa
[90]
OECD AI paper: IP issues in AI trained on scraped data
Mar 7, 2025 · The white paper discusses different approaches to address issues posed by data scraping. These include a code of conduct, technical tools, standard contract ...<|control11|><|separator|>
[91]
Community archives & digital preservation – Breaking down barriers
Nov 3, 2021 · These included: poor documentation; lack of continuity funding; dependence on small number of volunteers, conflation of backup with preservation ...
[92]
Challenges in preservation and archiving digital materials
Aug 1, 2020 · What are some of the issues driving digital preservation today? The first is Heterogeneity. No community or organization is going to create, ...
[93]
Digital Preservation and Copyright
WIPO article discussing copyright exceptions and challenges for preservation activities in libraries and archives.
[94]
The Challenge of Preserving Good Data in the Age of AI
Article examining how AI-generated content floods the internet, complicating decisions on what digital information merits archiving.