Backup
A backup in computing refers to the process of creating and maintaining duplicate copies of data, applications, or entire systems on a secondary storage device or location, enabling recovery and restoration in the event of data loss, corruption, hardware failure, or other disruptions.[1][2] This practice is fundamental to data protection and disaster recovery, as it mitigates risks from human errors, cyberattacks, power outages, and natural disasters, ensuring business continuity and minimizing downtime that can cost organizations millions per minute for mission-critical operations.[3][1] Regular backups are recommended for all users, from individuals to enterprises, to safeguard critical information against irreversible loss.[4][5] Backups employ diverse strategies tailored to needs like recovery time objectives (RTO) and recovery point objectives (RPO), including full backups that copy the entire dataset; incremental backups that capture only changes since the last backup; differential backups that record all changes since the last full backup; continuous data protection (CDP) for real-time replication; and bare-metal backups for complete system restoration.[1] Storage media have evolved from tape drives—known for low cost and high capacity but slower access—to hard disk drives (HDDs), solid-state drives (SSDs), dedicated backup servers, and scalable cloud storage, which offers remote accessibility and flexibility.[1] Best practices, such as the 3-2-1 rule (three copies of data on two different types of media, with one stored offsite), enhance resilience against localized failures.[4]Fundamentals
Definition and Purpose
Backup refers to the process of creating copies of computer data stored in a separate location from the originals, enabling restoration in the event of data loss, corruption, or disaster.[2][6] This practice ensures that critical information remains accessible and recoverable, forming a foundational element of data protection strategies. Key concepts include redundancy, which involves maintaining multiple identical copies of data to mitigate single points of failure, and point-in-time recovery, allowing restoration to a specific moment before an incident occurred.[7][8] Backups integrate into the broader data lifecycle—encompassing creation, usage, archival, and deletion—by preserving data integrity and availability throughout these phases.[9] The primary purposes of backups are to support disaster recovery, ensuring systems and data can be restored after events like hardware failures or natural disasters; to facilitate business continuity by minimizing operational downtime; and to meet regulatory compliance requirements for data retention and auditability.[10][11][12] They also protect against human errors, such as accidental deletions, and cyber threats including ransomware and cyberattacks, which can encrypt or destroy data.[13][14] Historically, data backups emerged in the 1950s with the advent of mainframe computers, initially relying on punch cards for data storage and processing before transitioning to magnetic tape systems like the IBM 726 introduced in 1952, which offered higher capacity and reliability.[15][16] In 2025, amid explosive data growth driven by artificial intelligence, Internet of Things devices, and cloud computing, global data volume is estimated at 181 zettabytes, heightening the need for robust backup mechanisms to manage this scale and prevent irrecoverable losses.[17]Historical Development
The earliest forms of data backup in computing emerged in the 1940s and 1950s alongside vacuum tube-based systems, where punch cards and paper tape served as primary storage and archival media.[18] By the 1930s, IBM was already processing up to 10 million punch cards daily for data handling, a practice that persisted into the 1960s and 1970s for batch processing and rudimentary backups in mainframe environments.[19] Magnetic tape, patented in 1928 but widely adopted by IBM in the 1950s, revolutionized backup by enabling faster sequential data access and greater capacity compared to paper-based methods, often inspired by adaptations from audio recording technologies like those in vacuum cleaners.[20] These tapes became standard for archiving in the 1960s and 1970s, supporting the growing needs of early enterprise computing. In the 1970s and 1980s, backup practices advanced with the proliferation of minicomputers and the introduction of cartridge-based magnetic tape systems, such as IBM's 3480 format launched in 1984, which offered compact, high-density storage for mainframes and improved reliability over reel-to-reel tapes.[16] The rise of personal computers and Unix systems in the late 1970s spurred software innovations; for instance, the Unix 'dump' utility appeared in Version 6 Unix around 1975 for filesystem-level backups, while 'tar' (tape archive) was introduced in Seventh Edition Unix in 1979 to bundle files for tape storage.[21] By the 1980s and 1990s, hard disk drives became affordable for backups, shifting from tape-only workflows, and RAID (Redundant Array of Independent Disks) was conceptualized in 1987 by researchers at the University of California, Berkeley, providing fault-tolerant disk arrays that enhanced data protection through redundancy.[22] Incremental backups, which capture only changes since the prior backup to reduce storage and time, gained traction during this era, with early implementations in Unix tools and a key patent for optimized incremental techniques filed in 1989.[23] The 2000s marked a transition to disk-to-disk backups, driven by falling hard drive costs and the need for faster recovery; by the early decade, disk replaced tape as the preferred primary backup medium for many enterprises, enabling near-line storage for quicker access.[24] Virtualization further transformed backups, with VMware's ESX Server released in 2001 introducing bare-metal hypervisors that supported VM snapshots for point-in-time recovery without full system shutdowns.[25] Cloud storage emerged as a milestone with Amazon S3's launch in 2006, offering scalable, offsite object storage that began integrating with backup workflows for remote replication.[26] Data deduplication, which eliminates redundant data blocks to optimize storage, saw significant adoption starting around 2005, with Permabit Technology Corporation pioneering inline deduplication solutions for virtual tape libraries to address exploding data volumes.[27] From the 2010s onward, backups evolved to handle big data and hybrid cloud environments, incorporating features like automated orchestration across on-premises and cloud tiers for resilience against outages.[15] The 2017 WannaCry ransomware attack, which encrypted data on over 200,000 systems worldwide, underscored vulnerabilities in traditional backups, prompting a surge in cyber-resilient strategies such as air-gapped and immutable storage to prevent tampering.[28] In the 2020s, ransomware incidents escalated, with disclosed attacks rising 34% from 2020 to 2022, continuing through 2024 when 59% of organizations were affected, and into 2025.[29][30] This has driven adoption of immutable backups that lock data versions against modification for a defined period. Trends now emphasize AI-optimized backups for predictive anomaly detection and zero-trust models integrated into storage, as highlighted in Gartner's 2025 Hype Cycle for Storage Technologies, which positions cyberstorage and AI-driven data management as maturing innovations for enhanced security and efficiency.[31][32]Backup Strategies and Rules
The 3-2-1 Backup Rule
The 3-2-1 backup rule serves as a foundational best practice for data redundancy and recoverability, recommending the maintenance of three total copies of critical data: the original production copy plus two backups. These copies must reside on two distinct types of storage media to guard against media-specific failures, such as disk crashes or tape degradation, while ensuring at least one copy is stored offsite or disconnected from the primary network to mitigate risks from physical disasters, theft, or localized cyberattacks.[33][34][35] In light of escalating cyber threats, particularly ransomware that targets mutable backups, the rule has evolved by 2025 into the 3-2-1-1-0 framework. This extension incorporates an additional immutable or air-gapped copy—isolated via physical disconnection or unalterable storage policies—to prevent encryption or deletion by malware, alongside a mandate for zero recovery errors achieved through routine verification testing. Air-gapped solutions, such as offline tapes, or cloud-based isolated repositories enhance resilience by breaking the attack chain, ensuring clean restores even in sophisticated breach scenarios.[33][36][37] This strategy offers a balanced approach to data protection, optimizing costs through minimal redundancy while preserving accessibility for rapid recovery and providing robust safeguards against diverse failure modes. For instance, a typical implementation might involve the original data on a local server disk, a backup on external hard drives or NAS, and an offsite copy in cloud storage, thereby distributing risk across hardware types and locations without requiring excessive resources.[38][39] Implementing the 3-2-1 rule begins with evaluating data criticality to focus efforts on high-value assets, such as business records or application databases, using tools like risk assessments to classify information. Next, choose media diversity based on factors like capacity, speed, and compatibility—ensuring no single failure mode affects all copies—while automating backups via software that supports multiple destinations. Finally, establish offsite storage through geographic separation, such as remote data centers or compliant cloud providers, to confirm isolation from primary site vulnerabilities.[37][39][35] According to the 2025 State of Backup and Recovery Report, variants of the 3-2-1 rule are increasingly adopted amid rising threats, with only 50% of organizations currently aligning actual recovery times with their RTO targets, underscoring the rule's role in enhancing overall resilience.[40]Rotation and Retention Policies
Rotation schemes define the systematic cycling of backup media or storage to ensure regular data protection while minimizing resource use. One widely adopted approach is the Grandfather-Father-Son (GFS) model, which organizes backups into hierarchical cycles: daily incremental backups (sons) capture changes from the previous day, weekly full backups (fathers) provide a comprehensive snapshot at the end of each week, and monthly full backups (grandfathers) serve as long-term anchors retained for extended periods, such as 12 months.[41][42] This scheme balances short-term recovery needs with archival efficiency by rotating media sets, typically using separate tapes or disks for each level to avoid overwrites.[43] Another rotation strategy is the Tower of Hanoi scheme, inspired by the mathematical puzzle, which optimizes incremental chaining for extended retention with limited media. In this method, backups occur on a recursive schedule—every other day on the first media set, every fourth day on the second, every eighth on the third, and so on—allowing up to 2^n - 1 days of coverage with n media sets while ensuring each backup depends only on the prior full or relevant incremental for restoration.[44][45] This approach reduces media wear on frequently used sets and supports efficient space utilization in environments with high daily change rates.[46] Retention policies govern how long backups are kept before deletion or archiving, primarily driven by regulatory compliance to prevent data loss and support audits. For instance, under the General Data Protection Regulation (GDPR) in the European Union, organizations must retain personal data only as long as necessary for the specified purpose, with retention periods determined by the data's purpose and applicable sector-specific or national laws (e.g., 5-10 years for certain financial records under related regulations).[47][48] Similarly, the Health Insurance Portability and Accountability Act (HIPAA) in the United States mandates retention of protected health information documentation for at least six years from creation or the last effective date.[49] To enforce immutability during these periods, Write Once Read Many (WORM) storage is employed, where data can be written once but not altered or deleted until the retention term expires, safeguarding against ransomware or accidental overwrites.[50][51] Several factors influence the design of rotation and retention policies, including the assessed value of the data, potential legal holds that extend retention beyond standard periods, and the ongoing costs of storage infrastructure. High-value data, such as intellectual property, may warrant longer retention to mitigate recovery risks, while legal holds—triggered by litigation or investigations—can indefinitely pause deletions.[52] Storage costs further constrain policies, as prolonged retention increases expenses for cloud or on-premises media, prompting tiered approaches like moving older backups to cheaper archival tiers.[53] In 2025, emerging trends leverage AI-driven dynamic retention, where machine learning algorithms automatically adjust policies based on real-time threat detection and data usage patterns to optimize protection without excessive storage bloat.[54][55] A common example of rotation implementation is a weekly full backup combined with daily incrementals, where full backups occur every Friday to reset the chain, and incrementals run Monday through Thursday, retaining the prior week's full for quick point-in-time recovery.[56] To estimate storage needs under such a policy, organizations use formulas like Total space = (Full backup size × Number of full backups retained) + (Average incremental size × Number of days retained), accounting for deduplication ratios that can reduce effective usage by 50-90% depending on data redundancy.[57][58] Challenges in these policies arise from balancing extended retention with deduplication technologies, as long-term archives often cannot share metadata across active and retention tiers, potentially doubling storage demands and complicating space reclamation when deleting expired backups.[59] This tension requires careful configuration to avoid compliance failures or unexpected cost overruns, especially in deduplicated environments where inter-backup dependencies limit aggressive pruning.[60]Data Selection and Extraction
Targeting Files and Applications
Selecting files and applications for backup involves evaluating their criticality to business operations or personal use, such as user-generated documents, configuration files, and databases that cannot be easily recreated, while excluding transient data like temporary files to optimize storage and performance.[61] Critical items are prioritized based on potential impact from loss, with user files in home directories often targeted first due to their unique value, whereas system and application binaries are typically omitted as they can be reinstalled from original sources.[61] Exclusion patterns, such as*.tmp or *.log, are applied to skip junk or ephemeral files, reducing backup size without compromising recoverability.[62]
At the file level, backups offer granularity by targeting individual files, specific directories, or patterns, allowing for efficient synchronization of only changed or selected items. Tools like rsync enable this selective approach through options such as --include for specific paths (e.g., --include='docs/*.pdf') and --exclude for unwanted elements (e.g., --exclude='temp/'), facilitating incremental transfers over local or remote destinations while preserving permissions and timestamps.[62] This method supports directories as units for broader coverage, such as syncing an entire /home/user/projects/ folder, but allows fine-tuning to avoid unnecessary data.[63]
For applications, backups are tailored to their architecture: databases like MySQL are often handled via logical dumps using mysqldump, which generates SQL scripts to recreate tables, views, and data (e.g., mysqldump --all-databases > backup.sql), ensuring consistency without halting operations when combined with transaction options like --single-transaction.[64] Email servers employing IMAP protocols can be backed up by exporting mailbox contents to standard formats like MBOX or EML using tools that connect via IMAP, preserving folder structures and attachments for archival.[65] Virtual machines (VMs) are commonly treated as single image files, capturing the entire disk state (e.g., VMDK or VHD) through host-level snapshots to enable quick restoration of the full environment.[66]
Challenges arise with large files exceeding 1TB, such as high-definition videos, where bandwidth constraints and incompressible data types prolong initial uploads and recovery times, often necessitating hybrid strategies like disk-to-disk seeding before cloud transfer.[67] In distributed systems, data sprawl across hybrid environments complicates visibility and consistency, as exponential growth in volume—projected to reach 181 zettabytes globally by 2025—strains backup processes and increases the risk of incomplete captures.[17] By 2025, backing up SaaS applications like Office 365 requires API-based connectors for automated extraction of Exchange, OneDrive, and Teams data, with tools configuring OAuth authentication to pull items without on-premises agents.[68]
Best practices emphasize prioritizing via Recovery Point Objective (RPO), the maximum tolerable data loss interval, targeting under 1 hour for critical applications like databases and email to minimize business disruption through frequent incremental or continuous backups.[69] This approach integrates with broader filesystem backups for comprehensive coverage, ensuring selected files and apps align with overall data protection goals.[61]