Fact-checked by Grok 2 weeks ago

Self-Monitoring, Analysis and Reporting Technology

Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) is a built-in system for hard disk drives (HDDs) and solid-state drives (SSDs) that detects and reports indicators of potential reliability issues, enabling early prediction of drive failures to facilitate data backup and replacement. Developed collaboratively by major hard disk drive manufacturers, S.M.A.R.T. originated from IBM's 1992 introduction of Predictive Failure Analysis (PFA) technology in its 9337 disk arrays using SCSI-2 drives, which monitored key parameters to forecast failures. The technology evolved into a standardized feature with the publication of the ATA-3 specification in 1997 by the Small Form Factor (SFF) Committee, integrating it into the AT Attachment (ATA) interface for broader adoption across IDE and later Serial ATA (SATA) drives. By the early 2000s, S.M.A.R.T. became a default capability in most consumer and enterprise storage devices, with adaptations for SSDs focusing on metrics like wear leveling and program/erase cycles rather than mechanical attributes. S.M.A.R.T. operates by continuously tracking a set of vendor-defined attributes—such as read error rates, spin-up time, , and reallocated sectors—using internal sensors and counters within the drive's . These attributes include values that are processed by algorithms to generate normalized values compared against predefined thresholds; if a value falls at or below its threshold, the drive signals a potential , often providing at least 24 hours of continued operation for . The system supports commands (e.g., subcommand B0h) for enabling/disabling monitoring, querying status, and executing self-tests, including short tests (2-10 minutes) that verify basic functionality and long tests (up to several hours) that perform comprehensive read/write scans. While highly effective for proactive maintenance, S.M.A.R.T. has limitations, as it only monitors detectable parameters and cannot predict all failure modes, such as sudden electronic issues or manufacturing defects. Its data is accessible via , operating system tools, or third-party software, making it essential for IT administrators and users in data centers, personal computing, and archival storage to minimize and .

Introduction

Definition

Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) is a firmware-embedded monitoring system integrated into hard disk drives (HDDs) and solid-state drives (SSDs), designed to track and assess the operational health of storage devices in real time. Developed as an industry standard in the mid-1990s by the (SFF) Committee, S.M.A.R.T. enables drives to self-diagnose potential issues autonomously, without relying on host for basic detection. At its core, S.M.A.R.T. continuously collects on key drive health indicators, such as levels, error rates during read/write operations, and metrics like spin retry counts in HDDs. The drive's performs ongoing of these attributes against predefined thresholds, flagging deviations that could signal . This facilitates predictive detection by identifying trends toward faults early, allowing systems to alert users or administrators to impending issues before complete drive occurs.

Purpose and Benefits

Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) serves as a built-in diagnostic in hard disk drives (HDDs) and solid-state drives (SSDs), designed to proactively key operational attributes and predict potential failures before they occur. Its primary purpose is to detect indicators of drive degradation, such as increasing error rates, and generate alerts that enable users or to perform timely data backups or drive replacements, thereby preventing unexpected . This occurs continuously during normal drive operations, with data collected periodically to assess reliability without interrupting performance. The technology offers several practical benefits in storage management, including reduced system through early failure warnings that allow for scheduled rather than reactive repairs. By identifying issues proactively, S.M.A.R.T. helps extend the effective lifespan of drives by enabling interventions that mitigate further wear, such as reducing workload on degrading components. In environments, it supports by facilitating the preservation of critical information across large-scale storage arrays, minimizing the risk of widespread outages or corruption from sudden failures. Specific failure prediction scenarios illustrate these advantages; for instance, a rising Spin Retry Count attribute signals repeated failures in powering up the drive's , often indicating motor or electronics issues that could lead to complete inoperability if unaddressed. Similarly, an increasing Reallocated Sector Count tracks the number of bad sectors remapped to spare areas, serving as an early indicator of media degradation and imminent read/write failures. These examples allow administrators to anticipate problems days or weeks in advance, averting . S.M.A.R.T. evolved into an industry standard feature with its inclusion in the ATA-3 specification in 1997, and it has since become a ubiquitous component in virtually all modern HDDs and SSDs produced by major manufacturers. This standardization ensures consistent health reporting across devices, enhancing compatibility and reliability in diverse computing ecosystems.

Historical Development

Origins

Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) was developed through collaborative efforts among major vendors to enable drives to self-assess health metrics like error rates and temperature, allowing before . Development of S.M.A.R.T. spanned to 1995, led by and in partnership with Seagate, Quantum, Conner, and through the (SFF) Committee. contributed foundational monitoring concepts from its Predictive Failure Analysis (PFA) system in drives, while drove the ATA-focused protocol in 1995 under the initial name IntelliSafe, which evolved into the flexible S.M.A.R.T. standard emphasizing vendor-defined attributes and thresholds. Early implementations appeared in ATA drives following the 1995 standardization, integrating firmware to report key parameters via host commands. A pivotal occurred in when submitted the specification (SFF-8035) to the SFF Committee, resulting in its adoption and incorporation into the ATA-3 standard, which formalized S.M.A.R.T. commands for /ATA interfaces. This enabled broader compatibility, with subsequent expansion to interfaces adapting the for server environments through equivalent log pages and commands. Early adoption faced challenges in , as the S.M.A.R.T. framework permitted vendor-specific attribute selections and threshold interpretations, leading to inconsistencies in across drives from different manufacturers. Despite these variations, the gained traction by mid-1995, with initial deployments highlighting its potential to reduce in and systems through proactive alerts.

Predecessors

In the 1980s, hard disk drives incorporated fundamental mechanisms to maintain , laying the groundwork for later self-monitoring technologies. Seagate's ST-506, the first 5.25-inch hard drive introduced in 1980, used (CRC) codes for detecting errors during read operations, allowing for basic tracking of without external intervention. This CRC implementation represented an early form of internal error logging, where the drive handled detection autonomously but did not provide predictive insights or host-accessible reports on error trends. The emergence of the Small Computer System Interface (SCSI) in 1986 further advanced diagnostic capabilities in storage devices. drives supported commands such as REQUEST SENSE, which enabled hosts to query the drive for detailed error status, including hardware faults and operation failures, facilitating rudimentary self-diagnostics. These features allowed system administrators to retrieve error logs manually, marking a step toward drive-initiated , though limited to reactive responses rather than proactive analysis. Despite these innovations, 1980s technologies like CRC tracking and SCSI diagnostics required manual interpretation by users or host software, lacking built-in automated analysis of error patterns or threshold-based failure predictions. This gap highlighted the need for more intelligent systems that could self-assess health attributes and alert on impending failures, driving advancements in the late 1980s toward automated monitoring. A key predecessor was Compaq's IntelliSafe, developed in collaboration with Seagate, Quantum, and Conner in the early 1990s as an early predictive failure tool. IntelliSafe expanded on error logging by monitoring drive attributes such as seek errors and spin-up retries, comparing them against vendor-defined thresholds to generate alerts for potential failures, though it still relied on host software for detailed reporting. This approach bridged manual diagnostics and full automation, influencing the evolution of self-monitoring in storage devices.

Core Mechanisms

Monitored Attributes

Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) monitors a range of parameters indicative of storage device health, reliability, and operational status to enable early detection of potential failures. These attributes encompass error rates, environmental factors, and performance metrics, which are collected and maintained by the device's firmware to provide insights into degradation or faults. Key categories of monitored attributes include error rates, which track incidents such as read and write errors, uncorrectable sector errors, and seek errors to assess risks; environmental factors, covering aspects like (current, minimum, and maximum values), , power cycle counts, and power-on resets to evaluate wear from usage and conditions; and performance metrics, such as seek times, data throughput, and transfer efficiency to gauge operational speed and efficiency over time. For instance, monitoring helps identify overheating that could accelerate component , while power cycle counts reflect mechanical stress from frequent startups. These categories allow for a holistic view of without delving into protocol-specific implementations. Each attribute consists of a normalized value and a raw value. The normalized value represents a on a scale typically from 0 to 255, where higher values indicate better condition (often 1-253, with 1 signaling worst-case and 253 near-ideal), while 0 or values like 254-255 are reserved for , , or pre- states; this scale is vendor-defined but standardized in structure to facilitate . The value, in contrast, provides unprocessed, vendor-specific such as counts, elapsed hours, or readings, offering detailed context for the normalized assessment without a fixed scale. Thresholds are set by manufacturers to trigger alerts when normalized values fall below acceptable levels. In addition to standardized attributes, vendors may implement proprietary ones to capture device-specific metrics, such as custom error logging or advanced sensor data, which are stored in designated log areas and complement the core set for enhanced diagnostics. These vendor-specific attributes maintain compatibility with the overall S.M.A.R.T. framework while allowing innovation in monitoring. Attributes are updated in real-time by the drive firmware during normal operations, background scans, or self-initiated data collection periods, with periodic autosaves to non-volatile storage ensuring persistence across power cycles or resets; this mechanism minimizes impact on device performance while enabling timely health tracking. The collected data can then be reported to host systems for analysis.

Reporting and Analysis

In S.M.A.R.T., the analysis process is performed by the device's , which continuously monitors and evaluates key attributes such as error rates and operational parameters by comparing their normalized values against predefined thresholds to identify potential anomalies or degradation. These thresholds represent minimum acceptable levels for drive reliability, with normalized attribute values typically scaled from 1 to 100 or 1 to 253, where higher values indicate healthier conditions; when a value falls below its threshold, it signals a degrading state. For example, attributes like reallocated sector count or spin-up retry count are assessed in this manner to detect early signs of issues. Reporting of analyzed data occurs through status flags and detailed logs accessible via commands, providing an overall and granular insights into attribute trends. The Device Data Structure, a 512-byte sector, stores current, worst-case, and threshold values for each attribute, along with off-line data collection status flags that indicate whether is active, completed, or interrupted. Logs such as the Comprehensive SMART Error Log and Extended SMART Self-Test Log record historical data, including error timestamps and test outcomes, enabling host systems to retrieve and interpret drive without impacting . A critical alert mechanism in S.M.A.R.T. is the Threshold Exceeds Condition (TEC), which is triggered when any attribute's normalized value drops below its threshold, indicating a high likelihood of imminent failure and prompting warranty replacement in many implementations. The TEC status is reported via the SMART RETURN STATUS command, returning specific error codes (e.g., F4h 2Ch) to signal the condition to the host, distinct from general pass/fail flags. Basic algorithms for failure prediction rely on trend analysis of attribute values over time, using logged historical data to extrapolate degradation patterns and estimate remaining operational life. tracks changes in raw and normalized values through periodic updates during background or off-line data collection, comparing them against baselines to forecast when thresholds might be approached, though vendor-specific variations exist in the exact predictive logic. This approach prioritizes proactive alerts based on observed wear rather than reactive error detection.

Standardization

Standard Specifications

The formal standards governing Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) across interfaces were established to ensure consistent capabilities for health and reliability. The /ATAPI-5 specification, released in 2000 by the T13 technical committee under the Committee for Standards (INCITS), defines the basic S.M.A.R.T. feature set for devices, including core commands for data retrieval and status reporting. Similarly, the SCSI Primary Commands-2 (SPC-2) standard (INCITS 351-2001) for interfaces incorporates S.M.A.R.T.-equivalent functionality through mechanisms like informational exception operations, enabling predictive in -based systems. For modern non-volatile memory express (NVMe) interfaces, S.M.A.R.T. is integrated via the NVMe Base Specification version 1.3 (ratified in 2017) and subsequent revisions, which mandate a /Health Information log page for controllers supporting the NVM command set. S.M.A.R.T.'s evolution within standards began with its introduction as an optional feature in the ATA-3 specification (1997), where it was proposed for but not required for device compliance. By ATA/ATAPI-5 and later revisions, S.M.A.R.T. became a standardized, mandatory component of the command set if implemented by the device, promoting widespread adoption while allowing vendors flexibility in support. The NVMe Base Specification, first published in 2011, incorporated S.M.A.R.T. elements from its inception (version 1.0), with enhancements in version 1.3 adding detailed health attributes and asynchronous event notifications to align with high-performance SSD requirements. Common elements across these standards include standardized attribute IDs (ranging from 1 to 255) to track metrics like error rates, , and usage; threshold settings, which allow hosts to define failure criteria for attributes via commands such as ENABLE/DISABLE ATTRIBUTE AUTOSAVE or Set Features in NVMe; and enable/disable controls to toggle S.M.A.R.T. functionality (e.g., subcommands D0h for enable and D2h for disable in ). These features facilitate uniform by specifying data formats and command opcodes. The T13 committee plays a central role in ATA S.M.A.R.T. standardization, developing revisions to the command set for consistent host-device interactions and . Likewise, the T10 committee under INCITS oversees standards, defining logging and exception reporting to support S.M.A.R.T.-like in environments.

Implementation Variations

Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) implementations exhibit significant variations across drive vendors and protocols, stemming from the non-standardized nature of attribute definitions and thresholds. Although core S.M.A.R.T. functionality is outlined in standards like , the specific interpretation of attributes remains largely vendor-specific, leading to inconsistencies in how health data is collected, normalized, and reported. For instance, Seagate drives often report high raw values for Attribute 01 (Read Error Rate) on new units, which is normal and does not indicate failure, whereas other vendors treat similar raw values as potential issues. Similarly, Attribute 194 (Temperature ) on Seagate disks stores temperature data in both raw and normalized fields, while drives may display uninitialized values (such as 253) for Attributes 10 ( Retry Count) and 11 ( Retry Count) until the drive accumulates sufficient operating hours, potentially misleading tools. These custom thresholds and meanings, such as 's interpretation of Attribute 190 (Airflow Temperature) as the difference from 100 degrees , complicate cross-vendor comparisons and require vendor documentation for accurate analysis. Visibility of S.M.A.R.T. attributes to the host operating system is another area of inconsistency, as not all attributes are exposed through standard interfaces like passthrough. Many vendor-specific attributes remain hidden or require tools for access, limiting general-purpose software's ability to retrieve comprehensive data. For example, in configurations or USB enclosures, support for S.M.A.R.T. queries varies by controller type, often necessitating specialized drivers or protocols like SAT (SCSI-ATA Translation) to pass through the data, and even then, only a subset of attributes may be available without vendor-specific utilities. This opacity can hinder proactive health , as users may overlook critical metrics unless using manufacturer-provided software, such as Seagate's SeaTools or Western Digital's . Compatibility challenges arise from firmware variations across drive models and vendors, frequently resulting in false positives or negatives in health assessments. Firmware bugs, such as those in early SSDs (e.g., the Intel 330 series with firmware 300i), can cause attributes like to report erroneously high values (e.g., around 890,000 hours), triggering unnecessary failure alarms despite the drive functioning normally. Differences in how attributes are updated or persisted—such as Maxtor's use of Attribute 9 () in minutes that reset every 65,536 minutes versus hours on other vendors—can lead to inconsistent reporting during firmware updates or across drive generations, exacerbating interoperability issues in mixed environments. Standards have continued to address some uniformity issues, particularly in NVMe implementations. The NVMe 2.0 Base Specification, ratified in 2021, enhanced the SMART/Health Information log by clarifying metrics like Data Units Written to exclude the impact of Write Zeroes commands, reducing potential discrepancies in wear tracking and improving cross-vendor consistency for solid-state drives. The specification has further evolved, with Revision 2.3 (ratified in 2025) introducing enhancements such as improved sustainability metrics that complement health reporting. These updates promote more reliable health reporting in modern NVMe ecosystems, though legacy ATA/SATA variations persist.

Interface Implementations

ATA and SATA

Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) is implemented in the (Advanced Technology Attachment) interface through a set of specific commands that allow hosts to enable, disable, and query device health data. The primary ATA commands for S.M.A.R.T. include SMART ENABLE OPERATIONS (opcode B0h, feature D8h), which activates the feature set; SMART DISABLE OPERATIONS (opcode B0h, feature D9h), which deactivates it and prevents further S.M.A.R.T. operations except for re-enabling; SMART READ DATA (opcode B0h, feature D0h), a PIO data-in command that retrieves the 512-byte device S.M.A.R.T. containing attribute values, offline status, and self-test status; and SMART READ THRESHOLD (opcode B0h, feature D1h), which fetches the 512-byte threshold structure for comparing attribute values against failure s. In the S.M.A.R.T. data structure, attributes are represented by unique IDs with both normalized (0-100 scale, where 100 is best) and raw values, enabling predictive . Notable attributes include ID 01h (Read Error Rate), which tracks the rate of read errors normalized against a vendor-specific ; ID 05h (Reallocated Sectors Count), counting sectors remapped due to errors; and ID C5h (Current Pending Sector Count), indicating unstable sectors pending reallocation or remapping after a write attempt. These attributes contribute to the Exceeds , where an attribute's normalized value falls below its , signaling potential device ; this condition is queried via the SMART RETURN STATUS command (opcode B0h, feature DAh), which returns sector number 2CF4h if exceeded, otherwise F4FFh. The offline status, part of the S.M.A.R.T. at byte 362, reports values such as 00h (offline idle), 02h (completed), or 03h (in progress), indicating the device's background activity. The transition to ( ) maintained full with the S.M.A.R.T. command set while introducing serial transport enhancements. Released in 2003, 1.0 specification supports S.M.A.R.T. operations identically to , inheriting the feature from ATA/ATAPI-5, with commands executed via non-data s in the device state machine (e.g., transitioning from DND0 to DND1 states). 's Native Command Queuing (NCQ), an optional feature in 1.0, enables efficient queuing of up to 32 commands, including S.M.A.R.T. queries when issued as part of broader I/O operations, reducing latency in multi-command environments without altering the core S.M.A.R.T. .

NVMe

In the NVMe protocol, Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) is adapted for high-performance solid-state drives (SSDs) connected via PCIe, emphasizing and thermal management over mechanical components. The primary mechanism is the SMART/Health Information log page (Log Identifier 02h), which provides controller-level mandatory for all NVMe controllers, with optional namespace-specific support. This log aggregates key attributes to enable proactive failure prediction, accessible through standardized admin commands. The Health Status log page includes critical attributes such as , which reports the current composite temperature at byte 1 in (subtract 273.15 for ), along with up to eight sensor readings in vendor-specific fields and configurable thresholds for warnings. Percentage Used (bytes 6-7) estimates the proportion of the drive's endurance life consumed (0-100% or 255 if unavailable/undefined, updated at least every 60 minutes), serving as a core indicator of wear. Available Spare (bytes 2-3, percentage of remaining spare blocks) and its (bytes 4-5) provide additional endurance metrics, with critical warnings if spare falls below (bit 0 in byte 0). Media errors are tracked via fields like the Number of Media Errors (bytes 72-75, 32-bit count of unrecovered media errors during read/write operations), summarizing issues from the Error Information Log (ID 01h). These attributes focus on SSD reliability metrics rather than HDD-specific mechanical indicators. To retrieve S.M.A.R.T. data, the command (opcode 02h) is used, allowing hosts to poll the Health Status log at the controller or level. Critical warning bits (byte 0 of the log) provide immediate alerts for conditions like available spare capacity below threshold (bit 0), temperature exceeding limits (bit 1), reliability degradation due to media errors (bit 2), or transition to read-only mode (bit 3); these bits can trigger Asynchronous Event Requests for real-time notification. SSD-specific attributes in NVMe S.M.A.R.T. prioritize wear, including wear leveling count inferred from the Percentage Used and Available Spare fields, which monitor block-level balancing to prevent localized over-erasure. Program/erase cycles are indirectly captured through metrics like Data Units Written (bytes 40-47) and the Endurance Group Information Log, quantifying stress from repeated write operations. These differ fundamentally from HDD metrics by focusing on electrical and . The NVMe 2.0 specification (released 2021) enhances S.M.A.R.T. with improved through features like the Persistent Event Log (ID 16h) and Telemetry Logs (LPA bit 3), enabling for failure forecasting via SMART/Health Log snapshots (Event Type 01h). Namespace-specific monitoring is advanced via NVM Sets and the Endurance Group Information Log (ID A7h), allowing granular tracking per logical partition for multi-tenant environments.

SCSI and SAS

In and interfaces, health monitoring functionality similar to S.M.A.R.T. is provided via native commands to enable robust health monitoring of hard disk drives and solid-state drives in server and configurations. For drives supporting , S.M.A.R.T. data can be accessed through translated commands. The primary mechanism for accessing health data involves the LOG SENSE command, which retrieves specific log pages containing diagnostic metrics. Key pages include Page Code 2Fh for Informational Exceptions, which reports overall health threshold status (e.g., whether predefined limits for drive health have been exceeded), and Page Code 10h for Self-Test Results, detailing outcomes of diagnostic tests. Additionally, Page Code 18h (Protocol-Specific Log Page) provides phy-level error counters relevant to implementations. SAS, as a serial evolution of parallel SCSI, has supported health monitoring since its initial specification (SAS 1.0, ratified in ), incorporating dual-port architecture to facilitate redundant monitoring paths. Each port operates independently with unique SAS addresses, allowing hosts to query health data via either path for and in multi-initiator setups, thereby minimizing single points of failure in monitoring operations. This dual-port capability enhances reliability in arrays by enabling continuous checks even if one path is disrupted. Enterprise-focused health monitoring metrics in and emphasize parameters critical for large-scale storage, such as I/O error rates (e.g., invalid DWORD counts and loss of in SAS phys), write error rates (tracked via protocol-specific logs), and background results that detect surface defects during idle periods. These metrics support proactive management in environments by logging cumulative errors and outcomes for . Compared to implementations, and monitoring is more robust for due to the use of Mode Page 0Ah (Control Mode Page), which configures informational exceptions and asynchronous event reporting tailored to multi-device topologies.

Operational Features

Self-Tests

Self-tests in Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) provide active diagnostic capabilities that enable hosts to initiate on-demand assessments of storage device health, distinct from continuous attribute monitoring. These tests verify the integrity of device components, such as read/write circuitry and media surfaces, by performing targeted scans and checks. S.M.A.R.T. self-tests support predictive maintenance by allowing administrators to detect emerging issues early, schedule periodic diagnostics, and respond to potential failures before data loss occurs. Three primary test types are defined: the short self-test, which performs basic electrical and logical checks on key components in 2 to 10 minutes; the extended self-test, which conducts a thorough scan of all data sectors and may take several hours depending on device capacity; and the conveyance self-test, a brief routine lasting minutes that evaluates integrity for damage incurred during shipping or handling. These tests focus on attributes like error rates and seek performance but do not exhaustively detail every monitored . Short and extended tests can operate in offline mode for background execution without blocking I/O or in captive mode for immediate, foreground completion. The conveyance test is typically captive and vendor-specific in its assessments. Execution varies by interface protocol. In ATA and SATA environments, self-tests are initiated using the SMART EXECUTE OFF-LINE IMMEDIATE command (opcode B0h) with subcommand D4h and the test type encoded in the Feature register (offline: 01h for short, 02h for extended, 03h for conveyance; captive: 81h for short, 82h for extended, 83h for conveyance). For NVMe devices, the Device Self-test admin command (opcode 14h) specifies the test via Command Dword 10 (e.g., 01h for short, 02h for extended), targeting the controller, a specific namespace, or all namespaces through the NSID field in Dword 15. In SCSI and SAS implementations, the SEND DIAGNOSTIC command (opcode 1Dh) triggers tests by setting the SELFTEST bit for a default short captive test or using SELF-TEST CODE values (e.g., 001b for background short, 010b for background extended), which map to equivalent ATA S.M.A.R.T. operations in translated environments. Results are logged with pass/fail status, detailed error codes indicating failure points (e.g., segment number or failing logical block address), and timestamps based on power-on hours. In ATA/SATA, outcomes populate the Self-Test Log (log address 06h, supporting up to 21 entries in a circular buffer) and optionally the Extended Self-Test Log (07h) for comprehensive failure data. NVMe records results in the Device Self-test Log Page (identifier 06h, last 20 tests) and integrates with the SMART/Health Information Log (02h) for status and completion metrics. SCSI logs mirror ATA formats via READ LOG EXT or translation layers, capturing similar diagnostics. Logs include abort indicators if interrupted, ensuring traceability. For , self-tests can be aborted mid-execution—via a dedicated abort code (e.g., Fh in NVMe or 4Fh in Feature register)—to prioritize urgent operations, with updated as "aborted" in the . Offline modes facilitate non-disruptive scheduling, such as nightly extended tests, while recommended polling intervals (stored in device attributes) guide host queries for completion without constant . This enables proactive interventions, like , based on failure thresholds in logged results.

Logging and Commands

In Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.), structures vary by to store health attributes, self-test results, and diagnostic in , ensuring persistence across power cycles unless explicitly reset by operations like or controller reset. For and devices, the primary S.M.A.R.T. structures include the Device SMART Data structure, which holds up to 30 attributes, and dedicated pages accessible via the General Purpose (GPL) feature set. These encompass attribute values, thresholds, and vendor-specific , with the Extended SMART Self-Test (log address 07h) supporting a of up to 3,449 pages containing detailed test descriptors, including logical block addresses (LBAs), execution status, timestamps, and failing sectors. The standard SMART Self-Test (log address 06h) is a fixed 512-byte structure that maintains up to 21 read-only entries in a , each entry comprising 24 bytes with fields for self-test number, status, code, timestamp, and failure details. The SMART Command Transport (SCT) protocol extends logging and error recovery capabilities for ATA devices by encapsulating advanced commands within S.M.A.R.T. log pages, specifically using log addresses E0h (SCT Command/Status) and E1h (SCT Transfer) to enable host-device communication for tasks like temperature logging, error recovery controls, and feature reconfiguration without disrupting standard ATA operations. SCT commands are transported via the existing S.M.A.R.T. infrastructure, allowing up to 512 bytes per transfer, and support enterprise-grade features such as write-scratchpad operations for temporary buffering during diagnostics. These logs persist across power cycles, with attribute autosave enabled via the SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE command to ensure critical like power-on hours and sector error counts is retained in non-volatile storage. Access to ATA S.M.A.R.T. logs is facilitated by the READ LOG EXT command (opcode 2Fh), which retrieves data using PIO Data-In protocol and supports 48-bit addressing for larger logs. The command specifies the log address (e.g., 06h for self-test results), page count (minimum 1, up to the log's capacity), and starting page number, returning the requested sectors upon completion or aborting with an error if the log is unsupported or parameters are invalid. For example, invoking READ LOG EXT with log address 30h accesses the Device Statistics log, providing cumulative metrics like total data units read and unsafe shutdowns. In NVMe implementations, S.M.A.R.T.-equivalent logging uses composite log pages, notably the SMART/Health Information log (page identifier 02h), a 512-byte structure aggregating controller-wide health data including critical warnings (e.g., available spare below threshold), composite temperature in , percentage used (0-255, saturating at 255), data units read/written in 512-byte blocks (scaled by 1,000), power cycles, , and media errors—all retained across power cycles and updated hourly when powered on. The NVMe Get Log Page admin command (opcode 02h) retrieves these composite logs, with parameters specifying the log page identifier (e.g., 02h for /Health), number of dwords to transfer, namespace identifier (NSID FFFFFFFFh for controller scope), and optional for partial reads via scatter-gather lists or physical region pages. This command is mandatory for all NVMe controllers supporting the feature (indicated by the SMARTS bit in the Identify Controller ) and ensures by aborting if the specified log exceeds available size or NSID is invalid. In SCSI and SAS environments, logging occurs through log pages accessed via the command ( 4Dh), with the Self-Test Results log (page code 10h) serving as the primary S.M.A.R.T.-like structure, consisting of a 4-byte header followed by up to 20 parameter entries (each 16 bytes), ordered by recency and including self-test number, status, code, timestamp, first failure address, and sense keys. SCSI self-test logs support retention across power cycles when the device saves parameters to non-volatile media (controlled by the DS bit in log parameters and SP bit in LOG SELECT), though this is implementation-dependent and may reset on hard resets or I_T loss if not configured for persistence. The LOG SENSE command retrieves these pages by specifying the page code (e.g., 10h), subpage code (00h), and parameter list length, translating -equivalent data where applicable via SCSI/ATA Translation Layer (SAT) mappings, such as converting self-test status to SCSI sense codes. Overall, S.M.A.R.T. log retention ensures up to 21 self-test entries in , 20 in , and comprehensive health metrics in NVMe persist through power events, enabling reliable post-cycle analysis without data loss.

Limitations

Accuracy and Reliability

The accuracy of Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) in predicting disk failures has been evaluated through large-scale empirical studies, revealing moderate effectiveness with notable limitations. A seminal analysis by of over 30,000 drives found that while certain S.M.A.R.T. attributes, such as scan errors and reallocation counts, correlate with increased risk—drives with one or more scan errors being 39 times more likely to fail within 60 days—more than 56% of failed drives exhibited no abnormalities in the four most predictive attributes. Even considering all available S.M.A.R.T. parameters, over 36% of failures showed zero counts across them, indicating limited prognostic value for individual drive predictions. Backblaze's examinations of hundreds of thousands of drives in the provided further insights into S.M.A.R.T.'s predictive capabilities, focusing on attributes like reported uncorrectable errors () and command timeouts (). In a study of operational and failed drives, 76.7% of failed units displayed one or more of five key attributes exceeding zero, suggesting a detection rate in the 70-80% range for these metrics, though this varied by drive model and workload. However, the analysis emphasized that such thresholds alone yield inconsistent results, with cumulative error trends over time offering better but still imperfect foresight into impending failures. S.M.A.R.T. systems are prone to false positives and negatives, which undermine their reliability in proactive maintenance. Overly sensitive thresholds, such as non-zero values in uncorrectable sector counts, trigger alerts in about 4.2% of healthy operational drives, leading to unnecessary replacements and operational overhead. Conversely, false negatives occur in 23.3% of failed drives where no warning attributes activate, and more advanced models using only S.M.A.R.T. can exhibit false negative rates as high as 69%, particularly for short windows, as undetected degradations or sudden in SSDs evade . In SSDs, attributes tracking NAND levels may fail to flag latent failures until wear exceeds critical points, contributing to unanticipated breakdowns. Several factors compromise S.M.A.R.T.'s overall reliability, including variations across vendors in attribute reporting and interpretation. Raw values for attributes like read error rates () differ significantly by manufacturer, with some normalizing them uniformly to 100 regardless of actual health, hindering cross-vendor comparisons and consistent failure forecasting. Additionally, incomplete attribute exposure—where drives support only a subset of the 70+ possible S.M.A.R.T. parameters, varying by model—limits diagnostic depth, as critical metrics like high-fly writes may not be accessible. The absence of standardized prediction models exacerbates these issues, as thresholds and algorithms remain proprietary, reducing and universal applicability. Recent analyses (as of ) highlight incremental improvements in S.M.A.R.T. for SSDs under NVMe protocols, where enhanced attributes for and media wearout enable better endurance tracking and failure anticipation compared to earlier generations. However, persistent gaps remain for HDDs, where mechanical components like head crashes continue to evade reliable prediction due to delayed attribute changes.

Integration Challenges

Accessing S.M.A.R.T. data in host operating systems often requires specialized tools due to limited built-in support. In distributions, utilities like smartctl from the package must be installed and configured to query drive attributes, as native kernel interfaces do not provide direct visibility without additional setup. Similarly, Windows lacks comprehensive native S.M.A.R.T. , relying on third-party implementations of or vendor-specific applications for data retrieval. RAID configurations exacerbate these visibility issues, as controllers typically present logical volumes to the operating system, obscuring individual physical drive details and preventing standard S.M.A.R.T. queries. To overcome this, administrators must use controller-specific tools—such as for LSI/Avago controllers or storcli for setups—or employ pass-through commands like smartctl -d megaraid,N /dev/sdX to access underlying drives. This added complexity can delay failure detection in enterprise environments where is common. Vendor-specific implementations further complicate integration, as S.M.A.R.T. attribute thresholds and interpretations are and not standardized across manufacturers. For instance, Seagate drives require the SeaTools diagnostic software to accurately assess overall via a pass/fail status, without exposing detailed threshold values to third-party tools, which may yield inconsistent results. Likewise, SSDs demand their Storage Executive software for correct attribute labeling and threshold evaluation, as generic utilities often misinterpret vendor-defined metrics like retired NAND blocks or lifetime remaining. This reliance on creates lock-in, hindering unified monitoring across heterogeneous storage fleets. Power management features in modern drives can conflict with S.M.A.R.T. operations, particularly offline , which is often suspended during low-power states such as or standby to minimize use. According to ATA specifications, drives support capabilities like "Suspend Offline collection upon new command," allowing background scans to pause when entering power-saving modes, but this interrupts periodic monitoring and may lead to outdated attribute updates upon resumption. In systems with aggressive policies, such as laptops or energy-efficient servers, this suspension reduces the timeliness of failure warnings. As of 2025, integrating S.M.A.R.T. in and edge environments introduces additional hurdles in across distributed fleets. Large-scale systems, like those in hyperscale data centers, face challenges in collecting and centralizing S.M.A.R.T. logs from thousands of drives due to varying controller access methods, latencies, and incomplete from virtualized or disaggregated . Research on proactive highlights that incomplete aggregation of these logs can impair reliability modeling, as not all drives report data uniformly in dynamic setups.

References

  1. [1]
    SMART Definition - What is SMART monitoring? - TechTerms.com
    Sep 30, 2022 · SMART is a monitoring system that is built into hard disk drives and solid-state drives that detects and reports errors that may lead to drive failure.<|control11|><|separator|>
  2. [2]
  3. [3]
  4. [4]
    Define: SMART used in SmartCollect - IBM
    Oct 17, 2019 · SMART stands for Self-Monitoring Analysis and Reporting Technology. SMART technology was developed by a number of major hard disk drive manufacturers.
  5. [5]
    S.M.A.R.T. History and predecessors - NTFS.com
    The industry's first hard disk monitoring technology was introduced by IBM in 1992 in their IBM 9337 Disk Arrays for AS/400 servers using IBM 0662 SCSI-2 disk ...
  6. [6]
    [PDF] ATA Interface Drives - Seagate Technology
    The specifications listed in this table are for quick reference. For details on specification measurement or definition, see the appropriate section of this ...
  7. [7]
    How does SMART function of hard disks Work?
    SMART (Self-Monitoring Analysis and Reporting Technology) is designed by IBM. It was created to monitor the disk status by using various methods and devices ( ...
  8. [8]
    SMART Attributes documentation - Thomas-Krenn-Wiki-en
    May 5, 2020 · SMART has been standard since ATA-3 to ATA/ATAPI. However, it was removed shortly before the publication of the standard description of the ...
  9. [9]
  10. [10]
    [PDF] Cheetah 15K.6 SAS Revision A Product Manual - Seagate Technology
    S.M.A.R.T. is an acronym for Self-Monitoring Analysis and Reporting Technology. This technology is intended to recognize conditions that indicate imminent drive ...<|separator|>
  11. [11]
    [PDF] Savvio 10K.1 SCSI Product Manual - Seagate Technology
    S.M.A.R.T. is an acronym for Self-Monitoring Analysis and Reporting Technology. This technology is intended to recognize conditions that indicate a drive ...
  12. [12]
    A survivability analysis of enterprise hard drives incorporating the ...
    Oct 3, 2024 · The primary goal of SMART is to predict drive failures before they occur, allowing for preventive measures such as data backup or drive ...Missing: benefits extended
  13. [13]
    [PDF] Seagate SMART Attribute Specification
    Dec 1, 2010 · Data that may be useful in determining the health of a particular disk drive are stored as SMART. Attributes. The following data structure ...
  14. [14]
    What is SMART? - Computer Hope
    Feb 21, 2025 · Short for Self-Monitoring Analysis and Reporting Technology, SMART, or SMART, is a diagnostic method created by IBM and introduced with the ATA-3 specification.
  15. [15]
    [PDF] SMART - Self-Monitoring, Analysis and Reporting Technology
    SMART (also written S.M.A.R.T.), which stands for Self-Monitoring, Analysis and Reporting Technology, is an industry standard.
  16. [16]
    SMART Attributes For Predicting HDD Failure - Horizon Technology
    Feb 11, 2025 · The first self-monitoring technology for hard drives appeared in 1992, in IBM's 9337 disk arrays of IBM 0662 SCSI-2 disk drives. The monitoring ...
  17. [17]
    Monitoring Hard Disks with SMART | Linux Journal
    Jan 1, 2004 · In late 1995, parts of SFF-8035i were merged into the ATA-3 standard. Starting with the ATA-4 standard, the requirement that disks maintain ...
  18. [18]
    [PDF] Accredited Standards Committee* X3, Information ... - T10.org
    Tom Hanan moved and John Brooks seconded that SFF 8035 (SMART) be incorporated into ATA-3 in a normative fashion at the technical editor's discretion The ...
  19. [19]
    S.M.A.R.T. Attributes - NTFS.com
    Each drive manufacturer defines a set of attributes and selects threshold values which attributes should not go below under normal operation.
  20. [20]
    [PDF] Hard Disk Drives: - Ed Thelen
    In 1952 IBM sent him to San Jose, California to create and manage a West. Coast Lab where he led a research team which developed the disk drive technology.Missing: date | Show results with:date
  21. [21]
    You Don't Know Jack about Disks - ACM Queue
    Jul 31, 2003 · ST-506, ESDI / SMD, SCSI / ATA. Date introduced, 1980, 1972/1985 ... The error correction code (ECC) fields provided error recovery information.
  22. [22]
    [PDF] Get S.M.A.R.T. for Reliability - Argus Monitor
    IBM's reliability-prediction technology is called Predictive Failure Analysis (PFA). PFA measures several attributes, including head flying height, to predict ...Missing: original | Show results with:original
  23. [23]
    [PDF] AT Attachment 8 - ATA/ATAPI Command Set (ATA8-ACS)
    May 21, 2007 · This standard specifies the AT Attachment command set between host systems and storage devices. It provides a common command set for systems ...<|separator|>
  24. [24]
    ATA/ATAPI-5 — the fifth revision of the ATA standard released in 2000
    This is the fifth ATA/ATAPI standard, released in 2000. The standard defines the connectors and cables for physical interconnection between host and storage ...Missing: specification SMART
  25. [25]
    Catalog of Standards - INCITS
    Easily and quickly access details on the scope of work, cost of the standard and where to purchase them. New Standards and Technical Reports INCITS standards ...Missing: 370-2000 SCSI
  26. [26]
    [PDF] NVM Express Revision 1.3
    May 1, 2017 · This NVM Express revision 1.3 specification is proprietary to the NVM Express, Inc. (also referred to as. “Company”) and/or its successors and ...
  27. [27]
    smartctl - Control and Monitor Utility for SMART Disks
    Commands for SCSI Tape drives as defined in the SSC-4 standard (ANSI INCITS 516-2013). SSC stands for "SCSI Streaming Commands". Draft standards can be ...
  28. [28]
    T13.org
    Technical Committee T13 is responsible for all interface standards relating to the popular AT Attachment (ATA) storage interface utilized as the disk drive ...Missing: PDF | Show results with:PDF
  29. [29]
    T10 Technical Committee
    T10 Technical Committee, which standardizes SCSI Storage Interfaces.
  30. [30]
    FAQ – smartmontools
    Unlike other parts of SMART (logs, self-tests), the attributes are not (and never were) part of the ATA standards.
  31. [31]
  32. [32]
    [PDF] NVM Express Base Specification 2.0
    May 13, 2021 · The NVM Express Management Interface (NVMe-MI) specification defines an optional management interface for all NVM Express Subsystems. NVM ...
  33. [33]
    Changes in NVM Express Revision 2.0
    In the SMART / Health Information log page the Data Units Written field is not impacted by the Write Zeroes command (Reference NVMe Base Spec 2.0 section 5.16.
  34. [34]
    [PDF] ATA/ATAPI Command Set - 3 (ACS-3) - NEVAR
    Oct 17, 2011 · ... ATA/ATAPI Command Set - 3 (ACS-3). This is a draft proposed American ... SMART RETURN STATUS command causes the device to.
  35. [35]
    [PDF] Serial ATA: High Speed Serialized AT Attachment
    Jan 7, 2003 · This 1.0a revision of the Serial ATA / High Speed Serialized AT Attachment specification (“Final. Specification”) is available for product ...
  36. [36]
    [PDF] NVM Express Base Specification 2.0d
    Jan 11, 2024 · The NVM Express Base Specification 2.0d, revision 2.0d, is available for download at https://nvmexpress.org. It incorporates several ECNs.<|control11|><|separator|>
  37. [37]
    NVMe™ SSD Management, Error Reporting, and Logging Capabilities
    The SMART log page also works to summarize the error log page ... Along with log pages, many NVMe specification features work to report errors and operation ...<|separator|>
  38. [38]
    [PDF] T10/05-142r0 SAT - LOG SENSE command and SMART
    May 23, 2005 · 1. The LOG SENSE command is used by SCSI application clients to retrieve SMART data. The SATL must be able to provide the occurrence of an ...Missing: SAS | Show results with:SAS
  39. [39]
    [PDF] Serial Attached SCSI (SAS) Interface Manual - Seagate Technology
    Jul 5, 2006 · This is a Serial Attached SCSI (SAS) Interface Manual, specifically Revision B, which includes an introduction to the SAS interface.
  40. [40]
    [PDF] SCSI Interface Product Manual, Volume 2 - Bitsavers.org
    The initiator sets up the parameters for S.M.A.R.T. operation using Mode Select Informational Exceptions Con- ... Control Mode Page (0Ah) Length from 0Ah to 06h.
  41. [41]
    [PDF] NVM Express Base Specification, Revision 2.1
    Aug 5, 2024 · NVM Express® Base Specification, Revision 2.1 is available for download at https://nvmexpress.org. The. NVM Express Base Specification, ...
  42. [42]
    [PDF] T10/05-245r4 SAT - SEND DIAGNOSTIC command and Self-Test ...
    Dec 13, 2005 · The SEND DIAGNOSTIC command is a mandatory SCSI command per SBC-2. 2. The “self-test” feature of this command must be supported per SPC-3 if the.Missing: specification | Show results with:specification
  43. [43]
    [PDF] AT Attachment 8 - ATA/ATAPI Command Set
    Oct 28, 2013 · The AT Attachment command set includes the PACKET feature set implemented by devices commonly known as ATAPI devices. This standard maintains a ...
  44. [44]
    [PDF] NVM Express Base Specification, Revision 2.2
    Mar 11, 2025 · The NVM Express Base Specification, Revision 2.2, is available for download at https://nvmexpress.org. It incorporates Revision 2.1, TP4194 and ...
  45. [45]
    [PDF] SCSI Primary Commands -3 - T10.org
    Nov 26, 2008 · ... Introduction ... Sense data ..................................................................................................
  46. [46]
    [PDF] Failure Trends in a Large Disk Drive Population - Google Research
    Drives typically scan the disk surface in the background and report errors as they discover them. Large scan error counts can be indicative of surface defects, ...<|separator|>
  47. [47]
    What S.M.A.R.T. Hard Drive Errors Actually Tell Us About Failures
    Oct 6, 2016 · Have you ever wondered what your hard drive SMART errors actually mean? Find out what we look at to determine if a drive is about to fail.Missing: variations early
  48. [48]
    [PDF] Making Disk Failure Predictions SMARTer! - USENIX
    Feb 27, 2020 · However, a large-scale study of disk failures [75] shows that a small number of LSEs alone do not necessarily indicate that a disk failure has ...<|separator|>
  49. [49]
    Analyzing Hard Drive S.M.A.R.T. Stats: A Look at Drive Health
    Nov 12, 2014 · Every disk drive includes Self-Monitoring, Analysis, and Reporting Technology (SMART), which reports internal information about the drive.
  50. [50]
    [PDF] NVMe SSD Failures in the Field - USENIX
    Jul 13, 2022 · In this paper, we study the fail-stop and fail-slow failures of NVMe SSDs deployed at Alibaba. Specifically, we collect and analyze device logs ...
  51. [51]
    Machine Learning Based Collaborative Prediction of SSD Failures ...
    When tested on Alibaba's dataset of 700k NVMe SSDs, this method yielded higher failure prediction accuracy, with the average recall increasing from 51% to 63% ...Missing: SMART | Show results with:SMART
  52. [52]
    Using smartctl for hard drive diagnostics - Liquid Web
    The smartctl command-line utility allows you to interact with the SMART system to check drive health, run self-tests, and retrieve valuable diagnostic ...
  53. [53]
    smartctl Linux: Check and Test S.M.A.R.T Health - LinuxConfig
    Sep 22, 2025 · The smartctl utility can be used to launch a variety of self-tests: short; long; conveyance (ATA devices only); select (ATA devices only). Let's ...<|control11|><|separator|>
  54. [54]
    S.M.A.R.T. How to – Step by step guide for the disks health monitoring
    Monitoring individual disk health in RAID environments can be challenging because RAID controllers often obscure the details of each drive from the operating ...
  55. [55]
    Linux Use smartctl To Check Disk Behind Adaptec RAID Controllers
    Apr 24, 2024 · Use smartctl to check disk behind Adaptec RAID controllers. Use the following commands to find if RAID card detected and grab info about each disks.Missing: integration | Show results with:integration
  56. [56]
    How do I interpret SMART diagnostic utilities results? | Seagate US
    SMART monitors attributes; a 'FAIL' status means near-term failure, so back up data. Seagate uses 'pass/fail' status, not individual attribute values. Third- ...Missing: definition | Show results with:definition
  57. [57]
    SSDs and SMART data
    ### Summary on Vendor-Specific S.M.A.R.T. Attributes and Proprietary Software
  58. [58]
    smartd.conf - SMART Disk Monitoring Daemon Configuration File
    ATA disks have five different power states. In order of increasing power consumption they are: 'OFF', 'SLEEP', 'STANDBY', 'IDLE', and 'ACTIVE'.
  59. [59]
    Howto_ReadSmartctlReports_ATA - smartmontools
    Jan 4, 2024 · Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate.
  60. [60]
    Proactive Drive Failure Prediction for Cloud Storage System ...
    Proactive drive failure prediction can help operators handle the failing drives in advance, enhancing the storage system dependability.
  61. [61]
    Backblaze Drive Stats for Q3 2024 | Hard Drive Failure Rates
    Nov 12, 2024 · As of the end of Q3 2024, Backblaze was monitoring 292,647 hard disk drives (HDDs) and solid state drives (SSDs) in our cloud storage ...