SMART Tests

Document Scope

The SMART tests can be intricate and overwhelming if you're not sure where to focus. This article aims to shed light on SMART tests, providing insights on when to schedule them and what specific aspects to monitor.

SMART Summary - Self-Monitoring, Analysis, and Reporting Technology

This monitoring system is integrated with various disk drives, including traditional HDDs, SSDs, and eMMC. Its primary goal is to proactively monitor drives, aiming to identify potential failing drives before they actually fail. This is achieved through reporting on a wide array of indicators and attributes, which can be overwhelming. Unfortunately, these indicators and attributes lack standardization across the industry, and even seemingly similar metrics from different vendors are often interpreted differently.

Our objective is to provide clarity on this subject, making this monitoring system more valuable for our users and helping prevent data loss.

Manually Running Tests

Ensure that smartmontools is installed. If not, please install it based on your operating system using either yum or apt-get.

Installation using YUM (for Red Hat, CentOS, Fedora, etc.):

sudo yum install smartmontools

Installation using APT (for Debian, Ubuntu, etc.):

sudo apt-get install smartmontools

Confirm SMART is Supported:

sudo smartctl -i /dev/<device>

Executing Tests:

smartctl -t short /dev/<device> 
smartctl -t long /dev/<device>
smartctl -t conveyance /dev/<device>

Gathering Specific Time Frames:

sudo smartctl -c /dev/<device>

Testing Schedule

The following tests serve as a solid foundation to initiate proactive drive monitoring. I recommend maintaining the frequency of these tests while adjusting the timing to suit the system and environment. It is advisable to schedule these tests during non-peak times. As an example, I typically schedule short tests at midnight on Fridays and long tests at 8 pm on Sundays

Test	Description	Frequency
Short	A brief test lasting ≤ 2 minutes to identify a defective drive. Three separate tests, including an electrical test, a mechanical test, and a Read/Verify from a portion of the disk, can reliably confirm a faulty disk in a short amount of time. The contents and location of the area read and verified may vary between manufacturers but are still valuable regular tests to monitor disk health.	Weekly to Daily, depending on server role and criticality of data
Long	Similar to the Short test, but with two distinct differences. Firstly, there is no time limit, and secondly, the entire disk is read. This results in a longer test duration directly related to the size of the disk, ensuring no area of usable disk space is overlooked	Monthly to Weekly, depending on server role and criticality of data.
Conveyance	Exclusive to ATA drives, this test takes only a few minutes. It is utilized when disks have been transported to identify any damage incurred during transit before use	Recommended only if disks have been moved to a new location/system.

Scheduling Tests

To streamline and automate testing, scheduling can be implemented using the smartd.conf file. Below is an example:

/dev/<device> -a -m user@domain.com -o -s (S/../../1/5|L/../../6/1)

In this example, the specified device is monitored for all SMART features. A report is sent to user@domain.com after the Short test completes, scheduled for the 1st day of the week at 5 AM. The Long test is scheduled for the 6th day of the week at 1 AM.

Further options and detailed explanations are provided below.

Flag	Definition	Notes
-a	monitors ALL SMART features
-d	specify the interface explicitly	Scheduling a different test for "ATA" vs "SCSI" drives
-m	specify email for notifications
-M	specifies type of email	Default is "once" other options are "daily", "diminishing" or "test"
-M exec	path to script can be specified to run when smartd starts
-n	prevents spin-up due to smartd polling	"never", "sleep", "standby", "idle" or "active"
-o	offline data collection	Would recommend this be used in almost every case
-s	toggles SMART support
-S	autosave of device vendor specific attributes	test type /month/day/day-of-week/time

What Metrics Matter

SMART #	Name	Definition
5	Reallocated_Sector_Count	Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months
10	Spin Retry Count	Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
187	Reported_Uncorrectable_Errors	The count of errors that could not be recovered using hardware ECC
188	Command_Timeout	The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero.
194	Temperature	Indicates the device temperature, if the appropriate sensor is fitted
196	Reallocation Event Count	Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful and unsuccessful attempts are counted.
197	Current_Pending_Sector_Count	Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written. However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors
198	Offline_Uncorrectable	The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

When a Disk is Suspected Bad

Confirm the device in question. Run a long test to thoroughly scan and test the entire disk. Note that this process may take several hours but provides a comprehensive overview of the disk's health.

Attach the SMART log using the command:

smartctl -x /dev/<device>

This information can be submitted with an Exxact ticket for RMA replacement validation within the 3-year warranty provided by Exxact. If the system is older than that, contact the drive manufacturer directly, as most offer a 5-year warranty

SMART Tests

Document Scope

SMART Summary - Self-Monitoring, Analysis, and Reporting Technology

Manually Running Tests

Testing Schedule

Scheduling Tests

What Metrics Matter

When a Disk is Suspected Bad

Was this article helpful?

Comments

Search

SMART Tests

Document Scope

SMART Summary - Self-Monitoring, Analysis, and Reporting Technology

Manually Running Tests

Testing Schedule

Scheduling Tests

What Metrics Matter

When a Disk is Suspected Bad

Was this article helpful?

Comments