Document Scope
The SMART tests can be intricate and overwhelming if you're not sure where to focus. This article aims to shed light on SMART tests, providing insights on when to schedule them and what specific aspects to monitor.
SMART Summary - Self-Monitoring, Analysis, and Reporting Technology
This monitoring system is integrated with various disk drives, including traditional HDDs, SSDs, and eMMC. Its primary goal is to proactively monitor drives, aiming to identify potential failing drives before they actually fail. This is achieved through reporting on a wide array of indicators and attributes, which can be overwhelming. Unfortunately, these indicators and attributes lack standardization across the industry, and even seemingly similar metrics from different vendors are often interpreted differently.
Our objective is to provide clarity on this subject, making this monitoring system more valuable for our users and helping prevent data loss.
Manually Running Tests
Ensure that smartmontools is installed. If not, please install it based on your operating system using either yum or apt-get.
Installation using YUM (for Red Hat, CentOS, Fedora, etc.):
sudo yum install smartmontools
Installation using APT (for Debian, Ubuntu, etc.):
sudo apt-get install smartmontools
Confirm SMART is Supported:
sudo smartctl -i /dev/<device>
Executing Tests:
smartctl -t short /dev/<device>
smartctl -t long /dev/<device>
smartctl -t conveyance /dev/<device>
Gathering Specific Time Frames:
sudo smartctl -c /dev/<device>
Testing Schedule
The following tests serve as a solid foundation to initiate proactive drive monitoring. I recommend maintaining the frequency of these tests while adjusting the timing to suit the system and environment. It is advisable to schedule these tests during non-peak times. As an example, I typically schedule short tests at midnight on Fridays and long tests at 8 pm on Sundays
Test |
Description |
Frequency |
Short |
A brief test lasting ≤ 2 minutes to identify a defective drive. Three separate tests, including an electrical test, a mechanical test, and a Read/Verify from a portion of the disk, can reliably confirm a faulty disk in a short amount of time. The contents and location of the area read and verified may vary between manufacturers but are still valuable regular tests to monitor disk health. |
Weekly to Daily, depending on server role and criticality of data |
Long |
Similar to the Short test, but with two distinct differences. Firstly, there is no time limit, and secondly, the entire disk is read. This results in a longer test duration directly related to the size of the disk, ensuring no area of usable disk space is overlooked |
Monthly to Weekly, depending on server role and criticality of data. |
Conveyance |
Exclusive to ATA drives, this test takes only a few minutes. It is utilized when disks have been transported to identify any damage incurred during transit before use |
Recommended only if disks have been moved to a new location/system. |
Scheduling Tests
To streamline and automate testing, scheduling can be implemented using the smartd.conf file. Below is an example:
/dev/<device> -a -m user@domain.com -o -s (S/../../1/5|L/../../6/1)
In this example, the specified device is monitored for all SMART features. A report is sent to user@domain.com after the Short test completes, scheduled for the 1st day of the week at 5 AM. The Long test is scheduled for the 6th day of the week at 1 AM.
Further options and detailed explanations are provided below.
Flag |
Definition |
Notes |
-a |
monitors ALL SMART features |
|
-d |
specify the interface explicitly |
Scheduling a different test for "ATA" vs "SCSI" drives |
-m |
specify email for notifications |
|
-M |
specifies type of email |
Default is "once" other options are "daily", "diminishing" or "test" |
-M exec |
path to script can be specified to run when smartd starts |
|
-n |
prevents spin-up due to smartd polling |
"never", "sleep", "standby", "idle" or "active" |
-o |
offline data collection |
Would recommend this be used in almost every case |
-s |
toggles SMART support |
|
-S |
autosave of device vendor specific attributes |
test type /month/day/day-of-week/time |
What Metrics Matter
SMART # |
Name |
Definition |
5 |
Reallocated_Sector_Count |
Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months
|
10 |
Spin Retry Count |
Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem. |
187 |
Reported_Uncorrectable_Errors |
The count of errors that could not be recovered using hardware ECC |
188 |
Command_Timeout |
The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero. |
194 |
Temperature |
Indicates the device temperature, if the appropriate sensor is fitted |
196 |
Reallocation Event Count |
Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful and unsuccessful attempts are counted.
|
197 |
Current_Pending_Sector_Count |
Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written. However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors
|
198 |
Offline_Uncorrectable |
The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem. |
When a Disk is Suspected Bad
Confirm the device in question. Run a long test to thoroughly scan and test the entire disk. Note that this process may take several hours but provides a comprehensive overview of the disk's health.
Attach the SMART log using the command:
smartctl -x /dev/<device>
This information can be submitted with an Exxact ticket for RMA replacement validation within the 3-year warranty provided by Exxact. If the system is older than that, contact the drive manufacturer directly, as most offer a 5-year warranty
Comments
0 comments
Please sign in to leave a comment.