You are here

SMART disk

Submitted by Peter on Sat, 2010-01-09 13:54

SMART, when applied to disk drives, is Self Monitoring Analysis and Reporting Technology. SMART provides a way to detect disk drives that are deteriorating and replace the disk drives before total failure. Here is a description of what can be reported. You will have to check your operating system to find out how it reports errors.

This article uses an example from Ubuntu 9.10. There are equivalents in modern versions of Windows and Linux. If your operating system does not display and report SMART information, your operating system is not modern.

Disk utility

Select System, Administration, Disk utility. The following image shows the disk utility. Select disk drives and partitions in the left side column. Notice the line SMART status: Disk is healthy - More information. That More information link takes you to the screens showing the details used to decide the health of the disk.

Ubuntu disk utility display showing a list of disk drives on the left and details for the first selected disk on the right

Disk status

Select the disk you want to check then select More information.

More information part 1 the summary showing the hardware identification, temperature, and repeating the overall status

Temperature

The manufacturer specifies an operating temperature. Our example disk operates between 0 and 60 degrees centigrade. Most disks show temporary errors when approaching the maximum temperature. Try to keep the temperature below 50degrees.

Some disks will fail the instant they hit the maximum temperature while others will survive a while at temperatures over the limit but their life will be reduced. When your disk passes the maximum of 60 degrees, the lubricant starts to evaporate. If half the lubricant evaporates during a few minutes above the maximum temperature then the number of start/stop cycles is reduced by half and the design life is also halved.

Your disk with the 5 year design life and the 3 year guarantee might last 10 years if kept cool but fail at the end of 3 years if it occasionally runs a little bit too hot. Check the computer cooling fans regularly and clean the dust out of the air vents.

When all your disks are too hot, there are usually other components cooking to death. If just one disk is too hot, that disk is cooking because the bearings are wearing out. The disk might last long enough to run a backup.

Attributes

A list of important attributes following the status display with the good attributes displaying a healthy green light

The list continues with less important attributes

The list continues with unimportant attributes

Read Error Rate

You should have zero read errors. If one occurs, you want immediate notification. A lot of disk failures start with read errors that are recovered but quickly deteriorate to complete failure.

Spin up Time

Spin up time depends on the model of the disk. Measure spin up time when the disk is new then note any significant slow down. A slow spin up could be a low voltage because the computer power supply is overloaded or worn bearings. Replace the disk when the bearings start to wear out.

An overloaded power supply means it i time to upgrade. A bigger power supply will help. New power supplies are more efficient and may use less electricity to create the same amount of power in the computer.

Start/Stop Count

Your disk is rated as capable of starting up a certain number of times, perhaps 50000 times. If you start your workstation once per day, your disk will last 136 years based purely on start/stop cycles. A notebook disk going in and out of power saving might start 50 times per day and last only 2.7 years.

Note that disks have both a Start/Stop Cycles limit and a Component Design Life. The example disk is designed for 50,000 Start/Stop Cycles or 5 years, whichever comes first.

Reallocation Sector Count

You want zero reallocated sectors when you buy the disk. Some cheap computers use disks that showed errors during manufacturing and were sold off without a brand label.

An occasional reallocation means the disk contains dirt or is wearing out. A sudden burst of reallocations indicates dirt that is moving across the disk and will quickly fill up the reserve area, leading to permanent failure. Your disk manufacturer may not specify the size of the reserve leaving you with no idea when the disk will fail. You probably have time for one backup then replace the disk.

Seek Error Rate

Seek errors indicate the disk is too hot or wearing out. Check the temperature in the status area and replace the disk if the temperature is too high.

The example disk can operate from 0 to 60 degrees centigrade. A temperature near 60 means the disk is running at the edge of the design specification and will occasionally fail. The lubricant will evaporate from the bearings faster than normal. The slightest change in disk activity or air flow or power supply voltage can push the disk over the 60 degree limit and destroy your disk.

Switch off your computer if the disks approaches 50 degrees and investigate why your disk is so hot. Your computer might contain a failed cooling fan or have a vent blocked by dust or be overloaded with components. You might need air conditioning.

One of my friends had his computer sitting on a desk near a window. In the afternoon the sun hit the computer. In summer the computer failed several times before he showed me the location and I recommended moving the computer out of the sun.

Seek Timer Performance

The disk manufacturer supplies a specification for average seek time, in this case 8.9 milliseconds, and might specify full stroke seek time, in this case 18 milliseconds. If the seek time is greater than the specified full stroke seek time then the disk seek mechanism is failing or the disk is dirty. Replace the disk.

I have never found an error of this type. If the seeks fail, disks usually go into a recalibration loop producing a continual clicking sound and your computer slows down or stops.

Power-On Hours

Good brands of disks exceed their design life by up to double and you end up replacing the disks because they are too small or too slow. Servers clock up the largest number of powered on hours because the disks are on 24 hours per day every day but server disks fail less frequently than desktop disks.

Based on observing failures, some brands are less reliable. If a manufacturer has two brands then the second brand usually uses older designs and older manufacturing techniques. Seagate purchased Maxtor at a time when I had a lot of Seagate disks running error free and a few Maxtor disks failing almost as soon as their one year guarantee ran out. I do not know what Maxtor is like now but there is little difference in price. Why would I risk saving a few dollars when the cost of replacing a failed disk is hundreds of dollars.

Some models of disks, even from the best brands, are bad news. Many years ago one batch of Seagate disks failed unusually fast. There was a big court case against Fujitsu over some failed disks.

Heat and vibration damages some brands more than others. Good disks have a large safety factor built in while cheap disks may not. Good disks are made in new high precision factories using the best of everything. Cheap disks might be made in older lower specification factories or using older less reliable designs.

If there is no power on hours specified or displayed, replace your disk when it hits the design life.

Spin up Retry Count

This should be zero. If there is a spin up retry, it is usually because your computer power supply cannot provide the brief heavy power load required when disks spin up. For temporary relief, you can set spin up delays on the disks so they are not all starting up at the same time. Start the disks after the fans and everything else go through their start up power munching phase. Long term relief is a better power supply.

Calibration Retry Count

Continual recalibration produces a continual clicking sound and your computer slows down or stops. you might not hear the clicking on modern quiet disks when they are inside a well designed computer. You should be able to hear the clicking with the computer case open.

Replace the disk. Check the temperature first because the disk might be too hot. Replace the disk if recalibration occurs at normal temperatures.

Power Cycle Count

This is a duplicate of Start/Stop Count.

Soft read error rate

This is similar to Read Error Rate.

Reported Uncorrectable Errors

This should be zero. It might jump due to wear or heat as described in previous items.

Airflow Temperature

This is the temperature measured on the outside of the disk case and should not be over 30 degrees because your CPU and other components will start to cook.

Temperature

This is the temperature inside the disk case and should not be over 50 degrees centigrade (120 degrees fahrenheit).

Hardware ECC Recovered

This should be low but there is no absolute definition of low and no relevant specification from manufacturers. You would have to compare several disks of the same model after the same amount of disk activity. Compare this count across disks in a RAID array and consider replacing a disk if it has a far higher count than the other disks in the array.

Reallocation Count

This should be zero in a new disk and very low in an old disk. See Reallocation Sector Count.

Current Pending Sector Count

This should be zero.

Uncorrectable Sector Count

This should be zero.

UDMA CRC Error Rate

If this is not zero, your disk could be failing. You have a damaged data cable from the disk to the motherboard. The data cable might be loose. Your motherboard might contain rubbishy controller chips or your hardware drivers might be using the wrong DMA settings for the chip/disk combination.

Write Error Rate

If this is greater then zero, panic. There are probably other errors listed in earlier items.

Soft Read Error Rate

You will get errors here if there is a temperature error or wear as indicated in earlier items. Cool the disk then replace the disk if the problem persists.