18

I have a NAS appliance that is just over a month old. It is configured to email me alerts generated from the hard drives' SMART data. After one day, one of the hard drives reported that a sector had gone bad and been reallocated. Over the first week, that number climbed to six total sectors for the hard drive in question. After a month, the number stands at nine reallocated sectors. The rate definitely seems to be decelerating.

The NAS is configured with six 1.5 TB drives in a RAID-5 configuration. With such high capacity drives, I would expect a sector to fail from time to time, so I was not concerned when the first few sectors were relocated. It bothers me though that none of the other disks are reporting any problems.

At what rate of relocations, or total number of relocations, should I start to get worried for the drive's health? Might this vary based on the capacity of the drive?

Jeremy
  • 641
  • 3
  • 11
  • 16
  • nice one, jeremy. one of the best on serverfault as many others here will find it useful and it's not easy to find an answer to. definitely deserves more than +2. you might want to rephrase the question so that it's not specific to NetGear, but storage in general though – username May 19 '09 at 17:47
  • Thanks for the feedback, I made the changes you suggested and updated the situation. – Jeremy May 20 '09 at 13:05
  • 1
    I replace drives at _one_ reallocated sector. You should expect zero over the warranty timespan of the drive. The manufacturers have always honored the warranty on these drives. – Michael Hampton Jun 13 '14 at 05:34

7 Answers7

22

Re-reading Google's paper on the subject, "Failure Trends in a Large Disk Drive Population", I think I can safely say that Adam's answer is incorrect. In their analysis of an extremely massive population of drives, roughly 9% had non-zero reallocation counts. The telling quote is this:

After their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts, making the critical threshold for this parameter also one.

It's even more interesting when dealing with "offline reallocations", which are reallocations discovered during background scrubbing of the drive, not during actual requested IO ops. Their conclusion:

After the first offline reallocation, drives have over 21 times higher chances of failure within 60 days than drives without offline reallocations; an effect that is again more drastic than total reallocations.

My policy from now on will be that drives with non-zero reallocation counts are to be scheduled for replacement.

siegi
  • 103
  • 4
Insyte
  • 9,314
  • 2
  • 27
  • 45
  • That is interesting, I had heard of that paper but I may need to re-read it. FWIW, 4 out of the 6 drives in my NAS have reallocated sectors. Thanks for the answer. – Jeremy Nov 17 '09 at 14:02
13

Drives, like most components, have a bathtub curve failure rate. They fail a lot in the beginning, have a relatively low failure rate in the middle, and then fail a lot as they reach the end of their life.

Just as the whole drive follows this curve, particular areas of the disk will also follow this curve. You'll see a lot of sector re-allocations in the beginning of using the drive, but this should taper off. When the drive starts to fail at the end of life it'll start losing more and more sectors.

You don't need to worry about 6 (depending on the drive - consult the manufacturer), but you need to watch and see the frequency of each new reallocation. If the deterioration accelerates or stays the same, worry. Otherwise, it should be fine after the initial break-in period.

-Adam

Adam Davis
  • 5,366
  • 3
  • 36
  • 52
  • A small point: Drives will fail LONG before their MTBF. I think you mean they fail a lot as they approach their expected lifetime. – Eddie May 04 '09 at 16:41
  • 5
    Didn't Google pretty thoroughly debunk the "bathtub curve" theory? – Insyte Nov 17 '09 at 02:18
3

Different drives probably have different parameters. On a drive that I last checked that was a 1TB enterprise series disk from one vendor there were 2048 reserved sectors for reallocation.

You can estimate the number of reserved sectors looking in the S.M.A.R.T. report on a drive that has got a nonzero number of reallocated sectors. Consider a report on a failed drive below.

...
ID# ATTRIBUTE_NAME          VALUE WORST THRESH TYPE      WHEN_FAILED  RAW
...          
  5 Reallocated_Sector_Ct   005   005   036    Pre-fail  FAILING_NOW  1955

Here 95% of its reserved capacity has been used which is 1955 sectors. Therefore the initial capacity was about 2057. In fact it is 2048, the difference is due to the rounding error.

The S.M.A.R.T. turns the drive into a failing state when the number of reallocated sectors reaches a certain threshold. For the drive in question this threshold is set at 64% of the reserved capacity. That is roughly 1310 remapped sectors.

However the reserved sectors are not lying in a continuous span. Instead they are split into several groups, each group is being used for remapping sectors from a specific part of the disk. This is done to keep the data local to an area on the disk.

The downside of locality is that the disk might have many reserved sectors. Yet one area may already run out of reserved capacity. In this case the behavior depends on the firmware. On one drive we observed it go into a FAILED state and block when an error occurs in a part that is no longer protected.

Dmitri Chubarov
  • 2,296
  • 1
  • 15
  • 28
  • How did you determine that "there were 2048 reserved sectors for reallocation"? – AJ. Feb 27 '13 at 01:50
  • Perhaps 2047 is the max amount of re-allocable sectors. One of my drives had exactly 2047 when bought off eBay for "new", which is 0x7FF, also b11,111,111,111. Going to 2048 would waste an extra bit. – davide Jun 11 '15 at 13:20
2

You might want to run a S.M.A.R.T. long self-test, if the drive supports it. This may give you more information about the status of the drive. If your NAS cannot do this, and if you can pull the drive out or power down the NAS for a few hours, then you can do the long self-test with the hard disk plugged into another machine.

Eddie
  • 11,332
  • 8
  • 36
  • 48
1

When a drive this new behaves like this it's not to be trusted at all!

Send it back as soon as possible, and get a replacement drive.

1

Different manufacturers have different "acceptable loss" numbers (same idea as with monitors and bad pixels). Check with the drive manufacturer to find out what their standard is.

It does look like a bad trend though...

Brian Knoblauch
  • 2,188
  • 2
  • 32
  • 45
0

Western Digital specially proud by technology that recover bad sector in acceptable time instead of freeze disk placed in RAID, its name TLER (http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery). The time is typically 5..7 seconds.

As I found on web there are WD disk drives with disabled option but some peoples enabled this feature on cheap Green WD drives then place them into RAID.

WDTLER utility removed from WD support site but can be easily discovered via Google.

P.S. I use this utility only for reading status and I not use RAID by now :)