45

Google did a very thorough study on hard drive failures which found that a significant portion of hard drives fail within the first 3 months of heavy usage.

My coworkers and I are thinking we could implement a burn-in process for all our new hard drives that could potentially save us some heartache from losing time on new, untested drives. But before we implement a burn-in process, we would like to get some insight from others who are more experienced:

  • How important is it to burn in a hard drive before you start using it?
  • How do you implement a burn-in process?
    • How long do you burn in a hard drive?
    • What software do you use to burn in drives?
  • How much stress is too much for a burn-in process?

EDIT: Due to the nature of the business, RAIDs are impossible to use most of the time. We have to rely on single drives that get mailed across the nation quite frequently. We back up drives as soon as we can, but we still encounter failure here and there before we get an opportunity to back up data.

UPDATE

My company has implemented a burn-in process for a while now, and it has proven to be extremely useful. We immediately burn in all new drives that we get in stock, allowing us to find many errors before the warranty expires and before installing them into new computer systems. It has also proven useful to verify that a drive has gone bad. When one of our computers starts encountering errors and a hard drive is the main suspect, we'll rerun the burn-in process on that drive and look at any errors to make sure the drive actually was the problem before starting the RMA process or throwing it in the trash.

Our burn-in process is simple. We have a designated Ubuntu system with lots of SATA ports, and we run badblocks in read/write mode with 4 passes on each drive. To simplify things, we wrote a script that prints a "DATA WILL BE DELETED FROM ALL YOUR DRIVES" warning and then runs badblocks on every drive except the system drive.

Phil
  • 1,003
  • 2
  • 11
  • 16
  • 15
    Curiously, why do you mail the drives across the country? It sounds like you're more likely to run into drive failure as the result of shock failure and general mishandling by the postal service than by reasons a burn-in would detect. – Paperjam Sep 08 '11 at 01:06
  • Do these drives end up in servers or desktops? – sjbotha Sep 08 '11 at 03:46
  • Why not a pair of very high speed Internet connection and data duplication tool (e.g. rsync)? If you need security, SSH will do. And why not a set of 128 GB quad-layer Bluray disks or SSDs? I'm just thinking that you may be using a non-optimal process in the first place if you really need to mail harddisks frequently, I can't think of a valid use case where mailing harddisk is the optimal solution. And Google's study would be misleading in your use case, since their study did not involve cases where drives are mailed around frequently. – Lie Ryan Sep 08 '11 at 11:01
  • 7
    @Lie Ryan: Mailing storage hardware is still the fastest way to transfer data. The internet is very slow, and if you need to pass through institutional networks and firewalls, it gets even worse. – Jonas Sep 08 '11 at 12:45
  • 4
    @Lie Ryan: If you're dealing with government classified secret data, *SSH will not do*. In general, a computer containing that data may not be connected to the public Internet. If you've got gigabytes of classified data to transfer, mailing an encrypted drive is the most efficient option. – benzado Sep 08 '11 at 15:05
  • @Jonas: Using a pair of 22Mbit/s business connection ([2*$100/month](http://business.comcast.com/Internet/plans.aspx), you can transfer a total of 7TB/month. Sending the same amount of data through UPS would cost 7*$60=$420 (one day delivery, $60 for a 0.5kg package, 1TB harddisks). And you have practically zero latency between data generation and data received, and zero worry about damaged harddisk. – Lie Ryan Sep 08 '11 at 20:49
  • 3
    @Lie Ryan: Exatly. Even with a really fast internet connection, it takes you a month to deliver 7TB (and that's if you have the luxury of having control over your and the recipient's throughput). With UPS, the 7TB are there the next day. – Jonas Sep 09 '11 at 02:11
  • @Jonas: with a fast Internet connection, your data starts arriving instantly and if you can use rsync, it can often deflate those 7TB transfer to a few gigabytes. The only time you cannot use rsync is if you're generating 7TB of **new and unique** data daily, and you probably will only be capable to generate that amount of new and unique data if you work in something like the Large Hadron Collider (although people in the LHC has a freaking fast internet connection as well, scientists around the world are just too impatient to wait for a next day delivery). – Lie Ryan Sep 09 '11 at 03:42
  • @Jonas: moreover, if you **really** were generating 7TB of data daily, then you can use the $360/month 100 MB/s connection (bandwidth ~7TB/day), and the price difference wouldn't really matter since the computer and other equipments that can record and process all that data probably would cost you much more than your Internet connection fee. – Lie Ryan Sep 09 '11 at 04:44
  • 5
    That's nice and everything, except when you have mobile teams out on the road collecting massive amounts of data. Meaning you have to rely on hotel or mobile internet connections, and you are never in one spot for more than 8 hours. – Phil Sep 09 '11 at 16:10
  • I'd like to add that [3Ware does recommend](http://www.3ware.com/products/pdf/Drive_compatibility_list_9650SE_900-0019-02RevS.pdf) doing a 48 burn in (see top of PDF). Also, the OP could mail both component drives of a software RAID around. Those are easily reassembled into an array in another computer. – Halfgaar Oct 13 '13 at 08:23
  • @Phil - Could you please share that script and more specifics to run it. I need to do the same. – Alex S Sep 17 '15 at 09:52

10 Answers10

49

IMNSHO, you shouldn't be relying on a burn-in process to weed out bad drives and "protect" your data. Developing this procedure and implementing it will take up time that could be better used elsewhere and even if a drive passes burn-in, it may still fail months later.

You should be using RAID and backups to protect your data. Once that is in place, let it worry about the drives. Good RAID controllers and storage subsystems will have 'scrubbing' processes that go over the data every so often and ensure everything is good.

Once that all is taken care of, there's no need to do disk scrubbing, though as others have mentioned it doesn't hurt to do a system load test to ensure that everything is working as you expect. I wouldn't worry about individual disks at all.


As has been mentioned in the comments, it doesn't make a lot of sense to use hard drives for your particular use case. Shipping them around is far more likely to cause data errors that won't be there when you did the burn-in.

Tape media is designed to be shipped around. You can get 250MBps (or up to 650MBps compressed) with a single IBM TS1140 drive which should be faster than your hard drive. And bigger as well - a single cartridge can give you up to 4TB (uncompressed).

If you don't want to use tape, use SSDs. They can be treated far rougher than HDDs and satisfy all the requirements you've given so far.


After all that, here are my answers to your questions:

  • How important is it to burn in a hard drive before you start using it?
    Not at all.
  • How do you implement a burn-in process?
    • How long do you burn in a hard drive?
      One or two runs.
    • What software do you use to burn in drives?
      A simple run of, say, shred and badblocks will do. Check the SMART data afterwards.
  • How much stress is too much for a burn-in process?
    No stress is too much. You should be able to throw anything at a disk without it blowing up.
MikeyB
  • 38,725
  • 10
  • 102
  • 186
  • 1
    My only regret about this post is that I can only upvote it once. @Phil, you're re-inventing the wheel there. The way to ensure you don't lose data to random drive (or other, related hardware) failures are backups and RAID arrays. – Rob Moir Sep 07 '11 at 20:30
  • 9
    I agree you shouldn't rely on it, but running running a scan against a system before it is put in production has identified a potential issue for me a couple times. If you aren't in a huge rush, letting a computer scan itself for a day or two doesn't usually hurt anything. – Zoredache Sep 07 '11 at 20:33
  • 8
    This answer has the highest votes, yet has failed to answer the question the most. The OP has stated that RAIDs are impossible. If "single drives are getting mailed across the nation" then a raid cannot be built. A backup process is in place, but it seems that the OP is wanting to find anything and everything to help save time in case a drive goes bad. (Note: I work at the same company as the OP, so I know the situation, the data gets copied to a RAID once it is delivered) If you HAD to burn in drives, hypothetically, how would you? – jsmith Sep 07 '11 at 21:52
  • 3
    In that case, the question should provide this context. From what you're saying, the question should be "We need to send hard drives via snail mail/freight. How should we do a burn-in test before shipping them to minimize failures?" Funnily enough, the answer doesn't change. Send a pair of drives! Or better, use tapes. Tapes are *designed* to be used in this manner, HDDs are not. More information leads to a different answer. – MikeyB Sep 07 '11 at 22:50
  • I can only confirm google's statement, never had a 100+ disk array that did not have at least one or two drives die during the first months. A proper RAID design with global hot spares for every type of disk used in the array deals with that. Good enterprise arrays have call-home support so that dead drives get out of your array ASAP. – pfo Sep 07 '11 at 22:55
  • 1
    @MikeyB I agree, I'm just trying to provide that information so that you can provide a better answer. Tapes cannot write data as fast as needed, so they are not an option. The drives need to be at least 10k rpm to have the correct functionality. I agree in a perfect world he could send two drives, one to do a nightly backup. And in a perfect world, this decision would be in the OP's hands. It's just not what is needed. "How should I burn in hard drives?" is the question. :) – jsmith Sep 07 '11 at 22:55
  • 2
    Tapes can not write data as fast as hard drives? 250MBps (uncompressed) isn't good enough for you? I would say that a burn-in process isn't actually going to help protect against failures with your use case. – MikeyB Sep 07 '11 at 23:17
  • 1
    I agree with the point @MikeyB raises. If a drive is being shipped around then any testing you did prior to shipping it is probably void. If you need to test post shipping I suspect I would generate an md5sums/sha256sums file for all the files on the drive and send that separate from the drives. When you receive the drive with the files verify the hashes. – Zoredache Sep 08 '11 at 00:15
  • Also, use RAID6, not RAID5. Losing all redundancy because of one failing drive and subsequently putting all the remaining drives under the extra stress of a rebuild is a bad thing. – Simon Richter Sep 08 '11 at 00:46
38

How important is it to burn in a hard drive before you start using it?

If you have a good backup, and good high-availability systems, then not very much. Since restoring from a failure should be pretty easy.

How do you implement a burn-in process? What software do you use to burn in drives? How much stress is too much for a burn-in process?

I will typically run badblocks against a drive or new system when I get it. I will run it whenever I resurrect a computer from the spares pile. A command like this (badblocks -c 2048 -sw /dev/sde) will actually write to every block 4 times each time with a different pattern (0xaa, 0x55, 0xff, 0x00). This test does not do anything to test lots of random reads/writes, but it should prove that every block can be written too and read.

You could also run bonnie++, or iometer which are benchmarking tools. These should try to stress your drives a bit. Drives shouldn't fail even if you try to max them out. So you might as well try to see what they can do. I do not do this though. Getting an I/O benchmark of your storage system right at install/setup time may be very useful in the future when you are looking at performance issues.

How long do you burn in a hard drive?

A single run of badblocks is enough in my opinion, but I believe I have a very strong backup system, and my HA needs are not that high. I can afford some downtime to restore service on most of the systems I support. If you are so worried, that you think a multi-pass setup may be required, then you probably should have RAID, good backups, and a good HA setup anyway.

If I am in a rush, I may skip a burn-in. My backups, and RAID should be fine.

Zoredache
  • 128,755
  • 40
  • 271
  • 413
8

Given your clarification, it doesn't sound like any burn-in process would be of any use to you. Drives fail primarily because of mechanical factors, usually heat and vibration; not because of any sort of hidden time-bomb. A "burn-in" process tests the installation environment as much as anything else. Once you move the thing, you're back to where you started.

But here are a few pointers that might help you:

Laptop drives are usually designed to withstand a more jostling and vibration than desktop drives. My friends who work in data-recovery shops always ship data to clients on laptop drives for that reason. I've never tested this fact, but it seems to be "common knowledge" in select industries.

Flash drives (e.g. USB thumb drives) are about the most shock-resistant of any medium you'll find. It should be even less likely that you'll loose data in transit if you use flash media.

If you ship a Winchester drive, do a surface scan before putting it in use. Or better yet, just don't put it into use. Instead, you may want to designate certain drives as "shipping" drives, which see all the abuse, but which you don't rely on for data integrity. (I.e.: copy data onto the drive for shipping, copy off after shipping, very checksums on both sides, that kind of thing).

tylerl
  • 14,885
  • 7
  • 49
  • 71
  • "Once you move the thing, you're back to where you started." - not true. this is how an MTTF graph for a disk looks like: http://www.cs.cmu.edu/~bianca/fast/img13.png http://www.cs.ucla.edu/classes/spring09/cs111/scribe/16/CS111ScribeNotes_images/ddmmg49s_14hfjjcbcx_b.bmp testing in the early period can filter out a lot of problematic drives, manufacturers actually do these stress tests to make more reliable product but of course it's not economical to do the test till the graph flattens out. – Karoly Horvath Sep 08 '11 at 10:41
  • 3
    @yi_H: I think tylerl is implying that probably the cause of early failure is due to the shipping process itself (which isn't implausible, I'd like to see a graph plotting drive failure vs mailing frequency), therefore if you ship harddrive across the country, then the MTTF will reset; so if you mail a drive every three months, you're always in the early failure zone. – Lie Ryan Sep 08 '11 at 11:20
5

Your process is wrong. You should use raid arrays. Where I work we have made ruggedized raid arrays that are designed to get transported around. It's not rocket science.

Shock mounting the drives in oversize enclosures with big rubber vibration isolators will improve reliability hugely. (Seagate constellation-es drives, are as an example rated for 300G shock but only 2G vibration, non-operating: so the shipping case needs to vibration isolate the drive. http://www.novibes.com/Products&productID=62 or http://www.novibes.com/Products&productId=49 [part #50178])


However, if you really want to burn-in-test hard drives, so here it goes.

I've worked on systems like hard drives and burn in found some problems but...

For accelerated lifecycle testing of PCBs to bring out faults, nothing beats some hot/cold cycles. ( operating hot-cold cycles works even better... but it's harder for you to do, especially with banks of HDD's)

Get yourself an environmental chamber big enoug for the number of drives you acquire at a time. ( These are pretty expensive, it'd be cheaper to ship raid arrays around) You can't skimp on the test chambers you will need humidity control and programmable ramps.

Program in two repeating temperature ramps, down to minimum storage temp, up to maximum storage temp, make the ramps steep enough to upset the application engineer from you hard drive manufacturer. 3 cold-hot cycles in 12 hours should see the drives failing pretty quickly. Run the drives at least 12 hours like this. If any work afterwards I'll be surprised.

I didn't think this up: One place I worked we had a production engineer did this, to get more products shipped with the same test equipment, there was a huge surge in faults in test, but the dead on arrival rate dropped to practically zero.

slm
  • 7,355
  • 16
  • 54
  • 72
Tim Williscroft
  • 257
  • 1
  • 7
5

I disagree with all the answers that basically say "Don't bother with burn-in, have good backups".

While you should always have backups, I spent 9 hours yesterday (on top of my usual 10-hour shift) restoring from backups because the system was running with drives that hadn't been burned in.

There were 6 drives in a RAIDZ2 config (ZFS equivalent to RAID-6) and we had 3 drives die over the course of 18 hours on a box that had been running for approximately 45 days.

The best solution I've found is to purchase drives from one particular manufacturer (don't mix-and-match), then run their provided tool for exercising the drives.

In our case we buy Western Digital and use their DOS-based drive diagnostics from a bootable ISO. We fire it up, run the option to write random garbage to the entire disk, then run the short SMART test followed by the long SMART test. That's usually enough to weed out all the bad sectors, read/write reallocations, etc...

I'm still trying to find a decent way to 'batch' it so I can run it against 8 drives at a time. Might just use 'dd if=/dev/urandom of=/dev/whatever' in Linux or 'badblocks'.

EDIT: I found a nicer way to 'batch' it. I finally got around to setting up a PXE boot server on our network to address a particular need, and noticed that the Ultimate Boot CD can be PXE booted. We now have a handful of junk machines sitting around that can be PXE booted to run drive diagnostics.

Aaron C. de Bruyn
  • 578
  • 10
  • 28
  • 3
    What do you know? An answer that answers the question and doesn't preach at the OP. +1 – datu-puti Jul 29 '16 at 23:43
  • 3
    If you want random data to write to the drive, don't read from /dev/urandom; it isn't as slow as its blocking cousin /dev/random but it's still slow, and it doesn't really gain you anything. Instead, set up a plain dm-crypt mapping with a random key (which you can get from /dev/urandom or /dev/random) and then just dd /dev/zero into the mapped device. It'll probably be anywhere between twice as fast and an order of magnitude faster. https://gitlab.com/cryptsetup/cryptsetup/wikis/FrequentlyAskedQuestions "How can I wipe a device with crypto-grade randomness?" has an example of how to do that. – user Aug 16 '17 at 20:25
  • Excellent suggestion @MichaelKjörling – Aaron C. de Bruyn Aug 17 '17 at 18:50
3

How important is it to burn in a hard drive before you start using it?

It Depends.
If you're using it in a RAID that provides redundancy (1, 5, 6, 10)? Not very.
If you're using it standaolone? A little bit, but you're better off just running smartd or something to monitor it instead, at least in my opinion.

This naturally leads to my answer to "How do you implement a burn-in process?" -- I don't.
Rather than trying to "burn in" disks I run them in redundant pairs and use predictive monitoring (like SMART) to tell me when a drive is getting wonky. I've found that the extra time required to do a full burn-in (really exercising the whole disk) is substantially more expensive than dealing with a disk failure and swap-out.
Combining RAID and good backups your data should be very safe, even when dealing with infant mortality (or the other end of the bathtub cure when you start having drives die of old age)

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • 1
    What if the drive cannot be monitored because it is in not in one constant place? :) – jsmith Sep 07 '11 at 22:24
  • 2
    @jsmith - then you have it send you notifications rather than polling it from a monitoring host -- there are almost no situations where you genuinely *can't* monitor something, there are just ones that require a little creative thinking :) – voretaq7 Sep 08 '11 at 04:45
2

Spinrite (grc.com) will read and write back all the data on the drive. It's a good thing to do for a new drive even if you're not trying to get it to fail. It's takes a long time to run at level 4, usually a couple of days for current-size drives. I should also add that it is non-destructive. In fact if it has data in bad spots it will move and recover it. Of course you would never run it on a an SSD.

Charlie
  • 181
  • 2
1

I'm sure a once-a-week benchmarking and error checking will suffice in "burning in" hard drives. Though since your post I've never heard of such a thing.

Quoted from "6_6_6" on Stroagereview.com

  1. Connect the drive to a running system. Read SMART values.
  2. Do a SMART short self test. Do a SMART long self-test.
  3. Zero fill / Wipe the drive with the manufacturer's utility. Entire drive.
  4. Run HDTach full read/write. Everest / Sandra, etc all have stress tests. Run hard drive part continuously for hours.
  5. Run Victoria for Windows Read/Write test and make sure no slow sectors.
  6. Drop to DOS. Run MHDD, run a LBA test and see check for slow sectors. Run Read/Write/Verify test. Run drive internal ATA secure erase command.
  7. Do a full format.
  8. Compare SMART values. If no anomalies, all good to go. Install your OS and continue.

In all, I personally think it's a bad idea.

Source: http://forums.storagereview.com/index.php/topic/27398-new-hdd-burn-in-routines/

slm
  • 7,355
  • 16
  • 54
  • 72
U4iK_HaZe
  • 631
  • 5
  • 13
0

First, I agree with other posters that your use case suggests that tape drives will be the better option.

If that is not possible, If you have to fly drives across the nation, a true RAID doesn't seem to be an option, as you'll have to have many more drives transported, increasing the risk of failure. However, what about a simple mirroring scheme, sending one drive and keeping the other at the source site?

Then, if the drive fails at arrival, a new copy can be made and sent. If the drive is good at arrival, the spare can then be reused - either for sending or for backing up the original data.

0

You haven't really said why the drives are being shipped - is this is just a way of sending data, do they have complete applications/OS images ready to be booted in a PC, or something else?

I agree with the other answers that RAID or backups are better than scanning, due to the risks of shipping a drive causing mechanical issues.

A more general way of putting this would be "rely on redundant data to catch and correct errors" - either ship 2 drives for each set of data, or ship redundant data on a single drive. Something like Parchive lets you add a defined level of redundancy to data, enabling recovery even if a large part of the data is corrupted. Since disks are quite cheap these days, just buying a larger disk than strictly required will often be cheaper than scanning the drive, shipping a replacement drive, or shipping 2 drives.

This would protect against non-catastrophic failures of the drive - however it's still best to not re-use a shipped drive except for shipping, as suggested earlier, i.e. view it like a tape that must be extracted to a 'real' drive that's permanently installed and not shipped anywhere.

This should let you ship a large amount of data (or even application/OS images) and reduce the impact of disk errors to whatever level is economic.

RichVel
  • 3,524
  • 1
  • 17
  • 23