15

In general I would like to know how often a RAID array needs to be scrubbed. What contributes to the need to scrub more often (reading data?, writing data?, unexpected shutdowns?, drive age?, drive size?, number of users?, etc.)?

I've been reading the Arch wiki here and all it really says is that scrubbing should be done regularly. I was just wondering how regularly is enough. Obviously it would depend, but what is a reasonable range? Yearly? Monthly? Weekly? Daily? Thanks a lot for any information.

mrfred
  • 253
  • 1
  • 2
  • 6
  • 1
    How big is your array? How long does a scrub of your array take? Larger arrays can take hours or days for a full scrub. Obviously you can't scrub more frequently than it takes for a single scrub to complete. The Debian package includes a cron script that runs the first Sunday of every month. – Zoredache Apr 10 '14 at 21:22
  • I depends how dirty it is :) – metacom Apr 10 '14 at 22:59
  • @Zoredache: Thats exactly the kind of stuff I was looking for. Since larger arrays take longer, would that decrease the frequency that you should scrub? Would increasing the number of users increase that frequency? Do you need to scrub if you aren't really writing much new data? – mrfred Apr 10 '14 at 23:42
  • Software RAID? Hardware RAID? *ZFS?* – ewwhite Apr 11 '14 at 12:00

1 Answers1

16

How often you should scan depends on a lot of things.

  • Age of the disks. The older they are, the more likely they are to contain evil.
  • The original quality of the disks in question. Stuff sold as 'enterprise' is more likely to last error-free, and the 1+TB size disks of 2014 are a lot more reliable than their 2009 equivalents were when they shipped.
  • How sensitive your production I/O is to the scrubbing I/O.
  • How much of your dataset you consider to be your working set.

The hardware RAID vendors often include a background scrub process for this very reason, some even allow you to tune the I/O priority of the scrubbing process which allows you to avoid (or greatly reduce) the production I/O penalty for a scrub. Of course, if your priority is low and your prod I/O runs the disks mostly flat out you'll probably never complete a scrub and not even notice it until you get a failure.

Unfortunately, I don't know if the Linux kernel deprioritizes scrubbing I/O or not. Either way, it's a good idea to test it with your prod loads to be sure any hits to performance are acceptable. If it is acceptable, good! If it isn't, you get to make a choice on whether or not to add spindles to allow scrub+prod I/O or just accept the risk of possible array failures down the road.

Another thing that impacts scrubbing frequency is I/O usage pattern. If the production loads only hit a minority of the disks, the only I/O that would normally find a bad block in the idle portion would be your scrub; in that case you want to scrub more often. If your production loads routinely read the whole disk-set (like daily full backups), then production I/O is going to stumble across problems sooner and you can scrub less often.

A good plan of action would be:

  1. Run some tests to see if scrubbing will get in the way of production.
    1. Figure out how long a full scrub takes while you're at it.
  2. Figure out what percentage of your disk-set will get multiple accesses in a given week (include backup I/O, if any, in this calculation).
  3. Based on 1 and 2 decide if you're in the less-often or more-often camp.

Once you have that data...

  • If a full scan takes under a day and doesn't impact production noticeably, you can go as often as once a week.
  • If a full scan takes under a day and does impact production, figure out what part of your week/month is least affected and try to run it then.
  • If a full scan takes over a day but under a week and doesn't impact production, run it as often as every other week or once every other month.
  • If a full scan takes over a day but under a week and does impact production, consider adding resources to allow it to be run, require scans to be run during arranged maintenance windows, or take advantage of the idle/check ability of scrubbing to do it in fits and starts continually.
  • If a full scan takes over a week, once a month is often enough. But if it impacts production, you will need to add resources to allow it to complete.
sysadmin1138
  • 131,083
  • 18
  • 173
  • 296