1

I have a number of Debian servers in a datacenter and from time to time I notice that the software RAID 1 was degraded. While the re-sync process starts automatically and I don't lose any data I find it annoying as it slows down the servers even for days while the HDDs are re-syncing.

I was wondering what exactly causes the HDDs to de-sync in the first place and if there are any configuration options to prevent this from happening.

Any thoughts/suggestions on this matter would be greatly appreciated.

Alex Flo
  • 1,711
  • 3
  • 17
  • 23
  • 3
    It takes **days** for a mirror to resync and this happens *regularly*? – MDMarra Aug 14 '12 at 13:09
  • 1
    This is not normal. You need to troubleshoot. What do your logs say? – David Schwartz Aug 14 '12 at 13:12
  • @MDMarra: servers have tens millions of files on HDDs and they are heavily used so indeed, it may take days sometimes. Regularly: it happens once per month or every other month. – Alex Flo Aug 14 '12 at 13:19
  • I have a lot of servers in RAID 1 and never have a problem. Maybe you can check if your drives are not becoming faulty with SMART ? After that, check your logs, you will see if an error is send to the kernel by the chipset. – Dom Aug 14 '12 at 13:08
  • A resync should only sync changes, not all tens of millions of files. This isn't right. – MDMarra Aug 14 '12 at 14:37
  • fwiw, MD, an mdadm resync does indeed resync every bit on the drive, blank or data, used or not. – MadHatter Aug 14 '12 at 15:14
  • Interesting. I don't use mdadm, so I didn't realize that. That's some poor behavior. :( – MDMarra Aug 14 '12 at 17:13

1 Answers1

4

You may also want to check for the existence of a cron job which regularly runs a RAID check on the mirrors. This can look a lot like a resync while it's happening.

On CentOS-type systems, it's done by /etc/cron.weekly/99-raid-check; I don't know what that'd be on a Debian system, though.

Edit: That's a weekly cron job that runs a RAID check, which causes the discs to perform something very like a RAID resync. This isn't the same thing as just checking to see if RAID has failed; the substantive line is echo "check" > /sys/block/$dev/md/sync_action. If you're saying that you constantly find your RAID arrays resyncing, this may be what's biting you. If you're saying that they constantly report unrecovering degraded, this isn't it.

If you think this might be it, you'll have to look at wherever Debian keeps its weekly/monthly cron jobs.

Edit 2: this file in /sys isn't a real file, it's a kernel artefact. You have to find out which cron job is writing check to that file, and stop it. I'm sorry, but I've little experience with debian, and don't know where it keeps its system cron files. But if you poke around, you should be able to find the local equivalent of my /etc/cron.weekly/99-raid-check, and edit it (or a resource file it depends on) so it doesn't do that, or just delete it.

Edit 3: you might try

echo idle > /sys/block/md0/md/sync_action

to stop an in-progress sync check. But it's been a while since I had to disable one mid-check, so I can't swear to that.

MadHatter
  • 78,442
  • 20
  • 178
  • 229
  • I don't have such a file, what is it inside of it? – Alex Flo Aug 14 '12 at 13:20
  • RAID health is checked daily and I receive an email should a HDD be broken (that is, it won't sync anymore). – Alex Flo Aug 14 '12 at 13:21
  • Thanks for the suggestion, I haven't found that file in /sys/block... but I'll have a look at the files there and see if I can find anything like "check". This could be the reason as the HDDs are successfully resynced each time. – Alex Flo Aug 14 '12 at 13:40
  • What do you see in /sys/block ? – MadHatter Aug 14 '12 at 13:41
  • I found the file you're talking about here /sys/block/md2/md/sync_action and it contains "check". I was wondering what shall I put there instead to stop this kind of checks? – Alex Flo Aug 14 '12 at 13:45
  • Yes, that seems to be source of the problem, I found the script. Thanks for your help! 2 more questions: should there be any problems if I just deactivate this script from doing its monthly routine? Is there a way to stop a re-syncing once started? – Alex Flo Aug 14 '12 at 14:03
  • I have found no problems from deactivating the script. As for stopping one mid-sync, see above. And if you're OK with this answer, you may wish to consider accepting it by clicking the "tick" outline by it, which drives the SF reputation system for both of us; my apologies if you already know that. – MadHatter Aug 14 '12 at 14:12
  • Indeed it was a problem related to the way Debian triggers a RAID1 check each month (first Sunday of each month to be precise). I added an "exit" at the start of /etc/cron.d/mdadm file and this should prevent further occurrences. Thanks @MadHatter! – Alex Flo Aug 14 '12 at 14:14