15

This morning a drive failed on our database server. The drive array (3 disks) is setup in a RAID 5 configuration.

While we wait for a drive replacement we are preparing for a recovery strategy. Users are continuing to work on the system, albeit very slowly (don't know why??).

How does one install the new drive - will the data for this drive automatically be rebuilt from the parity or is there another process we should follow?

Edit: This is a hardware RAID controller. (Thanks for the answers so far, appreciated)

Teddy
  • 5,134
  • 1
  • 22
  • 27
Philip Fourie
  • 537
  • 2
  • 6
  • 13
  • 4
    By the way, the time to decide what to do if a drive fails on a critical server is *before* a drive fails on a critical server. – David Schwartz Aug 28 '11 at 10:53

6 Answers6

14

The system is running very slowly because it has to reconstruct the missing data which involves additional CPU and I/O.

If you have a missing disk in a RAID-5 configuration you have no recovery strategy. If another disk goes down you will lose your data. Run, don't walk, to the nearest vendor from which you can get a compatible part covered by manufacturer's warranty shipped by a same-day urgent courier. If the vendor you bought the array from is already in the process of getting the part, get both parts and stash the other one away as a spare.

If you have a RAID-5 being used for a production system you should consider leaving a spare disk in the array as a hot spare.

Added - If your logs are not on a separate volume (physically separate disks) move them to a separate set of disks, even just a single mirrored pair. This will also be a performance win if your database has any significant load as contention on log volumes has a disproportionately bad effect on performance.

If this is possible you can also make your database more robust by doing the following:

  1. Shut down the database.
  2. Backup the database.
  3. Move the logs to a physically separate set of disks (make sure you reconfigure the database so it knows where the logs have been moved to).
  4. Restart the database and application.

If you have the logs on a separate volume you can restore and roll forward from the backup if and only if a disk failure does not compromise the logs. Database logs should be on a separate disk volume for (amongst others) the following reasons:

  • Logs usage patterns are predominantly sequential, appending log entries onto the end of the file (the file is in effect a ring buffer). This means that a large number of log entries can be written out quickly as there is little disk head seek activity.

  • If they are sharing physical disks with a heavily random access workload (e.g. a transactional tables and indexes) they will be slowed down disproportionately as the head seek activity disrupts the sequential writes.

  • Having the logs on a separate volume is almost always a performance win and only needs a single mirrored pair for logs to support quite a heavy workload. This means that the hardware to do it is quite cheap, so there is a small cost for a big performance and reliability win.

  • If your data array goes down the logs are not lost. If you have a proper backup strategy you can restore from the backup and roll foward from the logs. This means that a whole array can go down on the server without being a single point of failure. Both the log and data arrays have to fail simultaneously to cause data loss.

  • Thanks for the answer especially explaining why the system is running slowly. – Philip Fourie Sep 25 '08 at 08:33
  • Spot on. I would even suggest shutting it down until you get that replacement drive in place. Like Nigel says, you have no recovery strategy. Loss another drive, loose it all. – Stu Thompson Sep 25 '08 at 09:04
  • Hi Nigel, thanks for taking the time and sharing your expertise. It is indeed great advice. I'll report back later on the outcome of the recovery. – Philip Fourie Sep 25 '08 at 09:32
5

1) Backup.

Right now no data has been lost. If your backups are not up to date backup now.

2) Read the manual, call the vendor etc.

Different RAID systems have different steps for replacing a disk, and done wrong you risk destroying the whole array. Without knowing what sort of RAID hardware/software you have we can only guess at the steps needed.

Also, the slow performance is because RAID 5 in a degraded state (i.e.: one disk dead) has horrible read performance. How horrible depends on how the parity is stored and which disk died, but the "good" news is slow performance with one disk gone is a known issue and not cause for panic.

DrStalker
  • 6,676
  • 24
  • 76
  • 106
4

First I would read the manual for the hardware/software that you're using - the section for failure recovery :)

Should be a simple matter of replacing the disk and rebuilding the array though.

The most important point in such cases is that the disk should be replaced as soon as possible since if another disk fails you will probably lose data. Also you should address the cause of failure - was it because the disk was getting old? Should you replace the other ones too? Or was it because of a power surge, heat or vibration?

  • 1
    probably lose data? Most definitely lose all data on the array! Go to Jail, do not pass Go. (backups aside, of course.) – Stu Thompson Sep 25 '08 at 09:06
1

As far as I understand RAID5, when your replace the failed drive, it is automatically rebuilt, from information stored on the other two. Whether you can 'hot-swap' the new drive into place does depend on you system - you may have to power down first. Either way, considering the relatively low cost of drives, and the importance of your data (Reflected by your decision to use RAID5 in the first place), you really ought to have a spare drive, sat in a drawer, ready for such an eventuality.

I've recently built a new development PC for myself, and setup the main data drives under RAID5. I ordered one more drive than necessary, so that I've got the spare ready for that emergency moment (That I'm hoping won't happen)

Now you've asked the question, I suppose I'd better read up on the subject some more.

  • For small data volumes a mirrored pair is better as it typically has better sequential access speed than a small RAID-5. If you want hot-swap, look at some of the hot-swap bay systems on somewhere like scsi4me.com – ConcernedOfTunbridgeWells Sep 26 '08 at 08:21
0

Totally system-dependent. What do the manuals say? Does your hardware completely support hotplugging new drives from the controller to the drive bay? Do you have recent backups?

0

NXC's post sums it up nicely. Just in case you don't replace the faulty drive before second fails, there is still a good chance of having almost everything (sometimes everything) recovered by specialized recovery service. The data is still there on disks, and failed disk can usually be brought back to life in specialized lab with proper equipment. However price for this service is quite high. Having a spare disk and proper backups (as per NXC's suggestion) is definitely the way to go in future.