0

So recently we have been having strange RAID storage server issues. The most recent issue I have no idea what could be going on.

config is Raid 5; 17 + 1 (17 disk volume + 1 dedicated hot spare)

A disk flagged itself as 'removed'. As we're on an extreme budget at the moment, we are trying to reseat all drives before we replace them, provided it is only 1 drive that has flagged at any point in time (the idea being we can afford that + one other disk to fail due to the 17+1 config). The servers are barely in use in terms of actual data protection needs, the space is being used in a kind of temporary processing sketchpad function rather than for archiving of important stuff. So it's not the end of the world, but still we'd like to have the raid 5 buffer, plus the extra buffer of the dedicated spare.

I reseated the disk, and instead of the server returning to 17+1 configuration, it bizarrely showed up as now being an 18 disk raid 5 volume. In the past, reseating has returned the server to 17+1 as expected. Sometimes the +1 comes back as foreign or not automatically being assigned as a dedicated spare, but it always comes back as separate to the 17 disks in use. Either the hot spare is the one that got removed, or one of the 17 was the one that got 'removed', and the hot spare automatically took the place of the 17th disk in the raid 5 volume, so that the disk that is reseated is surplus to the 17 disks that make up the new set of 17.

What do I do? Presumably I can't shrink the volume down to 17 then re-assign the disk as a dedicated hot spare, as the raid volume is now 18 disks large. But if that's so, we no longer have a configuration offering us the ability to recover from 2 lost drives, as there's no 19th slot to install a dedicated hot spare.

2 Answers2

4

An 18-disk RAID5 is a train wreck waiting to happen... I hope you have a good backup.

Seriously, you need to use RAID6 for anything beyond 5 disks maximum or disks larger than 1 TB.

Since you don't have the option to shrink the array, you'd need to add disks so you can migrate to RAID6, with or without hot spare. With 18 disks I'd seriously suggest using a RAID60 with nine-disk subarrays (thx @Nikita).

we no longer have a configuration offering us the ability to recover from 2 lost drives

You never had. RAID5 with host spare can recover from one lost drive and after rebuilding may recover from another lost drive. If anything happens during rebuild - which isn't uncommon - the array is lost.

there's no 19th slot to install a dedicated hot spare.

If you can't add drives, you're pretty much out of space anyway. Either test your backup-and-recovery scheme, deleting and creating a RAID6 or - better - RAID60 array this time, or consider migrating to a new server.

If there's no budget and no maintenance window for recreating the array you're pretty much out of options. Make sure there's reliable and well-tested backup (two backup instances, on different media, testing including bare-metal recovery), run regular scrubbing (significantly reducing the chance to hit stale data errors while rebuilding), stop the reseating practice (which might have gotten you into this pickle in the first place), and hold your thumbs. You're running on fumes.

By the way, have you estimated the cost and scenario of the array failing altogether?

Zac67
  • 8,639
  • 2
  • 10
  • 28
  • 1
    A single RAID6 out of 18 devices is almost as bad as RAID5. I'd suggest RAID60, two groups of 9 disks. – Nikita Kipriyanov May 18 '22 at 12:32
  • @NikitaKipriyanov Fair point - a single large RAID5 is far worse than a single large RAID6 though. – Zac67 May 18 '22 at 13:38
  • Thanks! This is my first large modern storage system. I'm trying to work out the best way to manage it. It was designed by a comp. scientist who is 'oldschool'. Smart but not business/cost/performance focused. The cost of loss isnt zero, but it is not particularly high. We take disks and try to extract useful info from them, either actual file data or metadata/relationships. The actual data generated is many times the source drive size, but probably 99.9% discarded once reported on/collated. The main thing that I found weird is why the raid 5 volume changed from 17 disks to 18 automatically. – gavinpeters86 May 19 '22 at 02:38
  • WRT the comment on failure tolerance: I think I understand what you mean. There are 2 possibilities A) 0000h = 17 used | 0 rebuilding | 1 spare | 0 dead; 0100h = 16 used | 1 rebuilding | 0 spare | 1 dead; 0300h = 15 used | 1 rebuilding | 0 spare | 2 dead; 0400h = second dies before spare rebuilds, 15/17 can't recover. B) 0000h = 17 used | 0 rebuilding | 1 spare | 0 dead; 0100h = 16 used | 1 rebuilding | 0 spare | 1 dead; 0300h = 17 used | 0 rebuilding | 0 spare | 1 dead; now a 2nd death,but raid rebuilt already; 0400h = 16 used | 0 rebuilding | 0 spare | 2 dead | – gavinpeters86 May 19 '22 at 05:43
1

Raid6 is much better than R5 + hot spare, as twice as many disks are allowed to fail at once. Actually it's a horror with so many drives anyway but less horror than R5 would be ;)

Now you have R5 on 18 disks, so if anything fails you're relying on correctly reading all sectors from 17 disks (due to how CRC sums work the controller needs to read all empty and all used sectors). Moreover some of these disks are unstable and maybe failed already.

Usually drive gets kicked out of RAID as it requires more time than usual to read data. More time than usual is usually a sign of your drive being on the brink of failure, and it may or may not appear in SMART. And these are probably the "strange issues" with failing drives which can be later re-attached to the array.

Reseating a drive or using it is maybe a good idea on Raid 1/Raid10 not on a setup where you have no margin. In such case as yours i'd assume that the array is dead already, if not it'll probably happen very soon.

So IMO the solution would be - use R10 with these shitty, failing drives and limit resource usage somehow OR do R6 with a spare so it gets instantly rebuild after one drive is lost. It's better to have some retention and delete historical data if you're on budget than to loose everything.

Probably you need to start fixing it ASAP. Speak with the boss and communicate that this raid layout is inadequate and there are 3 options:

  • Continue in R5 and loosing everything in not so far away future
  • Rebuilding in R10 and limiting data stored
  • R6 + spare which is probably a very bad idea, but maybe you could do R6 without spare for this temporary processing and R1 for all important stuff (this way you won't be sacrifying too much storage)

Actually you're very lucky that this is still working...

Slawek
  • 141
  • 3
  • Thanks. Yes we're probably lucky to be running on this data set. Will do some reading about other designs. etc. When I did my formal IT education, storage tech was not as advanced as it is these days (not stone age, but a long time now) so I definitely have a lot more to get my head around. The designer of the system (my boss) is also more a science/research boffin than a data storage expert. He is perfectly open to new ideas, but budget constraints are intense and there's not a great fear of data loss. It's more what we learn from the processing than building a legacy/archive. – gavinpeters86 May 19 '22 at 05:09