2

I do not know whether this requires immediate action or not. W2012R2 server with two 500GB SSDs plus one identical Hot Global Spare. RAID is Megaraid. Screenshots from the RAID Utility are attached below. This is a business in a small town where we have no easy access to hardware professionals. I am a software developer so, well, I'm all we have.

The configuration has a separate 1TB drive that is used for "scratch" storage and does not require mirroring or backup. Then, three Samsung 500GB SSDs, two mirrored and one configured as a Global Hot Spare.

We've started seeing what appear to be disk errors in the log shown below. My objective would be to determine which drive is failing and swap it out with the Hot Spare that was originally installed for this type of situation.

These drives have been running for about 3.5 years 24/7 without incident.

So, my questions are:

  1. Given this is the first evidence of any drive problem and the fact the software indicates the status as "optimal", do I need to replace immediately? This being an SSD, do I expect it to fail as a spinning drive would, i.e., getting quickly worse? Or as an SSD is the eventual failure a ways off?

  2. Given I should replace now, I have no idea how to approach this with this software in the most straightforward way. Intuitively, I should be able to determine which drive is failing, but the message seems to convey no information of that sort. Then, add the Hot Spare to the array, and remove the failing drive.

3(a). How do I determine which of the existing drives is the problem drive?

3(b). How do I remove the failing drive from the array and replace it with the Hot Spare to rebuild?

3(c). Can this all be done from the Windows utility, or must it be done from the bootup RAID settings screen? This utility SEEMS to support these operations.

I will deeply appreciate any input on this problem. I'm trying to deal with it before we start losing data or having downtime, but I find that getting the array up originally a few years ago was a simpler problem than swapping out a potentially failing drive.

Thanks, in advance.

Megaraid Screen 1 Megaraid Screen 2 Megaraid Screen 3

  • Did you look at `Medium Error Count` of each disk on the `Physical` tab? I guess the problem is rather related to your 1TB disk as it is reported `PD 0:7`, controller 0, slot 7. – Thomas Jun 02 '19 at 07:06

1 Answers1

0

What do the SMART details say about damaged sectors and read errors? What about the remaining lifetime in %? In general if drives still show as Optimal, you can safely keep using them (I'm even still using one on a daily base that showed as "BAD condition" two years ago without a problem, although I would not recommend it for important data and certainly not in business environments), although this depends from drive to drive.

Since you have a Hot Spare drive (that is, if it's really set up as a Hot Spare!), it will replace a failing drive automatically (that's why it's called Hot spare) as soon as it's needed. Normally, you don't need to manually intervene here. Make sure your backups are okay though: recent, long enough history and verified for corruption.

Some other thinking:

How much budget do you have available? If you have some budget available, buy an extra drive just in case. Be sure that it's the same model for best reliability. If you do, check your warranty on the current one that's showing errors and have it replaced. If it's a decent SSD brand and model, you probably have at least 5 years of warranty (unless the maximum TBW is exceeded). If there's plenty budget, buy more than one.

How important is uptime? If downtime is totally unacceptable, you should invest in high availability of your storage, meaning a spare storage system in case your current goes down. A cloud backup storage is one option, but you'll need a good internet connection for that. Another option is an extra NAS. If budget is tight, a second hand system is also a decent option to have as a backup plan to reduce downtime.

About hardware tech support, there are remote options too. Don't go trying things too quickly by yourself because there's a chance you'll f* things up and cause downtime to the company.

I'm not familiar with Megaraid, but the software of your raid controller should be enough to replace a failing drive or modify your raid setup.

aardbol
  • 1,463
  • 4
  • 17
  • 25