Identify HDD failure by sound + Is it safe to run smartctl on a mounted RAID1?

Question

Basically, I have two WD HDDs in a programmatic RAID1 (no special hardware):

$ lsscsi --verbose
[0:0:0:0]    disk    ATA      WDC WD10EFRX-68F 0A82  /dev/sda 
  dir: /sys/bus/scsi/devices/0:0:0:0  [/sys/devices/pci0000:00/0000:00:17.0/ata1/host0/target0:0:0/0:0:0:0]
[1:0:0:0]    disk    ATA      WDC WD10EFRX-68F 0A82  /dev/sdb 
  dir: /sys/bus/scsi/devices/1:0:0:0  [/sys/devices/pci0000:00/0000:00:17.0/ata2/host1/target1:0:0/1:0:0:0]

one of those (/dev/sdb) started to produce perculiar noise. I've run the SMART overall-health self-assessment test with the result: PASSED, here's the output:

$ sudo smartctl -a /dev/sdb

... 

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   131   129   021    Pre-fail  Always       -       4450
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       201
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       18349
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       201
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       135
193 Load_Cycle_Count        0x0032   191   191   000    Old_age   Always       -       29322
194 Temperature_Celsius     0x0022   112   101   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         0         -

Looks okaish, right? Please note, this RAID is inside a server with big uptime, I believe, I hadn't checked it for something like 70 days, it was working just fine, so when I got to it in person I heard the unheard sound and turned it off.

QUESTIONS:

Here's the sound of my HDD (it does not click, only produces sawing-like noise, the clicks are made by me moving), what could this be?
Is it "safe" to run sudo smartctl -t short /dev/sdb? One answer states it is "as safe as continuing using a failed drive", however my question here is not addressed to something obvious but to the issue of RAID - can linux do something undesired while checking single drive from a mounted RAID? From an unmounted one? Does it turn off the power for a short period of time for the test? May the test affect both drives?

score 0 · Answer 1 · answered Mar 23 '21 at 17:37

I can't be certain of your first question, I've never heard a hard drive making that noise before. When I have heard that noise, it's been something decidedly lower RPM than a disk drive, usually a fan, usually a failing bearing. If the drive mount is still solid, and there's nothing touching the drive itself (most RAID boxes isolate the drive a bit with rubber mounts) then the only thing I can think of is bearing failure in the drive motor. I'm assuming that it doesn't change as the drive is being accessed.

I'm hoping that you have this thing in a mirrored setup (RAID1) rather than as a simple stripe set. If so, you can lose a drive without it affecting your data. I believe the WD10EFRX is a 1TB disk, no? Those are pretty cheap, maybe the thing to do, if you're sure it is the disk, is to use MDADM to "fail" it, physically replace it, and then add the new disk to the array. It will take about an hour to re-sync, and it is in my experience well worth it for the peace of mind. You can then take the failing drive and test it on separate hardware, to ensure your data is unharmed by the test process. And yes, I know that doesn't answer your second question either, but it is a way to avoid any possibility of disruption with either the short or long SMART tests.

There is also `mdadm --replace`, which is the same as `--fail`, but will leave the old drive active until the new drive has been initialized. — Simon Richter, Mar 23 '21 at 18:27
Which I guess would be fine if you had another SATA connector. Not everyone does. — tsc_chazz, Mar 24 '21 at 03:44

score 0 · Answer 2 · answered Mar 23 '21 at 18:25

The point of having a RAID is that you don't have to worry too much about failing harddisks.

You do need to run regular checks though, every week or two is good, these will read all the data from all disks, compare that they are consistent and rewrite any sectors that fail to read (so the disk can reallocate them).

The SMART attributes look absolutely fine, and only two attributes are updated by running "offline" tests, the rest are running commentary on normal operation. You can (and should) run the long offline test periodically while there is little going on, as drive activity interrupts the test and makes it return to the last checkpoint (otherwise it is fine to run the tests at any time).

So if that disk is going to fail, it is going to fail suddenly, not gradually.

What you can do now is add a third disk, increase the number of mirrors to three, and just leave the old disk running. You will get better performance out of that, and the setup will still be redundant if one disk fails.

Since this is software RAID, you will also need to investigate whether your bootloader is properly installed everywhere.

Identify HDD failure by sound + Is it safe to run smartctl on a mounted RAID1?

QUESTIONS:

2 Answers2