3

Earlier today I received an automated email from mdadm monitor with the following:

This is an automatically generated mail message from mdadm
running on server

This is an automatically generated mail message from mdadm
running on server

A Fail event had been detected on md device /dev/md127.

It could be related to component device /dev/sdd1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid10 sda1[0] sdc1[2] sdd1[3](F) sdb1[1]
  5860267008 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_]
  [==========>..........]  check = 50.9% (2983082496/5860267008) finish=1025.1min speed=46774K/sec

unused devices: <none>

I just logged into the server and ran cat /proc/mdstat and this is the result:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid10 sda1[0] sdb1[1] sdc1[2]
      5860267008 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_]

unused devices: <none>

Did I understand this right? Has a drive failed?

Thomas
  • 4,155
  • 5
  • 21
  • 28
BrokenCode
  • 153
  • 3

2 Answers2

3

Yes sdd1 has failed.

From the original email

md127 : active raid10 sda1[0] sdc1[2] sdd1[3](F) sdb1[1] 5860267008 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_]

and from your observation

md127 : active raid10 sda1[0] sdb1[1] sdc1[2] 5860267008 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_]

Ideally your array should have 4 devices but it only has 3.

user9517
  • 114,104
  • 20
  • 206
  • 289
1

You have 512E disk sdd. That means you can have some unknown problems with old BIOS, old SATA controller or old OS drivers. SMART data of this disk show good health. sdd disk is good!

Why it was marked as failed? I think, there are many causes:

  1. Less power. Simply install new higher power supply unit. This disk can eat up to 20W of power at high load.

  2. Disable server power management. Also do it for your disk. Or set maximum performance. Disable OS power services such as powerd, cpuspeed and etc.

  3. Old SATA controller can work wrongly at high load. Try to update BIOS. If this does not help you then install new SATA controller (which properly supports 4K and 512E disk) or change the motherboard to a new one.

Another way - use only 512n old models disks. They have capacity up to 4Tb.

Mikhail Khirgiy
  • 2,003
  • 9
  • 7
  • Something strange is happening, today I turned on the server and received this message: `This is an automatically generated mail message from mdadm running on server A DegradedArray event had been detected on md device /dev/md127. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : active raid10 sdb1[0] sdc1[1] sdd1[2] 5860267008 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_] unused devices: ` Sdd back, sde gone. – BrokenCode Sep 06 '17 at 19:02
  • Why would these problems start appearing suddenly? Server and RAID array were working fine for years. – BrokenCode Sep 06 '17 at 19:04
  • May be power. Try to change power unit – Mikhail Khirgiy Sep 06 '17 at 19:20
  • Thanks, I will be looking into that. For now I re-added the disk back into the array and they are syncing now. – BrokenCode Sep 06 '17 at 19:53
  • Also try to increase time in seconds between each disk "power on" actions in motherboard or raid BIOS. – Mikhail Khirgiy Sep 07 '17 at 03:17
  • The sync failed. I received a message with FailSpare event for drive sde. – BrokenCode Sep 07 '17 at 21:20
  • Why `sde`? You had 4 disks `sda` `sdb` `sdc` `sdd`. What was changed? – Mikhail Khirgiy Sep 08 '17 at 04:05