6

I've been using a RAID5 HDD storage (8x6TB) at my HP P840 for like 2 years now and it has always had unusually many drive failures. Everything was good for half a year, but now drives are failing in a strange way. For example 2 new drives failed a few days after they have been added to the RAID. I have also already replaced the RAID controller and am using up-to-date firmware on Mainboard and RAID controller.

Also I have tried to use different drives. Initially there were HGST DeskStar 6TB drives used in that RAID, now I have been replacing them with HGST UltraStar 6TB when replacing failed drives. But the behaviour is the same.

Also it seems (most of) the drives are not really failed, because once I replaced the RAID controller, one failed drive was recognized as OK again and the rebuild started.

My hoster's support says the problem is that I'm actually using RAID5 and I should switch to RAID10 instead. It's hard for me to believe as I've been using RAID5 without problems at other systems (no drive failures in years).

Can anyone give me a hint, what else the culprit could be? Something wrong with the way the RAID controller is configured?

Thank you!

EDIT:
The server is a HP DL180 G9
Reason for drive failure is always "Write retries failed"

UPDATE: Our hoster offered us to completely replace the hardware and switch to RAID6. We did that and it's been running smoothly for a while now. Although this was not really investigated, I believe shodanshok's explanation about punctured arrays seems reasonable. Therefore I will accept that answer. Thanks everybody!

  Smart Array P840 in Slot 1                (sn: PDNNF0ARH321GD)


     Port Name: 1I

     Port Name: 2I

     Internal Drive Cage at Port 1I, Box 2, OK

     Internal Drive Cage at Port 1I, Box 2, OK

     Internal Drive Cage at Port 2I, Box 1, OK
     array A (Solid State SATA, Unused Space: 0  MB)


  logicaldrive 1 (447.1 GB, RAID 1+0, OK)

  physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
  physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)

     array B (SATA, Unused Space: 0  MB)


  logicaldrive 2 (38.2 TB, RAID 5, Interim Recovery Mode)

  physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:5 (port 1I:box 2:bay 5, SATA, 6001.1 GB, Failed)
  physicaldrive 1I:2:6 (port 1I:box 2:bay 6, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:7 (port 1I:box 2:bay 7, SATA, 6001.1 GB, OK)
  physicaldrive 1I:2:8 (port 1I:box 2:bay 8, SATA, 6001.1 GB, OK)

Detailed Info:

     Smart Array P840 in Slot 1
        Bus Interface: PCI
        Slot: 1
        Serial Number: PDNNF0ARH321GD
        Cache Serial Number: PEYFP0BRH323YZ
        RAID 6 (ADG) Status: Enabled
        Controller Status: OK
        Hardware Revision: B
        Firmware Version: 6.60
        Rebuild Priority: High
        Expand Priority: Medium
        Surface Scan Delay: 3 secs
        Surface Scan Mode: Idle
        Parallel Surface Scan Supported: Yes
        Current Parallel Surface Scan Count: 1
        Max Parallel Surface Scan Count: 16
        Queue Depth: Automatic
        Monitor and Performance Delay: 60  min
        Elevator Sort: Enabled
        Degraded Performance Optimization: Disabled
        Inconsistency Repair Policy: Disabled
        Wait for Cache Room: Disabled
        Surface Analysis Inconsistency Notification: Disabled
        Post Prompt Timeout: 15 secs
        Cache Board Present: True
     Cache Status: OK
     Cache Ratio: 10% Read / 90% Write
     Drive Write Cache: Enabled
     Total Cache Size: 4.0 GB
     Total Cache Memory Available: 3.2 GB
     No-Battery Write Cache: Enabled
     SSD Caching RAID5 WriteBack Enabled: True
     SSD Caching Version: 2
     Cache Backup Power Source: Batteries
     Battery/Capacitor Count: 1
     Battery/Capacitor Status: OK
     SATA NCQ Supported: True
     Spare Activation Mode: Activate on physical drive failure (default)
     Controller Temperature (C): 51
     Cache Module Temperature (C): 38
     Number of Ports: 2 Internal only
     Encryption: Disabled
     Express Local Encryption: False
     Driver Name: hpsa
     Driver Version: 3.4.16
     Driver Supports HP SSD Smart Path: True
     PCI Address (Domain:Bus:Device.Function): 0000:06:00.0
     Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
     Controller Mode: RAID
     Controller Mode Reboot: Not Required
     Latency Scheduler Setting: Disabled
     Current Power Mode: MaxPerformance
     Host Serial Number: CZ270500GM
     Sanitize Erase Supported: False
     Primary Boot Volume: logicaldrive 1 (600508B1001CE0F9FACF3A1358647115)
     Secondary Boot Volume: logicaldrive 1 (600508B1001CE0F9FACF3A1358647115)


     Port Name: 1I
           Port ID: 0
           Port Connection Number: 0
           SAS Address: 5001438038AD05A0
           Port Location: Internal
           Managed Cable Connected: False

     Port Name: 2I
           Port ID: 1
           Port Connection Number: 1
           SAS Address: 5001438038AD05A8
           Port Location: Internal
           Managed Cable Connected: False

     Internal Drive Cage at Port 1I, Box 2, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 1I
        Box: 2
        Location: Internal

     Physical Drives
        physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
        None attached


     Internal Drive Cage at Port 1I, Box 2, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 1I
        Box: 2
        Location: Internal

     Physical Drives
        physicaldrive 1I:2:1 (port 1I:box 2:bay 1, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:2 (port 1I:box 2:bay 2, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:3 (port 1I:box 2:bay 3, SATA, 6001.1 GB, OK)
        physicaldrive 1I:2:4 (port 1I:box 2:bay 4, SATA, 6001.1 GB, OK)
        None attached


     Internal Drive Cage at Port 2I, Box 1, OK
        Power Supply Status: Not Redundant
        Drive Bays: 4
        Port: 2I
        Box: 1
        Location: Internal

     Physical Drives
        physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)
        None attached

     Array: A
        Interface Type: Solid State SATA
        Unused Space: 0  MB (0.0%)
        Used Space: 894.2 GB (100.0%)
        Status: OK
        MultiDomain Status: OK
        Array Type: Data
        HP SSD Smart Path: disable



  Logical Drive: 1
     Size: 447.1 GB
     Fault Tolerance: 1+0
     Heads: 255
     Sectors Per Track: 32
     Cylinders: 65535
     Strip Size: 256 KB
     Full Stripe Size: 512 KB
     Status: OK
     MultiDomain Status: OK
     Caching:  Enabled
     Unique Identifier: 600508B1001CE0F9FACF3A1358647115
     Disk Name: /dev/sda
     Mount Points: / 18.6 GB Partition Number 2
     OS Status: LOCKED
     Logical Drive Label: 0216D6F9PDNNF0ARH502MC7DFA
     Mirror Group 1:
        physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 240.0 GB, OK)
     Mirror Group 2:
        physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 240.0 GB, OK)
        physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 240.0 GB, OK)
     Drive Type: Data
     LD Acceleration Method: Controller Cache

  physicaldrive 2I:1:1
     Port: 2I
     Box: 1
     Bay: 1
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712004AG240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 31
     Maximum Temperature (C): 39
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:2
     Port: 2I
     Box: 1
     Bay: 2
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV706303CH240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 29
     Maximum Temperature (C): 36
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:3
     Port: 2I
     Box: 1
     Bay: 3
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712003V8240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 29
     Maximum Temperature (C): 35
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 2I:1:4
     Port: 2I
     Box: 1
     Bay: 4
     Status: OK
     Drive Type: Data Drive
     Interface Type: Solid State SATA
     Size: 240.0 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Firmware Revision: N2010101
     Serial Number: PHDV712004GA240AGN
     Model: ATA     INTEL SSDSC2BB24
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 31
     Maximum Temperature (C): 37
     SSD Smart Trip Wearout: Not Supported
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False


     Array: B
        Interface Type: SATA
        Unused Space: 0  MB (0.0%)
        Used Space: 43.7 TB (100.0%)
        Status: Failed Physical Drive
        MultiDomain Status: OK
        Array Type: Data
        HP SSD Smart Path: disable

        Warning: One of the drives on this array have failed or has been removed.




  Logical Drive: 2
     Size: 38.2 TB
     Fault Tolerance: 5
     Heads: 255
     Sectors Per Track: 32
     Cylinders: 65535
     Strip Size: 256 KB
     Full Stripe Size: 1792 KB
     Status: Interim Recovery Mode
     MultiDomain Status: OK
     Caching:  Enabled
     Parity Initialization Status: Initialization Failed
     Unique Identifier: 600508B1001CF94F84873C91FD89B549
     Disk Name: /dev/sdb
     Mount Points: None
     Logical Drive Label: 04DA1DD6PDNNF0ARH502MC546F
     Drive Type: Data
     LD Acceleration Method: Controller Cache

  physicaldrive 1I:2:1
     Port: 1I
     Box: 2
     Bay: 1
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: NAHN3UZY
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 37
     Maximum Temperature (C): 43
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:2
     Port: 1I
     Box: 2
     Bay: 2
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNT517
     Serial Number: NAHLKP0X
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 37
     Maximum Temperature (C): 56
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:3
     Port: 1I
     Box: 2
     Bay: 3
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: NCH8E81Z
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 33
     Maximum Temperature (C): 41
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:4
     Port: 1I
     Box: 2
     Bay: 4
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: NAHYMAUY
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 34
     Maximum Temperature (C): 41
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:5
     Port: 1I
     Box: 2
     Bay: 5
     Status: Failed
     Last Failure Reason: Write retries failed
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: K1H942MD
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Maximum Temperature (C): 43
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Applicable
     Sanitize Erase Supported: False

  physicaldrive 1I:2:6
     Port: 1I
     Box: 2
     Bay: 6
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: TDR2
     Serial Number: K8JM5TKN
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 33
     Maximum Temperature (C): 38
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:7
     Port: 1I
     Box: 2
     Bay: 7
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: APGNW7JH
     Serial Number: K8H9BW2N
     Model: ATA     HGST HDN726060AL
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 34
     Maximum Temperature (C): 39
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False

  physicaldrive 1I:2:8
     Port: 1I
     Box: 2
     Bay: 8
     Status: OK
     Drive Type: Data Drive
     Interface Type: SATA
     Size: 6001.1 GB
     Drive exposed to OS: False
     Native Block Size: 4096
     Rotational Speed: 7200
     Firmware Revision: T7MH
     Serial Number: K1H623JD
     Model: ATA     HUS726060ALE610
     SATA NCQ Capable: True
     SATA NCQ Enabled: True
     Current Temperature (C): 35
     Maximum Temperature (C): 40
     PHY Count: 1
     PHY Transfer Rate: 6.0Gbps
     Drive Authentication Status: Not Authenticated. Smart Array will not control drive LEDs.
     Sanitize Erase Supported: False
Laord
  • 83
  • 6
  • 1
    One of the drives reports 56C maybe that’s a thermal problem? All your drives seem 4K, so I guess the stripes would be correctly aligned. – eckes Mar 17 '19 at 16:16
  • Please don't use R5 in 2015, not with big slow disks anyway, it's considered super dangerous to your data and has been for over a decade now - in fact vendors shipping disk controllers capable of R5 is borderline negligent these days. – Chopper3 Mar 29 '19 at 17:04

2 Answers2

10

You probably have an heavily punctured array, which cause an early "planned death" of the replacement disk due to failed stripe reconstruction. You can read more information here and here

The solution is to backup, destroy the array, recreate it and restore from backup.

Next time avoid using a RAID5 array with such big drives. I strongly suggest using RAID6 or, even better, RAID10.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
5

You should be using RAID6 with the size and types of disks in the system. However, there's nothing inherently wrong with running RAID5 on HP Smart Array RAID controllers. I think your issue is a result of using consumer disks in a setup not certified for the server hardware.

Some details about the server may be helpful, though.

Is this an HPE server, or are you just using an HPE controller?

These don't appear to be HPE drives or HPE drive carriers. That's a bad sign.

The hpssacli output you've provided would also show the reason for the disk failure. If you're not on an HPE server and there's a backplane issue or SATA timeouts (noticed you're on SATA disks), there's chance that you're getting false positives.

Example: (see the Last Failure Reason line):

  physicaldrive 2I:2:8
     Port: 2I
     Box: 2
     Bay: 8
     Status: Failed
     Last Failure Reason: Aborted Command
     Drive Type: Data Drive
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thanks for your answer. The server is a HP DL180 G9; Reason for drive failure is always "Write retries failed". I have added this info to my question. – Laord Mar 17 '19 at 14:37
  • Why aren't you using HP disks on this system? Are these genuine HP drive carriers? – ewwhite Mar 17 '19 at 14:47
  • This is rented hardware. I'm using these drives because they are the ones the hoster is offering. Actually i cannot find any info on whether these drives are somewhat certified for HP. – Laord Mar 17 '19 at 14:59
  • 1
    They are not certified. Your hosting provider is using cheap consumer disks. – ewwhite Mar 17 '19 at 15:03