0

I have a pretty modern RAID hardware for this:

  1. Controller: Intel RS3SC008
  2. SAS Expander: Intel RES3FV288
  3. HDDs: Seagate ST8000AS0002-1NA17Z

For the moment, I don't have BBU, which should be Intel AXXRMFBU4.

SAS expander is properly connected with the controller to G port (according to manual).

All system parts have proper temperature and ventilation (for example temp at the controller ROC is around 43C, which is more than optimal).

Controller and Expander are flashed to the latest firmware.

HDDs are the latest firmware also.

My problem is whatever RAID level I configure (tried 0, 6) and whatever cache configurations I choose, I face errors, when on real load:

  1. In some configurations VD device goes offline, stating that some HDDs went offline.
  2. Assuming that these Hdds might be faulty, I've created another test without these HDDs, still failing.
  3. In the logs enter image description here I see warnings complaining about temp sensors which I don't have, and some phy device reset warnings. No real errors until VD went offline, because of one of Hdds were misbehaving and went offline. I've tried to exclude these faulty HDDs in consequent tests. That seemed to slightly recover from the problem, but in the end, I am at the beginning.

I suspect having 4 faulty HDDs in the bunch of 20 new HDDs is kind of strange.

What would you suggest in this situation?

What could be the problem?

HDD incompatibility?

Is there a way to recover from this situation?

Bart
  • 101
  • 2

2 Answers2

0

Use HD-tune on each drive to see if the have SMART problems (reallocated or bad sectors are a priority).

In a more practical test-like approach:

Test in sets of 4 drives. As in make sets of 4 disks in RAID 0.

Then do copies from one set to the others.

This way you can relatively quickly identify which ones have a problem.

Note: RAID 0'ing that many Seagates is suicide waiting to happen.

The 4-disk arrays you find good put them back into a single one if needed (or wait towards the end of testing so you can actually use all good drives).

For the ones not working well, swap some of the drives between or split into arrays of 2 disks so you can further filter them out. Try to identify if there are bad cables at fault by swapping cables from a good 2-set to a bad 2-set.

Also, note that error does identify the port at fault, so you could start by eliminating these signaled by the errors.

"Command timeout" error may imply an inaccessible HDD.

Overmind
  • 2,970
  • 2
  • 15
  • 24
  • thank you for the tips. Of course, I'm not going to use R0 on so many drives. It was only for testing purposes. Initially, I wanted to test full load with all drives. I will make next tests tomorrow. For now even faulty HDDs excluded from VD do not show any problems in SMART. Can I assume that 4 HDDs working properly in RAID (whatever level) confirms that these HDDs with this controller are compatible? – Bart Jun 29 '21 at 19:13
0

Final conclusion, unfortunately not a solution.

After several series of tests conluded, I can confirm that drives mentioned earlier:

  1. HDDs: Seagate ST8000AS0002-1NA17Z
  2. SSDs: Crucial CT1000BX500SSD1

are completely incompatible with RAID configurations and of very low performance.

As a side note, it is completely strange to me, why they introduced the same level of performance drop after few seconds of heavy operation. I suppose it was due to similar basic, slow, low-level components used.

I've lost a lot of time on this issue, so maybe this post will help anyone.

Bart
  • 101
  • 2