I have a pretty modern RAID hardware for this:
- Controller: Intel RS3SC008
- SAS Expander: Intel RES3FV288
- HDDs: Seagate ST8000AS0002-1NA17Z
For the moment, I don't have BBU, which should be Intel AXXRMFBU4.
SAS expander is properly connected with the controller to G port (according to manual).
All system parts have proper temperature and ventilation (for example temp at the controller ROC is around 43C, which is more than optimal).
Controller and Expander are flashed to the latest firmware.
HDDs are the latest firmware also.
My problem is whatever RAID level I configure (tried 0, 6) and whatever cache configurations I choose, I face errors, when on real load:
- In some configurations VD device goes offline, stating that some HDDs went offline.
- Assuming that these Hdds might be faulty, I've created another test without these HDDs, still failing.
- In the logs I see warnings complaining about temp sensors which I don't have, and some phy device reset warnings. No real errors until VD went offline, because of one of Hdds were misbehaving and went offline. I've tried to exclude these faulty HDDs in consequent tests. That seemed to slightly recover from the problem, but in the end, I am at the beginning.
I suspect having 4 faulty HDDs in the bunch of 20 new HDDs is kind of strange.
What would you suggest in this situation?
What could be the problem?
HDD incompatibility?
Is there a way to recover from this situation?