1

I have a server equipped with a LSI 9271-8i RAID controller, with 4 x 4TB organized as RAID-5 and 1 x 8TB as JBOD (which is called RAID-0 in the controller).

When I copy bigger amounts of data (~1 TB), I can observe the following: for the first few gigabytes the transfer speed is fine and limited by the disk or network speeds, usually ~100MB/s. But after a while, the transfer completely pauses for approx. 20-30 seconds, and continues then with the next approx. 1 GB. I copy a lot of files with each between 10MB and 500MB, and during the pause robocopy stays at a file and continues to the next after the pause. That way the overall transfer rate drops to ~20MB/s.

During the pause, browsing the drives' files is not possible, and in one case I received an controller reset error message ("Controller encountered a fatal error and was reset"). Also accessing controller data with the CLI tool is not possible during that pause (the result is displayed when the pause is over).

I could observe this behaviour when copying

  • gigabit network to RAID-5 volume
  • gigabit network to JBOD volume
  • JBOD to RAID-5
  • RAID-5 to JBOD

There is nothing going on that looks suspicious to me: temperatures (disks, BBU) are within the valid range, controller temp seems a bit high, but also within specs. No checks are running on the RAID, no rebuild in progress.

Any guesses?

Before I replace the controller, I want to try optimizing the thermal situation. Does this behaviour sound like a possibly thermal issue?

I find it strange that the first 20-30 GB are working fine, and the pauses are not ocurring before that. If I leave the server alone for a while and retry, then again a few GBs are copied fine. The only naive explanation for me is that the controller gets too hot. Why the controller and not the disks? The RAID-5 disks are 7200rpm and stacked very closely, while the JBOD single disk is 5400rpm and with a lot of air around. Would be strange if both would show the same overheating symptoms.

  • Sounds really like a temperature issue. You should monitor controller temperature or try using a fan on the heatsink. Have you tried reading and writing large amounts of data separately? Writing will cause more heat and should lead to these stalls faster. – Zac67 Dec 12 '17 at 20:57
  • Thanks, Zac. I will try improving the air flow and attach a fan to the passively cooled controller. Checking the difference between reading and writing sounds interesting, will try that. I just could not find a tool yet to monitor the temperature properly on that controller. – Markus Erlacher Dec 14 '17 at 09:56
  • Well, maybe just stick a remote thermometer onto the heatsink and watch it. ;-) – Zac67 Dec 14 '17 at 17:22

2 Answers2

1

I had a similar issue with a 9260-16i. It was not temps as I have dual 92mm fans blowing right on the LSI. I have a second server set up same way and it was fine. What I discovered was the server with the issues was set with a 64K strip size and working server had 256K stripe size. I backed up the problem server and rebuilt the drive group with 256K stripe and then formatted the OS drive with 64K clusters (since I have multi-GB file). I have been moving data back and no hesitations and basically running at full gigabit NIC speed on writes moving over 350GB per hour non-stop no pauses.

0

The issue is probably related to the controller flushing out its own DRAM cache. Anyone having such issue should try setting the controller cache to writethrough rather than writeback

shodanshok
  • 44,038
  • 6
  • 98
  • 162