0

Today one of my servers started sending me alerts about non-optimal RAID state. These were triggered by a very simple script run by cron -- if it detects that not all of the disks report 'Optimal' state, it sends an alert.

Now, the issue is that the RAID seems to be fine but the megacli -LDInfo -Lall -aALL command invoked by the script fails repeatedly leaving a cryptic error message in syslog: megacli: Failed to alloc kernel SGL buffer for IOCTL. The curious thing is that the command does work sometimes and does return output, but most of the time it just returns two blank lines and the exit code:

# megacli -LDInfo -Lall -aALL


Exit Code: 0x00

The same goes for megacli with other parameters like megacli -AdpAllInfo -aAll. Every time the command fails the said error appears in syslog.

This has never happened before, as far as I can remember. No changes were made at the server recently. The adapter is a PERC 6/i Integrated and the server runs under Debian Wheezy.

What could possibly be the issue and where do I start resolving this?

EDIT:

# megacli -v


      MegaCLI SAS RAID Management Tool  Ver 5.00.12 May 08, 2009

    (c)Copyright 2009, LSI Corporation, All Rights Reserved.

Exit Code: 0x00

At least this command works every time without triggering the error ;) I've just realised this is an old release of megacli. Still, it shouldn't matter since the very same setup has been working a couple dozen of months with no problem and now suddenly decided to go wild.

Erathiel
  • 711
  • 3
  • 10
  • 21
  • It probably _does_ matter. Servers are not static despite our best efforts. I'd update that first. – Michael Hampton Mar 10 '15 at 18:14
  • After reviewing the case with a colleague we deduced it must have been some memory corruption and decided to reboot the server. Is been working fine since then, a week already. Yeah, I know a reboot is not really a solution but hey, it helped as a last resort. – Erathiel Mar 17 '15 at 11:50
  • This has happened to me again, on a different server, after some 700 days of uptime (yes, shame on me for that). Same symptoms, same errors in syslog, same solution. – Erathiel Apr 21 '16 at 09:20

1 Answers1

1

There is a debian bug report for this issue, see:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=703356

There doesn't seem to be a fix as of now, and it also happens with recent kernels. My personal suspect is Xen. Do you run Xen on this host?

  • Well, yes, your suspicion is correct, both my hosts (from the OP and the comment to that) are running Xen. – Erathiel Mar 10 '17 at 14:42