15

Summary

I have been getting these cryptic messages in syslog since I installed some new hardware and I can't figure out what the problem is, if it's serious, or what to do about it.

They're from the new SATA HBA and they follow a pattern. I will get several of the first message followed by several of the second message 5-30 seconds later. They come in blobs that are all logged in the same second and the exact amount of each varies between about 2 and 35. It can be minutes or hours between appearances of the entries.

Example of the two messages:

Jul 13 06:06:23 durandal kernel: [366918.435596] mpt2sas0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Jul 13 06:06:28 durandal kernel: [366923.145524] mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)

It is always always 0x31120303 followed by 0x31110d01.

mpt2sas is the driver for the SATA host bus adapter I'm using but the error content is overly cryptic. It doesn't tell me what the problem is, what disk or port it is with or how severe it is.

Hardware

Supermicro X9SCL with a Xeon E3-1220 and 8GB of RAM.

LSI SAS2008 based Supermicro AOC-USAS2-L8I SAS/SATA HBA connected to a Supermicro CSE-M35T-1B disk tray set. It has three Western Digital WD30EZRX and two Segate ST3000DM001 plugged into it. All 3TB drives (exact same number of sectors actually). No port expanders in use.

The HBA, disk trays and 4 of the drives are new. One of the WD30EZRXes has been in for months, had no problems with it. Had it connected to the integrated Intel SATA controller previously, moved it into the drive bays with this new setup.

Had problems with the HBA needing to reset frequently and getting really awful performance. Updated the firmware/bios to "Phase 12", the latest release available from Supermicro and changed the type to IT (i.e. passthrough, from IR for integrated raid since I was going to use all software raid): 2008IT12.FW. That update cleared up all the early issues and I didn't start getting the above messages until later (see below).

The first four disks I added are all on the first SFF-8087 port (split to 4 SATA cables). The latest disk I added is on the other port, if that matters.

The only other disk on the system contains the OS, and is a older Intel 80GB SSD plugged into the integrated SATA controller.

Software

Ubuntu 11.10 (oneiric). Linux 3.0.0-14-server x86_64. Using the mpt2sas driver that comes with the OS.

Trying to build a RAID6 array using Linux md with those five disks. Started with a degenerate array of 3 disks, the two Segates and one of the new WD drives. This was fast and went very well, no messages in the logs after I did the firmware update. Meanwhile, I am still using the old WD disk on port 0 of the same controller.

Added the other new WD disk to the array. Rebuild started and I am now getting those messages in syslog periodically. I'm not sure how long it's supposed to take to add a disk to the array but the estimated time (cat /proc/mdstat) ranges from thousands to tens of thousands of minutes, much longer than it took the first 3 disks. I do understand that the WD disks are much slower; I got different models to cut down on the chances of multiple disk failure, and those were the two cheapest 3TB models.

Notes

SMART does not report any problems on any disks. There are no logged errors on any disks and none of the failure stats are anywhere near threshold.

The logged messages only started appearing after I added the last disk, which suggests that one may be having a problem but I have nothing else pointing to that.

I did find a header file that seems to correspond to the logging messages from this driver. The first message seems to be an abort (code 12) for a "sub code" 0303 that isn't listed. The second message is a reset (code 11) for a reason that also isn't clear. If I could determine what 0303 and 0d01 mean, that would be really helpful.

I know that 4 disks in a 5 disk RAID6 is an incomplete array. I'm planning to copy the contents of the old disk to the array once it finishes integrating the 4th disk and then add the old disk to the array as well.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
Chris Smith
  • 580
  • 1
  • 4
  • 13

3 Answers3

6

Likely your best bet is a hardware problem somewhere between your disks and up to and including your sas raid controller. I recommend trying:

  1. Run any diagnostic tools from the vendor/s if they are available
  2. Check/re-seat/replace cables
  3. strip out hardware components and swap out hardware in the chain that connects the disks to your raid controller, including the controller itself (i.e., for you, try something else than the motherboard integrated raid).

I had one out of two identical Dell PowerEdge R515 giving very similar messages (logs periodically filling up with mpt2sas0 messages, though I do not have the exact numeric codes). Dell's own bootable diagnostic picked these up as "hardware errors" and replacing the RAID sas backplane solved the issue.

When I was investigating, I could not find a comprehensive resource of what various mpt2sas0 error codes mean. I suspect they may even be hardware-vendor-specific (someone who knows more about SAS needs to confirm or deny this). So your error codes could mean something widely different, but if SMART is clean it is hard to imagine other good reasons for mpt2sas0 to report error codes.

These errors can be very serious. My R515 worked seemingly OK with these messages for a week with a 12 disk Ubuntu Linux software raid 6, but then suddenly ejected all 12 disks out of the array as broken (!)

Also in my case the SMART for all disks were completely clean. A good check is a smart self diagnostic test: smartctl -t long /dev/sdX, and then check the results about a day later with smartctl -l selftest /dev/sdX. If all is OK the test should say Completed and the LBA_first_err column should be empty.

Thomas
  • 4,155
  • 5
  • 21
  • 28
  • Note: the RAID controller (HBA really) is already a separate card. The onboard SATA controller works fine. I do have a replacement SFF-8087 cable on order, should be here by tomorrow. That's my top suspect at this point. – Chris Smith Jul 25 '12 at 14:41
  • The bad cable was the problem! I replaced both of them (two SFF ports) with some higher quality cables and haven't any problems since! I'm accepting your answer since it's the longest and does suggest a bad cable. P.S. I definitely did the long SMART tests; no problems on any of the disks. – Chris Smith Jul 27 '12 at 13:40
  • Great to hear that you found the problem. Thanks for the accept. – Rickard Armiento Jul 27 '12 at 20:21
  • For me is really strange that I meet this problem before also just in case of Dell PowerEdge platform. Same result the issue was with cables... – Mazeryt Jun 11 '16 at 15:25
4

Wow, a tough one.

This seems to indicate that 0x31120303 is a bus reset due to one of your devices being under heavy load. It also says you don't need to worry about it. (Haha, yeah right.)

This indicates that these log messages are happening because one of your devices is taking too long to respond to commands. This says the same thing, and also indicates it occurs under heavy load.

While this isn't a complete answer, it hopefully will point you in a useful direction.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • 1
    I saw some of those postings but was never able to find the exact message was getting. Turned out to be a bad SFF-8087->SATA cable. Thanks for the help! – Chris Smith Jul 27 '12 at 13:41
0

This means that you have some error on the disk, it is a SATA disk in a SAS controller from LSI and due to the error all outstanding requests were aborted.

In most cases you have a medium error on the disk which is the trigger for this error. This error by itself doesn't mean a medium error and you'll need to check the logs for other hints to find what is the source of the original disk failure.

Slightly more elaborated version at: http://blog.disksurvey.org/blog/2014/03/27/sata-handling-of-medium-errors-log-info-0x0x31080000/

Baruch Even
  • 1,043
  • 6
  • 18
  • Interesting post, thanks for sharing! SATA is a crappy protocol but the disks are cheap and do what I need. The message has not reappeared since I replaced the faulty cable. – Chris Smith Mar 28 '14 at 15:43
  • 1
    More decoding of LSI Loginfo can be found through a utility I created to decipher it: http://blog.disksurvey.org/blog/2014/08/10/decoding-lsi-loginfo-codes/ – Baruch Even Aug 13 '14 at 19:47
  • @BaruchEven , Looks like the blog URL is not available. Do you have the content available somewhere? – dlmeetei Aug 12 '22 at 11:58
  • 1
    @dlmeetei the software itself is available at https://github.com/baruch/lsi_decode_loginfo the blog is gone and I do not have an archive of it unfortunately – Baruch Even Aug 16 '22 at 07:43