1

After noticing a high load on our virtual root server (2 x 1 TB RAID 1 Subset) I have found these messages in /var/log/messages (CentOS):

kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
kernel: ata3.00: failed command: WRITE DMA
kernel: ata3.00: cmd ca/00:10:e0:1b:01/00:00:00:00:00/e1 tag 18 dma 8192 out
kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
kernel: ata3.00: status: { DRDY }
kernel: ata3: hard resetting link
kernel: Clocksource tsc unstable (delta = -25761696872 ns)
kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
kernel: ata3.00: configured for UDMA/100
kernel: ata3.00: device reported invalid CHS sector 0
kernel: ata3: EH complete

Could please someone bring some light in it? Is it a serious hdd problem or something else? How can I check the health of the virtual hdd (without SMART capability)?

hellcode
  • 163
  • 1
  • 6

1 Answers1

2

The disk did not respond in time and was reset by the OS, it can mean many things but the two most common are:

  1. Media error -- some location(s) on the disk cannot be read from or written to
  2. Link errors -- Bad cable

This specific error with no previous errors on other errors or increased latency may indicate the media error. You can use smartctl to see if there are CRC errors in the smart counters to very though.

If it is a media error then the disk is in trouble since the command that failed is a write. Normally writes don't fail with media error and it is the reads later on that fail. It could be that a previous read took a bit too long and the write fell victim to the timeout. I've seen it happen as well.

You should also notice that the link was renegotiated to 1.5 Gbps, if this is the first failure you have a link problem, if this is the third or more failure of its kind than this points to the bad behavior I've seen in Linux that tries to alleviate the resets with a reduced speed even if the failure is not a link issue but a media error.

Action items:

  • Check smart for CRC errors
  • Check how many errors you had in the past
  • If you want to recover the 3Gbps speed, reboot
  • Check if you have the message "NCQ disabled due to excessive errors" in your logs, it may explain a disk slowdown, but not the disk problem itself
  • Make sure you have backup since it may very well be that your disk is failing
Baruch Even
  • 1,043
  • 6
  • 18
  • Thank you so much for your reply. I have opened a ticket after the last comments above and the hoster immediately rebooted the server, but I have still no response from the hoster. But I have seen that they renamed the ticket to "performance". It was the first time I saw this error (and no further is in logs for one month). I did a "badblocks" with the result of no errors. – hellcode Aug 27 '14 at 19:32
  • It could very well be a one off media problem, these do happen and are not a real concern. The result you got matched up to that since the Linux kernel hit a timeout, it reset the link and reduced the speed. Since it is a one-off or at least pretty rare you are not seeing this any longer and the drive is likely to continue working for a long time just fine. These things do happen on a fairly regular basis, I wouldn't worry about this too much. – Baruch Even Aug 28 '14 at 05:31
  • They told me to update the Baremetal-Tools (apparently some kind of software needed on the virtual host to support functions for virtualisation). So I updated Parallels Tools from 7.0.13253.694417 to 7.0.19496.1024109 (via vnc). But I am not sure if this shouldn't be done by the hoster during planned maintenance. And I am not sure if this really could be the reason for the problem. And I am asking me if all other clients on the same physical host were affected by this... – hellcode Aug 28 '14 at 12:35
  • I have now a statement of my hoster: The physical HDDs are dedicated to one customer only. There was no physical problem with the HDDs, but the communication between virtual HDD and OS was not optimal because of the old baremetal version. – hellcode Aug 28 '14 at 15:23
  • I have no experience with Parallels to know about that side of things. I personally still think that the HDD had a one-off issue that caused a timeout on some IOs and that was the issue. Things should be fine but you should keep an eye for more occurrences of such log messages and timeouts, just in case it happens again. Hardware is unpredictable, a problem may be very rare or it may increase in frequency, it is impossible to know at this stage. – Baruch Even Aug 28 '14 at 17:01