30

Recent versions of RHEL/CentOS (EL6) brought some interesting changes to the XFS filesystem I've depended on heavily for over a decade. I spent part of last summer chasing down an XFS sparse file situation resulting from a poorly-documented kernel backport. Others have had unfortunate performance issues or inconsistent behavior since moving to EL6.

XFS was my default filesystem for data and growth-partitions, as it offered stability, scalability and a good performance boost over the default ext3 filesystems.

There's an issue with XFS on EL6 systems that surfaced in November 2012. I noticed that my servers were showing abnormaly-high system loads, even when idle. In one case, an unloaded system would show a constant load average of 3+. In others, there was a 1+ bump in load. The number of mounted XFS filesystems seemed to influence the severity of the load increase.

System has two active XFS filesystems. Load is +2 following upgrade to the affected kernel. enter image description here

Digging deeper, I found a few threads on the XFS mailing list that pointed to an increased frequency of the xfsaild process sitting in the STAT D state. The corresponding CentOS Bug Tracker and Red Hat Bugzilla entries outline the specifics of the issue and conclude that this is not a performance problem; only an error in the reporting of system load in kernels newer than 2.6.32-279.14.1.el6.

WTF?!?

In a one-off situation, I understand that the load reporting may not be a big deal. Try managing that with your NMS and hundreds or thousands of servers! This was identified in November 2012 at kernel 2.6.32-279.14.1.el6 under EL6.3. Kernels 2.6.32-279.19.1.el6 and 2.6.32-279.22.1.el6 were released in subsequent months (December 2012 and February 2013) with no change to this behavior. There's even been a new minor release of the operating system since this issue was identified. EL6.4 was released and is now on kernel 2.6.32-358.2.1.el6, which exhibits the same behavior.

I've had a new system build queue and have had to work around the issue, either locking kernel versions at the pre-November 2012 release for EL6.3 or just not using XFS, opting for ext4 or ZFS, at a severe performance penalty for the specific custom application running atop. The application in question relies heavily on some of the XFS filesystem attributes to account for deficiencies in the application design.

Going behind Red Hat's paywalled knowledgebase site, an entry appears stating:

High load average is observed after installing kernel 2.6.32-279.14.1.el6. The high load average is caused by xfsaild going into D state for each XFS formatted device.

There is currently no resolution for this issue. It is currently being tracked via Bugzilla #883905. Workaround Downgrade the installed kernel package to a version lower then 2.6.32-279.14.1.

(except downgrading kernels not an option on RHEL 6.4...)

So we're 4+ months into this problem with no real fix planned for the EL6.3 or EL6.4 OS releases. There's a proposed fix for EL6.5 and a kernel source patch available... But my question is:

At what point does it make sense to depart from the OS-provided kernels and packages when the upstream maintainer has broken an important feature?

Red Hat introduced this bug. They should incorporate a fix into an errata kernel. One of the advantages of using enterprise operating systems is that they provide a consistent and predictable platform target. This bug disrupted systems already in production during a patch cycle and reduced confidence in deploying new systems. While I could apply one of the proposed patches to the source code, how scalable is that? It would require some vigilance to keep updated as the OS changes.

What's the right move here?

  • We know this could possibly be fixed, but not when.
  • Supporting your own kernel in a Red Hat ecosystem has its own set of caveats.
  • What's the impact on support eligibility?
  • Should I just overlay a working EL6.3 kernel on top of newly-build EL6.4 servers to gain the proper XFS functionality?
  • Should I just wait until this is officially fixed?
  • What does this say about the lack of control we have over enterprise Linux release cycles?
  • Was relying on an XFS filesystem for so long a planning/design mistake?

Edit:

This patch was incorporated into the most recent CentOSPlus kernel release (kernel-2.6.32-358.2.1.el6.centos.plus). I'm testing this on my CentOS systems, but this doesn't help much for the Red Hat-based servers.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 4
    I was always under the belief that if you're using EL6, and paying RHEL support, then it's their onus to fix it for you? – Tom O'Connor Apr 07 '13 at 18:24
  • 7
    Yes... Red Hat will fix it... ***On their own timetable!!*** - This issue surfaced at the end of 2012. It's still not fixed. It's not slated for repair until the release of RHEL 6.5, so technically, they *are* taking care of it... – ewwhite Apr 07 '13 at 18:26
  • Well, with the attitude Red Hat is showing (ref the bug tracker) I honestly do not believe they are bothering with XFS anymore. A custom kernel makes sense here, but what is the point of paying for support? Maybe CentOS is your path.. – pauska Apr 07 '13 at 21:26
  • @pauska I have a combination of RHEL *and* CentOS systems. I'm testing a possible workaround for my CentOS systems, but for the systems under real RHEL support, I would like to keep things as stock/supportable as possible. – ewwhite Apr 08 '13 at 05:50
  • 5
    I understand your frustration, I was responsible for a mixed RHEL/CentOS environment before and RH makes it really hard for you to keep things stock sometimes, seeing how they continuously "ignore" to fix crucial bugs, sometimes which they introduce themselves. Then they do schedule a fix for the next major release, but since they fail to support upgrading to the next major version this is little helpful. At some point I chose to ditch their official kernels on some RHEL5 boxes simply because I had to due to lack of a specific feature. – Adrian Frühwirth Apr 08 '13 at 16:58
  • Does OEL include this patch? Could you live with Oracle? – ptman Apr 10 '13 at 10:16
  • @ptman I'm not sure if OEL includes the patch. I'm pretty sure that OEL's kernel is newer and would not have had this problem in the first place. Red Hat introduced the bug in an effort to fix *another* bug. – ewwhite Apr 10 '13 at 10:20
  • 1
    Is a move to SLES an option? They have supported XFS for much longer than RH. – Martin Schröder Apr 10 '13 at 22:28
  • 2
    @MartinSchröder SLES isn't particularly popular in the US, but it could be an option. XFS itself isn't broken, but Red Hat's handling of it is. It's worth considering. – ewwhite Apr 11 '13 at 15:11

3 Answers3

15

This was fixed (quietly) by Red Hat April 23, 2013 in RHEL kernel-2.6.32-358.6.1.el6 as part of the 6.4 errata updates...

ewwhite
  • 194,921
  • 91
  • 434
  • 799
14

At what point does it make sense to depart from the OS-provided kernels and packages when the upstream maintainer has broken an important feature?

"At the point where the vendor's kernel or packages are so horribly broken that they impact your business" is my general answer (coincidentally this is also about the point where I say it makes sense to start looking at ways to depart from the vendor relationship).

Basically as you and others have said, RedHat seems to not want to patch this in their distributed kernel (for whatever reason). That pretty much leaves you in the situation of having to roll your own kernel (keeping it up to date on patches yourself, maintaining your own package and installing it on your systems with Puppet or similar, or running a package server that Yum or whatever they use today can reference), or taking your marbles and going home.


Yes I know taking your marbles and going home is often an expensive proposition -- switching OS vendors is a huge pain, especially in the Linux world where the flavors are radically different from an administrative standpoint.
Other options like going totally CentOS are also unappealing (because you lose support, and you're still getting essentially RedHat's code built by someone else so you'd still have this bug).

Unfortunately unless enough people (i.e. "huge companies) take their marbles and go home the vendor won't care so much about screwing people over by shipping bad code and not fixing it.

voretaq7
  • 79,345
  • 17
  • 128
  • 213
3

If you do need to patch your RHEL kernel, you can do it yourself and be officially supported on that kernel, you'll just need for them to certify it.

There are provisions in the RHEL support agreement for doing so - ISTR you're restricted to 1 or 2 per quarter or year but can't remember for sure.

MikeyB
  • 38,725
  • 10
  • 102
  • 186
  • Very good to know! – ewwhite May 01 '13 at 18:00
  • This is not correct. You can request an accelerated fix from Red Hat, but there are criteria the issue has to meet for this to be delivered, and several different ways to deliver a supported accelerated fix. If you go recompiling your own kernel, that kernel is not supported by Red Hat. – suprjami Jun 28 '13 at 12:53
  • I have a customer that does exactly this. I don't think they do it for everyone but they do do it. – MikeyB Jun 28 '13 at 13:49