On DL380p gen8 servers using XFS on top of LVM on top of raid 1+0 with 6 disks, an identical workload results in a ten-fold increase in disk writes on RHEL 6 compared to RHEL 5, making applications unusable.
Note that I'm not looking at optimizing the co6 system as much as possible, but at understanding why co6 behaves so wildly different, and solving that.
vmstat/iostat
We have a MySQL replication setup, using mysql 5.5. Mysql slaves on gen8 servers using RHEL 6 as OS perform badly, inspection with vmstat and iostat shows that these servers do ten times the page out activity and ten times the amount of writes to the disk subsystem. blktrace show that these writes are not initiated by mysql, but by the kernel.
Centos 5:
[dkaarsemaker@co5 ~]$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 12 252668 102684 10816864 0 0 8 124 0 0 9 1 90 0 0
1 0 12 251580 102692 10817116 0 0 48 2495 3619 5268 6 1 93 0 0
3 0 12 252168 102692 10817848 0 0 32 2103 4323 5956 6 1 94 0 0
3 0 12 252260 102700 10818672 0 0 128 5212 5365 8142 10 1 89 0 0
[dkaarsemaker@co5 ~]$ iostat 1
Linux 2.6.18-308.el5 (bc290bprdb-01.lhr4.prod.booking.com) 02/28/2013
avg-cpu: %user %nice %system %iowait %steal %idle
8.74 0.00 0.81 0.25 0.00 90.21
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cciss/c0d0 277.76 399.60 5952.53 2890574849 43058478233
cciss/c0d0p1 0.01 0.25 0.01 1802147 61862
cciss/c0d0p2 0.00 0.01 0.00 101334 32552
cciss/c0d0p3 277.75 399.34 5952.52 2888669185 43058383819
dm-0 32.50 15.00 256.41 108511602 1854809120
dm-1 270.24 322.97 5693.34 2336270565 41183532042
avg-cpu: %user %nice %system %iowait %steal %idle
7.49 0.00 0.79 0.08 0.00 91.64
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cciss/c0d0 300.00 32.00 4026.00 32 4026
cciss/c0d0p1 0.00 0.00 0.00 0 0
cciss/c0d0p2 0.00 0.00 0.00 0 0
cciss/c0d0p3 300.00 32.00 4026.00 32 4026
dm-0 0.00 0.00 0.00 0 0
dm-1 300.00 32.00 4026.00 32 4026
avg-cpu: %user %nice %system %iowait %steal %idle
4.25 0.00 0.46 0.21 0.00 95.09
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cciss/c0d0 507.00 160.00 10370.00 160 10370
cciss/c0d0p1 0.00 0.00 0.00 0 0
cciss/c0d0p2 0.00 0.00 0.00 0 0
cciss/c0d0p3 507.00 160.00 10370.00 160 10370
dm-0 0.00 0.00 0.00 0 0
dm-1 507.00 160.00 10370.00 160 10370
avg-cpu: %user %nice %system %iowait %steal %idle
5.33 0.00 0.50 0.08 0.00 94.09
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cciss/c0d0 318.00 64.00 4559.00 64 4559
cciss/c0d0p1 0.00 0.00 0.00 0 0
cciss/c0d0p2 0.00 0.00 0.00 0 0
cciss/c0d0p3 319.00 64.00 4561.00 64 4561
dm-0 0.00 0.00 0.00 0 0
dm-1 319.00 64.00 4561.00 64 4561
And on Centos 6 a ten-fold increase in paged out and disk writes:
[root@co6 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 361044 52340 81965728 0 0 19 1804 36 110 1 1 98 0 0
0 0 0 358996 52340 81965808 0 0 272 57584 1211 3619 0 0 99 0 0
2 0 0 356176 52348 81966800 0 0 240 34128 2121 14017 1 0 98 0 0
0 1 0 351844 52364 81968848 0 0 1616 29128 3648 3985 1 1 97 1 0
0 0 0 353000 52364 81969296 0 0 480 44872 1441 3480 1 0 99 0 0
[root@co6 ~]# iostat 1
Linux 2.6.32-279.22.1.el6.x86_64 (bc291bprdb-01.lhr4.prod.booking.com) 02/28/2013 _x86_64_ (32 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.08 0.00 0.67 0.27 0.00 97.98
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 373.48 1203.02 115203.05 11343270 1086250748
dm-0 63.63 74.92 493.63 706418 4654464
dm-1 356.48 1126.72 114709.47 10623848 1081596740
avg-cpu: %user %nice %system %iowait %steal %idle
0.25 0.00 0.19 0.06 0.00 99.50
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 330.00 80.00 77976.00 80 77976
dm-0 0.00 0.00 0.00 0 0
dm-1 328.00 64.00 77456.00 64 77456
avg-cpu: %user %nice %system %iowait %steal %idle
0.38 0.00 0.19 0.63 0.00 98.81
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 570.00 1664.00 128120.00 1664 128120
dm-0 0.00 0.00 0.00 0 0
dm-1 570.00 1664.00 128120.00 1664 128120
avg-cpu: %user %nice %system %iowait %steal %idle
0.66 0.00 0.47 0.03 0.00 98.84
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 317.00 448.00 73048.00 448 73048
dm-0 34.00 0.00 272.00 0 272
dm-1 309.00 448.00 72776.00 448 72776
Narrowing down
Gen 8 servers using RHEL 5, and gen 7 servers using RHEL 5 or 6 do not show this problem. Furthermore, RHEL 6 with ext3 as filesystem instead of our default xfs does not show the problem. The problem really seems to be somewhere between XFS, gen8 hardware and centos 6. RHEL 6 also shows the problem.
Edit 29/04: we added qlogic HBA's t the G8 machine. Using XFS on fibre channel storage does not show the problem. So it's definitely somewhere in the interaction between xfs/hpsa/p420i.
XFS
The newer xfs in rhel 8 seems to be able to detect underlying stripe width, but only on p420i controllers using the hpsa driver, not p410i controllers using cciss.
xfs_info output:
[root@co6 ~]# xfs_info /mysql/bp/
meta-data=/dev/mapper/sysvm-mysqlVol isize=256 agcount=16, agsize=4915136 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=78642176, imaxpct=25
= sunit=64 swidth=192 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=38400, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
sunit/swidth are both 0 in all the setup marked as OK above. We seem to be unable to change this, either in mkfs or with the noalign mount option. We also don't know if this is the cause.
Hugepages
Other people having XFS problems on rhel 6, say that disabling hugepages, and especially transparent hugepages can be beneficial. We disabled both, the problem did not go away.
We tried and observed many things already, none of the following have helped:
- Using numactl to influence memory allocations. We noticed that g7 and g8 have a different numa layout, no effect was seen
- Newer kernels (as new as 3.6) did not seem to solve this. Neither did using fedora 17.
- iostat does not report a ten-fold increase in write transactions, only in number of bytes written
- Using different I/O schedulers has no effect.
- Mounting the relevant filesystem noatime/nobarrier/nopdiratime did not help
- Changing /proc/sys/vm/dirty_ratio had no effect
- This happens both on systems based on 2640 and 2670 CPU's
- hpsa-3.2.0 doesn't fix the problem