Any insight from someone with a bit of experience in the linux IO system would be helpful. Here is my story:
Recently brought up a cluster of six Dell PowerEdge rx720xds to serve files via Ceph. These machines have 24 cores over two sockets with two numa zones and 70 odd gigabytes of memory. The disks are formatted as raids of one disk each (we could not see a way to expose them directly otherwise). Networking is provided by mellanox infiniband IP over IB (IP packets are turned into IB in kernel land, not hardware).
We have each of our SAS drives mounted like so:
# cat /proc/mounts | grep osd
/dev/sdm1 /var/lib/ceph/osd/ceph-90 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdj1 /var/lib/ceph/osd/ceph-87 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdu1 /var/lib/ceph/osd/ceph-99 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdd1 /var/lib/ceph/osd/ceph-82 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdk1 /var/lib/ceph/osd/ceph-88 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdl1 /var/lib/ceph/osd/ceph-89 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdh1 /var/lib/ceph/osd/ceph-86 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdo1 /var/lib/ceph/osd/ceph-97 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdc1 /var/lib/ceph/osd/ceph-81 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdb1 /var/lib/ceph/osd/ceph-80 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sds1 /var/lib/ceph/osd/ceph-98 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdn1 /var/lib/ceph/osd/ceph-91 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sde1 /var/lib/ceph/osd/ceph-83 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdq1 /var/lib/ceph/osd/ceph-93 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdg1 /var/lib/ceph/osd/ceph-85 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdt1 /var/lib/ceph/osd/ceph-95 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdf1 /var/lib/ceph/osd/ceph-84 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdr1 /var/lib/ceph/osd/ceph-94 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdi1 /var/lib/ceph/osd/ceph-96 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdp1 /var/lib/ceph/osd/ceph-92 xfs rw,noatime,attr2,inode64,noquota 0 0
The IO going through these machines bursts at a few hundred MB/s, but most of the time is pretty idle with a lot of little 'pokes':
# iostat -x -m
Linux 3.10.0-123.el7.x86_64 (xxx) 07/11/14 _x86_64_ (24 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.82 0.00 1.05 0.11 0.00 97.02
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.11 0.25 0.23 0.00 0.00 27.00 0.00 2.07 3.84 0.12 0.61 0.03
sdb 0.02 0.57 3.49 2.28 0.08 0.14 77.18 0.01 2.27 2.99 1.18 1.75 1.01
sdd 0.03 0.65 3.93 3.39 0.10 0.16 70.39 0.01 1.97 2.99 0.79 1.57 1.15
sdc 0.03 0.60 3.76 2.86 0.09 0.13 65.57 0.01 2.10 3.02 0.88 1.68 1.11
sdf 0.03 0.63 4.19 2.96 0.10 0.15 73.51 0.02 2.16 3.03 0.94 1.73 1.24
sdg 0.03 0.62 3.93 3.01 0.09 0.15 70.44 0.01 2.06 3.01 0.81 1.66 1.15
sde 0.03 0.56 4.35 2.61 0.10 0.14 69.53 0.02 2.26 3.00 1.02 1.82 1.26
sdj 0.02 0.73 3.67 4.74 0.10 0.37 116.06 0.02 1.84 3.01 0.93 1.31 1.10
sdh 0.03 0.62 4.31 3.04 0.10 0.15 67.83 0.02 2.15 3.04 0.89 1.75 1.29
sdi 0.02 0.59 3.82 2.47 0.09 0.13 74.35 0.01 2.20 2.96 1.03 1.76 1.10
sdl 0.03 0.59 4.75 2.46 0.11 0.14 70.19 0.02 2.33 3.02 1.00 1.93 1.39
sdk 0.02 0.57 3.66 2.41 0.09 0.13 73.57 0.01 2.20 3.00 0.97 1.76 1.07
sdm 0.03 0.66 4.03 3.17 0.09 0.14 66.13 0.01 2.02 3.00 0.78 1.64 1.18
sdn 0.03 0.62 4.70 3.00 0.11 0.16 71.63 0.02 2.25 3.01 1.05 1.79 1.38
sdo 0.02 0.62 3.75 2.48 0.10 0.13 76.01 0.01 2.16 2.94 0.99 1.70 1.06
sdp 0.03 0.62 5.03 2.50 0.11 0.15 68.65 0.02 2.39 3.08 0.99 1.99 1.50
sdq 0.03 0.53 4.46 2.08 0.09 0.12 67.74 0.02 2.42 3.04 1.09 2.01 1.32
sdr 0.03 0.57 4.21 2.31 0.09 0.14 72.05 0.02 2.35 3.00 1.16 1.89 1.23
sdt 0.03 0.66 4.78 5.13 0.10 0.20 61.78 0.02 1.90 3.10 0.79 1.49 1.47
sdu 0.03 0.55 3.93 2.42 0.09 0.13 70.77 0.01 2.17 2.97 0.85 1.79 1.14
sds 0.03 0.60 4.11 2.70 0.10 0.15 74.77 0.02 2.25 3.01 1.10 1.76 1.20
sdw 1.53 0.00 0.23 38.90 0.00 1.66 87.01 0.01 0.22 0.11 0.22 0.05 0.20
sdv 0.88 0.00 0.16 28.75 0.00 1.19 84.55 0.01 0.24 0.10 0.24 0.05 0.14
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 1.84 1.84 0.00 1.15 0.00
dm-1 0.00 0.00 0.23 0.29 0.00 0.00 23.78 0.00 1.87 4.06 0.12 0.55 0.03
dm-2 0.00 0.00 0.01 0.00 0.00 0.00 8.00 0.00 0.47 0.47 0.00 0.45 0.00
The problem:
After around 48 hours later, contiguous pages are so fragmented that magniutde four (16 pages, 65536 byte) allocations begin to fail and we start dropping packets (due to kalloc failing when a SLAB is grown).
This is what a relatively "healthy" server looks like:
# cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone DMA 0.000 0.000 0.000 0.001 0.003 0.007 0.015 0.031 0.031 0.096 0.225
Node 0, zone DMA32 0.000 0.009 0.015 0.296 0.733 0.996 0.997 0.998 0.998 1.000 1.000
Node 0, zone Normal 0.000 0.000 0.019 0.212 0.454 0.667 0.804 0.903 0.986 1.000 1.000
Node 1, zone Normal 0.000 0.027 0.040 0.044 0.071 0.270 0.506 0.772 1.000 1.000 1.000
When fragmentation gets considerably worse, the system seems to start spinning in kernel space and everything just falls apart. One anomaly during this failure is that xfsaild seems to use a lot of CPU and gets stuck in the uninterruptible sleep state. I don't want to jump to any conclusions with weirdness during total system failure, though.
Workaround thus far.
In order to ensure that these allocations not fail, even under fragmentation, I set:
vm.min_free_kbytes = 16777216
After seeing millions of blkdev_requests in SLAB caches, I tried to reduce dirty pages via:
vm.dirty_ratio = 1
vm.dirty_background_ratio = 1
vm.min_slab_ratio = 1
vm.zone_reclaim_mode = 3
Possibly changing too many variables at once, but just in case the inodes and dentries were causing fragmentation I decided to keep them to a minimum:
vm.vfs_cache_pressure = 10000
And this seemed to help. Fragmentation is still high though, and the reduced inode and dentry issues meant that I noticed something weird which leads me to...
My question:
Why is it that I have so many blkdev_requests (that are active no less), that just disappear when I drop caches?
Here is what I mean:
# slabtop -o -s c | head -20
Active / Total Objects (% used) : 19362505 / 19431176 (99.6%)
Active / Total Slabs (% used) : 452161 / 452161 (100.0%)
Active / Total Caches (% used) : 72 / 100 (72.0%)
Active / Total Size (% used) : 5897855.81K / 5925572.61K (99.5%)
Minimum / Average / Maximum Object : 0.01K / 0.30K / 15.69K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
2565024 2565017 99% 1.00K 80157 32 2565024K xfs_inode
3295194 3295194 100% 0.38K 78457 42 1255312K blkdev_requests
3428838 3399527 99% 0.19K 81639 42 653112K dentry
5681088 5680492 99% 0.06K 88767 64 355068K kmalloc-64
2901366 2897861 99% 0.10K 74394 39 297576K buffer_head
34148 34111 99% 8.00K 8537 4 273184K kmalloc-8192
334768 334711 99% 0.57K 11956 28 191296K radix_tree_node
614959 614959 100% 0.15K 11603 53 92824K xfs_ili
21263 19538 91% 2.84K 1933 11 61856K task_struct
18720 18636 99% 2.00K 1170 16 37440K kmalloc-2048
32032 25326 79% 1.00K 1001 32 32032K kmalloc-1024
10234 9202 89% 1.88K 602 17 19264K TCP
22152 19765 89% 0.81K 568 39 18176K task_xstate
# echo 2 > /proc/sys/vm/drop_caches :(
# slabtop -o -s c | head -20
Active / Total Objects (% used) : 965742 / 2593182 (37.2%)
Active / Total Slabs (% used) : 69451 / 69451 (100.0%)
Active / Total Caches (% used) : 72 / 100 (72.0%)
Active / Total Size (% used) : 551271.96K / 855029.41K (64.5%)
Minimum / Average / Maximum Object : 0.01K / 0.33K / 15.69K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
34140 34115 99% 8.00K 8535 4 273120K kmalloc-8192
143444 20166 14% 0.57K 5123 28 81968K radix_tree_node
768729 224574 29% 0.10K 19711 39 78844K buffer_head
73280 8287 11% 1.00K 2290 32 73280K xfs_inode
21263 19529 91% 2.84K 1933 11 61856K task_struct
686848 97798 14% 0.06K 10732 64 42928K kmalloc-64
223902 41010 18% 0.19K 5331 42 42648K dentry
32032 23282 72% 1.00K 1001 32 32032K kmalloc-1024
10234 9211 90% 1.88K 602 17 19264K TCP
22152 19924 89% 0.81K 568 39 18176K task_xstate
69216 59714 86% 0.25K 2163 32 17304K kmalloc-256
98421 23541 23% 0.15K 1857 53 14856K xfs_ili
5600 2915 52% 2.00K 350 16 11200K kmalloc-2048
This says to me that the blkdev_request buildup is not in fact related to dirty pages, and furthermore that the active objects aren't really active? How can these objects be freed if they aren't in fact in use? What is going on here?
For some background, here is what the drop_caches is doing:
http://lxr.free-electrons.com/source/fs/drop_caches.c
Update:
Worked out that they might not be blkdev_requests, but may be xfs_buf entries showing up under that 'heading'? Not sure how this works:
/sys/kernel/slab # ls -l blkdev_requests(
lrwxrwxrwx 1 root root 0 Nov 7 23:18 blkdev_requests -> :t-0000384/
/sys/kernel/slab # ls -l | grep 384
lrwxrwxrwx 1 root root 0 Nov 7 23:18 blkdev_requests -> :t-0000384/
lrwxrwxrwx 1 root root 0 Nov 7 23:19 ip6_dst_cache -> :t-0000384/
drwxr-xr-x 2 root root 0 Nov 7 23:18 :t-0000384/
lrwxrwxrwx 1 root root 0 Nov 7 23:19 xfs_buf -> :t-0000384/
I still don't know why these are cleared by the 'drop_slabs', or how to work out what is causing this fragmentation.
Bonus question: What's a better way to get to the source of this fragmentation?
If you read this far, thank you for your attention!
Extra requested information:
Memory and xfs info: https://gist.github.com/christian-marie/f417cc3134544544a8d1
Page allocation failure: https://gist.github.com/christian-marie/7bc845d2da7847534104
Follow up: perf info and compaction related things
http://ponies.io/raw/compaction.png
Compaction code seems a little inefficient huh? I've cobbled some code together to attempt to replicate the failed compactions: https://gist.github.com/christian-marie/cde7e80c5edb889da541
This seems to reproduce the issue.
I'll note also that an event trace tells me that there are lots of failed reclaims, over and over and over:
<...>-322 [023] .... 19509.445609: mm_vmscan_direct_reclaim_end: nr_reclaimed=1
Vmstat output is concerning also. Whilst the system is in this high load state the compactions are going through the roof (and mostly failing):
pgmigrate_success 38760827
pgmigrate_fail 350700119
compact_migrate_scanned 301784730
compact_free_scanned 204838172846
compact_isolated 18711615
compact_stall 270115
compact_fail 244488
compact_success 25212
There is indeed something awry with reclaim/compaction.
At the moment I'm looking towards reducing the high order allocations by adding SG support to our ipoib setup. The real issue appears likely vmscan related.
This is interesting, and references this question: http://marc.info/?l=linux-mm&m=141607142529562&w=2