Cause of page fragmentation on "large" server with xfs, 20 disks and Ceph

Question

Any insight from someone with a bit of experience in the linux IO system would be helpful. Here is my story:

Recently brought up a cluster of six Dell PowerEdge rx720xds to serve files via Ceph. These machines have 24 cores over two sockets with two numa zones and 70 odd gigabytes of memory. The disks are formatted as raids of one disk each (we could not see a way to expose them directly otherwise). Networking is provided by mellanox infiniband IP over IB (IP packets are turned into IB in kernel land, not hardware).

We have each of our SAS drives mounted like so:

# cat /proc/mounts | grep osd
/dev/sdm1 /var/lib/ceph/osd/ceph-90 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdj1 /var/lib/ceph/osd/ceph-87 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdu1 /var/lib/ceph/osd/ceph-99 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdd1 /var/lib/ceph/osd/ceph-82 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdk1 /var/lib/ceph/osd/ceph-88 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdl1 /var/lib/ceph/osd/ceph-89 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdh1 /var/lib/ceph/osd/ceph-86 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdo1 /var/lib/ceph/osd/ceph-97 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdc1 /var/lib/ceph/osd/ceph-81 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdb1 /var/lib/ceph/osd/ceph-80 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sds1 /var/lib/ceph/osd/ceph-98 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdn1 /var/lib/ceph/osd/ceph-91 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sde1 /var/lib/ceph/osd/ceph-83 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdq1 /var/lib/ceph/osd/ceph-93 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdg1 /var/lib/ceph/osd/ceph-85 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdt1 /var/lib/ceph/osd/ceph-95 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdf1 /var/lib/ceph/osd/ceph-84 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdr1 /var/lib/ceph/osd/ceph-94 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdi1 /var/lib/ceph/osd/ceph-96 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdp1 /var/lib/ceph/osd/ceph-92 xfs rw,noatime,attr2,inode64,noquota 0 0

The IO going through these machines bursts at a few hundred MB/s, but most of the time is pretty idle with a lot of little 'pokes':

# iostat -x -m
Linux 3.10.0-123.el7.x86_64 (xxx)   07/11/14    _x86_64_    (24 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
       1.82    0.00    1.05    0.11    0.00   97.02
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.11    0.25    0.23     0.00     0.00    27.00     0.00    2.07    3.84    0.12   0.61   0.03
sdb               0.02     0.57    3.49    2.28     0.08     0.14    77.18     0.01    2.27    2.99    1.18   1.75   1.01
sdd               0.03     0.65    3.93    3.39     0.10     0.16    70.39     0.01    1.97    2.99    0.79   1.57   1.15
sdc               0.03     0.60    3.76    2.86     0.09     0.13    65.57     0.01    2.10    3.02    0.88   1.68   1.11
sdf               0.03     0.63    4.19    2.96     0.10     0.15    73.51     0.02    2.16    3.03    0.94   1.73   1.24
sdg               0.03     0.62    3.93    3.01     0.09     0.15    70.44     0.01    2.06    3.01    0.81   1.66   1.15
sde               0.03     0.56    4.35    2.61     0.10     0.14    69.53     0.02    2.26    3.00    1.02   1.82   1.26
sdj               0.02     0.73    3.67    4.74     0.10     0.37   116.06     0.02    1.84    3.01    0.93   1.31   1.10
sdh               0.03     0.62    4.31    3.04     0.10     0.15    67.83     0.02    2.15    3.04    0.89   1.75   1.29
sdi               0.02     0.59    3.82    2.47     0.09     0.13    74.35     0.01    2.20    2.96    1.03   1.76   1.10
sdl               0.03     0.59    4.75    2.46     0.11     0.14    70.19     0.02    2.33    3.02    1.00   1.93   1.39
sdk               0.02     0.57    3.66    2.41     0.09     0.13    73.57     0.01    2.20    3.00    0.97   1.76   1.07
sdm               0.03     0.66    4.03    3.17     0.09     0.14    66.13     0.01    2.02    3.00    0.78   1.64   1.18
sdn               0.03     0.62    4.70    3.00     0.11     0.16    71.63     0.02    2.25    3.01    1.05   1.79   1.38
sdo               0.02     0.62    3.75    2.48     0.10     0.13    76.01     0.01    2.16    2.94    0.99   1.70   1.06
sdp               0.03     0.62    5.03    2.50     0.11     0.15    68.65     0.02    2.39    3.08    0.99   1.99   1.50
sdq               0.03     0.53    4.46    2.08     0.09     0.12    67.74     0.02    2.42    3.04    1.09   2.01   1.32
sdr               0.03     0.57    4.21    2.31     0.09     0.14    72.05     0.02    2.35    3.00    1.16   1.89   1.23
sdt               0.03     0.66    4.78    5.13     0.10     0.20    61.78     0.02    1.90    3.10    0.79   1.49   1.47
sdu               0.03     0.55    3.93    2.42     0.09     0.13    70.77     0.01    2.17    2.97    0.85   1.79   1.14
sds               0.03     0.60    4.11    2.70     0.10     0.15    74.77     0.02    2.25    3.01    1.10   1.76   1.20
sdw               1.53     0.00    0.23   38.90     0.00     1.66    87.01     0.01    0.22    0.11    0.22   0.05   0.20
sdv               0.88     0.00    0.16   28.75     0.00     1.19    84.55     0.01    0.24    0.10    0.24   0.05   0.14
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00    1.84    1.84    0.00   1.15   0.00
dm-1              0.00     0.00    0.23    0.29     0.00     0.00    23.78     0.00    1.87    4.06    0.12   0.55   0.03
dm-2              0.00     0.00    0.01    0.00     0.00     0.00     8.00     0.00    0.47    0.47    0.00   0.45   0.00

The problem:

After around 48 hours later, contiguous pages are so fragmented that magniutde four (16 pages, 65536 byte) allocations begin to fail and we start dropping packets (due to kalloc failing when a SLAB is grown).

This is what a relatively "healthy" server looks like:

# cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.003 0.007 0.015 0.031 0.031 0.096 0.225 
Node 0, zone    DMA32 0.000 0.009 0.015 0.296 0.733 0.996 0.997 0.998 0.998 1.000 1.000 
Node 0, zone   Normal 0.000 0.000 0.019 0.212 0.454 0.667 0.804 0.903 0.986 1.000 1.000 
Node 1, zone   Normal 0.000 0.027 0.040 0.044 0.071 0.270 0.506 0.772 1.000 1.000 1.000

When fragmentation gets considerably worse, the system seems to start spinning in kernel space and everything just falls apart. One anomaly during this failure is that xfsaild seems to use a lot of CPU and gets stuck in the uninterruptible sleep state. I don't want to jump to any conclusions with weirdness during total system failure, though.

Workaround thus far.

In order to ensure that these allocations not fail, even under fragmentation, I set:

vm.min_free_kbytes = 16777216

After seeing millions of blkdev_requests in SLAB caches, I tried to reduce dirty pages via:

vm.dirty_ratio = 1
vm.dirty_background_ratio = 1
vm.min_slab_ratio = 1
vm.zone_reclaim_mode = 3

Possibly changing too many variables at once, but just in case the inodes and dentries were causing fragmentation I decided to keep them to a minimum:

vm.vfs_cache_pressure = 10000

And this seemed to help. Fragmentation is still high though, and the reduced inode and dentry issues meant that I noticed something weird which leads me to...

My question:

Why is it that I have so many blkdev_requests (that are active no less), that just disappear when I drop caches?

Here is what I mean:

# slabtop -o -s c | head -20
 Active / Total Objects (% used)    : 19362505 / 19431176 (99.6%)
 Active / Total Slabs (% used)      : 452161 / 452161 (100.0%)
 Active / Total Caches (% used)     : 72 / 100 (72.0%)
 Active / Total Size (% used)       : 5897855.81K / 5925572.61K (99.5%)
 Minimum / Average / Maximum Object : 0.01K / 0.30K / 15.69K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
2565024 2565017  99%    1.00K  80157       32   2565024K xfs_inode              
3295194 3295194 100%    0.38K  78457       42   1255312K blkdev_requests        
3428838 3399527  99%    0.19K  81639       42    653112K dentry                 
5681088 5680492  99%    0.06K  88767       64    355068K kmalloc-64             
2901366 2897861  99%    0.10K  74394       39    297576K buffer_head            
 34148  34111  99%    8.00K   8537        4    273184K kmalloc-8192           
334768 334711  99%    0.57K  11956       28    191296K radix_tree_node        
614959 614959 100%    0.15K  11603       53     92824K xfs_ili                
 21263  19538  91%    2.84K   1933       11     61856K task_struct            
 18720  18636  99%    2.00K   1170       16     37440K kmalloc-2048           
 32032  25326  79%    1.00K   1001       32     32032K kmalloc-1024           
 10234   9202  89%    1.88K    602       17     19264K TCP                    
 22152  19765  89%    0.81K    568       39     18176K task_xstate

# echo 2 > /proc/sys/vm/drop_caches                                                                                                                                                   :(
# slabtop -o -s c | head -20       
 Active / Total Objects (% used)    : 965742 / 2593182 (37.2%)
 Active / Total Slabs (% used)      : 69451 / 69451 (100.0%)
 Active / Total Caches (% used)     : 72 / 100 (72.0%)
 Active / Total Size (% used)       : 551271.96K / 855029.41K (64.5%)
 Minimum / Average / Maximum Object : 0.01K / 0.33K / 15.69K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 34140  34115  99%    8.00K   8535        4    273120K kmalloc-8192           
143444  20166  14%    0.57K   5123       28     81968K radix_tree_node        
768729 224574  29%    0.10K  19711       39     78844K buffer_head            
 73280   8287  11%    1.00K   2290       32     73280K xfs_inode              
 21263  19529  91%    2.84K   1933       11     61856K task_struct            
686848  97798  14%    0.06K  10732       64     42928K kmalloc-64             
223902  41010  18%    0.19K   5331       42     42648K dentry                 
 32032  23282  72%    1.00K   1001       32     32032K kmalloc-1024           
 10234   9211  90%    1.88K    602       17     19264K TCP                    
 22152  19924  89%    0.81K    568       39     18176K task_xstate            
 69216  59714  86%    0.25K   2163       32     17304K kmalloc-256            
 98421  23541  23%    0.15K   1857       53     14856K xfs_ili                
  5600   2915  52%    2.00K    350       16     11200K kmalloc-2048

This says to me that the blkdev_request buildup is not in fact related to dirty pages, and furthermore that the active objects aren't really active? How can these objects be freed if they aren't in fact in use? What is going on here?

For some background, here is what the drop_caches is doing:

http://lxr.free-electrons.com/source/fs/drop_caches.c

Update:

Worked out that they might not be blkdev_requests, but may be xfs_buf entries showing up under that 'heading'? Not sure how this works:

/sys/kernel/slab # ls -l blkdev_requests(
lrwxrwxrwx 1 root root 0 Nov  7 23:18 blkdev_requests -> :t-0000384/

/sys/kernel/slab # ls -l | grep 384
lrwxrwxrwx 1 root root 0 Nov  7 23:18 blkdev_requests -> :t-0000384/
lrwxrwxrwx 1 root root 0 Nov  7 23:19 ip6_dst_cache -> :t-0000384/
drwxr-xr-x 2 root root 0 Nov  7 23:18 :t-0000384/
lrwxrwxrwx 1 root root 0 Nov  7 23:19 xfs_buf -> :t-0000384/

I still don't know why these are cleared by the 'drop_slabs', or how to work out what is causing this fragmentation.

Bonus question: What's a better way to get to the source of this fragmentation?

If you read this far, thank you for your attention!

Extra requested information:

Memory and xfs info: https://gist.github.com/christian-marie/f417cc3134544544a8d1

Page allocation failure: https://gist.github.com/christian-marie/7bc845d2da7847534104

Follow up: perf info and compaction related things

http://ponies.io/raw/compaction.png

Compaction code seems a little inefficient huh? I've cobbled some code together to attempt to replicate the failed compactions: https://gist.github.com/christian-marie/cde7e80c5edb889da541

This seems to reproduce the issue.

I'll note also that an event trace tells me that there are lots of failed reclaims, over and over and over:

<...>-322 [023] .... 19509.445609: mm_vmscan_direct_reclaim_end: nr_reclaimed=1

Vmstat output is concerning also. Whilst the system is in this high load state the compactions are going through the roof (and mostly failing):

pgmigrate_success 38760827 pgmigrate_fail 350700119 compact_migrate_scanned 301784730 compact_free_scanned 204838172846 compact_isolated 18711615 compact_stall 270115 compact_fail 244488 compact_success 25212

There is indeed something awry with reclaim/compaction.

At the moment I'm looking towards reducing the high order allocations by adding SG support to our ipoib setup. The real issue appears likely vmscan related.

This is interesting, and references this question: http://marc.info/?l=linux-mm&m=141607142529562&w=2

Heck yeah!! We don't get many of these good questions. I'll see what we can do, though. — ewwhite, Nov 07 '14 at 23:21
Please can you provide the output of `/proc/buddyinfo` and the results of `free -m`? The blockdev requests are *probably* represented as buffers in `free`. Oh, and the distro you are using would be good too. Additionally, do you have any `page allocation failure` messages appearing in `dmesg`? If so please provide the output plus any relevent context. — Matthew Ife, Nov 07 '14 at 23:35
Also, did you use a specific `mkfs.xfs` command line? Hugepages enabled? — ewwhite, Nov 07 '14 at 23:36
Tried disabling transparent hugepages by themselves (setting them to never), failure still occured. Did not try this in conjunction with other 'fixes'. — pingu, Nov 07 '14 at 23:44
XFS and memory info: https://gist.github.com/christian-marie/f417cc3134544544a8d1 — pingu, Nov 07 '14 at 23:47
Here is an allocation failure: https://gist.github.com/christian-marie/7bc845d2da7847534104 — pingu, Nov 07 '14 at 23:50
@pingu I'd like to see more XFS allocation groups for disks of that size, but hmm, otherwise seems normal. — ewwhite, Nov 07 '14 at 23:52
@ewwhite, I'll look into allocation groups when I sort this out then, thanks. I'm not 100% sure if any xfs options were used on creation. I did not personally format the disks. — pingu, Nov 07 '14 at 23:57
`/proc/buddyinfo` gives you the fragmentation of the page orders. If you manually defragment the zones using `echo 1>/proc/sys/vm/compact_memory` does that not significantly alter the zone ordering? — Matthew Ife, Nov 08 '14 at 01:31
@MatthewIfe How would I manually defragment the zones? Is this what you mean: https://gist.github.com/christian-marie/4744c94a537b6a3bfb45 — pingu, Nov 08 '14 at 01:34
Sure, but dropping the caches makes the zone defrag redundant. — Matthew Ife, Nov 08 '14 at 01:36
@MatthewIfe Like this? https://gist.github.com/christian-marie/5a6553a5b5fbd11f7cfd — pingu, Nov 08 '14 at 01:38
Thats right, but currently theres hardly anything to worry about with regards to fragmentation. I'm writing an 'answer' to try to collate some details together. — Matthew Ife, Nov 08 '14 at 01:49
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/18492/discussion-between-pingu-and-matthew-ife). — pingu, Nov 08 '14 at 02:42

score 4 · Answer 1 · answered Nov 08 '14 at 02:01

I thought I'd put an answer with my observations because there are a lot of comments.

Based off of your output at https://gist.github.com/christian-marie/7bc845d2da7847534104

We can determine the following:

The GFP_MASK for the memory allocation tried is allowed to do the following.
- Can access emergency pools (I think this means access data below the high watermark for a zone)
- Dont use emergency reserves (I think this means dont allow access to memroy below the min watermark)
- Allocate from one of the normal zones.
- Can swap in order to make room.
- Can drop caches in order to make room.

Zone fragmentation is location here:

[3443189.780792] Node 0 Normal: 3300*4kB (UEM) 8396*8kB (UEM) 4218*16kB (UEM) 76*32kB (UEM) 12*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 151056kB
[3443189.780801] Node 1 Normal: 26667*4kB (UEM) 6084*8kB (UEM) 2040*16kB (UEM) 96*32kB (UEM) 22*64kB (UEM) 4*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 192972kB

And memory utilization at the time is here:

[3443189.780759] Node 0 Normal free:149520kB min:40952kB low:51188kB high:61428kB active_anon:9694208kB inactive_anon:1054236kB active_file:7065912kB inactive_file:7172412kB unevictable:0kB isolated(anon):5452kB isolated(file):3616kB present:30408704kB managed:29881160kB mlocked:0kB dirty:0kB writeback:0kB mapped:25440kB shmem:743788kB slab_reclaimable:1362240kB slab_unreclaimable:783096kB kernel_stack:29488kB pagetables:43748kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[3443189.780766] Node 1 Normal free:191444kB min:45264kB low:56580kB high:67896kB active_anon:11371988kB inactive_anon:1172444kB active_file:8084140kB inactive_file:8556980kB unevictable:0kB isolated(anon):4388kB isolated(file):4676kB present:33554432kB managed:33026648kB mlocked:0kB dirty:0kB writeback:0kB mapped:45400kB shmem:2263296kB slab_reclaimable:1606604kB slab_unreclaimable:438220kB kernel_stack:55936kB pagetables:44944kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

The fragmentation of each zone is bad in the page allocation failure output. There are lot of free order 0 pages with much fewer to none higher order pages. A 'good' result will be plentiful free pages along each order, gradually getting lower in size the higher the order you go. Having 0 high order pages 5 and above indicates fragmentation and starvation for high order allocations.

I am currently not seeing a convincing degree of evidence to suggest that the fragmentation during this period is anything to do with slab caches. In the resulting memory stats, we can see the following

Node 0 = active_anon:9694208kB inactive_anon:1054236kB
Node 1 = active anon:11371988kB inactive_anon:1172444kB

There are no huge pages assigned from userspace, and userspace will thus always claim order 0 memory. Thus in both zones altogether there is over 22GiB of defragmentable memory.

Behaviours I cannot Explain

When high order allocations fail, it is my understanding that memory compaction is always attempted in order to allow for regions of high-order memory allocation to take place and succeed. Why does this not happen? If it does happen, why can it not find any memory to defragment when there is 22GiB of it ripe for reordering?

Behaviours I think I can explain

This needs more research to understand properly, but I believe the ability for the allocation to automatically swap/drop some pagecache to succeed probably does not apply here because there is a lot of free memory still available, so no reclaims occur. Just not enough in the higher orders.

Whilst theres lots of free memory and a few order 4 requests left in each zone, the "total all free memory for each order and deduct from the real free memory" issue results in a 'free memory' below the 'min' watermark which is what leads to the actual allocation failure.

It does seem odd that a relatively (to total free memory) small SLAB cache would fragment all of the memories. I would have expected that with all of those free evictable pages it would simply evict some and be done with it. I suspect NUMA might have something to do with this oddness. — pingu, Nov 08 '14 at 02:17
@pingu If this behavior is reproducible, try enabling the `numad` service and take note of the actions in `/var/log/numad.log`. You may also need libcgroup installed. — ewwhite, Nov 08 '14 at 04:44
@ewwhite Okay, I have one running now. I'm not sure what I'm expecting it to do or what information we might get from it. What are you hoping might happen? — pingu, Nov 08 '14 at 05:48
@pingu Oh, [numad](http://linux.die.net/man/8/numad) automates process placement to optimize NUMA distances/locality and actively moves processes into cgroups to accomplish this. I enable this by default on systems of this size. I thought it could potentially address the fragmentation issue. — ewwhite, Nov 08 '14 at 05:59
Working on a patch to add SG support to ipoib as a way forward at the moment to reduce high order allocation load. I will provide a better follow up when I confirm some more things. — pingu, Nov 18 '14 at 23:20
@xdel yes, your problem looks very similar to mine: http://ponies.io/raw/compaction.png — pingu, Nov 18 '14 at 23:21

score 0 · Answer 2 · answered Sep 06 '20 at 11:34

We had same TX packet drop issue with Ceph on IP over IB. In our case, the problem was due to large MTU size(64K). Someone selected large MTU size(64K) to increase throughput. However, when we ran Ceph long time and heavy loads, the throughput and osd latency become worse due to many tx packet drops. And when we change the MTU size to 9K and throughput and latency become stable. We are also considering to reduce MTU size to 8K according to following article. https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaag/wehs/l0wehs00_otherconfigurationconsiderationoptimalmtusize.htm

Some devices cannot handle MTU above (approximately) 8000 or 9000. — Michael Hampton, Sep 06 '20 at 14:59