34

Problem

A CentOS machine with kernel 2.6.32 and 128 GB physical RAM ran into trouble a few days ago. The responsible system administrator tells me that the PHP-FPM application was not responding to requests in a timely manner anymore due to swapping, and having seen in free that almost no memory was left, he chose to reboot the machine.

I know that free memory can be a confusing concept on Linux and a reboot perhaps was the wrong thing to do. However, the mentioned administrator blames the PHP application (which I am responsible for) and refuses to investigate further.

What I could find out on my own is this:

  • Before the restart, the free memory (incl. buffers and cache) was only a couple of hundred MB.
  • Before the restart, /proc/meminfo reported a Slab memory usage of around 90 GB (yes, GB).
  • After the restart, the free memory was 119 GB, going down to around 100 GB within an hour, as the PHP-FPM workers (about 600 of them) were coming back to life, each of them showing between 30 and 40 MB in the RES column in top (which has been this way for months and is perfectly reasonable given the nature of the PHP application). There is nothing else in the process list that consumes an unusual or noteworthy amount of RAM.
  • After the restart, Slab memory was around 300 MB

If have been monitoring the system ever since, and most notably the Slab memory is increasing in a straight line with a rate of about 5 GB per day. Free memory as reported by free and /proc/meminfo decreases at the same rate. Slab is currently at 46 GB. According to slabtop most of it is used for dentry entries:

Free memory:

free -m
             total       used       free     shared    buffers     cached
Mem:        129048      76435      52612          0        144       7675
-/+ buffers/cache:      68615      60432
Swap:         8191          0       8191

Meminfo:

cat /proc/meminfo
MemTotal:       132145324 kB
MemFree:        53620068 kB
Buffers:          147760 kB
Cached:          8239072 kB
SwapCached:            0 kB
Active:         20300940 kB
Inactive:        6512716 kB
Active(anon):   18408460 kB
Inactive(anon):    24736 kB
Active(file):    1892480 kB
Inactive(file):  6487980 kB
Unevictable:        8608 kB
Mlocked:            8608 kB
SwapTotal:       8388600 kB
SwapFree:        8388600 kB
Dirty:             11416 kB
Writeback:             0 kB
AnonPages:      18436224 kB
Mapped:            94536 kB
Shmem:              6364 kB
Slab:           46240380 kB
SReclaimable:   44561644 kB
SUnreclaim:      1678736 kB
KernelStack:        9336 kB
PageTables:       457516 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    72364108 kB
Committed_AS:   22305444 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      480164 kB
VmallocChunk:   34290830848 kB
HardwareCorrupted:     0 kB
AnonHugePages:  12216320 kB
HugePages_Total:    2048
HugePages_Free:     2048
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        5604 kB
DirectMap2M:     2078720 kB
DirectMap1G:    132120576 kB

Slabtop:

slabtop --once
Active / Total Objects (% used)    : 225920064 / 226193412 (99.9%)
 Active / Total Slabs (% used)      : 11556364 / 11556415 (100.0%)
 Active / Total Caches (% used)     : 110 / 194 (56.7%)
 Active / Total Size (% used)       : 43278793.73K / 43315465.42K (99.9%)
 Minimum / Average / Maximum Object : 0.02K / 0.19K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
221416340 221416039   3%    0.19K 11070817       20  44283268K dentry                 
1123443 1122739  99%    0.41K 124827        9    499308K fuse_request           
1122320 1122180  99%    0.75K 224464        5    897856K fuse_inode             
761539 754272  99%    0.20K  40081       19    160324K vm_area_struct         
437858 223259  50%    0.10K  11834       37     47336K buffer_head            
353353 347519  98%    0.05K   4589       77     18356K anon_vma_chain         
325090 324190  99%    0.06K   5510       59     22040K size-64                
146272 145422  99%    0.03K   1306      112      5224K size-32                
137625 137614  99%    1.02K  45875        3    183500K nfs_inode_cache        
128800 118407  91%    0.04K   1400       92      5600K anon_vma               
 59101  46853  79%    0.55K   8443        7     33772K radix_tree_node        
 52620  52009  98%    0.12K   1754       30      7016K size-128               
 19359  19253  99%    0.14K    717       27      2868K sysfs_dir_cache        
 10240   7746  75%    0.19K    512       20      2048K filp  

VFS cache pressure:

cat /proc/sys/vm/vfs_cache_pressure
125

Swappiness:

cat /proc/sys/vm/swappiness
0

I know that unused memory is wasted memory, so this should not necessarily be a bad thing (especially given that 44 GB are shown as SReclaimable). However, apparently the machine experienced problems nonetheless, and I'm afraid the same will happen again in a few days when Slab surpasses 90 GB.

Questions

I have these questions:

  • Am I correct in thinking that the Slab memory is always physical RAM, and the number is already subtracted from the MemFree value?
  • Is such a high number of dentry entries normal? The PHP application has access to around 1.5 M files, however most of them are archives and not being accessed at all for regular web traffic.
  • What could be an explanation for the fact that the number of cached inodes is much lower than the number of cached dentries, should they not be related somehow?
  • If the system runs into memory trouble, should the kernel not free some of the dentries automatically? What could be a reason that this does not happen?
  • Is there any way to "look into" the dentry cache to see what all this memory is (i.e. what are the paths that are being cached)? Perhaps this points to some kind of memory leak, symlink loop, or indeed to something the PHP application is doing wrong.
  • The PHP application code as well as all asset files are mounted via GlusterFS network file system, could that have something to do with it?

Please keep in mind that I can not investigate as root, only as a regular user, and that the administrator refuses to help. He won't even run the typical echo 2 > /proc/sys/vm/drop_caches test to see if the Slab memory is indeed reclaimable.

Any insights into what could be going on and how I can investigate any further would be greatly appreciated.

Updates

Some further diagnostic information:

Mounts:

cat /proc/self/mounts
rootfs / rootfs rw 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=66063000k,nr_inodes=16515750,mode=755 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,relatime 0 0
/dev/mapper/sysvg-lv_root / ext4 rw,relatime,barrier=1,data=ordered 0 0
/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0
/dev/sda1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0
tmpfs /phptmp tmpfs rw,noatime,size=1048576k,nr_inodes=15728640,mode=777 0 0
tmpfs /wsdltmp tmpfs rw,noatime,size=1048576k,nr_inodes=15728640,mode=777 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0
cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0
cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0
/etc/glusterfs/glusterfs-www.vol /var/www fuse.glusterfs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 0 0
/etc/glusterfs/glusterfs-upload.vol /var/upload fuse.glusterfs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
172.17.39.78:/www /data/www nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=38467,timeo=600,retrans=2,sec=sys,mountaddr=172.17.39.78,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=172.17.39.78 0 0

Mount info:

cat /proc/self/mountinfo
16 21 0:3 / /proc rw,relatime - proc proc rw
17 21 0:0 / /sys rw,relatime - sysfs sysfs rw
18 21 0:5 / /dev rw,relatime - devtmpfs devtmpfs rw,size=66063000k,nr_inodes=16515750,mode=755
19 18 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
20 18 0:16 / /dev/shm rw,relatime - tmpfs tmpfs rw
21 1 253:1 / / rw,relatime - ext4 /dev/mapper/sysvg-lv_root rw,barrier=1,data=ordered
22 16 0:15 / /proc/bus/usb rw,relatime - usbfs /proc/bus/usb rw
23 21 8:1 / /boot rw,relatime - ext4 /dev/sda1 rw,barrier=1,data=ordered
24 21 0:17 / /phptmp rw,noatime - tmpfs tmpfs rw,size=1048576k,nr_inodes=15728640,mode=777
25 21 0:18 / /wsdltmp rw,noatime - tmpfs tmpfs rw,size=1048576k,nr_inodes=15728640,mode=777
26 16 0:19 / /proc/sys/fs/binfmt_misc rw,relatime - binfmt_misc none rw
27 21 0:20 / /cgroup/cpuset rw,relatime - cgroup cgroup rw,cpuset
28 21 0:21 / /cgroup/cpu rw,relatime - cgroup cgroup rw,cpu
29 21 0:22 / /cgroup/cpuacct rw,relatime - cgroup cgroup rw,cpuacct
30 21 0:23 / /cgroup/memory rw,relatime - cgroup cgroup rw,memory
31 21 0:24 / /cgroup/devices rw,relatime - cgroup cgroup rw,devices
32 21 0:25 / /cgroup/freezer rw,relatime - cgroup cgroup rw,freezer
33 21 0:26 / /cgroup/net_cls rw,relatime - cgroup cgroup rw,net_cls
34 21 0:27 / /cgroup/blkio rw,relatime - cgroup cgroup rw,blkio
35 21 0:28 / /var/www rw,relatime - fuse.glusterfs /etc/glusterfs/glusterfs-www.vol rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072
36 21 0:29 / /var/upload rw,relatime - fuse.glusterfs /etc/glusterfs/glusterfs-upload.vol rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072
37 21 0:30 / /var/lib/nfs/rpc_pipefs rw,relatime - rpc_pipefs sunrpc rw
39 21 0:31 / /data/www rw,relatime - nfs 172.17.39.78:/www rw,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=38467,timeo=600,retrans=2,sec=sys,mountaddr=172.17.39.78,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=172.17.39.78

GlusterFS config:

cat /etc/glusterfs/glusterfs-www.vol
volume remote1
  type protocol/client
  option transport-type tcp
  option remote-host 172.17.39.71
   option ping-timeout 10
   option transport.socket.nodelay on # undocumented option for speed
    # http://gluster.org/pipermail/gluster-users/2009-September/003158.html
  option remote-subvolume /data/www
end-volume

volume remote2
  type protocol/client
  option transport-type tcp
  option remote-host 172.17.39.72
   option ping-timeout 10
   option transport.socket.nodelay on # undocumented option for speed
        # http://gluster.org/pipermail/gluster-users/2009-September/003158.html
  option remote-subvolume /data/www
end-volume

volume remote3
  type protocol/client
  option transport-type tcp
  option remote-host 172.17.39.73
   option ping-timeout 10
   option transport.socket.nodelay on # undocumented option for speed
        # http://gluster.org/pipermail/gluster-users/2009-September/003158.html
  option remote-subvolume /data/www
end-volume

volume remote4
  type protocol/client
  option transport-type tcp
  option remote-host 172.17.39.74
   option ping-timeout 10
   option transport.socket.nodelay on # undocumented option for speed
        # http://gluster.org/pipermail/gluster-users/2009-September/003158.html
  option remote-subvolume /data/www
end-volume

volume replicate1
  type cluster/replicate
   option lookup-unhashed off    # off will reduce cpu usage, and network
   option local-volume-name 'hostname'
  subvolumes remote1 remote2
end-volume

volume replicate2
  type cluster/replicate
   option lookup-unhashed off    # off will reduce cpu usage, and network
   option local-volume-name 'hostname'
  subvolumes remote3 remote4
end-volume

volume distribute
  type cluster/distribute
  subvolumes replicate1 replicate2
end-volume

volume iocache
  type performance/io-cache
   option cache-size 8192MB        # default is 32MB
   subvolumes distribute
end-volume

volume writeback
  type performance/write-behind
  option cache-size 1024MB
  option window-size 1MB
  subvolumes iocache
end-volume

### Add io-threads for parallel requisitions
volume iothreads
  type performance/io-threads
  option thread-count 64 # default is 16
  subvolumes writeback
end-volume

volume ra
  type performance/read-ahead
  option page-size 2MB
  option page-count 16
  option force-atime-update no
  subvolumes iothreads
end-volume
Wolfgang Stengel
  • 703
  • 1
  • 5
  • 11
  • Please provide the output of `cat /proc/self/mounts` and (maybe quite long) `cat /proc/self/mountinfo`. – Matthew Ife Dec 14 '13 at 16:03
  • @MIfe I've updated the question, both outputs are appended. – Wolfgang Stengel Dec 14 '13 at 16:12
  • My feeling here is its probably to do with NFS dentry caching. Out of interest can you run `cat /etc/nfsmount.conf`. Also do you have any directories that contain many files in its immediate directory? – Matthew Ife Dec 14 '13 at 16:29
  • 1
    Well, since vfs_cache_pressure > 100, kernel should prefer to reclaim dentrie cache memory. This can easily be a bug, 2.6.32 is rather old kernel, even with RedHat backport patches. BTW, what is exact kernel version? – poige Dec 14 '13 at 16:30
  • Oh -- and you also have a gluster volume, what is that used for and how are the bricks setup? `cat /etc/glusterfs/glusterfs-www.vol` – Matthew Ife Dec 14 '13 at 16:33
  • @MIfe All lines in `/etc/nfsmount.conf` are commented out, it's all default values I suppose. There are some dirs that have many files in them, but never more than 1000 per dir. – Wolfgang Stengel Dec 14 '13 at 16:35
  • @poige It's 2.6.32-431.el6.x86_64. I've been googling around for this problem though, some people claim this is a bug, but it apparently always turned out not to be. – Wolfgang Stengel Dec 14 '13 at 16:36
  • @MIfe GlusterFS config added. I don't know anything about the setup though, I just know it exists and that the files show up in the right place for the PHP application. – Wolfgang Stengel Dec 14 '13 at 16:42
  • The top three slab lines being to do with fuse might indicate its to do with gluster, do you know if this host is used in the gluster system as a storage node at all? http://www.gluster.org/community/documentation/index.php/Linux_Kernel_Tuning#vm.vfs_cache_pressure this link suggests servers can become incredibly dentry heavy especially when handling large portions of small files. Check `ps -Ao args | grep glust` for any gluster storage/client instances and their arguments. – Matthew Ife Dec 14 '13 at 17:00
  • 2
    (Your sysadmin sounds *terrible*. It gives us a bad name) – ewwhite Dec 14 '13 at 17:12
  • For anyone that is still interested, the problem has been solved. Check out my answer below. Thanks everyone for taking the time to help me. – Wolfgang Stengel Jan 24 '14 at 19:00

5 Answers5

24

Confirmed Solution

To anyone who might run into the same problem. The data center guys finally figured it out today. The culprit was a NSS (Network Security Services) library bundled with Libcurl. An upgrade to the newest version solved the problem.

A bug report that describes the details is here:

https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1044666

Apparently, in order to determine if some path is local or on a network drive, NSS was looking up a nonexisting file and measureing the time it took for the file system to report back! If you have a large enough number of Curl requests and enough memory, these requests are all cached and stack up.

Wolfgang Stengel
  • 703
  • 1
  • 5
  • 11
  • Note that the `libnss` was fixed to not issue lots of `access()` calls with non-existant filenames. However, I think any other program issuing lots of `access()` calls with non-existant filenames may still cause the same problem if one uses affected kernel versions. – Mikko Rantalainen Dec 13 '20 at 13:44
15

Am I correct in thinking that the Slab memory is always physical RAM, and the number is already subtracted from the MemFree value?

Yes.

Is such a high number of dentry entries normal? The PHP application has access to around 1.5 M files, however most of them are archives and not being accessed at all for regular web traffic.

Yes, if the system isn't under memory pressure. It has to use the memory for something, and it's possible that in your particular pattern of usage, this is the best way to use that memory.

What could be an explanation for the fact that the number of cached inodes is much lower than the number of cached dentries, should they not be related somehow?

Lots of directory operations would be the most likely explanation.

If the system runs into memory trouble, should the kernel not free some of the dentries automatically? What could be a reason that this does not happen?

It should, and I can't think of any reason it wouldn't. I'm not convinced that this is what actually went wrong. I'd strongly suggest upgrading your kernel or increasing vfs_cache_pressure further.

Is there any way to "look into" the dentry cache to see what all this memory is (i.e. what are the paths that are being cached)? Perhaps this points to some kind of memory leak, symlink loop, or indeed to something the PHP application is doing wrong.

I don't believe there is. I'd look for any directories with absurdly large numbers of entries or very deep directory structures that are searched or traversed.

The PHP application code as well as all asset files are mounted via GlusterFS network file system, could that have something to do with it?

Definitely it could be a filesystem issue. A filesystem bug causing dentries not to be released, for example, is a possibility.

David Schwartz
  • 31,215
  • 2
  • 53
  • 82
  • Thank you for answering my questions individually. The cache pressure was finally increased further and the dentry cache increase stopped. – Wolfgang Stengel Jan 13 '14 at 12:40
  • I could not track down the responsible program yet. If I find out more, I'll report back for anyone else having this problem. – Wolfgang Stengel Jan 13 '14 at 12:44
  • 2
    Thanks! Big directory (0.25 mil files) was totally the cause of the problem in my case, anytime something interacted with it 2GB of ram would disappear into the cache. – Some Linux Nerd May 19 '15 at 21:03
15

I ran into this exact issue, and while Wolfgang is correct about the cause, there's some important detail missing.

  • This issue impacts SSL requests done with curl or libcurl, or any other software that happens to use mozilla NSS for secure connection. Non-secure requests do not trigger the issue.

  • The problem does not require concurrent curl requests. The accumulation of dentry will occur as long as curl calls are frequent enough to outpace the OS's efforts to reclaim RAM.

  • the newer version of NSS, 3.16.0, does include a fix for this. however, you don't get the fix for free by upgrading NSS, and you don't have to upgrade all of NSS. you only have to upgrade nss-softokn (which has a required dependency on nss-utils) at a minimum. and to get the benefit, you need to set the environment variable NSS_SDB_USE_CACHE for the process that is using libcurl. the presence of that environment variable is what allows the costly non-existent file checks to be skipped.

FWIW, I wrote a blog entry with a little more background/details, in case anyone needs it.

J. Paulding
  • 366
  • 3
  • 8
  • Thanks for a nice blog post, but I would like to mention that nss-softokn has still not been updated to version 3.16 on CentOS/RHEL. It will probably be fixed in version 6.6. – Strahinja Kustudic Oct 04 '14 at 21:45
  • 1
    Thanks for the note. Perhaps Amazon got out ahead of this one (maybe even at our request?) for their managed repos. On older versions (3.14-3.15), you still get half the benefit by setting the appropriate environment variables. If you have the know-how, you might be able to build v3.16 directly. Otherwise, increasing the cache pressure and taking the associated CPU hit might be your best bet for reliable performance. – J. Paulding Oct 06 '14 at 15:56
  • 3
    This is fixed in Centos 6.6 with nss-softokn-3.14.3-17 – Strahinja Kustudic Nov 06 '14 at 23:14
  • 1
    Just to be clear for people looking for a quick fix: you have to both update the `nss-softoken` RPM *AND* set the `NSS_SDB_USE_CACHE=YES` env var to have curl https calls stop flooding your dentry cache. – Steve Kehlet Jan 12 '15 at 17:48
  • Do you have the blog entry available somewhere? The URL linked in the answer doesn't seem to work and Wayback machine doesn't have a copy. – Mikko Rantalainen Dec 13 '20 at 13:46
  • 1
    Sorry, Mikko. That was on my old companies' web site. The company was purchased by Knetik several years ago, and at one point, the blog was still available at https://tools.knetik.io/blog/2014-05-16-optimizing-aws-nss-softoken, but even that seems gone now ... However, Steve's note above is valid, if you are on an older version of nss-softoken --- and it is also asserted above that 3.17 resolves the issue, although I don't personally know if you would need to set the env variable w/ that version - you may have to check the release notes – J. Paulding Dec 13 '20 at 18:06
4

See https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7/2.6.7-mm1/broken-out/vfs-shrinkage-tuning.patch

There're numbers showing that you can expect some noticeable dentry memory reclaim when vfs_cache_pressure is set a way higher than 100. So 125 can be too low for it to happen in your case.

poige
  • 9,171
  • 2
  • 24
  • 50
  • From all I have read, increasing `vfs_cache_pressure` above `100` only makes sense if you do not have enough RAM for your workload. In that case, having value way above 100 (e.g. 10000) will free some RAM. That will result in worse IO overall, though. – Mikko Rantalainen Jan 17 '18 at 11:15
3

Not really an explanation to your answer, but as a user of this system this information you provided:

cat /proc/meminfo
MemTotal:       132145324 kB
...
SReclaimable:   44561644 kB
SUnreclaim:      1678736 kB

Is enough to tell me that this is not your problem and its the responsibility of the sysadmin to provide an adequate explanation.

I dont want to sound to rude here but;

  • You lack specific information on the role of this host.
  • How the host is supposed to prioritize resources is out of your scope.
  • You are not familiar, or had any part in the design and deployment of the storage on this host.
  • You are unable to offer certain system output as you are not root.

It is your sysadmins responsibility to justify or resolve the slab allocation anomaly. Either you haven't given us a complete picture of the whole saga that lead you up to this (which frankly I am not interested in) or your sysadmin is behaving irresponsibly and/or incompetently in the way he considers handling this problem.

Feel free to tell him some random stranger on the internet thinks he isn't taking his responsibilities seriously.

Matthew Ife
  • 22,927
  • 2
  • 54
  • 71