10

after searching around for this and only finding posts of people who do not interpret the "cached" figure correctly, I decided to ask this question.

I have some servers at hand, which act strangely. Namely, their RAM usage is very high, for no apparent reason. It seems as if an invisible process has lots of "used" RAM (and I mean "used").

Here's some info:

  • all servers run SLES 11
  • kernel is 3.0.76
  • all servers run as guests under a VMWare ESX infrastructure
  • I have not set up the servers and had no say in OS choice nor do I have access to the virtualization infrastructure
  • all servers are set up similarly and they do run the same set of software (it's a cluster and yeah, I know, virtualized cluster, yada yada, as said: I had and have no say in that)

And some shell output:

root@good-server:# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      14780       1173          0        737       8982
-/+ buffers/cache:       5059      10894
Swap:        31731          0      31731

root@good-server:# python ps_mem.py
[... all processes neatly listed ...]
---------------------------------
                          4.7 GiB
=================================

root@bad-server:# free -m
             total       used       free     shared    buffers     cached
Mem:         15953      15830        123          0        124       1335
-/+ buffers/cache:      14370       1583
Swap:        31731         15      31716

root@bad-server:# python ps_mem.py
[... all processes neatly listed ...]
---------------------------------
                          4.0 GiB
=================================

Contents of /proc/meminfo of the good server

MemTotal:       16336860 kB
MemFree:          112356 kB
Buffers:          138384 kB
Cached:          1145208 kB
SwapCached:         1244 kB
Active:          4344336 kB
Inactive:        1028744 kB
Active(anon):    3706796 kB
Inactive(anon):   382724 kB
Active(file):     637540 kB
Inactive(file):   646020 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      32493560 kB
SwapFree:       32477728 kB
Dirty:              1248 kB
Writeback:             0 kB
AnonPages:       4087776 kB
Mapped:            60132 kB
Shmem:               156 kB
Slab:             274968 kB
SReclaimable:     225864 kB
SUnreclaim:        49104 kB
KernelStack:        4352 kB
PageTables:        16400 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    40661988 kB
Committed_AS:    6576912 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      311400 kB
VmallocChunk:   34359418748 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       73728 kB
DirectMap2M:    16703488 kB

Contents of /proc/meminfo of the bad server

MemTotal:       16336860 kB
MemFree:         1182320 kB
Buffers:          756244 kB
Cached:          8695688 kB
SwapCached:            0 kB
Active:         13499680 kB
Inactive:         843208 kB
Active(anon):    4853460 kB
Inactive(anon):    37372 kB
Active(file):    8646220 kB
Inactive(file):   805836 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      32493560 kB
SwapFree:       32493560 kB
Dirty:              1268 kB
Writeback:             0 kB
AnonPages:       4890180 kB
Mapped:            84672 kB
Shmem:               252 kB
Slab:             586084 kB
SReclaimable:     503716 kB
SUnreclaim:        82368 kB
KernelStack:        5176 kB
PageTables:        19684 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    40661988 kB
Committed_AS:    6794180 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      311400 kB
VmallocChunk:   34359419468 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      112640 kB
DirectMap2M:    16664576 kB

TL;DR - if you compare these side-by-side, here's the main differences (BADserver - GOODserver):

MemFree       -1070 MB
Cached        -7550 MB
Active        -9155 MB
Active(anon)  -1147 MB
Active(file)  -8009 MB
AnonPages     - 802 MB

The other differences are rather small and within limits one might expect (but you can see for yourself)

As you can see, on the good server the total of all RES and SHR memory of all processes is pretty much in line with free -m's output for "used -/+ buffers/cache" value - which is what you'd expect, right?

Now look at the bad server: free -m's output for "used -/+ buffers/cache" value is about 3 times as high as you might expect, summing up everything ps can show you.

This also matches what /proc/meminfo tells me.

So far I have no idea how that is even possible. What might be going on here?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
luxifer
  • 177
  • 1
  • 3
  • 12

1 Answers1

12

I think you may have a VMware memory ballooning issue. There's a chance that memory overcommitment across the vSphere infrastructure is too high. You won't be able to remediate this without access to the vSphere vCenter, but you should be able to detect this from within your virtual machines, assuming vmtools are installed:

Can you please post the output of vmware-toolbox-cmd stat balloon ?

Also, you've been allocated 16GB of RAM. Please ask whomever is in control of the infrastructure if there are any manual RAM limits placed on the VMs in question.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Having read how ballooning works on vmware linux vms I think this is the cause. I am pretty unimpressed they do not offer a way from the VM side for account for the 'used' pages though. – Matthew Ife Feb 18 '15 at 11:43
  • 1
    This is indeed correct I think... the good server shows "o MB"... the bad server shows "10092 MB", which is pretty much in line with what we're seeing! – luxifer Feb 18 '15 at 11:46
  • @luxifer So now you guys have to [fix it](http://serverfault.com/a/528301/13325). Which either means removing an artificial RAM limit on the VM or vMotioning to another ESXi host. Ask your VMware infrastructure team to see if this is a [more widespread problem](http://serverfault.com/a/536899/13325). – ewwhite Feb 18 '15 at 11:49
  • @ewwhite I'll notify them for sure. However, it's the infrastructure of one of our customers and normally they should have identified this. Unfortunately, that's not how big, world-wide IT service providers seem to work ;) – luxifer Feb 18 '15 at 12:08
  • @luxifer Seriously, I find that this can happen in [all sorts of organizations](http://serverfault.com/questions/528254/vsphere-education-what-are-the-downsides-of-configuring-vms-with-too-much-ra), and the people tasked with managing the vSphere infrastructure don't seem to realize it. – ewwhite Feb 18 '15 at 12:13
  • @ewwhite sure but their monitoring has now sent alert mails with memory warnings for these machines for about 7 hours now so they are aware of the issue... in theory. Anyway I thank you for helping me to pin this down so I can prove the ball is in their field. From there I cannot do much, really. – luxifer Feb 18 '15 at 12:23