17

Here's my free and smem output:

danslimmon@bad-server:~$ free -m
             total       used       free     shared    buffers     cached
Mem:         30147      29928        218          6          4       3086
-/+ buffers/cache:      26837       3309
Swap:            0          0          0

danslimmon@bad-server:~$ smem -tw
Area                           Used      Cache   Noncache
firmware/hardware                 0          0          0
kernel image                      0          0          0
kernel dynamic memory      12857576    2887440    9970136
userspace memory           17661400    1272468   16388932
free memory                  351592     351592          0
----------------------------------------------------------
                           30870568    4511500   26359068

And here's the head of my top output, sorted by RSS:

top - 15:51:13 up 248 days, 14:20,  1 user,  load average: 14.43, 11.00, 8.95
Tasks: 510 total,   2 running, 508 sleeping,   0 stopped,   0 zombie
%Cpu(s): 30.6 us,  3.8 sy,  0.9 ni, 63.8 id,  0.1 wa,  0.0 hi,  0.5 si,  0.2 st
KiB Mem:  30870568 total, 30469188 used,   401380 free,     4364 buffers
KiB Swap:        0 total,        0 used,        0 free.  2994052 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 41801 cassand+  20   0 17.501g 7.845g 757184 S 428.9 26.6 181039:29 java
 73105 redacte+  20   0 8401132 6.181g   3684 S 251.1 21.0  11336:12 beam.smp
105293 nobody    20   0 3115584 2.060g   2284 S  97.2  7.0  28:38.51 statsd
  1743 opscent+  20   0 3347672 172816   1460 S   0.0  0.6 383:20.64 java
 73294 redacte+  30  10   70056  58952    988 S   0.3  0.2  13:39.12 consumer:00237
 73279 redacte+  30  10   68052  56916   1008 S   1.0  0.2  47:04.69 consumer:00226
 73281 redacte+  30  10   67552  56464   1012 S   1.7  0.2  61:14.90 consumer:00230
 73304 redacte+  30  10   65512  54404    984 S   0.7  0.2  37:46.67 consumer:00210
 73305 redacte+  30  10   64640  53576    988 S   1.7  0.2  73:32.57 consumer:00228
 73278 redacte+  30  10   64540  53504   1024 S   1.3  0.2  32:16.44 consumer:00212
 73308 redacte+  30  10   64452  53392   1056 S   0.7  0.2  34:27.21 consumer:00220
 73287 redacte+  30  10   64128  53016   1004 S   1.3  0.2  70:54.29 consumer:00218
 73300 redacte+  30  10   64024  52828    984 S   0.7  0.2  33:05.53 consumer:00207
 73299 redacte+  30  10   63744  52680    984 S   1.3  0.2  40:28.91 consumer:00209
 73302 redacte+  30  10   62840  51812   1028 S   1.3  0.2  45:07.17 consumer:00210
 73288 redacte+  30  10   62268  51240   1068 S   1.0  0.2  46:46.53 consumer:00209
 73297 redacte+  30  10   62988  50924    976 S   0.7  0.2  34:03.82 consumer:00203
 73296 redacte+  30  10   62024  50912    984 S   0.7  0.2  41:08.47 consumer:00205
 73280 redacte+  30  10   61748  50588    956 S   0.7  0.2  35:50.30 consumer:00203
 73303 redacte+  30  10   60632  49564    976 S   1.3  0.2  56:31.81 consumer:00182
 73290 redacte+  30  10   60512  49440    980 S   0.7  0.2  41:46.82 consumer:00195
 73283 redacte+  30  10   60444  49356    992 S   1.3  0.2  52:56.75 consumer:00196
 73289 redacte+  30  10   60328  49196    944 S   0.7  0.2  20:24.01 consumer:00189
 73291 redacte+  30  10   60164  49004   1000 S   1.3  0.2  62:30.71 consumer:00202
 73282 redacte+  30  10   59960  48876    980 S   0.7  0.2  34:53.59 consumer:00191
 73293 redacte+  30  10   59684  48512    972 S   0.7  0.2  33:04.45 consumer:00204
 73277 redacte+  30  10   58736  47628   1000 S   1.3  0.2  40:08.13 consumer:00183
 73285 redacte+  30  10   58552  47388   1012 S   0.7  0.2  35:10.61 consumer:00190
 73292 redacte+  30  10   57676  46476    980 S   0.3  0.2  22:59.14 consumer:00185
 73306 redacte+  30  10   55792  44716    988 S   1.0  0.1  21:42.18 consumer:00200
 73301 redacte+  30  10   55744  44696   1012 S   0.0  0.1  11:05.04 consumer:00194
 73298 redacte+  30  10   55128  43972    988 S   0.3  0.1  38:29.62 consumer:00187
 73286 redacte+  30  10   55024  43904    980 S   1.0  0.1  31:53.58 consumer:00170
 73295 redacte+  30  10   53276  42156   1008 S   0.3  0.1  18:50.26 consumer:00172
 73307 redacte+  30  10   52960  41884   1004 S   0.7  0.1  25:10.44 consumer:00169
 73284 redacte+  30  10   52492  41464   1024 S   0.3  0.1  25:27.32 consumer:00167
 98875 root      20   0 1034604  18088   1248 S   0.3  0.1  80:35.73 log-courier
 25696 root      20   0  779288  12232   1144 S   0.0  0.0   1304:55 collectd
  2073 root      20   0   60840  12092   1504 S   0.0  0.0  61:52.72 supervisord
  1255 root      20   0   51436   9844   1032 S   0.0  0.0  16:55.76 munin-node
 87724 root       0 -20   20936   8664   3508 S   0.0  0.0   0:05.79 atop
  2149 nobody    20   0   45352   7424   1624 S   0.0  0.0  13:49.30 consumer_probe
 16973 www-data  20   0  139148   5896   1536 S   0.0  0.0 874:07.04 nginx
 16974 www-data  20   0  139104   5880   1544 S   0.7  0.0 869:08.26 nginx
 16975 www-data  20   0  139148   5880   1532 R   8.6  0.0 880:08.30 nginx
 16972 www-data  20   0  139152   5756   1532 S   0.0  0.0 869:41.00 nginx
  1561 ds-agent  20   0   22336   5628   1004 S   0.0  0.0  87:27.22 datastax_agent_
 90639 syslog    20   0  354552   4364    676 S   0.0  0.0  24:13.76 rsyslogd
  9887 root      20   0  135816   4300   1296 S   0.0  0.0   0:00.01 nginx
101932 danslim+  20   0   21332   3800   1752 S   0.0  0.0   0:00.08 bash
101802 root      20   0  105632   3568   2580 S   0.0  0.0   0:00.01 sshd
  2065 snmp      20   0   45580   3540    764 S   0.0  0.0 171:34.48 snmpd
130366 cassand+  20   0   21120   3140   1304 S   0.0  0.0   0:00.03 bash
130349 danslim+  20   0   21224   3020   1140 S   0.0  0.0   0:00.02 bash
  1087 root      20   0   10224   2884    600 S   0.0  0.0   0:08.15 dhclient
     1 root      20   0   33648   2216    676 S   0.0  0.0   0:24.25 init
111772 danslim+  20   0   24072   2080   1156 R   0.3  0.0   0:00.11 top
101931 danslim+  20   0  105780   1756    752 S   0.3  0.0   0:00.16 sshd
 79834 postfix   20   0   40468   1716    852 S   0.0  0.0   0:32.20 tlsmgr
  3778 ntp       20   0   31444   1644   1044 S   0.0  0.0  14:21.01 ntpd
 85449 root      20   0   59640   1568   1112 S   0.0  0.0   0:00.00 cron
 85592 root      20   0   59640   1568   1112 S   0.0  0.0   0:00.00 cron
 85740 root      20   0   59640   1568   1112 S   0.0  0.0   0:00.00 cron
 85888 root      20   0   59640   1568   1112 S   0.0  0.0   0:00.00 cron
 86041 root      20   0   59640   1568   1112 S   0.0  0.0   0:00.00 cron
 87484 root      20   0   59640   1568   1112 S   0.0  0.0   0:00.00 cron
  1481 root      20   0   61364   1556    876 S   0.0  0.0   1:11.70 sshd
130365 root      20   0   65988   1468    840 S   0.0  0.0   0:00.01 sudo
   869 root      20   0   49724   1320    616 S   0.0  0.0   0:00.04 systemd-udevd
 79827 postfix   20   0   27624   1264    772 S   0.0  0.0   0:22.47 qmgr

As you can see, there's a ~10GB gap between the sum of the values in the RSS column (~18 GB) and the Used-minus-Cached value according to free. Cached memory is consequently super constrained. It's worth noting that the java process there is Cassandra and it's super heavily threaded.

I've read this answer, which says that the most common reason for free to report more used memory than top is that top doesn't include Shared memory in its RSS column. That makes sense, and perhaps it could explain this difference in reports. But I'm left wondering what tools will help me figure out which process is eating up all the memory and fix that problem.

It's definitely a problem, because I have another server that's supposed to be doing roughly identical work, and on that server there is a much smaller gap (~ 400MB compared to ~10GB) between sum-of-RSS-values and Used-minus-Cached, and Cached memory on that server is correspondingly much less constricted.

How can I figure out what's eating up all the memory?

System details

uname:

danslimmon@bad-server:~$ uname -a
Linux bad-server 3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

meminfo:

danslimmon@bad-server:~$ cat /proc/meminfo
MemTotal:       30870568 kB
MemFree:          231428 kB
Buffers:            6048 kB
Cached:          3151268 kB
SwapCached:            0 kB
Active:         19606476 kB
Inactive:        1509568 kB
Active(anon):   17969604 kB
Inactive(anon):      576 kB
Active(file):    1636872 kB
Inactive(file):  1508992 kB
Unevictable:        8656 kB
Mlocked:            8656 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              3512 kB
Writeback:             0 kB
AnonPages:      17968288 kB
Mapped:           838648 kB
Shmem:              7116 kB
Slab:             195856 kB
SReclaimable:     113120 kB
SUnreclaim:        82736 kB
KernelStack:        9112 kB
PageTables:        55016 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    15435284 kB
Committed_AS:   17220572 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       71496 kB
VmallocChunk:   34359632104 kB
HardwareCorrupted:     0 kB
AnonHugePages:  10680320 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       59392 kB
DirectMap2M:    31528960 kB

lsmod:

danslimmon@bad-server:~$ lsmod
Module                  Size  Used by
tcp_diag               12591  0 
inet_diag              18543  1 tcp_diag
dm_crypt               23177  0 
syscopyarea            12529  0 
sysfillrect            12701  0 
sysimgblt              12640  0 
fb_sys_fops            12703  0 
serio_raw              13462  0 
isofs                  39837  0 
raid10                 48128  0 
raid456                86484  0 
async_memcpy           12762  1 raid456
async_raid6_recov      12984  1 raid456
async_pq               13365  1 raid456
async_xor              13160  2 async_pq,raid456
async_tx               13509  5 async_pq,raid456,async_xor,async_memcpy,async_raid6_recov
xor                    21411  1 async_xor
raid6_pq               97812  2 async_pq,async_raid6_recov
raid0                  17842  0 
multipath              13145  0 
linear                 12894  0 
raid1                  35530  0 
crct10dif_pclmul       14289  0 
crc32_pclmul           13113  0 
ghash_clmulni_intel    13216  0 
aesni_intel            55624  0 
aes_x86_64             17131  1 aesni_intel
lrw                    13286  1 aesni_intel
gf128mul               14951  1 lrw
glue_helper            13990  1 aesni_intel
ablk_helper            13597  1 aesni_intel
cryptd                 20359  3 ghash_clmulni_intel,aesni_intel,ablk_helper
psmouse               106714  0 
floppy                 69418  0 
ixgbevf                50771  0

slabinfo:

danslimmon@bad-server:~$ sudo cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
isofs_inode_cache      0      0    632   51    8 : tunables    0    0    0 : slabdata      0      0      0
UDPLITEv6              0      0   1088   30    8 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                480    480   1088   30    8 : tunables    0    0    0 : slabdata     16     16      0
tw_sock_TCPv6       1024   1024    256   64    4 : tunables    0    0    0 : slabdata     16     16      0
TCPv6                256    256   1984   16    8 : tunables    0    0    0 : slabdata     16     16      0
kcopyd_job             0      0   3312    9    8 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
dm_rq_target_io        0      0    424   38    4 : tunables    0    0    0 : slabdata      0      0      0
cfq_queue              0      0    232   70    4 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    312   52    4 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
fuse_request           0      0    408   40    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode             0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_key_record_cache      0      0    576   56    8 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_inode_cache      0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    712   46    8 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache    864    864    600   54    8 : tunables    0    0    0 : slabdata     16     16      0
jbd2_journal_handle   1360   1360     48   85    1 : tunables    0    0    0 : slabdata     16     16      0
jbd2_journal_head   6480   6480    112   36    1 : tunables    0    0    0 : slabdata    180    180      0
jbd2_revoke_table_s    512    512     16  256    1 : tunables    0    0    0 : slabdata      2      2      0
jbd2_revoke_record_s  16128  16768     32  128    1 : tunables    0    0    0 : slabdata    131    131      0
ext4_inode_cache    7115  18546    984   33    8 : tunables    0    0    0 : slabdata    562    562      0
ext4_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_data      9408   9408     64   64    1 : tunables    0    0    0 : slabdata    147    147      0
ext4_allocation_context   6780   6780    136   60    2 : tunables    0    0    0 : slabdata    113    113      0
ext4_io_end         8848   8848     72   56    1 : tunables    0    0    0 : slabdata    158    158      0
ext4_extent_status  15380  32232     40  102    1 : tunables    0    0    0 : slabdata    316    316      0
dquot               1024   1024    256   64    4 : tunables    0    0    0 : slabdata     16     16      0
pid_namespace          0      0   2192   14    8 : tunables    0    0    0 : slabdata      0      0      0
user_namespace         0      0    264   62    4 : tunables    0    0    0 : slabdata      0      0      0
posix_timers_cache      0      0    248   66    4 : tunables    0    0    0 : slabdata      0      0      0
UDP-Lite               0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
ip_fib_trie          146    146     56   73    1 : tunables    0    0    0 : slabdata      2      2      0
UDP                  576    576    896   36    8 : tunables    0    0    0 : slabdata     16     16      0
tw_sock_TCP        32835  47680    256   64    4 : tunables    0    0    0 : slabdata    745    745      0
TCP                 1134   1134   1792   18    8 : tunables    0    0    0 : slabdata     63     63      0
blkdev_queue          84     84   2264   14    8 : tunables    0    0    0 : slabdata      6      6      0
blkdev_requests     2007   2142    384   42    4 : tunables    0    0    0 : slabdata     51     51      0
blkdev_ioc           780    780    104   39    1 : tunables    0    0    0 : slabdata     20     20      0
fsnotify_event      1156   1156    120   68    2 : tunables    0    0    0 : slabdata     17     17      0
sock_inode_cache    4991   5661    640   51    8 : tunables    0    0    0 : slabdata    111    111      0
shmem_inode_cache   1392   1392    672   48    8 : tunables    0    0    0 : slabdata     29     29      0
Acpi-ParseExt      15546  15624     72   56    1 : tunables    0    0    0 : slabdata    279    279      0
Acpi-State           306    306     80   51    1 : tunables    0    0    0 : slabdata      6      6      0
Acpi-Namespace      4182   4182     40  102    1 : tunables    0    0    0 : slabdata     41     41      0
taskstats            784    784    328   49    4 : tunables    0    0    0 : slabdata     16     16      0
proc_inode_cache    8535  10250    648   50    8 : tunables    0    0    0 : slabdata    205    205      0
sigqueue             816    816    160   51    2 : tunables    0    0    0 : slabdata     16     16      0
bdev_cache           468    468    832   39    8 : tunables    0    0    0 : slabdata     12     12      0
sysfs_dir_cache    27878  28584    112   36    1 : tunables    0    0    0 : slabdata    794    794      0
mnt_cache            357    357    320   51    4 : tunables    0    0    0 : slabdata      7      7      0
inode_cache        12096  12096    584   56    8 : tunables    0    0    0 : slabdata    216    216      0
dentry             30931  45864    192   42    2 : tunables    0    0    0 : slabdata   1092   1092      0
iint_cache             0      0     72   56    1 : tunables    0    0    0 : slabdata      0      0      0
buffer_head       519830 530088    104   39    1 : tunables    0    0    0 : slabdata  13592  13592      0
mm_struct           3153   3276    896   36    8 : tunables    0    0    0 : slabdata     91     91      0
files_cache          969    969    640   51    8 : tunables    0    0    0 : slabdata     19     19      0
signal_cache        2340   2340   1088   30    8 : tunables    0    0    0 : slabdata     78     78      0
sighand_cache       1230   1230   2112   15    8 : tunables    0    0    0 : slabdata     82     82      0
task_xstate         4212   4212    832   39    8 : tunables    0    0    0 : slabdata    108    108      0
task_struct         1227   1270   6144    5    8 : tunables    0    0    0 : slabdata    254    254      0
anon_vma            9665  12544     64   64    1 : tunables    0    0    0 : slabdata    196    196      0
shared_policy_node  16716  20995     48   85    1 : tunables    0    0    0 : slabdata    247    247      0
numa_policy          170    170     24  170    1 : tunables    0    0    0 : slabdata      1      1      0
radix_tree_node    43100  64581    568   57    8 : tunables    0    0    0 : slabdata   1133   1133      0
idr_layer_cache      390    390   2112   15    8 : tunables    0    0    0 : slabdata     26     26      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   64    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   64    4 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   64    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192         372    372   8192    4    8 : tunables    0    0    0 : slabdata     93     93      0
kmalloc-4096         295    328   4096    8    8 : tunables    0    0    0 : slabdata     41     41      0
kmalloc-2048         553    592   2048   16    8 : tunables    0    0    0 : slabdata     37     37      0
kmalloc-1024        2445   2496   1024   32    8 : tunables    0    0    0 : slabdata     78     78      0
kmalloc-512        25536  25984    512   64    8 : tunables    0    0    0 : slabdata    406    406      0
kmalloc-256        11307  12864    256   64    4 : tunables    0    0    0 : slabdata    201    201      0
kmalloc-192        15569  20202    192   42    2 : tunables    0    0    0 : slabdata    481    481      0
kmalloc-128        15904  25216    128   64    2 : tunables    0    0    0 : slabdata    394    394      0
kmalloc-96          9618   9618     96   42    1 : tunables    0    0    0 : slabdata    229    229      0
kmalloc-64         22131  49536     64   64    1 : tunables    0    0    0 : slabdata    774    774      0
kmalloc-32         24966  29824     32  128    1 : tunables    0    0    0 : slabdata    233    233      0
kmalloc-16         54975  57344     16  256    1 : tunables    0    0    0 : slabdata    224    224      0
kmalloc-8          10752  10752      8  512    1 : tunables    0    0    0 : slabdata     21     21      0
kmem_cache_node      256    256     64   64    1 : tunables    0    0    0 : slabdata      4      4      0
kmem_cache           256    256    256   64    4 : tunables    0    0    0 : slabdata      4      4      0

slabtop:

danslimmon@bad-server:~$ sudo slabtop -o
 Active / Total Objects (% used)    : 939965 / 1147485 (81.9%)
 Active / Total Slabs (% used)      : 23919 / 23919 (100.0%)
 Active / Total Caches (% used)     : 63 / 96 (65.6%)
 Active / Total Size (% used)       : 174670.15K / 219106.31K (79.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.19K / 8.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
507702 423501  83%    0.10K  13018       39     52072K buffer_head
 64581  46172  71%    0.55K   1133       57     36256K radix_tree_node
 58624  58624 100%    0.02K    229      256       916K kmalloc-16
 49152  22498  45%    0.06K    768       64      3072K kmalloc-64
 46368  32154  69%    0.19K   1104       42      8832K dentry
 44736  38006  84%    0.25K    699       64     11184K tw_sock_TCP
 30090  18932  62%    0.04K    295      102      1180K ext4_extent_status
 29824  22268  74%    0.03K    233      128       932K kmalloc-32
 29448  28805  97%    0.11K    818       36      3272K sysfs_dir_cache
 25664  25664 100%    0.50K    401       64     12832K kmalloc-512
 25216  15466  61%    0.12K    394       64      3152K kmalloc-128
 20995  18058  86%    0.05K    247       85       988K shared_policy_node
 20160  15866  78%    0.19K    480       42      3840K kmalloc-192
 20031   6286  31%    0.96K    607       33     19424K ext4_inode_cache
 16768  16256  96%    0.03K    131      128       524K jbd2_revoke_record_s
 15736  15736 100%    0.07K    281       56      1124K Acpi-ParseExt
 12864  10179  79%    0.25K    201       64      3216K kmalloc-256
 12224  11706  95%    0.06K    191       64       764K anon_vma
 12096  12096 100%    0.57K    216       56      6912K inode_cache
 10752  10752 100%    0.01K     21      512        84K kmalloc-8

dmesg: Gist

Here are free output and /proc/meminfo on a server with the same job but without this missing-allocation symptom:

danslimmon@good-server:~$ free -m
             total       used       free     shared    buffers     cached
Mem:         30148      26946       3201          2        156       9907
-/+ buffers/cache:      16882      13265
Swap:            0          0          0
danslimmon@good-server:~$ cat /proc/meminfo
MemTotal:       30871560 kB
MemFree:         3239620 kB
Buffers:          160000 kB
Cached:         10157876 kB
SwapCached:            0 kB
Active:         16048988 kB
Inactive:        4190252 kB
Active(anon):    9944028 kB
Inactive(anon):      512 kB
Active(file):    6104960 kB
Inactive(file):  4189740 kB
Unevictable:     6737636 kB
Mlocked:         6737636 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:             13548 kB
Writeback:             0 kB
AnonPages:      16660296 kB
Mapped:          1840700 kB
Shmem:              2752 kB
Slab:             380296 kB
SReclaimable:     295224 kB
SUnreclaim:        85072 kB
KernelStack:       12232 kB
PageTables:        48640 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    15435780 kB
Committed_AS:   16161132 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       71464 kB
VmallocChunk:   34359649788 kB
HardwareCorrupted:     0 kB
AnonHugePages:  14536704 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       49152 kB
DirectMap2M:    31539200 kB

danslimmon@good-server:~$ smem -tw
Area                           Used      Cache   Noncache
firmware/hardware                 0          0          0
kernel image                      0          0          0
kernel dynamic memory       8708520    8351840     356680
userspace memory           18329120    1743636   16585484
free memory                 3833920    3833920          0
----------------------------------------------------------
                           30871560   13929396   16942164
danslimmon
  • 303
  • 1
  • 8
  • 5
    We can see from meminfo that something has allocated ~ 10GB of anonymous huge pages. – Michael Hampton Sep 06 '16 at 20:43
  • Thanks, but I don't think that's it. I've added to my post an example of `proc/meminfo` on a server that's not exhibiting this behavior of missing memory consumption, and it has even _more_ memory allocated to anonymous huge pages. – danslimmon Sep 07 '16 at 15:07
  • Can you show the output of `smem -tw` ? – shodanshok Sep 09 '16 at 08:44
  • @shodanshok I've run it and added the output to the post body. Thanks. – danslimmon Sep 09 '16 at 20:54
  • Is this a VMware virtual machine? – ewwhite Sep 09 '16 at 21:07
  • 1
    @ewwhite Amazon EC2, so Xen – danslimmon Sep 09 '16 at 21:15
  • Can you also add the output of `slabtop` and `dmesg` (maybe for the latter it is better to use an external site as paste.bin) – shodanshok Sep 09 '16 at 21:38
  • 1
    You may also want to check (if it exists) `/proc/xen/balloon` for your unaccounted for memory. – Matthew Ife Sep 10 '16 at 22:59
  • @shodanshok I have added them to the body – danslimmon Sep 11 '16 at 17:11
  • 1
    @MatthewIfe That file doesn't exist on my system. – danslimmon Sep 11 '16 at 17:12
  • From `dmesg`, it appears you have ballooning enabled. Can you show the files (and their content) under `/sys/devices/system/xen_memory/xen_memory0/selfballoon`? If you do not have that dir, can you try searching for it issuing `find /sys -iname "*xen*"`? – shodanshok Sep 11 '16 at 18:20
  • Hygiene questions- have you compared kernel version and modules between the system exhibiting this problem and the one that isn't? – Jonah Benton Sep 12 '16 at 01:02
  • Another hygiene question- what does top report for the cassandra jvm on the non-impacted system? – Jonah Benton Sep 12 '16 at 01:54
  • `up 248 days` out of curiosity when was the last time you rebooted, and is this memory still unaccounted for after a reboot? Here is an answer about some common memory leak sources http://serverfault.com/questions/257759/something-eats-all-memory-i-suspect-memory-leak-on-some-app-how-to-detect-wha – Matt Sep 12 '16 at 21:15
  • Note that `top` from `procps` had a bug which was hiding some processes, see https://bugzilla.opensuse.org/show_bug.cgi?id=938207 – rudimeier Sep 16 '16 at 07:15

4 Answers4

13

Your smem -tw output shows that your kernel is consuming over 9 GB of dynamic memory:

danslimmon@bad-server:~$ smem -tw
Area                           Used      Cache   Noncache
firmware/hardware                 0          0          0
kernel image                      0          0          0
kernel dynamic memory      12857576    2887440    9970136
userspace memory           17661400    1272468   16388932
free memory                  351592     351592          0
----------------------------------------------------------
                           30870568    4511500   26359068

So, some kernel modules are consuming much memory. Prime candidate are closed source blobs, as NVIDIA kernel driver.

Can you post the output of lsmod and cat /proc/slabinfo ?

shodanshok
  • 44,038
  • 6
  • 98
  • 162
1

You're apparently aware of the shared memory issue when interpreting RSS figures, so I won't expand on that.

Linux systems vary widely (e.g. think of embedded systems), but I think the list of processes you are getting from top looks incomplete, with few system processes. You'll notice you have a init process with PID=1, and nothing else up to PID=869. I think you'll find that many of those PIDs are associated with live kernel processes. It's possible that something in there is using a lot of RAM. @shodanshok's answer certainly points that way. You'd see those process IDs in /proc if you have a procfs mounted.

I'm wondering if you have an in-memory file system with substantial content (e.g. /tmp). What does df -h look like?

If you have a procfs mounted on /proc, you'll see a complete list of process IDs in there. Also /proc/$PID/maps lists regions of physical memory that a process maps into its memory space, though interpreting it is far from trivial.

mc0e
  • 5,786
  • 17
  • 31
0

If you are using Apache Cassandra, could you specify the version being used?

DataDog provides a pretty good monitoring tool of Cassandra. (Source: https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/ )

Some other checklists for Cassandra are:

John Greene
  • 799
  • 7
  • 28
-1

Oracle recommend to disable transparent huge pages. When you will disable it you see that AnonHugePages will be 0. You can see good answer at: disable transparent hugepages

Mikhail Khirgiy
  • 2,003
  • 9
  • 7
  • 1
    Thanks, but I don't think that's it. I've added to my post an example of `proc/meminfo` on a server that's not exhibiting this behavior of missing memory consumption, and it has even _more_ memory allocated to anonymous huge pages. – danslimmon Sep 07 '16 at 15:07
  • Hmm.. Then try to calculate memory via command `ps aux` – Mikhail Khirgiy Sep 07 '16 at 16:22
  • 2
    Lots of ppl tend to do dumb things. If you want to get rid of unintended use of THPs, you'd better use `madvise` setting instead of `disable`. Of course people tend to blindly copy examples w/o getting a single thought about it. – poige Sep 11 '16 at 17:42