0

Currently I'm running a 8 server Ceph setup consisting out off 3 Ceph monitors and 5 Ceph nodes. Performance wise the cluster runs great but after time the nodes start swapping the ceph-osd process to disk. When this happens I experience very pore performance and even the node that is swapping is sometimes seen as down by the cluster. Running swapoff -a followed by swapon -a temporary fixes the issue but in time it returns.

As I understand it is normal to run high in memory with Ceph due to caching and such, but memory is expected to be released and not to start swapping.

We tried the following:

  • Double memory, just takes longer to experience the problem
  • Update kernel, no result
  • Looked at various settings within Ceph, didn't find solutions there
  • Set swappiness to 1, no results just takes longer to experience the problem
  • Searched for bugs, all bugs found where for old versions of Ceph

Has anyone an idea why this occurs and how to mediate this?

As our configurations stand each server has the following specification:

Operating System: CentOS 7
Memory: 32GB
OSD's: 6x 900Gb
Ceph version: 13.2.5 Mimic
Swappiness set to 1

Current memory when swapping occurs:

# free -m
              total        used        free      shared  buff/cache   available
Mem:          31960       19270         747         574       11943       11634
Swap:          2931        1500        1431

Swap dump:

PID=9 - Swap used: 0 - (rcu_bh )
PID=11077 - Swap used: 4 - (snmpd )
PID=9518 - Swap used: 4 - (master )
PID=7429 - Swap used: 8 - (systemd-logind )
PID=7431 - Swap used: 8 - (irqbalance )
PID=7465 - Swap used: 16 - (chronyd )
PID=7702 - Swap used: 20 - (NetworkManager )
PID=7469 - Swap used: 24 - (crond )
PID=7421 - Swap used: 132 - (dbus-daemon )
PID=1 - Swap used: 140 - (systemd )
PID=3616 - Swap used: 216 - (systemd-udevd )
PID=251189 - Swap used: 252 - (ceph-mds )
PID=7412 - Swap used: 376 - (polkitd )
PID=7485 - Swap used: 412 - (firewalld )
PID=9035 - Swap used: 524 - (tuned )
PID=3604 - Swap used: 1608 - (lvmetad )
PID=251277 - Swap used: 18404 - (ceph-osd )
PID=3580 - Swap used: 31904 - (systemd-journal )
PID=9042 - Swap used: 91528 - (rsyslogd )
PID=251282 - Swap used: 170788 - (ceph-osd )
PID=251279 - Swap used: 188400 - (ceph-osd )
PID=251270 - Swap used: 273096 - (ceph-osd )
PID=251275 - Swap used: 284572 - (ceph-osd )
PID=251273 - Swap used: 333288 - (ceph-osd )

/proc/meminfo:

MemTotal:       32694980 kB
MemFree:         2646652 kB
MemAvailable:    9663396 kB
Buffers:         7138928 kB
Cached:           545828 kB
SwapCached:        23492 kB
Active:         24029440 kB
Inactive:        5137820 kB
Active(anon):   19307904 kB
Inactive(anon):  2687172 kB
Active(file):    4721536 kB
Inactive(file):  2450648 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       3002364 kB
SwapFree:        2220284 kB
Dirty:                 8 kB
Writeback:             0 kB
AnonPages:      21459096 kB
Mapped:            31508 kB
Shmem:            512572 kB
Slab:             338332 kB
SReclaimable:     271984 kB
SUnreclaim:        66348 kB
KernelStack:       11200 kB
PageTables:        55932 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    19349852 kB
Committed_AS:   29550388 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      378764 kB
VmallocChunk:   34342174716 kB
HardwareCorrupted:     0 kB
AnonHugePages:     90112 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      248704 kB
DirectMap2M:     5963776 kB
DirectMap1G:    27262976 kB
bk207
  • 173
  • 1
  • 2
  • 9
  • Can you provide `/proc/meminfo` when a node has the problem? That free output shows caches that should be easy to reclaim, but not a lot of detail. For example, `Committed_AS` is an estimate on how much would be required to not page out. – John Mahowald Aug 14 '19 at 00:21
  • @JohnMahowald I have added the requested information. The `Commited_As` is lower than `MemTotal` does this mean there is enough memory? – bk207 Aug 17 '19 at 16:50
  • Please add to your question your OSD configuration, especially cache sizing https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/ – John Mahowald Aug 18 '19 at 16:03

1 Answers1

0

Either add RAM, or tune OSDs to not use as much.

Your /proc/meminfo on a 32 GB system shows 26 GB of memory that the kernel is tracking with 1 GB pages (DirectMap1G). 18 GB of which are active anonymous pages. After reading up a bit on that Ceph BlueStore bypasses the kernel file system, this makes sense that it would need big chunks of anonymous memory. As opposed to using the file system and letting the kernel maintain large file caches.

OSD configuration wasn't provided, but I can guess. ~26 GB memory divided by 6 OSDs is a bit more than 4 GB per OSD. Approximately the default for osd_memory_target which 4 GB. That directive's documentation notes that in practice the (Linux) kernel may exceed this depending on how aggressively it is reclaiming pages. This hints at a difficulty in the virtual memory system: the kernel is attempting to be cleverly lazy in what it reclaims, memory doesn't get reclaimed as cleanly as people think.

24 GB and change for Ceph's anonymous pages alone is 75+% utilization of a 32 GB system. That's fairly high. Add in other allocations like file caches and the kernel, and it isn't too surprising paging out is observed.

The surprising part to me is you doubled RAM and still see the problem. Comitted_AS at about 28 GB makes this look like a 30 something GB workload to me. It would not page out at 60 GB, unless Ceph auto cache sizing is doing something clever as MemTotal increases (I don't know).

A simple thing to try is reducing osd_memory_target, maybe from 4 to 3 GB. Free up a handful of GB, and possibly utilization will be low enough to avoid death by slow page outs.

(Other Ceph cache tuning parameters are documented, but I don't understand them or your system enough to suggest what to try.)

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • I'm going to double the RAM on two machines and see if it helps, thanks for your replies and information. Will update the question with my findings after upgrading the RAM. – bk207 Aug 19 '19 at 07:52
  • Added more RAM and the system now has stabilised. – bk207 Oct 16 '19 at 11:15