0

I have a cluster of servers, each of them having 128GB or RAM and 6 x 2TB spinning disks dedicated for BlueStore OSDs. The servers also act like KVM hosts, so they are not dedicated to Ceph. In the past when using FileStore we noticed that if a server has low available memory (e.g. 10-20G), then the OSDs on this host start to do a lot more IOs than the others, generally slowing down the whole cluster. Now with BlueStore, I can see that each OSD daemon reserves around 3-4GB of memory for cache. I have reserved 5GB per OSD per server that won't be spent on VMs just to be safe.

My questions is does it matter now how much more free memory a host has for the performance and do I need to pack most VMs on hosts without OSDs like before? Or I don't need to think about that as long as I don't run into an OOM situation?

I am using Ceph 14.2.0.

Jacket
  • 131
  • 8

1 Answers1

0

Linux will evict file system cache much faster than anonymous or shared memory pages. So, under memory pressure you lose your cache, and do a lot more IOPS to the drives. Raw device access doesn't use files.

https://ceph.com/community/new-luminous-bluestore/

Memory usage

One nice thing about FileStore was that it used a normal Linux file system, which meant the kernel was responsible for managing memory for caching data and metadata. In particular, the kernel can use all available RAM as a cache and then release is as soon as the memory is needed for something else. Because BlueStore is implemented in userspace as part of the OSD, we manage our own cache, and we have fewer memory management tools at our disposal.

The bottom line is that with BlueStore there is a bluestore_cache_size configuration option that controls how much memory each OSD will use for the BlueStore cache. By default this is 1 GB for HDD-backed OSDs and 3 GB for SSD-backed OSDs, but you can set it to whatever is appropriate for your environment. (See the BlueStore configuration guide for more information.)

(In contrast, plenty of databases use files plus their own caching. But bypassing that is a valid choice too.)

Previously, you needed to size dozens of GB as file system cache. Now, most of that moved to a fixed size of anonymous pages. You still need cache to keep your VM workload IOPS down. It may be simpler to keep storage and compute hosts separate, if you have enough to make that practical.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • So the simple answer should be no then? If FS cache doesn't matter at all for BlueStore, are there any other factors that improve performance when having more available memory? – Jacket Apr 30 '19 at 13:41
  • There is not a simple answer. You still have to find an acceptable bluestore_cache_size, size your VMs, have a little file system cache, and have enough spare to avoid paging out. – John Mahowald Apr 30 '19 at 14:39