1

I have a k8s cluster on 4 VMs. 1 master and 3 workers. On each of the workers, I use rook to deploy a ceph OSD. The OSDs are using the same disk as the VM Operating System.

The VM disks are remote (the underlaying infrastructure is again a Ceph cluster).

This is the VM disk performance (similar for all 3 of them):

$ dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 4.82804 s, 222 MB/s

And the latency (await) while idle is around 8ms.

If I mount an RBD volume inside a K8S POD, the performance is very poor:

$ dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 128.619 s, 8.3 MB/s 

During high load (100% util for the rbd volume), the latency of the RBD volume is greater than 30 seconds.

I know that my setup is not what ceph recommends and that dd is not the best tool to profile disk performance, but the penalty from having ceph on top of VM disks is still huge.

VM Operating system is

CentOS 7.7.1908.
Kernel 3.10.0-1062.12.1.el7.x86_64

Network bandwidth between worker nodes:

[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  2.35 GBytes  2.02 Gbits/sec

Network latency is less than 1 ms.

I'm looking for some hints into further troubleshooting the issue and improving performance.

1 Answers1

4

It is not enough information about your CEPH cluster. But some thing will improve the performance:

  • It is necessary to put the journal on separated SSD (NVMe is even better). Even if you SSDs.
  • Use 10GbE network and separate cluster and external network. It will improve network latency.
  • Don't use 3 copies volumes. It is nice feature, but it makes your cluster slower.
  • By default, the scrubbing works all the time. It is necessary to change it. Better to do the scrubbing during the night.
  • Use BlueStore as format for OSDs.
  • Tune the server for maximum performance. For example, the CPUs governor should be performance.
MaksaSila
  • 76
  • 3
  • Ok. The thing is that while the RBD disk "utilization" is 100%, the VM disks are < 10%. I don't know where to look for the bottleneck. Could you help me towards updating the missing information on my CEPH cluster ? – Laurentiu Soica Feb 28 '20 at 22:17
  • 1
    Which hardware do you use? For example, if you run the cluster inside virtual machines with 1 hard drive, it works, but you should not expect any performance from such setup. CEPH should run on physical hardware with OSD servers that have a lot of hard drives (at least 3 per server). Regarding identify bottlenecks, atop and top should show you this. For example, big iowait means that your hard drives are too slow. – MaksaSila Feb 29 '20 at 19:47
  • I only have high IO wait on the RBD devices. The VM disk are mostly idle. I understand I am not in line with all the recommended ceph configurations. But having the underlying infrastructure mostly idle, should I simply expect such a performance penalty without a good reason ? What bothers me the most is that I cannot pinpoint the bottleneck. – Laurentiu Soica Feb 29 '20 at 21:13
  • Please share the output of the `ceph -s` command. – MaksaSila Mar 02 '20 at 08:23
  • I managed to improve performance by switching to BlueStore, hostNetwork instead of SDN and to replication 2 instead of 3. Regarding your remaining recommendations on infrastructure, is not something I can improve. – Laurentiu Soica Mar 02 '20 at 16:39
  • 1
    A pool size of 2 (replication) is not encouraged. At least not if you value your data. See http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013237.html – itsafire Mar 09 '20 at 16:05
  • It is a trade off... If you have money/resources to have additional copy, you can have it. It is risk management. In most of the cases, 2 copies is good enough.It allows to do maintenance and upgrades. I agree, 3 copies decrease risks. – MaksaSila Mar 10 '20 at 19:28
  • @LaurentiuSoica However, changing network from SDN to host is not supported in a running Rook cluster. Host networking should be configured when the cluster is first created. – gemfield May 18 '20 at 06:24
  • @LaurentiuSoica By the way, how much performance get imporved based on what you changed? – gemfield May 18 '20 at 06:26
  • From a rados bench run, the Average IOPS doubled on write, it's four times faster on seq and random reads. It's still far from the local disk performance but thats' all I've got. – Laurentiu Soica May 19 '20 at 07:33
  • 1
    To have local disk performance you need have more host and hard drives. The performance of CEPH grows with amount of hard drives (OSDs). – MaksaSila May 19 '20 at 08:25
  • @MaksaSila When the total OSDs are same, is there a performance difference between 2 osd disks per node and 4 osd disks per node? – gemfield May 25 '20 at 07:55
  • @gemfield Yes. It will distribute the data between 4 disks, so instead of writing to 2 disks in parallel, it will write to 4 disks in parallel. – MaksaSila May 26 '20 at 08:27
  • @MaksaSila if pool would be formed using fastest NVMe disks, is putting journal on separate NVMe would increase performance? Is this significant increase/worth to have it? – laimison Jun 28 '21 at 12:42
  • 1
    @laimison For NVMe you don't need separate journal device. – MaksaSila Jun 29 '21 at 13:07