2

The setup is 3 clustered Proxmox for computations, 3 clustered Ceph storage nodes,

ceph01 8*150GB ssds (1 used for OS, 7 for storage)
ceph02 8*150GB ssds (1 used for OS, 7 for storage)
ceph03 8*250GB ssds (1 used for OS, 7 for storage)

When I create a VM on proxmox node using ceph storage, I get below speed (network bandwidth is NOT the bottleneck)

Writing to VM where hdd in Ceph

[root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 46.7814 s, 23.0 MB/s

[root@localhost ~]# dd if=/dev/zero of=./here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 15.5484 s, 69.1 MB/s

Writing to VM where hdd in proxmox
for comparison, below is on a VM on proxmox, ssd same modal,

[root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.301 s, 104 MB/s

[root@localhost ~]# dd if=/dev/zero of=./here bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 7.22211 s, 149 MB/s

I have below ceph pool

size/min = 3/2
pg_num = 2048
ruleset = 0

Running 3 monitors on same hosts, Journals are stored on each own OSD Running latest proxmox with Ceph Hammer

Any suggestions on where should we look at for improvements? Is it the Ceph pool? Is it the Journals? Does it matter if Journal is in same drive as OS (/dev/sda) or OSD (/dev/sdX)?

fcukinyahoo
  • 145
  • 1
  • 1
  • 6
  • Large linear reads is not the best testcase for Ceph but there should be some improvements possible. What ssd are These? (SATA?) and what does the network look like? You might want to try 1/1 for size as with 3 hosts and triple redundancy there is no load distribution. – eckes Oct 13 '17 at 02:26
  • Is Proxmox using a client lib to access a rdb or is this a kernel rdb device on the host or is it using a file image in cephfs? – eckes Oct 13 '17 at 10:18

2 Answers2

2

You can increase disk throughput (MB/s) by set MTU to 9000 and change the I/O scheduler to noop.

mrmainnet
  • 39
  • 2
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://serverfault.com/help/whats-reputation) you will be able to [comment on any post](https://serverfault.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/low-quality-posts/343665) – Mr. Raspberry Oct 13 '17 at 10:02
  • 2
    @Mr.Raspberry it looks like a answer to me, Jumbo frames can increase the throughput (also it is unlikely with modern NICs to have any impact). Using a noop scheduler also can improve the access to virtualized Block devices – eckes Oct 13 '17 at 10:20
  • 1
    @eckes For more information about Ceph performance, you should read https://accelazh.github.io/ceph/Ceph-Performance-Tuning-Checklist and http://pve-devel.pve.proxmox.narkive.com/Uj4xVMui/krdb. I'm testing Proxmox 5.0 with Ceph 12.2.1 + Blustore + Dual 10Gbps NIC with MTU 9000. It gave me 1.2Gbps when test with dd. [root@localhost ~]# dd if=/dev/zero of=./here bs=1M count=1024 oflag=direct 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 0.905493 s, 1.2 GB/s [root@localhost ~]#" – mrmainnet Oct 17 '17 at 03:04
  • The bit about scheduling can have a huge impact on highly transactional systems. Ceph is a highly transactional system. – Spooler Feb 27 '18 at 20:58
1

I am running the cluster with CEPH Hammer too. If you run OSD in Filestore format you have to use NVMe for journal, even if you are using SSDs as OSD.

MaksaSila
  • 76
  • 3