Proxmox on Ceph performance & stability issues / Configuration doubts

Question

We have just installed a cluster of 6 Proxmox servers, using 3 nodes as Ceph storage, and 3 nodes as compute nodes.

We are experiencing strange and critical issues with the performances and stability of our cluster.

VMs and Proxmox web access tends to hang for no obvious reason, from a few seconds to a few minutes - when accessing via SSH, RDP or VNC console directly. Even the Proxmox hosts seem to be out of reach, as can be seen in this monitoring capture. This also creates Proxmox cluster issues with some servers falling out of sync.

For instance, when testing ping between host nodes, it would work perfectly a few pings, hang, carry on (without any pingback time increase - still <1ms), hang again, etc.

We had some performance issues initially, but those have been fixed by adjusting the NICs' MTU to 9000 (+1300% read/write improvement). Now we need to get all this stable, because right now it's not production ready.

Hardware configuration

We have a similar network architecture to the one described in Ceph's official doc, with a 1 Gbps public network, and a 10 Gbps cluster network. Those are connected to two physical network cards for each of the 6 servers.

Storage server nodes:

CPU: Xeon E-2136 (6 cores, 12 threads), 3.3 GHz, Turbo 4.5 GHz
RAM: 16 GB
Storage:
- 2x RAID 1 256 GB NVMe, LVM
  - system root logical volume: 15 GB (~55% free)
  - swap: 7.4 GB
  - WAL for OSD2: 80 GB
- 4 TB SATA SSD (OSD1)
- 12 TB SATA HDD (OSD2)
Network Interface Controller:
- Intel Corporation I350 Gigabit: connected to public 1 Gbps network
- Intel Corporation 82599 10 Gigabit: connected to 10 Gbps cluster (internal) network

Compute server nodes:

CPU: Xeon E-2136 (6 cores, 12 threads), 3.3 GHz, Turbo 4.5 GHz
RAM: 64 GB
Storage:
- 2x RAID 1 256 GB SATA SSD
  - system root logical volume: 15 GB (~65% free)

Software: (on all 6 nodes)

Proxmox 7.0-13, installed on top of Debian 11
Ceph v16.2.6, installed with Proxmox GUI
Ceph Monitor on each storage node
Ceph manager on storage node 1 + 3

Ceph configuration

ceph.conf of the cluster:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 192.168.0.100/30
     fsid = 97637047-5283-4ae7-96f2-7009a4cfbcb1
     mon_allow_pool_delete = true
     mon_host = 1.2.3.100 1.2.3.101 1.2.3.102
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 1.2.3.100/30

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.asrv-pxdn-402]
     host = asrv-pxdn-402
     mds standby for name = pve

[mds.asrv-pxdn-403]
     host = asrv-pxdn-403
     mds_standby_for_name = pve

[mon.asrv-pxdn-401]
     public_addr = 1.2.3.100

[mon.asrv-pxdn-402]
     public_addr = 1.2.3.101

[mon.asrv-pxdn-403]
     public_addr = 1.2.3.102

Questions:

Is our architecture correct?
Should the Ceph Monitors and Managers be accessed through the public network? (Which is what Proxmox's default configuration gave us)
Does anyone know where these disturbances/instabilities come from and how to fix those?

[edit]

Is it correct to use a default pool size of 3, when you have 3 storage nodes? I was initially temped by using 2, but couldn't find similar examples and decided to use the default config.

Noticed issues

We noticed that arping somehow is returning pings from two MAC addresses (public NIC and private NIC), which doesn't make any sens since these are seperate NICs, linked by a separate switch. This is maybe part of the network issue.
During a backup task on one of the VMs (backup to a Proxmox Backup Server, physically remote), somehow it seems to affect the cluster. The VM gets stuck in backup/locked mode, even though the backup seems to have finished properly (visible and accessible on the backup server).
Since the first backup issue, Ceph has been trying to rebuild itself, but hasn't managed to do so. It is in a degraded state, indicating that it lacks an MDS daemon. However, I double check and there are working MDS daemons on storage node 2 & 3. It was working on rebuilding itself until it got stuck in this state.

Here's the status:

root@storage-node-2:~# ceph -s
  cluster:
    id:     97637047-5283-4ae7-96f2-7009a4cfbcb1
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            Slow OSD heartbeats on back (longest 10055.902ms)
            Slow OSD heartbeats on front (longest 10360.184ms)
            Degraded data redundancy: 141397/1524759 objects degraded (9.273%), 156 pgs degraded, 288 pgs undersized
 
  services:
    mon: 3 daemons, quorum asrv-pxdn-402,asrv-pxdn-401,asrv-pxdn-403 (age 4m)
    mgr: asrv-pxdn-401(active, since 16m)
    mds: 1/1 daemons up
    osd: 6 osds: 4 up (since 22h), 4 in (since 21h)
 
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 480 pgs
    objects: 691.68k objects, 2.6 TiB
    usage:   5.2 TiB used, 24 TiB / 29 TiB avail
    pgs:     141397/1524759 objects degraded (9.273%)
             192 active+clean
             156 active+undersized+degraded
             132 active+undersized

[edit 2]

root@storage-node-2:~# ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME               STATUS  REWEIGHT  PRI-AFF
-1         43.65834  root default                                     
-3         14.55278      host asrv-pxdn-401                           
 0    hdd  10.91409          osd.0               up   1.00000  1.00000
 3    ssd   3.63869          osd.3               up   1.00000  1.00000
-5         14.55278      host asrv-pxdn-402                           
 1    hdd  10.91409          osd.1               up   1.00000  1.00000
 4    ssd   3.63869          osd.4               up   1.00000  1.00000
-7         14.55278      host asrv-pxdn-403                           
 2    hdd  10.91409          osd.2             down         0  1.00000
 5    ssd   3.63869          osd.5             down         0  1.00000

Did you monitor the ceph cluster during these interruptions? Did you notice anything, for example failing MON daemons? If that would be the case, all clients would need to reconnect to the next MON to reestablish their sessions which could cause something like that. The other thing is the used swap, that could also cause interruptions, especially OSDs do handle swap usage well. I recommend to look closer and monitor all components in more details to get a feeling where these interruptions come from. The public network for clients is correct. — eblock, Nov 15 '21 at 07:37
Yes, I did check the ceph status which was fine until the errors mentioned in the edit (see above). There is one MON on all 3 storage nodes, and 1 MDS on node 2 & 3. The node's RAM usage is at arout 10GB out of 16 available, with basically no swap usage. Does Question 4 (I just added) maybe have an influence on this ? Thanks for your help — Danyright, Nov 15 '21 at 11:24
Pool size 3 is good, you just can't recover to a different host if your failure domain is host (which is default and recommended) but you have to wait until it has recovered. Do you have the `ceph osd tree` output from the failure time? Or is it still in this state? — eblock, Nov 15 '21 at 11:38
Just added the `ceph osd tree` output above. It's still in this state. It looks like it's linked to CephFS and not the RBD, but I'm not certain... — Danyright, Nov 15 '21 at 11:47
The MDS message just says that you don't have standby daemon to take over the CephFS in case your one daemon fails. You usually want to have it redundant, but that's not the issue here. To me it still sounds like a network issue between those OSDs. — eblock, Nov 15 '21 at 14:35
One more question: do you use different device classes (hdd/ssd) in your crush rules? Because that could also make a difference in performance when some PGs are on HDDs and others on SSDs but belong to the same pool. If you don't distinguish then you'll also have most PGs on the HDDs because they have more capacity and thus a higher crush weight. — eblock, Nov 16 '21 at 08:05
You were right, the issue came from networking problems. Somehow the switches didn't filter/separate the networks properly, which caused the whole system to go crazy. The network is fixed, but the Ceph cluster hasn't recovered yet... — Danyright, Nov 17 '21 at 09:19
To answer your other question regarding device class: we created two different CRUSH maps for each type of device, in order to create 1 pool with SSDs and 1 pool with HDD (which also allows for fine tuning within our VMs). — Danyright, Nov 17 '21 at 09:21
But is the cluster actually recovering? Do you see progress? — eblock, Nov 18 '21 at 07:28