2

I'm currently setting up a server on Proxmox VE. I wanted all drives to be encrypted so i chose to setup luks on all disks and on top of luks i set up LVM.

Now when I'm transferring data from a fast drive (SSD) to a slower drive (HDD) using dd it starts up very fast with some GB/s. And then slows down. I'm then noticing an IO wait of up to 10% and the load of the system ramps up to 36. Some virtual machines are affected by that and freeze.

Further monitoring revealed that during the high IO wait dmcrypt_write is using 99% of the IO. So I installed Netdata to get some graphs and tose showed that the HDD is writing with about 120 to 150 MB/s.

After some time the kworkers get killed for taking too long. After some research I adjusted the dirty_ratio and dirty_background_ratio to a lower value, this helped but decreased the speed alot. To about 25 MB/s. That prevented the huge freeze ups but still causes some lags. This also slowed down the write speeds of the HDD itself. Instead of writing with 150 MB/s the HDD would now only write with 50 MB/s.

I honestly don't know any furhter. Is there any kind of cache that I haven't found yet? Or is there maybe a way to limit the write speeds in Linux to the write speeds of the drives like it should be?

My only goal is to copy data from A to B without having to limit speeds manually and without having to worry about VMs freezing up.

System Information:

CPU: 2x Intel Xeon E5-2650 v2
RAM: 128 GB DDR3 ECC
OS: Debian 10 with manually installed Proxmox VE
Kernel: Linux 5.3.18-3-pve #1 SMP PVE 5.3.18-3 (Tue, 17 Mar 2020 16:33:19 +0100) x86_64 GNU/Linux

The SSDs from which I let dd read are two Toshiba enterprise SAS-SSDs in a RAID 1. The HDDs are some SATA HDDs with 5400rpm (so ... not the fastest). They are also in a RAID 1.

The RAIDs are managed by a DELL PERC H710 mini (Embedded). All RAIDs have Adaptive Read Ahead as Read Policy and Write Through as Write Policy.

I also noticed a strange looking Dirty/Writeback graph: klick to see the image (newbie protection)

1 Answers1

2

The problem was caused by a too high dirty_ratio and dirty_background_ratio. Since the RAM is relatively large, the page cache was large too. The system blocks IO if the page cache is full and waits for it to get flushed. This is what caused the high IO Wait.

Decreasing those to a much smaller value (64 MB for background_ratio and 4 GB for dirty_ratio) solved my problem.