Write performance is 5 times worse with LUKS on top of mdadm RAID10 than without LUKS

Question

I have servers with many NVMe disks. I am testing disk performance with fio using the following:

fio --name=asdf --rw=randwrite --direct=1 --ioengine=libaio --bs=16k --numjobs=8 --size=10G --runtime=60 --group_reporting

For a single disk, LUKS doesn't impact performance very much.

I tried using mdadm with 6 disks in raid10 + an XFS fiile system. It performed well.

But when I create a LUKS container on top of the mdadm device, I get terrible performance:

To recap:

6 disk mdadm RAID10 + XFS = 116% of normal performance, ie 16% better write throughput and IOPS compared to a single disk + XFS
6 disk mdadm RAID10 + LUKS + XFS = 33% of normal performance, ie 67% worse write throughput and IOPS compared to a single disk + XFS

In all other scenarios I have not observed such a performance difference between LUKS and non LUKS. That includes LVM spanning, striping, and mirroring. In other words, mdadm RAID10 with 6 disks (I understand this to be spanned over 3 2-disk mirrors), with a LUKS container and an XFS or ext4 file system, performs worse in every regard when compared to:

Single disk with/out LUKS
2 LUKS disks mirrored by LVM (2 LUKS containers)
2 LUKS disks spanned by LVM (2 LUKS containers)

I want one LUKS container on top of mdadm RAID10. That is the easiest configuration to understand and is recommended by many people on ServerFault, reddit, etc. I cannot see how it would be better to LUKS the disks first and then join them to the array, although I have not tested this. Seems like most people recommend the order MDADM => LUKS => LVM => File System.

A lot of the advice I've seen online is about somehow aligning stripe sizes of the RAID array with something else (LUKS? filesystem?) But the configuration choices they recommend are no longer available. For instance, in Ubuntu 18.04, there is no stripe_cache_size for me to set.

The only thing that made a difference for me was the instructions on this page. I do have the same CPU, a variant of the AMD EPYC.

Is there something fundamentally wrong with MDADM + LUKS + FileSystem (XFS) on Ubuntu 18.04 with 6 NVMe drives? If so, I would appreciate to understand the problem. If not, what accounts for the huge gap in performance between non-LUKS and LUKS? I've checked CPU and memory while tests are running, and neither are saturated at all.

Side curiosity:

MDADM + LUKS + XFS outperforms MDADM + XFS when using 75/25 R/W mix. Does that make any sense? I would imagine that LUKS should always be a bit worse than no LUKS, especially with libaio direct=1....

Edit 1

@Michael Hampton

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7452 32-Core Processor
stepping        : 0
microcode       : 0x8301034
cpu MHz         : 1499.977
cache size      : 512 KB
physical id     : 0
siblings        : 64
core id         : 0
cpu cores       : 32
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht s
yscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmper
f pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extap
ic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_
llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed ad
x smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero ir
perf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4699.84
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

...Continues until process 63.

What hardware? Well, nvme list:

sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     BTLJ0086052F2P0BGN   INTEL SSDPE2KX020T8                      1           2.00  TB /   2.00  TB    512   B +  0 B   VDV10152
/dev/nvme1n1     BTLJ007503YS2P0BGN   INTEL SSDPE2KX020T8                      1           2.00  TB /   2.00  TB    512   B +  0 B   VDV10152
/dev/nvme2n1     BTLJ008609DJ2P0BGN   INTEL SSDPE2KX020T8                      1           2.00  TB /   2.00  TB    512   B +  0 B   VDV10152
/dev/nvme3n1     BTLJ008609KE2P0BGN   INTEL SSDPE2KX020T8                      1           2.00  TB /   2.00  TB    512   B +  0 B   VDV10152
/dev/nvme4n1     BTLJ00860AB92P0BGN   INTEL SSDPE2KX020T8                      1           2.00  TB /   2.00  TB    512   B +  0 B   VDV10152
/dev/nvme5n1     BTLJ007302142P0BGN   INTEL SSDPE2KX020T8                      1           2.00  TB /   2.00  TB    512   B +  0 B   VDV10152
/dev/nvme6n1     BTLJ008609VC2P0BGN   INTEL SSDPE2KX020T8                      1           2.00  TB /   2.00  TB    512   B +  0 B   VDV10152
/dev/nvme7n1     BTLJ0072065K2P0BGN   INTEL SSDPE2KX020T8                      1           2.00  TB /   2.00  TB    512   B +  0 B   VDV10152

What Linux Distro? Ubuntu xenial 18.04

What kernel? uname -r gives 4.15.0-121-generic

@anx

numactl --hardware gives

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 1019928 MB
node 0 free: 1015402 MB
node distances:
node   0
  0:  10

cryptsetup benchmark gives

# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1288176 iterations per second for 256-bit key
PBKDF2-sha256    1466539 iterations per second for 256-bit key
PBKDF2-sha512    1246820 iterations per second for 256-bit key
PBKDF2-ripemd160  916587 iterations per second for 256-bit key
PBKDF2-whirlpool  698119 iterations per second for 256-bit key
argon2i       6 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
argon2id      6 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)
#     Algorithm | Key |  Encryption |  Decryption
        aes-cbc   128b  1011.5 MiB/s  3428.1 MiB/s
    serpent-cbc   128b    90.2 MiB/s   581.3 MiB/s
    twofish-cbc   128b   174.3 MiB/s   340.6 MiB/s
        aes-cbc   256b   777.0 MiB/s  2861.3 MiB/s
    serpent-cbc   256b    93.6 MiB/s   581.9 MiB/s
    twofish-cbc   256b   179.1 MiB/s   340.6 MiB/s
        aes-xts   256b  1630.3 MiB/s  1641.3 MiB/s
    serpent-xts   256b   579.2 MiB/s   571.9 MiB/s
    twofish-xts   256b   336.2 MiB/s   335.8 MiB/s
        aes-xts   512b  1438.0 MiB/s  1438.3 MiB/s
 serpent-xts   512b   583.3 MiB/s   571.6 MiB/s
    twofish-xts   512b   336.9 MiB/s   335.7 MiB/s

disks namplate RIO ? not sure what you mean but Im guessing you mean the disk hardware:

INTEL SSDPE2KX020T8 - which is rated at 2000 MB/s for random write

@shodanshok

my RAID array is rebuilding, it's doing this weird thing where when I reboot it goes from /dev/md0 to /dev/md127 and loses the first device.

So I dded the first 1G of each of the 6 disks and then rebuilt

mdadm --create --verbose /dev/md0 --level=10 --raid-devices=6 /dev/nvme[0-5]n1

mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1953382400K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

Now mdadm -D /dev/md0 says

/dev/md0:
           Version : 1.2
     Creation Time : Tue Oct 20 07:27:19 2020
        Raid Level : raid10
        Array Size : 5860147200 (5588.67 GiB 6000.79 GB)
     Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Oct 20 07:27:50 2020
             State : clean, resyncing
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

     Resync Status : 0% complete

              Name : large20q3-co-120:0  (local to host large20q3-co-120)
              UUID : 6d422227:dbfac37a:484c8c59:7ce5cf6e
            Events : 6

    Number   Major   Minor   RaidDevice State
       0     259        1        0      active sync set-A   /dev/nvme0n1
       1     259        0        1      active sync set-B   /dev/nvme1n1
       2     259        3        2      active sync set-A   /dev/nvme2n1
       3     259        5        3      active sync set-B   /dev/nvme3n1
       4     259        7        4      active sync set-A   /dev/nvme4n1
       5     259        9        5      active sync set-B   /dev/nvme5n1

@Mike Andrews

Rebuild completed.

Edit 2

So after rebuild, I create luks container and XFS filesystem on the container.

Then I try the fio without specify ioengine and increasing numjobs to 128.

fio --name=randwrite --rw=randwrite --direct=1 --bs=16k --numjobs=128 --size=10G --runtime=60 --group_reporting

randw: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.1
Starting 128 processes
randw: Laying out IO file (1 file / 10240MiB)
Jobs: 128 (f=128): [w(128)][100.0%][r=0KiB/s,w=1432MiB/s][r=0,w=91.6k IOPS][eta 00m:00s]
randw: (groupid=0, jobs=128): err= 0: pid=17759: Wed Oct 21 04:02:36 2020
  write: IOPS=103k, BW=1615MiB/s (1693MB/s)(94.9GiB/60148msec)
    clat (usec): min=96, max=6186.3k, avg=1231.81, stdev=10343.03
     lat (usec): min=97, max=6186.3k, avg=1232.92, stdev=10343.03
    clat percentiles (usec):
     |  1.00th=[   898],  5.00th=[   930], 10.00th=[   955], 20.00th=[   971],
     | 30.00th=[   996], 40.00th=[  1012], 50.00th=[  1020], 60.00th=[  1037],
     | 70.00th=[  1057], 80.00th=[  1090], 90.00th=[  1827], 95.00th=[  2024],
     | 99.00th=[  2147], 99.50th=[  2245], 99.90th=[  9634], 99.95th=[ 16188],
     | 99.99th=[274727]
   bw (  KiB/s): min=   32, max=16738, per=0.80%, avg=13266.43, stdev=3544.46, samples=15038
   iops        : min=    2, max= 1046, avg=828.56, stdev=221.45, samples=15038
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.02%, 1000=34.71%
  lat (msec)   : 2=59.09%, 4=6.03%, 10=0.05%, 20=0.05%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%, >=2000=0.01%
  cpu          : usr=0.31%, sys=2.33%, ctx=6292644, majf=0, minf=1308
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,6216684,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1615MiB/s (1693MB/s), 1615MiB/s-1615MiB/s (1693MB/s-1693MB/s), io=94.9GiB (102GB), run=60148-60148msec

Disk stats (read/write):
    dm-0: ios=3/6532991, merge=0/0, ticks=0/7302772, in_queue=7333424, util=98.56%, aggrios=3/6836535, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    md0: ios=3/6836535, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/2127924, aggrmerge=0/51503, aggrticks=0/102167, aggrin_queue=21846, aggrutil=32.64%
  nvme0n1: ios=0/2131196, merge=0/51420, ticks=0/110120, in_queue=25668, util=29.16%
  nvme3n1: ios=0/2127405, merge=0/51396, ticks=0/96844, in_queue=19064, util=22.12%
  nvme2n1: ios=1/2127405, merge=0/51396, ticks=0/102132, in_queue=22128, util=25.15%
  nvme5n1: ios=2/2125172, merge=0/51693, ticks=0/92864, in_queue=17464, util=20.39%
  nvme1n1: ios=0/2131196, merge=0/51420, ticks=0/116220, in_queue=28492, util=32.64%
  nvme4n1: ios=0/2125172, merge=0/51693, ticks=0/94824, in_queue=18264, util=20.72%

and then I unmount, remove the luks container... and then try to mkfs.xfs -f /dev/md0 on /dev/md0, and it hangs... but eventually it completes. I run the same test.

Jobs: 128 (f=128): [w(128)][100.0%][r=0KiB/s,w=2473MiB/s][r=0,w=158k IOPS][eta 00m:00s]
randw: (groupid=0, jobs=128): err= 0: pid=13910: Wed Oct 21 07:48:59 2020
  write: IOPS=276k, BW=4314MiB/s (4523MB/s)(253GiB/60003msec)
    clat (usec): min=23, max=853750, avg=460.62, stdev=2832.50
     lat (usec): min=24, max=853751, avg=461.24, stdev=2832.50
    clat percentiles (usec):
     |  1.00th=[   42],  5.00th=[   48], 10.00th=[   53], 20.00th=[   61],
     | 30.00th=[   68], 40.00th=[   77], 50.00th=[   88], 60.00th=[  102],
     | 70.00th=[  131], 80.00th=[  693], 90.00th=[ 1762], 95.00th=[ 2180],
     | 99.00th=[ 2671], 99.50th=[ 2868], 99.90th=[ 4817], 99.95th=[ 6980],
     | 99.99th=[21890]
   bw (  KiB/s): min= 1094, max=48449, per=0.78%, avg=34643.43, stdev=7669.85, samples=15360
   iops        : min=   68, max= 3028, avg=2164.78, stdev=479.37, samples=15360
  lat (usec)   : 50=7.27%, 100=51.59%, 250=16.09%, 500=3.16%, 750=2.39%
  lat (usec)   : 1000=2.11%
  lat (msec)   : 2=10.08%, 4=7.16%, 10=0.12%, 20=0.03%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=0.66%, sys=10.31%, ctx=17040235, majf=0, minf=1605
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,16565027,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=4314MiB/s (4523MB/s), 4314MiB/s-4314MiB/s (4523MB/s-4523MB/s), io=253GiB (271GB), run=60003-60003msec

Disk stats (read/write):
    md0: ios=1/16941906, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/5682739, aggrmerge=0/2473, aggrticks=0/1218564, aggrin_queue=1186133, aggrutil=74.38%
  nvme0n1: ios=0/5685248, merge=0/2539, ticks=0/853448, in_queue=796840, util=66.08%
  nvme3n1: ios=0/5681945, merge=0/2474, ticks=0/1807992, in_queue=1812712, util=74.38%
  nvme2n1: ios=1/5681946, merge=0/2476, ticks=0/772512, in_queue=718264, util=63.36%
  nvme5n1: ios=0/5681023, merge=0/2406, ticks=0/1339628, in_queue=1300048, util=70.97%
  nvme1n1: ios=0/5685248, merge=0/2539, ticks=0/1361944, in_queue=1329024, util=70.38%
  nvme4n1: ios=0/5681029, merge=0/2406, ticks=0/1175864, in_queue=1159912, util=66.80%

Can you show the output of `mdadm -D /dev/`? What happen to performance when removing `--ioengine=libaio` and increasing `--numjobs=128`? — shodanshok, Oct 18 '20 at 16:57
I think @MichaelHampton may be on the right track, wanting to see what processor you have. With `numjobs=8`, you may be CPU limited. Use `htop` to watch the CPU usage meters while `fio` is running. Also, consider using the `--perf-same_cpu_crypt` and `-perf-submit_from_crypt_cpus` options to `cryptsetup`. It may be that for the fio workload, you're better off keeping it on the cores that FIO is using. — Mike Andrews, Oct 19 '20 at 15:17
@shodanshok and others, missed your comments. Will edit with info when I am back online. — tacos_tacos_tacos, Oct 20 '20 at 07:01
Thanks for the additional information. You've certainly got enough CPU there. But, definitely wait for the rebuild to complete. MD performance can be truly terrible during a rebuild. — Mike Andrews, Oct 20 '20 at 13:48
Thanks for the additional data. Can you do a `fio` run without `--ioengine=libaio` while increasing `--numjobs=128`, with and without `dm-crypt` ? — shodanshok, Oct 20 '20 at 20:22
@shodanshok ok, will post results. Can you help me interpret them? I read online that --ioengine=libaio is somehow the most "realistic" or "best" io engine, but I have no clue. — tacos_tacos_tacos, Oct 21 '20 at 03:59
@shodanshok I posted the mdadm+luks results with requested changes. Looks much better... how do I know what's realistic (for a database-type application) — tacos_tacos_tacos, Oct 21 '20 at 04:09
Thanks for updating! Are you using the `aes-xts` cipher for your LUKS volume? If so, that's your bottleneck. Your reported `cryptsetup benchmark` results for that cipher match what `fio` sees: `WRITE: bw=1615MiB/s`. — Mike Andrews, Oct 22 '20 at 14:12
@MikeAndrews That's the fastest one available, isn't it? I believe I am, checking. — tacos_tacos_tacos, Oct 22 '20 at 18:30

Write performance is 5 times worse with LUKS on top of mdadm RAID10 than without LUKS

Edit 1

Edit 2

0 Answers0