5

I have four NVMe drives in a RAID 0 configuration.

I am attempting to determine how many IOPS the array is handling.

When I run iostat, it appears that one drive is handling more IO than the other three drives.

Is this an error with the way that iostat collects data, a known issue with mdadm, or have I misconfigured the array?

Usage Details.

# iostat

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1        1669.12     22706.35     13975.13 63422465065 39034761844
nvme3n1         753.28     13228.56     12185.39 36949483692 34035736524
nvme1n1         635.93     13781.47     14014.10 38493855272 39143630456
nvme2n1         744.35     14704.94     14283.13 41073264648 39895068820
md0            4291.15     72863.78     56468.04 203520212237 157724286024

Software RAID device details

# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Feb 19 22:45:06 2021
        Raid Level : raid0
        Array Size : 8001060864 (7630.41 GiB 8193.09 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Fri Feb 19 22:45:06 2021
             State : clean 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

        Chunk Size : 512K

Consistency Policy : none

              Name : eth1:0
              UUID : 2e672c70:de98a756:160877d2:d8fe2c94
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259        1        0      active sync   /dev/nvme0n1p1
       1     259        5        1      active sync   /dev/nvme1n1p1
       2     259        7        2      active sync   /dev/nvme2n1p1
       3     259        3        3      active sync   /dev/nvme3n1p1

Block Devices

# lsblk
      
nvme0n1            259:0    0   1.9T  0 disk  
└─nvme0n1p1        259:1    0   1.9T  0 part  
  └─md0              9:0    0   7.5T  0 raid0 /mnt/raid0
nvme3n1            259:2    0   1.9T  0 disk  
└─nvme3n1p1        259:3    0   1.9T  0 part  
  └─md0              9:0    0   7.5T  0 raid0 /mnt/raid0
nvme1n1            259:4    0   1.9T  0 disk  
└─nvme1n1p1        259:5    0   1.9T  0 part  
  └─md0              9:0    0   7.5T  0 raid0 /mnt/raid0
nvme2n1            259:6    0   1.9T  0 disk  
└─nvme2n1p1        259:7    0   1.9T  0 part  
  └─md0              9:0    0   7.5T  0 raid0 /mnt/raid0

File System Details

# dumpe2fs -h /dev/md0
dumpe2fs 1.44.5 (15-Dec-2018)
Filesystem volume name:   QuadSSD
Last mounted on:          /mnt/raid0
Filesystem UUID:          8b33fb9d-1f98-44ff-a012-38ac10ffece3
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              250036224
Block count:              2000265216
Reserved block count:     100013260
Free blocks:              1759673576
Free inodes:              249676044
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         4096
Inode blocks per group:   256
RAID stride:              128
RAID stripe width:        512
Flex block group size:    16
Filesystem created:       Tue Mar  2 22:54:32 2021
Last mount time:          Sun Mar 14 15:55:16 2021
Last write time:          Sun Mar 14 15:55:16 2021
Mount count:              4
Maximum mount count:      -1
Last checked:             Tue Mar  2 22:54:32 2021
Check interval:           0 (<none>)
Lifetime writes:          14 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      f8a38f43-4d67-4137-972d-db2f8650ffad
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x3f3be24d
Journal features:         journal_incompat_revoke journal_64bit journal_checksum_v3
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x06a3502a
Journal start:            154915
Journal checksum type:    crc32c
Journal checksum:         0x963b1ac7

Notes:

  • I also see similar results running iostat 10 (nvme0n1 consistently has higher usage than the other drives)
  • The array/drive was never as a root partition.
  • Some output has been abbreviated. For example, other block devices are in the system.
shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • 3
    A couple of thoughts: I think to really test this, you need a synthetic benchmark like `fio`. The I/O load of 72MB/s read and 52MB/s write in that iostat output is a trivial amount of I/O for four NVMe drives to handle. So, it's very possible that what you're seeing is workload-dependent. Second, there *have* been weird problems with the statistics that iostat uses. It's not out of question that the kernel you're using might be wrong. – Mike Andrews Apr 22 '21 at 17:48

1 Answers1

2

The apparent bigger usage of the first device probably is an artifact of read alignment.

You have a 4x 512K chunk RAID0, meaning that the first device hits for any read aligned at 2 MB boundary. Both 2 MB and 4 MB are common alignment values for applications (ie: LVM physical chunks are 4 MB big by default), so the first drive can appear as more stressed than the others.

For a more in-depth (and correct) evaluation, you should observe your drives behavior during a typical real world test (or a reasonable approximation done via fio).

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • The "problem" with RAID0 is there's no redundancy so you have no choice as to which drive you want to read the data back from... The writes seem somewhat balanced between drives but if the workload isn't re-reading the entirety of what's written could it be that the reads are skewed to particular drives? – Anon Jun 17 '21 at 18:42
  • @Anon each RAID0 component disk has a different data chunk, so the read skew you describe is mainly due to read alignment. – shodanshok Jun 17 '21 at 19:57
  • @shodanshok - interesting point about the alignment! Note: The drives' behavior **was** observed using a typical real-world test. I am syncing the Ethereum block chain. That is the goal for the computer. – Vincent Saelzler Aug 03 '21 at 02:06