35

I've been trying to find a straight answer on this one, and it has proved elusive. This question and its answer is close, but does not really give me the specifics I would like. Let's start with what I think I know.

If you have a standard block device and you run sudo blockdev --report you will get something like this:

RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0    500107862016   /dev/sda
rw   256   512  4096       2048    399999238144   /dev/sda1
rw   256   512  1024  781252606            1024   /dev/sda2

Now, you decide to change that default 256 to 128 using --setra on any of the partitions and it happens to the whole block device, like so:

sudo blockdev --setra 128 /dev/sda1
sudo blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   128   512  4096          0    500107862016   /dev/sda
rw   128   512  4096       2048    399999238144   /dev/sda1
rw   128   512  1024  781252606            1024   /dev/sda2

This makes perfect sense to me - the block level device is where the setting is, not the partition, so it all changes. Also the default relationship between the RA setting and the device makes sense to me, it is generally:

RA * sector size (default = 512 bytes)

Hence, the changes I made above, with the default sector size will drop readahead from 128k to 64k. All well and good so far.

However, what happens when we add in a software RAID, or LVM and device-mapper? Imagine your report looks like this instead:

RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0     10737418240   /dev/xvda1
rw   256   512  4096          0    901875499008   /dev/xvdb
rw   256   512  4096          0    108447924224   /dev/xvdj
rw   256   512  4096          0    108447924224   /dev/xvdi
rw   256   512  4096          0    108447924224   /dev/xvdh
rw   256   512  4096          0    108447924224   /dev/xvdg
rw  4096   512  4096          0    433787502592   /dev/md0
rw  4096   512   512          0    429496729600   /dev/dm-0

In this case we have a device-mapped dm-0 LVM device on top of the md0 created by mdadm, which is in fact a RAID0 stripe across the four devices xvdg-j.

Both the md0 and dm-0 have settings of 4096 for RA, far higher than the block devices. So, some questions here:

  • How does the RA setting get passed down the virtual block device chain?
  • Does dm-0 trump all because that is the top level block device you are actually accessing?
  • Would lvchange -r have an impact on the dm-0 device and not show up here?

If it is as simple as, the RA setting from the virtual block device you are using gets passed on, does that mean that a read from dm-0 (or md0) will translate into 4 x 4096 RA reads? (one on each block device). If so, that would mean that these settings explode the size of the readahead in the scenario above.

Then in terms of figuring out what the readahead setting is actually doing:

What do you use, equivalent to the sector size above to determine the actual readahead value for a virtual device:

  • The stripe size of the RAID (for md0)?
  • Some other sector size equivalent?
  • Is it configurable, and how?
  • Does the FS play a part (I am primarily interested in ext4 and XFS)?
  • Or, if it is just passed on, is it simply the RA setting from the top level device multiplied by the sector size of the real block devices?

Finally, would there be any preferred relationship between stripe size and the RA setting (for example)? Here I am thinking that if the stripe is the smallest element that is going to be pulled off the RAID device, you would ideally not want there to have to be 2 disk accesses to service that minimum unit of data and would want to make the RA large enough to fulfill the request with a single access.

Adam C
  • 5,132
  • 2
  • 28
  • 49
  • What Linux distribution are you using? Are you using hardware or software raid? Seems like software. If hardware, what card/chipset are you using as much of this is set and stored in the device's firmware. – Jason Huntley Aug 21 '12 at 17:06
  • Also, the RA settings greatly depend on your filesystem allocation scheme. Are you using ext4? – Jason Huntley Aug 21 '12 at 17:19
  • I actually mention that it's software RAID and LVM in the question, so yes - software. In terms of the filesystem, I would be interested in the difference between XFS and ext4 here, answers for either would be good though – Adam C Aug 21 '12 at 19:49
  • XFS can be tuned heavily for better performance. That's covered in a few places on this site: [here](http://serverfault.com/a/367077/13325) and [here](http://serverfault.com/a/406070/13325)... What distribution of Linux are you using? That plays a factor because there are some distribution-specific tools available, too. – ewwhite Aug 21 '12 at 19:54
  • This is not a performance questions, it's more specific - I just want to know about RA settings and how they translate through/interact with the LVM/Software RAID layers – Adam C Aug 21 '12 at 20:26

3 Answers3

12

How does the RA setting get passed down the virtual block device chain?

It depends. Let's assume you are inside Xen domU and have RA=256. Your /dev/xvda1 is actual LV on the dom0 visible under /dev/dm1. So you have RA(domU(/dev/xvda1)) = 256 and RA(dom0(/dev/dm1)) = 512 . It will have such effect that dom0 kernel will access /dev/dm1 with another RA than domU's kernel. Simple as that.

Another sittutation will occur if we assume /dev/md0(/dev/sda1,/dev/sda2) sittuation.

blockdev --report | grep sda
rw   **512**   512  4096          0   1500301910016   /dev/sda
rw   **512**   512  4096       2048      1072693248   /dev/sda1
rw   **512**   512  4096    2097152   1499227750400   /dev/sda2
blockdev --setra 256 /dev/sda1
blockdev --report | grep sda
rw   **256**   512  4096          0   1500301910016   /dev/sda
rw   **256**   512  4096       2048      1072693248   /dev/sda1
rw   **256**   512  4096    2097152   1499227750400   /dev/sda2

Setting /dev/md0 RA won't affect /dev/sdX blockdevices.

rw   **256**   512  4096       2048      1072693248   /dev/sda1
rw   **256**   512  4096    2097152   1499227750400   /dev/sda2
rw   **512**   512  4096          0      1072627712   /dev/md0

So generally in my opinion kernel accesses blockdevice in the manner that is actually set. One logical volume can be accessed via RAID (that it's part of) or devicemapper device and each with another RA that will be respected.

So the answer is - the RA setting is IMHO not passed down the blockdevice chain, but whatever the top level device RA setting is, will be used to access the constituent devices

Does dm-0 trump all because that is the top level block device you are actually accessing?

If you mean deep propagation by "trump all" - as per my previous comment I think that you may have different RA's for different devices in the system.

Would lvchange -r have an impact on the dm-0 device and not show up here?

Yes but this is a particular case. Let's assume that we have /dev/dm0 which is LVM's /dev/vg0/blockdevice. If you do:

lvchange -r 512 /dev/vg0/blockdevice

the /dev/dm0 will also change because /dev/dm0 and /dev/vg0/blockdevice is exactly the same block device when it comes to kernel access.

But let's assume that /dev/vg0/blockdevice is the same as /dev/dm0 and /dev/xvda1 in Xen domU that is utilizing it. Setting the RA of /dev/xvda1 will take effect but dom0 will see still have it's own RA.

What do you use, equivalent to the sector size above to determine the actual readahead value for a virtual device:

I typically discover RA by experimenting with different values and testing it with hdparm .

The stripe size of the RAID (for md0)?

Same as above.

Does the FS play a part (I am primarily interested in ext4 and XFS)?

Sure - this is a very big topic. I recommend You start here http://archives.postgresql.org/pgsql-performance/2008-09/msg00141.php

Adam C
  • 5,132
  • 2
  • 28
  • 49
wojciechz
  • 538
  • 3
  • 11
  • This is very close to what I am looking for, and what I suspected - can you just clear up one thing for me: in the /dev/md0(/dev/sda1,/dev/sda2) situation I know that you can set separate RA values, but if you, say mount /data on /dev/md0 and read a file from it - does the 512 RA get used for reading from /dev/sda1 and /dev/sda2 (i.e. 512 used for both) or is 256 used on each? If the former it would seem wise to have RAID0 RA set to: SUM(RA of devices in the RAID0) – Adam C Aug 23 '12 at 14:50
  • 1
    Just telling from my experience - setting RA=512 on /dev/md0 with /dev/sdX disks under, acts exactly the same as we had access to /dev/sdX with RA=512 despite that for example we can have RA=256 setting on the bottom blockdevice. The 256 setting will be ignored in this case (note that /dev/sda is useless as a blockdevice if it's a part of /dev/md0) . I'm not a kernel programmer but this seems logical and seems to be confirmed by my practice. So reassuming. 3 threads reading from /dev/md0,RA=512 equal 3 threads reading from /dev/sd{a,b,c} with RA=512 . – wojciechz Aug 23 '12 at 15:03
  • Great, thanks! I have edited things slightly to make that clearer in the answer. Can I ask one more thing before I accept? Do you have an example (or link to one) for using hdparm to test RA? I was going to do something similar myself, so if there's a good reference it would save me time. – Adam C Aug 23 '12 at 15:15
  • It's not complicated, but depends what you want to check. Please refer to hdparm manual. If you want to check disk reads (which is a derivative of readahead) you can issue a command like _hdparm -t /dev/md0_ . The outcome will show something like _Timing buffered disk reads: 310 MB in 3.02 seconds = 102.79 MB/sec_ . The last value is typically strongly affected by RA setting. – wojciechz Aug 23 '12 at 15:22
  • 1
    ah, so not a direct measurement - understood, accepting now - thanks for the help :) – Adam C Aug 23 '12 at 15:32
4

Know the answer harder to explain so I will do so in example. Say for the sake of this you have 3 block devices an you set your RA to say 4 (4*512 byte) assuming standard sector. If you were to say use a RAID-5 scheme using the 3 disks, any read that even touched a stripe on a unique disk would compound the RA by the factor you initially set block device RA to. So if your read spanned exactly all 3 disks then your effective RA would be 12*512 byte. This can be compounded by settin RA in the various levels, eg MD or LVM. As a rule of thumb, if my app benefits from RA I set it on the highest layer possible so I dont compound the RA unnecessarrily. I then start the filesystem on sector 2049 and offset each sector start on a number divisible by 8. I may be way off on what you are asking but this is my 2¢.

Bill Clark
  • 41
  • 1
  • So, you are saying that whatever the RA setting is on the top level device, it will simply get passed down. Therefore, if you used LVM --> 2 x RAID --> 4 x physical disk each and you had RA of 4, then because there are 8 physical devices, you end up with an effective RA of 32. How would you tweak the chunk/stripe size of the RAID to be efficient in that scenario - I assume you want the RA to cover an entire stripe so you don't have to access twice? – Adam C Aug 22 '12 at 07:09
  • BTW, if I am getting this right, in the scenario I describe, I think I would want to have the chunk/stripe of the RAID0 set to be X, where X = RA * 512bytes. Therefore, if I have a chunk/stripe of 64k (the mdadm default) then then minimum RA I should use is 128 because that gets me the entire stripe in one shot. – Adam C Aug 22 '12 at 07:24
2

That's for the explanation. I made some tests with a RAID and LVM setup to prove you are right:

https://fatalfailure.wordpress.com/2017/05/13/where-to-set-readahead-lvm-raid-devices-device-mapper-block-devices

The one that matters is the one the OS is using

victorgp
  • 481
  • 2
  • 4
  • 9