195

I have recently started using LVM on some servers for hard drives larger than 1 TB. They're useful, expandable and quite easy to install. However, I could not find any data about the dangers and caveats of LVM.

What are the downsides of using LVM?

RichVel
  • 3,524
  • 1
  • 17
  • 23
Adam Matan
  • 12,504
  • 19
  • 54
  • 73
  • 23
    When reading the answers to this question, bear in mind the date (year) they were posted. A lot happens in 3 years in this industry. – MattBianco Oct 02 '14 at 11:15
  • 2
    I've done some updates recently (Apr 2015) having scanned through to see if anything has changed. The 2.6 kernel is now obsolete, SSDs are more common, but apart from some small LVM fixes not much has really changed. I did write some new stuff on using VM / cloud server snapshots instead of LVM snapshots. The state of write caching, filesystem resizing and LVM snapshots haven't really changed much as far as I can see. – RichVel May 03 '15 at 14:39
  • 1
    regarding the "bear in mind the date" comment -- true enough, but also consider that a lot of "enterprises" are still using RHEL 5 and RHEL 6, both of which are state-of-the-art or older than the date of the answer – JDS May 04 '15 at 16:05

6 Answers6

260

Summary

Risks of using LVM:

  • Vulnerable to write caching issues with SSD or VM hypervisor
  • Harder to recover data due to more complex on-disk structures
  • Harder to resize filesystems correctly
  • Snapshots are hard to use, slow and buggy
  • Requires some skill to configure correctly given these issues

The first two LVM issues combine: if write caching isn't working correctly and you have a power loss (e.g. PSU or UPS fails), you may well have to recover from backup, meaning significant downtime. A key reason for using LVM is higher uptime (when adding disks, resizing filesystems, etc), but it's important to get the write caching setup correct to avoid LVM actually reducing uptime.

-- Updated Dec 2019: minor update on btrfs and ZFS as alternatives to LVM snapshots

Mitigating the risks

LVM can still work well if you:

  • Get your write caching setup right in hypervisor, kernel, and SSDs
  • Avoid LVM snapshots
  • Use recent LVM versions to resize filesystems
  • Have good backups

Details

I've researched this quite a bit in the past having experienced some data loss associated with LVM. The main LVM risks and issues I'm aware of are:

Vulnerable to hard disk write caching due to VM hypervisors, disk caching or old Linux kernels, and makes it harder to recover data due to more complex on-disk structures - see below for details. I have seen complete LVM setups on several disks get corrupted without any chance of recovery, and LVM plus hard disk write caching is a dangerous combination.

  • Write caching and write re-ordering by the hard drive is important to good performance, but can fail to flush blocks to disk correctly due to VM hypervisors, hard drive write caching, old Linux kernels, etc.
  • Write barriers mean the kernel guarantees that it will complete certain disk writes before the "barrier" disk write, to ensure that filesystems and RAID can recover in the event of a sudden power loss or crash. Such barriers can use a FUA (Force Unit Access) operation to immediately write certain blocks to the disk, which is more efficient than a full cache flush. Barriers can be combined with efficient tagged/native command queuing (issuing multiple disk I/O requests at once) to enable the hard drive to perform intelligent write re-ordering without increasing risk of data loss.
  • VM hypervisors can have similar issues: running LVM in a Linux guest on top of a VM hypervisor such as VMware, Xen, KVM, Hyper-V or VirtualBox can create similar problems to a kernel without write barriers, due to write caching and write re-ordering. Check your hypervisor documentation carefully for a "flush to disk" or write-through cache option (present in KVM, VMware, Xen, VirtualBox and others) - and test it with your setup. Some hypervisors such as VirtualBox have a default setting that ignores any disk flushes from the guest.
  • Enterprise servers with LVM should always use a battery backed RAID controller and disable the hard disk write caching (the controller has battery backed write cache which is fast and safe) - see this comment by the author of this XFS FAQ entry. It may also be safe to turn off write barriers in the kernel, but testing is recommended.
  • If you don't have a battery-backed RAID controller, disabling hard drive write caching will slow writes significantly but make LVM safe. You should also use the equivalent of ext3's data=ordered option (or data=journal for extra safety), plus barrier=1 to ensure that kernel caching doesn't affect integrity. (Or use ext4 which enables barriers by default.) This is the simplest option and provides good data integrity at cost of performance. (Linux changed the default ext3 option to the more dangerous data=writeback a while back, so don't rely on the default settings for the FS.)
  • To disable hard drive write caching: add hdparm -q -W0 /dev/sdX for all drives in /etc/rc.local (for SATA) or use sdparm for SCSI/SAS. However, according to this entry in the XFS FAQ (which is very good on this topic), a SATA drive may forget this setting after a drive error recovery - so you should use SCSI/SAS, or if you must use SATA then put the hdparm command in a cron job running every minute or so.
  • To keep SSD / hard drive write caching enabled for better performance: this is a complex area - see section below.
  • If you are using Advanced Format drives i.e. 4 KB physical sectors, see below - disabling write caching may have other issues.
  • UPS is critical for both enterprise and SOHO but not enough to make LVM safe: anything that causes a hard crash or a power loss (e.g. UPS failure, PSU failure, or laptop battery exhaustion) may lose data in hard drive caches.
  • Very old Linux kernels (2.6.x from 2009): There is incomplete write barrier support in very old kernel versions, 2.6.32 and earlier (2.6.31 has some support, while 2.6.33 works for all types of device target) - RHEL 6 uses 2.6.32 with many patches. If these old 2.6 kernels are unpatched for these issues, a large amount of FS metadata (including journals) could be lost by a hard crash that leaves data in the hard drives' write buffers (say 32 MB per drive for common SATA drives). Losing 32MB of the most recently written FS metadata and journal data, which the kernel thinks is already on disk, usually means a lot of FS corruption and hence data loss.
  • Summary: you must take care in the filesystem, RAID, VM hypervisor, and hard drive/SSD setup used with LVM. You must have very good backups if you are using LVM, and be sure to specifically back up the LVM metadata, physical partition setup, MBR and volume boot sectors. It's also advisable to use SCSI/SAS drives as these are less likely to lie about how they do write caching - more care is required to use SATA drives.

Keeping write caching enabled for performance (and coping with lying drives)

A more complex but performant option is to keep SSD / hard drive write caching enabled and rely on kernel write barriers working with LVM on kernel 2.6.33+ (double-check by looking for "barrier" messages in the logs).

You should also ensure that the RAID setup, VM hypervisor setup and filesystem uses write barriers (i.e. requires the drive to flush pending writes before and after key metadata/journal writes). XFS does use barriers by default, but ext3 does not, so with ext3 you should use barrier=1 in the mount options, and still use data=ordered or data=journal as above.

SSDs are problematic because the use of write cache is critical to the lifetime of the SSD. It's best to use an SSD that has a supercapacitor (to enable cache flushing on power failure, and hence enable cache to be write-back not write-through).

Advanced Format drive setup - write caching, alignment, RAID, GPT

  • With newer Advanced Format drives that use 4 KiB physical sectors, it may be important to keep drive write caching enabled, since most such drives currently emulate 512 byte logical sectors ("512 emulation"), and some even claim to have 512-byte physical sectors while really using 4 KiB.
  • Turning off the write cache of an Advanced Format drive may cause a very large performance impact if the application/kernel is doing 512 byte writes, as such drives rely on the cache to accumulate 8 x 512-byte writes before doing a single 4 KiB physical write. Testing is recommended to confirm any impact if you disable the cache.
  • Aligning the LVs on a 4 KiB boundary is important for performance but should happen automatically as long as the underlying partitions for the PVs are aligned, since LVM Physical Extents (PEs) are 4 MiB by default. RAID must be considered here - this LVM and software RAID setup page suggests putting the RAID superblock at the end of the volume and (if necessary) using an option on pvcreate to align the PVs. This LVM email list thread points to the work done in kernels during 2011 and the issue of partial block writes when mixing disks with 512 byte and 4 KiB sectors in a single LV.
  • GPT partitioning with Advanced Format needs care, especially for boot+root disks, to ensure the first LVM partition (PV) starts on a 4 KiB boundary.

Harder to recover data due to more complex on-disk structures:

  • Any recovery of LVM data required after a hard crash or power loss (due to incorrect write caching) is a manual process at best, because there are apparently no suitable tools. LVM is good at backing up its metadata under /etc/lvm, which can help restore the basic structure of LVs, VGs and PVs, but will not help with lost filesystem metadata.
  • Hence a full restore from backup is likely to be required. This involves a lot more downtime than a quick journal-based fsck when not using LVM, and data written since the last backup will be lost.
  • TestDisk, ext3grep, ext3undel and other tools can recover partitions and files from non-LVM disks but they don't directly support LVM data recovery. TestDisk can discover that a lost physical partition contains an LVM PV, but none of these tools understand LVM logical volumes. File carving tools such as PhotoRec and many others would work as they bypass the filesystem to re-assemble files from data blocks, but this is a last-resort, low-level approach for valuable data, and works less well with fragmented files.
  • Manual LVM recovery is possible in some cases, but is complex and time consuming - see this example and this, this, and this for how to recover.

Harder to resize filesystems correctly - easy filesystem resizing is often given as a benefit of LVM, but you need to run half a dozen shell commands to resize an LVM based FS - this can be done with the whole server still up, and in some cases with the FS mounted, but I would never risk the latter without up to date backups and using commands pre-tested on an equivalent server (e.g. disaster recovery clone of production server).

  • Update: More recent versions of lvextend support the -r (--resizefs) option - if this is available, it's a safer and quicker way to resize the LV and the filesystem, particularly if you are shrinking the FS, and you can mostly skip this section.

  • Most guides to resizing LVM-based FSs don't take account of the fact that the FS must be somewhat smaller than the size of the LV: detailed explanation here. When shrinking a filesystem, you will need to specify the new size to the FS resize tool, e.g. resize2fs for ext3, and to lvextend or lvreduce. Without great care, the sizes may be slightly different due to the difference between 1 GB (10^9) and 1 GiB (2^30), or the way the various tools round sizes up or down.

  • If you don't do the calculations exactly right (or use some extra steps beyond the most obvious ones), you may end up with an FS that is too large for the LV. Everything will seem fine for months or years, until you completely fill the FS, at which point you will get serious corruption - and unless you are aware of this issue it's hard to find out why, as you may also have real disk errors by then that cloud the situation. (It's possible this issue only affects reducing the size of filesystems - however, it's clear that resizing filesystems in either direction does increase the risk of data loss, possibly due to user error.)

  • It seems that the LV size should be larger than the FS size by 2 x the LVM physical extent (PE) size - but check the link above for details as the source for this is not authoritative. Often allowing 8 MiB is enough, but it may be better to allow more, e.g. 100 MiB or 1 GiB, just to be safe. To check the PE size, and your logical volume+FS sizes, using 4 KiB = 4096 byte blocks:

    Shows PE size in KiB:
    vgdisplay --units k myVGname | grep "PE Size"

    Size of all LVs:
    lvs --units 4096b

    Size of (ext3) FS, assumes 4 KiB FS blocksize:
    tune2fs -l /dev/myVGname/myLVname | grep 'Block count'

  • By contrast, a non-LVM setup makes resizing the FS very reliable and easy - run Gparted and resize the FSs required, then it will do everything for you. On servers, you can use parted from the shell.

  • It's often best to use the Gparted Live CD or Parted Magic, as these have a recent and often more bug-free Gparted & kernel than the distro version - I once lost a whole FS due to the distro's Gparted not updating partitions properly in the running kernel. If using the distro's Gparted, be sure to reboot right after changing partitions so the kernel's view is correct.

Snapshots are hard to use, slow and buggy - if snapshot runs out of pre-allocated space it is automatically dropped. Each snapshot of a given LV is a delta against that LV (not against previous snapshots) which can require a lot of space when snapshotting filesystems with significant write activity (every snapshot is larger than the previous one). It is safe to create a snapshot LV that's the same size as the original LV, as the snapshot will then never run out of free space.

Snapshots can also be very slow (meaning 3 to 6 times slower than without LVM for these MySQL tests) - see this answer covering various snapshot problems. The slowness is partly because snapshots require many synchronous writes.

Snapshots have had some significant bugs, e.g. in some cases they can make boot very slow, or cause boot to fail completely (because the kernel can time out waiting for the root FS when it's an LVM snapshot [fixed in Debian initramfs-tools update, Mar 2015]).

  • However, many snapshot race condition bugs were apparently fixed by 2015.
  • LVM without snapshots generally seems quite well debugged, perhaps because snapshots aren't used as much as the core features.

Snapshot alternatives - filesystems and VM hypervisors

VM/cloud snapshots:

  • If you are using a VM hypervisor or an IaaS cloud provider (e.g. VMware, VirtualBox or Amazon EC2/EBS), their snapshots are often a much better alternative to LVM snapshots. You can quite easily take a snapshot for backup purposes (but consider freezing the FS before you do).

Filesystem snapshots:

  • filesystem level snapshots with ZFS or btrfs are easy to use and generally better than LVM, if you are on bare metal (but ZFS seems a lot more mature, just more hassle to install):

  • ZFS: there is now a kernel ZFS implementation, which has been in use for some years, and ZFS seems to be gaining adoption. Ubuntu now has ZFS as an 'out of the box' option, including experimental ZFS on root in 19.10.

  • btrfs: still not ready for production use (even on openSUSE which ships it by default and has team dedicated to btrfs), whereas RHEL has stopped supporting it). btrfs now has an fsck tool (FAQ), but the FAQ recommends you to consult a developer if you need to fsck a broken filesystem.

Snapshots for online backups and fsck

Snapshots can be used to provide a consistent source for backups, as long as you are careful with space allocated (ideally the snapshot is the same size as the LV being backed up). The excellent rsnapshot (since 1.3.1) even manages the LVM snapshot creation/deletion for you - see this HOWTO on rsnapshot using LVM. However, note the general issues with snapshots and that a snapshot should not be considered a backup in itself.

You can also use LVM snapshots to do an online fsck: snapshot the LV and fsck the snapshot, while still using the main non-snapshot FS - described here - however, it's not entirely straightforward so it's best to use e2croncheck as described by Ted Ts'o, maintainer of ext3.

You should "freeze" the filesystem temporarily while taking the snapshot - some filesystems such as ext3 and XFS will do this automatically when LVM creates the snapshot.

Conclusions

Despite all this, I do still use LVM on some systems, but for a desktop setup I prefer raw partitions. The main benefit I can see from LVM is the flexibility of moving and resizing FSs when you must have high uptime on a server - if you don't need that, gparted is easier and has less risk of data loss.

LVM requires great care on write caching setup due to VM hypervisors, hard drive / SSD write caching, and so on - but the same applies to using Linux as a DB server. The lack of support from most tools (gparted including the critical size calculations, and testdisk etc) makes it harder to use than it should be.

If using LVM, take great care with snapshots: use VM/cloud snapshots if possible, or investigate ZFS/btrfs to avoid LVM completely - you may find ZFS or btrfs is sufficiently mature compared to LVM with snapshots.

Bottom line: If you don't know about the issues listed above and how to address them, it's best not to use LVM.

RichVel
  • 3,524
  • 1
  • 17
  • 23
  • 4
    Online resizing with xfs works perfectly, you do not even have to specify the size. It will grow to the size of the LV read more in xfs_grow(5). OTOH I hit +1 for the summary on write barriers. – cstamas Jun 12 '11 at 10:24
  • @cstamas: not sure if XFS suffers from the same issues when reducing size of FS as ext3 - as long as it stays within LVM's available blocks within LV it should be fine. The corruption issue I mention is probably due to the interaction of `resize2fs` and `lvreduce/lvextend`, not LVM itself, and it is quite hard to pin down the exact problem - probably it's to do with specifying sizes that are calculated in MB by one tool and MiB by another tool, giving slight mismatches leading to an FS that is too large for the resized LV. – RichVel Jun 12 '11 at 10:41
  • 1
    @RichVel afaik there is no shrink support in XFS. First I tought this is a big deal, but in reality I only needed it a few times. – cstamas Jun 12 '11 at 14:21
  • 1
    shouldn't your second bullet be *enable* with battery backup? The next bullet is *disable* without battery. Also, as others have said, very very few filesystems require specifying the size for the resize operation. Most find that automatically. – TREE Jun 12 '11 at 14:38
  • 3
    DUDE! Where have you been all my life!? – songei2f Jun 12 '11 at 18:49
  • 3
    @TREE: the idea with a battery-backed RAID controller is that its cache is persistent across power failures and can generally be trusted to work as documented, whereas some hard disk caches lie about whether they have actually written a block to disk, and of course these caches aren't persistent. If you leave the hard disk cache enabled you are vulnerable to a sudden power failure (e.g PSU or UPS fails), which is protected against by the RAID controller's battery backup. – RichVel Jun 12 '11 at 20:39
  • @RichVel Ah. I was confusing the disk cache and the raid controller cache. Interesting twist. – TREE Jun 12 '11 at 23:54
  • 8
    One of the best answers I have ever seen, any topic. Only change I would make, move summary to the TOP of the question for those with attention deficit disorder or not a lot of time. :-) – Prof. Falken Jun 14 '11 at 07:06
  • 1
    Excellent and usefull answer. I've myself had some problems to easily recover data on LVM partition due to it's complexity compared to simplest FS eg; ext4. Moreover, checking the integrity of LVM partitions and recover them is not as easy as it should be – Razique Aug 16 '11 at 07:41
  • Explain me what is "LVM write caching"(?) :-) May be then I will as well consider this answer valuable. – poige Aug 17 '11 at 01:43
  • Does `lvresize -r` have the same issues mentioned in bullet 3? If not, bullet 3 is inapplicable. – JCallicoat Aug 30 '11 at 15:12
  • `lvresize -r` (aka `--resizefs` option) is interesting as it is supposed to resize the volume and the FS in one operation, but it seems this option is not well documented: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=433476 - personally, I wouldn't risk a production filesystem without some significant testing on dummy filesystems. The same issues apply, it's just that this feature may automate the whole process and avoid mistakes, as long as it's well debugged. – RichVel Aug 30 '11 at 18:04
  • More on `lvresize -r` - seems like it only started working in lvm2 (upstream) somewhat recently: https://bugs.launchpad.net/ubuntu/+source/lvm2/+bug/174032 - there are reports of `lvresize -r` failing with an fsadm error. – RichVel Aug 30 '11 at 18:15
  • What does all this about hard disk write caches have to do with LVM? That's an old problem, with or without LVM, and I don't see a connection in this writeup. – Greg Price Sep 19 '11 at 07:45
  • The connection is that LVM on older kernels (including many used in RHEL 5) doesn't have write barriers that work, and there are many other ways in which synchronous writes don't work with LVM. If you get this setup wrong, any power loss causes disk corruption. And because LVM has more complex on-disk structures making disk recovery very hard, you usually need to recover from backup, meaning significant downtime. So the bottom line is "get all write caches working properly with LVM or you may have significant downtime". This is explained further in some of the links. – RichVel Sep 24 '11 at 09:46
  • Regarding barriers on ext3/4: these filesystems also embed checksums in the journal - therefore you're getting about the same guarantees using data=writeback,barriers=0 as you do with XFS and barriers enabled (both will only implement meta-data journalling). While using barriers is probably a good idea with data journalling (data=ordered/data=journal) if you want this kind of reliability then BTRFS looks like a much more attractive option. – symcbean Jun 19 '12 at 13:14
  • @symcbean, I believe that ext3 doesn't implement journal checksumming - see http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal and discussion here: https://lwn.net/Articles/350175/. However, ext4 does implement this, via the underlying JBD2 layer (see 2nd link). Don't agree that using data=writeback,barriers=0 is equivalent to XFS with those options. In any case, the options you talk about (other than data=journal which logs all data blocks to journal) don't affect LVM metadata or data blocks so they don't really help with LVM data integrity. – RichVel Jun 20 '12 at 07:28
  • @symcbean, BTRFS has some nice features but it's still a bit early for production use (only partial fsck, see above) and you still need a correct write cache setup with BTFS in any case. LVM snapshots can be avoided with BTRFS at least, and BTRFS does support some LVM features such as multiple drives (https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices), but you may still want LVM while migrating, and to provide some resilience against BTRFS problem by splitting the drives into multiple filesystems. – RichVel Jun 20 '12 at 07:33
  • @RichVel: Yes, I was wrong about checksums in JBD (ext3) – symcbean Jun 20 '12 at 09:22
  • @RichVel: Very comprehensive answer! Since ext4 implements journal checksumming and has write barriers on per default (according to http://kernelnewbies.org/Ext4), is it then safe to use ext4 on LVM with default mount options? – mgd Feb 15 '13 at 19:09
  • @mgd: Not sure about ext4 default mount option, kernel updates and distros can change the defaults, so it's best to specify the options you want. LWN is a good source for latest info - see https://lwn.net/Articles/521803/ which says "Journal checksumming is an optional feature, not enabled by default, and, evidently, not widely used" (can't imagine why it's not default). Of course you'll need to make sure rest of stack (hypervisor, kernel, disk / SSD, RAID) also respects write barriers, as mentioned. And preferably test the results with one of the tools linked above. – RichVel Feb 17 '13 at 10:21
  • @mgd: it's possible that ext4 journal checksums would remove the need for one write barrier, but really you should confirm on LWN or kernel email list. See https://lwn.net/Articles/283161/ - rather old but good info. Confirmation that ext4 does enable write barriers by default: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt (not sure which kernel version started this) – RichVel Feb 17 '13 at 10:31
  • @mgd: small update done to answer covering ext4's use of write barriers by default - thanks for the comment. – RichVel Feb 17 '13 at 16:45
  • @RichVel: Thanks for your time and the info. My current ext4 options are `rw,relatime,user_xattr,barrier=1,data=ordered` (resulting from default + `relatime`). I will add `journal_checksum,barrier=1,data=ordered` explicitly to make sure they are always enabled. From what I read here http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal journal checksumming should help against corruption caused by reordering (e.g. done by the drive). Also, it seems that what you write in your answer is not only relevant for LVM but just more important than for an FS on a plain partition. Correct? – mgd Feb 17 '13 at 21:36
  • @mgd: journal checksums happen in many scenarios, and are not just for LVM, as you say. – RichVel Feb 18 '13 at 10:09
  • Very helpful thanks, +1, but I'd like to make one point: saying "RHEL 6 uses 2.6.32" is a bit misleading ["there is no one kernel version number that accurately represents the RHEL 6 kernel"](http://www.serverwatch.com/article.php/3880131/The-Red-Hat-Enterprise-Linux-6-Kernel-What-Is-It.htm). Is it really the case that any release of RHEL6 has this issue? I suspect not, and I'd be very surprised if 6.4 does even though it is nominally also based on 2.6.32. –  Mar 29 '13 at 16:07
  • 1
    @JackDouglas, you may well be right that Red Hat has backported some code from later releases and in any case 2.6.32 does support most of the barrier cases for LVM - see the last comments on this kernel bug page for what was added in 2.6.33: https://bugzilla.kernel.org/show_bug.cgi?id=9554. The way to be sure is to look in your error logs for 'barriers disabled' messages from the device mapper that underlies LVM. Something like 'barrier based sync failed'. – RichVel Mar 29 '13 at 16:30
  • Debian 6.0 squeeze does have an issue in 2.6.32 with write caching on LVM volumes at least when it's a Xen guest - I get the barrier-based sync failed error. Kernel is 2.6.32-5-xen-686. – RichVel Apr 04 '13 at 15:37
  • 2
    Seeing all the comments and the last update to the answer was a year ago, I was wondering if the answer could be updated to reflect any new changes in reliability, performance and ease of use. – Luis Alvarado Sep 19 '14 at 01:41
  • 3
    I've included corrections/updates from existing comments where applicable. Haven't been using LVM so much recently, but I don't recall seeing any major changes based on LWN.net stories, which track this sort of thing quite closely. ZFS on Linux is now more mature (but still better on FreeBSD or Solaris), and btrfs is still some way from real production maturity despite being used by some Linux distributions. So I don't see any changes that need to be included right now, but I'm happy to listen! – RichVel Sep 20 '14 at 07:22
  • Added a link on why LVM snapshots are slow. – RichVel Jul 11 '15 at 14:13
  • 1
    Great, after that answer I will never ever use LVM, and also will have perfect referral to other people that will trying to do this! – Mazeryt Apr 05 '16 at 15:02
  • Would I be correct to assume that if the LVM structure is static during power loss, then this boils down to *filesystem* data loss - not much different than without LVM, except for the more complex data structure? I understand some of this is mitigated on ext4, as noted by others. What about more recent kernel versions: 5.x, or at least 4.x? Also, a couple notes: 1) ext4 also added metadata checksums, though I'm not sure if they're stable enough for production yet. 2) `vgconvert --pvmetadatacopies` might help in some scenarios. It enables a 2nd copy of the LVM metadata. – MichaelK Feb 13 '21 at 08:31
  • 1
    @MichaelK if your LVM data structures on disk are static at the point of power loss, then there is little chance of data loss, and it's similar to filesystem data loss as you say. The issues here are about ensuring data gets written to disk at the right point, in a system with many active writes. Not sure that new kernel versions make much difference - I have mostly relied on [LWN.net](https://lwn.net/) which summarises major developments in Linux including LVM. ext4 metadata checksums are useful but don't affect LVM. – RichVel Feb 14 '21 at 08:09
  • `vgconvert` is [no longer part of LVM](https://manpages.ubuntu.com/manpages/focal/en/man8/vgconvert.8.html) as it's from the original LVM1, not the current LVM2 - you can ensure extra LVM metadata copies at volume creation time with [pvcreate](https://manpages.ubuntu.com/manpages/precise/man8/pvcreate.8.html), which is a good idea. – RichVel Feb 14 '21 at 08:10
  • Thank you for this excellent writeup. I wonder though, now that a decade has passed since it was published, where do we stand with LVM snapshots, resilience and other painpoints? – dyasny May 25 '21 at 13:20
  • @dyasny - there was an update in Dec 2019, but I don't really use LVM these days since mostly using Linux with cloud volumes. LVM is quite mature I believe, and I don't think there has been major feature work in the last 10 years or so - I haven't noticed many stories on [LWN.net](https://LWN.net). – RichVel Jul 14 '21 at 13:27
15

I [+1] that post, and at least for me I think most of the problems do exist. Seen them while running a few 100 servers and a few 100TB of data. To me the LVM2 in Linux feels like a "clever idea" someone had. Like some of these, they turn out to be "not clever" at times. I.e. not having strictly separated kernel and userspace (lvmtab) states might have felt really smart to do away with, because there can be corruption issues (if you don't get the code right)

Well, just that this separation was there for a reason - the differences show with PV loss handling, and online re-activation of a VG with i.e. missing PVs to bring them back into play - What is a breeze on "original LVMs" (AIX, HP-UX) turns into crap on LVM2 since the state handling is not good enough. And don't even get me talking about Quorum loss detection (haha) or state handling (if I remove a disk, that won't be flagged as unavailable. It doesn't even have the damn status column)

Re: stability pvmove... why is

pvmove data loss

such a top ranking article on my blog, hmmm? Just now I look at a disk where the phyiscal lvm data still is hung on the state from mid-pvmove. There have been some memleaks I think, and the general idea it's a good thing to copy around live block data from userspace is just sad. Nice quote from the lvm list "seems like vgreduce --missing does not handle pvmove" Means in fact if a disk detaches during pvmove then the lvm management tool changes from lvm to vi. Oh and there has also been a bug where pvmove continues after a block read/write error and does in fact no longer write data to the target device. WTF?

Re: Snapshots The CoW is done unsafely, by updating the NEW data into the snapshot lv area and then merging back once you delete the snap. This means you have heavy IO spikes during the final merge-back of new data into the original LV and, much more important, you of course also have a much higher risk of data corruption, because not the snapshot will be broken once you hit the wall, but the original.

The advantage is in performance, doing 1 writes instead of 3. Picking the fast but unsafer algorithm is something that one obviously expects from people like VMware and MS, on "Unix" I'd rather guess things would be "done right". I didn't see much performance issues as long as I have the snapshot backing store on a different disk drive than the primary data (and backup to yet another one of course)

Re: Barriers I'm not sure if one can blame that on LVM. It was a devmapper issue, as far as I know. But there can be some blame for not really caring about this issue from at least kernel 2.6 until 2.6.33 AFAIK Xen is the only hypervisor that uses O_DIRECT for the virtual machines, the problem used to be when "loop" was used because the kernel would still cache using that. Virtualbox at least has some setting to disable stuff like this and Qemu/KVM generally seems to allow caching. All FUSE FS are also having problems there (no O_DIRECT)

Re: Sizes I think LVM does "rounding" of the displayed size. Or it uses GiB. Anyway, you need to use the VG Pe size and multiply it by the LE number of the LV. That should give the correct net size, and that issue is always a usage issue. It is made worse by filesystems that don't notice such a thing during fsck/mount (hello, ext3) or don't have a working online "fsck -n" (hello, ext3)

Of course it's telling that you can't find good sources for such info. "how many LE for the VRA?" "what is the phyiscal offset for PVRA, VGDA, ... etc"

Compared to the original one LVM2 is the prime example of "Those who don't understand UNIX are condemned to reinvent it, poorly."

Update a few months later: I have been hitting the "full snapshot" scenario for a test by now. If they get full, the snapshot blocks, not the original LV. I was wrong there when I had first posted this. I picked up wrong info from some doc, or maybe I had understood it. In my setups I'd always been very paranoid to not let them fill up and so I never ended up corrected. It's also possible to extend/shrink snapshots, which is a treat.

What I've still been unable to solve is how to identify a snapshot's age. As to their performance, there is a note on the "thinp" fedora project page which says the snapshot technique is being revised so that they won't get slower with each snapshot. I don't know how they're implementing it.

Florian Heigl
  • 1,440
  • 12
  • 19
  • Good points, particularly on the pvmove data loss (didn't realise this could crash under low memory) and snapshot design. On write barriers/caching: I was conflating LVM and the kernel device mapper as from the user point of view they work together to deliver what LVM provides. Upvoted. Also liked your blog posting on pvmove data loss: http://deranfangvomende.wordpress.com/2009/12/28/a-primer-on-risking-data-loss-with-pvmove/ – RichVel Feb 03 '12 at 06:57
  • On snapshots: they are notoriously slow in LVM, so clearly it wasn't a good design decision to go for performance over reliability. By "hit the wall", did you mean the snapshot filling up, and can that really cause corruption of the original LV data? The LVM HOWTO says that the snapshot is dropped in this case: http://tldp.org/HOWTO/LVM-HOWTO/snapshots_backup.html – RichVel Feb 03 '12 at 07:01
  • 5
    "The CoW is done unsafely, by updating the NEW data into the snapshot lv area and then merging back once you delete the snap." This is just false. When new data is written to the original device, the *old* version is written into the snapshots COW area. No data is ever merged back (except if you want to). See http://www.kernel.org/doc/Documentation/device-mapper/snapshot.txt for all the gory technical details. – Damien Tournoud Jan 30 '13 at 21:11
  • Hi Damien, next time just read on to the point where I corrected my post? – Florian Heigl Mar 17 '13 at 19:35
  • As I understand, nowadays the *original* data is copied into the snapshot, and then the new data is written into the original volume (unless you mount the snapshot directly and write there - but that's a different use case). Don't know how it was 10 years ago. – MichaelK Feb 13 '21 at 09:02
14

While providing an interesting window on the state of LVM 10+ years ago, the accepted answer is now totally obsolete. Modern (ie: post-2012 LVM):

  • honors sync/barrier requests
  • has fast and reliable snapshot capability in the form of lvmthin
  • have stable SSD caching via lvmcache and a fast writeback policy for NVMe/NVDIMM/Optane via dm-writecache
  • virtual data optimizer (vdo) support thanks to lvmvdo
  • integrated and per-volume RAID thanks to lvmraid
  • automatic alignment to 1MB or 4MB (depending on the version), completely avoiding any 4K alignment issue (unless using LVM over a misaligned partition)
  • excellent support for volume extension, especially when doing it by adding other block devices (which is simply not possible when using a classical filesystem as ext4/xfs on top of plain partition)
  • an excellent, friendly and extremely useful mailing list at linux-lvm@redhat.com

Obviously, this does not mean you always have to use LVM - the golden rule for storage is to avoid unneeded layers. For example, for simple virtual machines you can surely go ahead with classical partition only. But if you value any of the features above, LVM is an extremely useful tool which should be in the toolbox of any serious Linux sysadmin.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
13

If you plan to use snapshots for backups - be prepared for a major performance hit when snapshot is present. otherwise it's all good. I've been using lvm in production for couple of years on dozens of servers, although my main reason to use it is the atomic snapshot not ability to expand volumes easily.

BTW if you're going to use 1TB drive, remember about partition alignment - this drive most probably has 4kB physical sectors.

Gordan Bobić
  • 936
  • 4
  • 10
pQd
  • 29,561
  • 5
  • 64
  • 106
  • +1 for performance warning for open snapshots. – Prof. Falken Jun 14 '11 at 07:08
  • my experience is that 1TB drives usually use 512 byte sectors, but most 2TB drives use 4Kb. – Dan Pritts Sep 24 '12 at 20:09
  • @DanPritts there's no harm in assuming that sector size is 4kB or even 128kB - just in case there's raid in between. you lose so little - maybe that 128kB and you can gain a lot. also when imaging from the old disk to a new one. – pQd Sep 24 '12 at 20:50
  • 1
    There is some minor harm to making filesystem block size "too big"; each file is contained in no less than a single block. If you've got a lot of tiny files and 128KB blocks it will add up. I agree though that 4K is quite reasonable, and if you move a filesystem to new hardware, you will end up with 4k sectors eventually. – Dan Pritts Sep 25 '12 at 14:58
  • 1
    (won't let me edit my previous comment)...A waste of space may not matter, but it will end up increasing your average seek time on spinning disks. It might possibly turn into write amplification (filling out the sector with nulls) on SSDs. – Dan Pritts Sep 25 '12 at 15:03
  • @DanPritts: see the Advanced Format section of my answer for coverage of 4 KiB sector drives - there's a mix of Advanced Format support for same size of drive. Some current drives have 4 KiB sectors but present both physical and logical sector sizes of 512 bytes to the OS, making it very hard for the kernel to get it right. – RichVel Feb 27 '13 at 07:53
  • Wasn't disagreeing with your premise - i'm all too aware of the 4k block size issue and drives that lie about their sector size. My experience though has been that 1TB drives have real 512b sectors but 2TB drives often have 4k physical and 512b logical sectors. Also, of course, just pointing out that there can be a minor downside to making filesystem blocks larger in some situations. – Dan Pritts Feb 27 '13 at 20:46
  • If you create snapshots from a recent lvm with thin logical volumes (lvcreate --thin), snapshots will use the "multisnap" implementation which shouldn't degrade write performance of the original volume. – Tobu Mar 04 '13 at 13:00
  • @Tobu - thanks for that note! could you please refer me to some source? – pQd Mar 04 '13 at 21:53
  • You should never treat a snapshot as backup (at least not as a complete solution). It only backs up blocks that were written to. If there's corruption in any other block, a snapshot won't help you. They are useful as a *source* for backups though. – MichaelK Feb 13 '21 at 09:11
  • @MichaelK - i agree with you. LVM here would provide crash consistent snapshots that can be transferred to another machine / offline. it would not hold them permanently as a backup. – pQd Feb 17 '21 at 11:18
5

Adam,

Another advantage: you can add a new physical volume (PV), move all the data to that PV and then remove the old PV without any service interruptions. I've used that capability at least four times in the past five years.

A disadvantage I didn't see pointed out clearly yet: There's a somewhat steep learning curve for LVM2. Mostly in the abstraction it creates between your files and the underlying media. If you work with just a few people who share chores on a set of servers, you may find the extra complexity overwhelming for your team as a whole. Larger teams dedicated to the IT work will generally not have such a problem.

For example, we use it widely here at my work and have taken time to teach the whole team the basics, the language and the bare essentials about recovering systems that don't boot correctly.

One caution specifically to point out: if you boot from an LVM2 logical volume you made find recovery operations difficult when the server crashes. Knoppix and friends don't always have the right stuff for that. So, we decided that our /boot directory would be on it's own partition and would always be small and native.

Overall, I'm a fan of LVM2.

Mike Diehn
  • 859
  • 4
  • 8
  • 3
    keeping `/boot` separate is always a good idea – Hubert Kario Aug 30 '11 at 15:46
  • 3
    GRUB2 does support booting from an LVM logical volume (see https://wiki.archlinux.org/index.php/GRUB2#LVM) but GRUB1 does not. I would always use a separate non-LVM /boot just to ensure it's easy to recover. Most rescue disks these days do support LVM - some require a manual `vgchange -ay` to find the LVM volumes. – RichVel Sep 01 '11 at 09:30
  • 1
    on pvmove: see the point about pvmove data loss made in Florian Heigl's answer. – RichVel Feb 03 '12 at 07:05
2

Couple of things:

Spanning LVs across Multiple PVs

I've seen folks advocating (StackExchange & elsewhere) extending VM space laterally: increasing space by adding ADDITIONAL PVs to a VG vs increasing a SINGLE PV. It's ugly and spreads your filesystem(s) across multiple PVs, creating a dependency on an ever-longer & longer chain of PVs. This is what your filesystems will look like if you scale your VM's storage laterally:

Illustrative graphic add vs increase PV

Data loss if a PV lost hosting Part of a Spanned LV

I've seen lots of confusion over this. If a Linear LV- and the filesystem which lives in it- are spanned across multiple PVs, would experience FULL or PARTIAL data loss? Here's the answer illustrated:

Illustration of data loss for Spanned LV if PV lost

Logically, this is what we should expect. If the extents holding our LV data are spread across multiple PVs and one of those PVs disappears, the filesystem in that LV would be catastrophically damaged.

Hope these little doodles made a complex subject a bit easier to understand the risks when working with LVM

F1Linux
  • 335
  • 5
  • 12