I have recently started using LVM on some servers for hard drives larger than 1 TB. They're useful, expandable and quite easy to install. However, I could not find any data about the dangers and caveats of LVM.
What are the downsides of using LVM?
I have recently started using LVM on some servers for hard drives larger than 1 TB. They're useful, expandable and quite easy to install. However, I could not find any data about the dangers and caveats of LVM.
What are the downsides of using LVM?
Summary
Risks of using LVM:
The first two LVM issues combine: if write caching isn't working correctly and you have a power loss (e.g. PSU or UPS fails), you may well have to recover from backup, meaning significant downtime. A key reason for using LVM is higher uptime (when adding disks, resizing filesystems, etc), but it's important to get the write caching setup correct to avoid LVM actually reducing uptime.
-- Updated Dec 2019: minor update on btrfs and ZFS as alternatives to LVM snapshots
Mitigating the risks
LVM can still work well if you:
Details
I've researched this quite a bit in the past having experienced some data loss associated with LVM. The main LVM risks and issues I'm aware of are:
Vulnerable to hard disk write caching due to VM hypervisors, disk caching or old Linux kernels, and makes it harder to recover data due to more complex on-disk structures - see below for details. I have seen complete LVM setups on several disks get corrupted without any chance of recovery, and LVM plus hard disk write caching is a dangerous combination.
data=ordered
option (or data=journal
for extra safety), plus barrier=1
to ensure that kernel caching doesn't affect integrity. (Or use ext4 which enables barriers by default.) This is the simplest option and provides good data integrity at cost of performance. (Linux changed the default ext3 option to the more dangerous data=writeback
a while back, so don't rely on the default settings for the FS.)hdparm -q -W0 /dev/sdX
for all drives in /etc/rc.local
(for SATA) or use sdparm for SCSI/SAS. However, according to this entry in the XFS FAQ (which is very good on this topic), a SATA drive may forget this setting after a drive error recovery - so you should use SCSI/SAS, or if you must use SATA then put the hdparm command in a cron job running every minute or so.Keeping write caching enabled for performance (and coping with lying drives)
A more complex but performant option is to keep SSD / hard drive write caching enabled and rely on kernel write barriers working with LVM on kernel 2.6.33+ (double-check by looking for "barrier" messages in the logs).
You should also ensure that the RAID setup, VM hypervisor setup and filesystem uses write barriers (i.e. requires the drive to flush pending writes before and after key metadata/journal writes). XFS does use barriers by default, but ext3 does not, so with ext3 you should use barrier=1
in the mount options, and still use data=ordered
or data=journal
as above.
SSDs are problematic because the use of write cache is critical to the lifetime of the SSD. It's best to use an SSD that has a supercapacitor (to enable cache flushing on power failure, and hence enable cache to be write-back not write-through).
Advanced Format drive setup - write caching, alignment, RAID, GPT
pvcreate
to align the PVs. This LVM email list thread points to the work done in kernels during 2011 and the issue of partial block writes when mixing disks with 512 byte and 4 KiB sectors in a single LV.Harder to recover data due to more complex on-disk structures:
/etc/lvm
, which can help restore the basic structure of LVs, VGs and PVs, but will not help with lost filesystem metadata.Harder to resize filesystems correctly - easy filesystem resizing is often given as a benefit of LVM, but you need to run half a dozen shell commands to resize an LVM based FS - this can be done with the whole server still up, and in some cases with the FS mounted, but I would never risk the latter without up to date backups and using commands pre-tested on an equivalent server (e.g. disaster recovery clone of production server).
Update: More recent versions of lvextend
support the -r
(--resizefs
) option - if this is available, it's a safer and quicker way to resize the LV and the filesystem, particularly if you are shrinking the FS, and you can mostly skip this section.
Most guides to resizing LVM-based FSs don't take account of the fact that the FS must be somewhat smaller than the size of the LV: detailed explanation here. When shrinking a filesystem, you will need to specify the new size to the FS resize tool, e.g. resize2fs
for ext3, and to lvextend
or lvreduce
. Without great care, the sizes may be slightly different due to the difference between 1 GB (10^9) and 1 GiB (2^30), or the way the various tools round sizes up or down.
If you don't do the calculations exactly right (or use some extra steps beyond the most obvious ones), you may end up with an FS that is too large for the LV. Everything will seem fine for months or years, until you completely fill the FS, at which point you will get serious corruption - and unless you are aware of this issue it's hard to find out why, as you may also have real disk errors by then that cloud the situation. (It's possible this issue only affects reducing the size of filesystems - however, it's clear that resizing filesystems in either direction does increase the risk of data loss, possibly due to user error.)
It seems that the LV size should be larger than the FS size by 2 x the LVM physical extent (PE) size - but check the link above for details as the source for this is not authoritative. Often allowing 8 MiB is enough, but it may be better to allow more, e.g. 100 MiB or 1 GiB, just to be safe. To check the PE size, and your logical volume+FS sizes, using 4 KiB = 4096 byte blocks:
Shows PE size in KiB:
vgdisplay --units k myVGname | grep "PE Size"
Size of all LVs:
lvs --units 4096b
Size of (ext3) FS, assumes 4 KiB FS blocksize:
tune2fs -l /dev/myVGname/myLVname | grep 'Block count'
By contrast, a non-LVM setup makes resizing the FS very reliable and easy - run Gparted and resize the FSs required, then it will do everything for you. On servers, you can use parted
from the shell.
It's often best to use the Gparted Live CD or Parted Magic, as these have a recent and often more bug-free Gparted & kernel than the distro version - I once lost a whole FS due to the distro's Gparted not updating partitions properly in the running kernel. If using the distro's Gparted, be sure to reboot right after changing partitions so the kernel's view is correct.
Snapshots are hard to use, slow and buggy - if snapshot runs out of pre-allocated space it is automatically dropped. Each snapshot of a given LV is a delta against that LV (not against previous snapshots) which can require a lot of space when snapshotting filesystems with significant write activity (every snapshot is larger than the previous one). It is safe to create a snapshot LV that's the same size as the original LV, as the snapshot will then never run out of free space.
Snapshots can also be very slow (meaning 3 to 6 times slower than without LVM for these MySQL tests) - see this answer covering various snapshot problems. The slowness is partly because snapshots require many synchronous writes.
Snapshots have had some significant bugs, e.g. in some cases they can make boot very slow, or cause boot to fail completely (because the kernel can time out waiting for the root FS when it's an LVM snapshot [fixed in Debian initramfs-tools
update, Mar 2015]).
Snapshot alternatives - filesystems and VM hypervisors
VM/cloud snapshots:
Filesystem snapshots:
filesystem level snapshots with ZFS or btrfs are easy to use and generally better than LVM, if you are on bare metal (but ZFS seems a lot more mature, just more hassle to install):
ZFS: there is now a kernel ZFS implementation, which has been in use for some years, and ZFS seems to be gaining adoption. Ubuntu now has ZFS as an 'out of the box' option, including experimental ZFS on root in 19.10.
btrfs: still not ready for production use (even on openSUSE which ships it by default and has team dedicated to btrfs), whereas RHEL has stopped supporting it). btrfs now has an fsck tool (FAQ), but the FAQ recommends you to consult a developer if you need to fsck a broken filesystem.
Snapshots for online backups and fsck
Snapshots can be used to provide a consistent source for backups, as long as you are careful with space allocated (ideally the snapshot is the same size as the LV being backed up). The excellent rsnapshot (since 1.3.1) even manages the LVM snapshot creation/deletion for you - see this HOWTO on rsnapshot using LVM. However, note the general issues with snapshots and that a snapshot should not be considered a backup in itself.
You can also use LVM snapshots to do an online fsck: snapshot the LV and fsck the snapshot, while still using the main non-snapshot FS - described here - however, it's not entirely straightforward so it's best to use e2croncheck as described by Ted Ts'o, maintainer of ext3.
You should "freeze" the filesystem temporarily while taking the snapshot - some filesystems such as ext3 and XFS will do this automatically when LVM creates the snapshot.
Conclusions
Despite all this, I do still use LVM on some systems, but for a desktop setup I prefer raw partitions. The main benefit I can see from LVM is the flexibility of moving and resizing FSs when you must have high uptime on a server - if you don't need that, gparted is easier and has less risk of data loss.
LVM requires great care on write caching setup due to VM hypervisors, hard drive / SSD write caching, and so on - but the same applies to using Linux as a DB server. The lack of support from most tools (gparted
including the critical size calculations, and testdisk
etc) makes it harder to use than it should be.
If using LVM, take great care with snapshots: use VM/cloud snapshots if possible, or investigate ZFS/btrfs to avoid LVM completely - you may find ZFS or btrfs is sufficiently mature compared to LVM with snapshots.
Bottom line: If you don't know about the issues listed above and how to address them, it's best not to use LVM.
I [+1] that post, and at least for me I think most of the problems do exist. Seen them while running a few 100 servers and a few 100TB of data. To me the LVM2 in Linux feels like a "clever idea" someone had. Like some of these, they turn out to be "not clever" at times. I.e. not having strictly separated kernel and userspace (lvmtab) states might have felt really smart to do away with, because there can be corruption issues (if you don't get the code right)
Well, just that this separation was there for a reason - the differences show with PV loss handling, and online re-activation of a VG with i.e. missing PVs to bring them back into play - What is a breeze on "original LVMs" (AIX, HP-UX) turns into crap on LVM2 since the state handling is not good enough. And don't even get me talking about Quorum loss detection (haha) or state handling (if I remove a disk, that won't be flagged as unavailable. It doesn't even have the damn status column)
Re: stability pvmove... why is
pvmove data loss
such a top ranking article on my blog, hmmm? Just now I look at a disk where the phyiscal lvm data still is hung on the state from mid-pvmove. There have been some memleaks I think, and the general idea it's a good thing to copy around live block data from userspace is just sad. Nice quote from the lvm list "seems like vgreduce --missing does not handle pvmove" Means in fact if a disk detaches during pvmove then the lvm management tool changes from lvm to vi. Oh and there has also been a bug where pvmove continues after a block read/write error and does in fact no longer write data to the target device. WTF?
Re: Snapshots The CoW is done unsafely, by updating the NEW data into the snapshot lv area and then merging back once you delete the snap. This means you have heavy IO spikes during the final merge-back of new data into the original LV and, much more important, you of course also have a much higher risk of data corruption, because not the snapshot will be broken once you hit the wall, but the original.
The advantage is in performance, doing 1 writes instead of 3. Picking the fast but unsafer algorithm is something that one obviously expects from people like VMware and MS, on "Unix" I'd rather guess things would be "done right". I didn't see much performance issues as long as I have the snapshot backing store on a different disk drive than the primary data (and backup to yet another one of course)
Re: Barriers I'm not sure if one can blame that on LVM. It was a devmapper issue, as far as I know. But there can be some blame for not really caring about this issue from at least kernel 2.6 until 2.6.33 AFAIK Xen is the only hypervisor that uses O_DIRECT for the virtual machines, the problem used to be when "loop" was used because the kernel would still cache using that. Virtualbox at least has some setting to disable stuff like this and Qemu/KVM generally seems to allow caching. All FUSE FS are also having problems there (no O_DIRECT)
Re: Sizes I think LVM does "rounding" of the displayed size. Or it uses GiB. Anyway, you need to use the VG Pe size and multiply it by the LE number of the LV. That should give the correct net size, and that issue is always a usage issue. It is made worse by filesystems that don't notice such a thing during fsck/mount (hello, ext3) or don't have a working online "fsck -n" (hello, ext3)
Of course it's telling that you can't find good sources for such info. "how many LE for the VRA?" "what is the phyiscal offset for PVRA, VGDA, ... etc"
Compared to the original one LVM2 is the prime example of "Those who don't understand UNIX are condemned to reinvent it, poorly."
Update a few months later: I have been hitting the "full snapshot" scenario for a test by now. If they get full, the snapshot blocks, not the original LV. I was wrong there when I had first posted this. I picked up wrong info from some doc, or maybe I had understood it. In my setups I'd always been very paranoid to not let them fill up and so I never ended up corrected. It's also possible to extend/shrink snapshots, which is a treat.
What I've still been unable to solve is how to identify a snapshot's age. As to their performance, there is a note on the "thinp" fedora project page which says the snapshot technique is being revised so that they won't get slower with each snapshot. I don't know how they're implementing it.
While providing an interesting window on the state of LVM 10+ years ago, the accepted answer is now totally obsolete. Modern (ie: post-2012 LVM):
lvmthin
lvmcache
and a fast writeback policy for NVMe/NVDIMM/Optane via dm-writecache
vdo
) support thanks to lvmvdo
lvmraid
linux-lvm@redhat.com
Obviously, this does not mean you always have to use LVM - the golden rule for storage is to avoid unneeded layers. For example, for simple virtual machines you can surely go ahead with classical partition only. But if you value any of the features above, LVM is an extremely useful tool which should be in the toolbox of any serious Linux sysadmin.
If you plan to use snapshots for backups - be prepared for a major performance hit when snapshot is present. otherwise it's all good. I've been using lvm in production for couple of years on dozens of servers, although my main reason to use it is the atomic snapshot not ability to expand volumes easily.
BTW if you're going to use 1TB drive, remember about partition alignment - this drive most probably has 4kB physical sectors.
Adam,
Another advantage: you can add a new physical volume (PV), move all the data to that PV and then remove the old PV without any service interruptions. I've used that capability at least four times in the past five years.
A disadvantage I didn't see pointed out clearly yet: There's a somewhat steep learning curve for LVM2. Mostly in the abstraction it creates between your files and the underlying media. If you work with just a few people who share chores on a set of servers, you may find the extra complexity overwhelming for your team as a whole. Larger teams dedicated to the IT work will generally not have such a problem.
For example, we use it widely here at my work and have taken time to teach the whole team the basics, the language and the bare essentials about recovering systems that don't boot correctly.
One caution specifically to point out: if you boot from an LVM2 logical volume you made find recovery operations difficult when the server crashes. Knoppix and friends don't always have the right stuff for that. So, we decided that our /boot directory would be on it's own partition and would always be small and native.
Overall, I'm a fan of LVM2.
Couple of things:
I've seen folks advocating (StackExchange & elsewhere) extending VM space laterally: increasing space by adding ADDITIONAL PVs to a VG vs increasing a SINGLE PV. It's ugly and spreads your filesystem(s) across multiple PVs, creating a dependency on an ever-longer & longer chain of PVs. This is what your filesystems will look like if you scale your VM's storage laterally:
I've seen lots of confusion over this. If a Linear LV- and the filesystem which lives in it- are spanned across multiple PVs, would experience FULL or PARTIAL data loss? Here's the answer illustrated:
Logically, this is what we should expect. If the extents holding our LV data are spread across multiple PVs and one of those PVs disappears, the filesystem in that LV would be catastrophically damaged.
Hope these little doodles made a complex subject a bit easier to understand the risks when working with LVM