In a distributed filesystem like MooseFS or XtreemFS, should individual nodes expose "raw-er" storage, or LVM'd storage?

Question

When preparing an infrastructure to utilize a distributed storage system like MooseFS or XtreemFS, how should individual nodes present storage to the rest of the environment?

Is it better for partitions to be presented close to the physical hardware, or should individual nodes present logical volumes and/or volume groups?

In a previous question, "Is there a way to do something like LVM over NFS?", I achieved a similar result to using a distributed system like GlusterFS via the intermediary of VMware.

How should this scenario best be approached for a distributed filesystem? Does the approach vary depending on the distributed filesystem chosen?

score 2 · Accepted Answer · answered Apr 17 '14 at 16:05

It seems there are two general approaches (at least in the MooseFS and XtreemFS worlds):

The drive-at-a-time

For MooseFS the best way is to use one HDD as one XFS partition connected to chunkserver. We don't recommending to use any RAID, LVM configuration.

Why?
First thing are HDD errors. If your hard drive starts to slowing down, is hard to find with one it is on LVM. On MFS you can find it very quickly even from MFS master web site. Second thing: adding or removing hard drive to MooseFS is easier than adding or removing to LVM group. Just add HDD to chunkserver, format to XFS, reload chunkserver and you have extra space on your instance. Third thing is that MoooseFS have many better sorting algorithms for placing chunks to many hard disks, So all hard drives have balanced traffic - LVM doesn't

The volume-at-a-time approach

The XtreemFS OSD (and also the other services) rely on a local file system for data and metadata storage. Thus, on a machine with multiple disks, you have two possibilities. First, you can combine multiple disks on one machine to a single file system, e.g. by using RAID, LVM, or a ZFS pool. Second, each disk (including SSDs, etc.) hold its own local file system and is exported by an own XtreemFS OSD service.

Both of the possibilities have their advantages and disadvantages and I cannot make a general recommendation. The first option brings flexibility in terms of the used RAID level or possibly attached SSD caches. Furthermore it might be easier to maintain and monitor one OSD process per machine than one process per disk.

Using one OSD server per local disk might result in a better performance. While running a RAID of fast SSDs, the XtreemFS OSD might become a bottleneck. You could also share the load of multiple OSD on one machine over multiple network interfaces. For replicated files, you have to care about replica placement and avoid placing multiple replicas of one file on OSDs running on the same hardware. You possibly have to write a custom OSD selection policy. XtreemFS offers an interface for this.

Which seems better

Based on the response from XtreemFS, it would seem that MooseFS could benefit from the volume-at-a-time approach, but only if you mitigate potential drive failures very well.

Drive-at-a-time has the benefit that in the event of a single drive failure (which seems to be the most concerning physical error that can happen), MooseFS' sorting algorithms and recovery systems can replicate now-unreplicated data and "ignore" the failed drive.

Volume-at-a-time has the benefit of forcing replicated data to be on different servers - but doesn't guarantee even/level individual drive usage.

^{_{These answers come from the respective mailing lists for MooseFS and XtreemFS - only grammar and readability have been improved; links to original threads provided}}

In a distributed filesystem like MooseFS or XtreemFS, should individual nodes expose "raw-er" storage, or LVM'd storage?

1 Answers1

The drive-at-a-time

The volume-at-a-time approach

Which seems better