0

I have a separate drive for each of my ceph OSD servers. Each OSD host has 4 data drives. Does one journal drive serve the 4? Is the journal drive shared? Should there be a partition for each data drive?

HorseHair
  • 317
  • 4
  • 11

1 Answers1

3

Journal/data separation

If you have just these four drives per OSD host, and all drives have similar performance, then the usual/recommended setup would be to have one OSD per disk (i.e. 4 per server), and each OSD would have its journal file on the same disk as the data.

Another popular (at least historically) setup is to have journals on separate drives that are optimized for write throughput and latency; usually SSDs, ideally SSDs with "power loss protection" so that they can acknowledge "sync" writes quickly without necessarily writing to the flash array (which can be somewhat slow). In this setup it is common to share a journal SSD between multiple OSD (data) drives. For example, our OSD servers have 8 or 10 spinning-rust drives for Ceph OSDs, and the journals are distributed over two SSDs.

Partitions

If your data and journal are on the same physical disk, I personally would put them on the same partition/file system. Mostly because I would be worried that if they were on separate partitions, then there would be a lot of head movement when the OSD/file system alternates between journal and (background) data writes. I'm not sure this is actually an issue, and on SSDs it certainly isn't. In general, separate partitions give you some optimization opportunities, i.e. different file system parameters or even file system types, or no file system at all for the journal. This comes at the cost of operational complexity, for example when adding or changing the size of a journal requires you'd need to repartition the disk.

In our setup with data on spinning disks and journals on (fewer) separate SSDs, we have a single partition per spinning disk (OSD), and a dedicated "journal" partition on each SSD; each partition contains 4–5 journals as files. Our journal files are sized at 6 GiB each, so the journal partitions are 40 GB or so.

Caveat emptor

This setup has evolved based on a few years of experience and considerations of SSD lifetime and file system/SSD efficiency (latency, throughput). It's not necessarily the optimum, but then it's a tricky area... OSD journals have a peculiar access pattern: write only to a circular buffer, with frequent "sync"s. And SSDs can have large variations in (especially write) latency depending on usage (and controller and file system smartness); and latency peaks can be exacerbated by the fact that Ceph only ACKs a write when N (typically 3) writes have been committed to stable storage. In general, I think this is still a little bit of a (dark?) science, and you definitely need to take the expected usage patterns into account, so take all recommendations with a grain of salt, especially these here.

Oh and everything I said is for the "classical" Ceph where the data is stored in a file system such as XFS/ext4/... With the upcoming "BlueStore" these considerations may not (all) apply anymore.

sleinen
  • 241
  • 1
  • 2
  • Thnk you. Don't the journal file systems have to be mounted (under /var/lib/ceph/osd/cluster-number/journal?) Wouldn't this mandate that each have a journal on a different partition? – HorseHair Jan 28 '17 at 16:29
  • A journal is just a (single) file. By default, it will be under the OSD data partition, under the filename that you wrote. But you can also put that journal file elsewhere, e.g. on a filesystem backed by fast storage such as SSD. You can also specify a raw block device partition as the journal "file". – sleinen Mar 04 '17 at 08:57
  • "Another popular (at least historically) setup is to have journals on separate drives" @sleinen why does this is not popular now. I thought having a separate ssd for journal is always better. Why not these days? (let's say if additional disk is not a burden and for data writes also separate ssds are used) – GP92 Jun 09 '17 at 04:29