9

I am pretty new to Ceph and try to find out if Ceph supports hardware level raid HBAs.

Sadly could not find any information. What I found is, that it is recommended to use plain disks for OSD. But this pushes the requirements to the PCIe, the interfaces of the disk to high bandwidths and the CPU requirements are very high.

Hardware RAID controllers have solved these requirements already and they provide high redundancy based on the setups without eating my PCIe, CPU or any other resources.

So my wished setup would be to have local RAID controller(s), which handle my in disk redundancy at controller level (Raid 5, raid 6) whatever RAID level I need. On top of what RAID LUNs I would like to use Ceph to do the higher level of replication between: host, chassis, rack, row, datacenter or whatever is possible or plannable in CRUSH

  1. Any experiences in that setup?
  2. Is it a recommended setup?
  3. Any in depth documentation for this hardware RAID integration?
Chaminda Bandara
  • 547
  • 6
  • 17
cilap
  • 277
  • 5
  • 14

3 Answers3

8

You can doesn't mean you should. Mapping RAID LUNs to Ceph is possible, but you inject one extra layer of abstraction and kind of render at least part of Ceph functionality useless.

Similar thread on their mailing list:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021159.html

BaronSamedi1958
  • 12,510
  • 1
  • 20
  • 46
  • 1
    could you elaborate "render at least part of Ceph functionality useless" a bit more? Do not get the point – cilap Jan 22 '18 at 14:33
  • 2
    The whole idea of Ceph... OK, one of the main ideas! is to avoid managing "islands of storage" which are RAID LUNs. – BaronSamedi1958 Jan 22 '18 at 17:02
0

But this pushes the requirements to the PCIe, the interfaces of the disk to high bandwidths and the CPU requirements are very high.

Not really, many storage workloads are served well with modern general purpose CPUs and interconnects.

Yes, a RAID controller takes care of redundancy with a handful of disks in one chassis. But that's cost and complexity when you run already redundant, multi node distributed storage solutions like Ceph. Why bother mirroring a physical disk when Ceph already has multiple copies of it?

The building blocks of such a solution are just a bunch of disks. Such as Open Compute Storage's Open Vault. 30 spindles in an enclosure, attached to a compute node of maybe a couple dozen CPU cores. Add as many nodes as you need to scale out. You can leave that compute dedicated to Ceph if you want to maximize throughput.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • do you have facts with real CPU, mem and disk benchmarks compared to a hardware RAID benchmarks? With hardware RAID arrays I have low requirements on CPU and mem, since the hardware controller is taking care of it. – cilap Jan 22 '18 at 14:30
  • I don't. And you really would want to do your own benchmark anyway. Just note that CPUs do billions of cycles per second, and interconnects (PCIe) do billions of transfers a second. You're free to use a RAID controller, it just seems not necessary in a distributed storage node. – John Mahowald Jan 25 '18 at 13:56
-1

The recommended setup is to use single disks or, eventually, disks in RAID-1 pairs.

A single SAS controller (or a RAID controller in JBOD mode) can drive several hundred disks without any trouble.

Using very large arrays defeats the very purpose of CEPH which is to avoid single points of failures and "hot points". It will also actually harm your redundancy.

Let's say you want to build a 1 PB CEPH cluster using 8 TB drives, using 36 disks servers chassis (ordinary Supermicro-like hardware). Let's compare the setups with and without RAID in terms of storage capacity and reliability:

  • With RAID-6 you need 5 chassis (and 10 OSDs).

    • Each chassis will have 2 18 disks RAID arrays.
    • You'll have 1024 TB of available storage.
    • In case of a multiple disk crash you'll have to rebuild 256 TB.
  • With CEPH and 5 chassis you'll have 180 OSDs.

    • Available capacity will be slightly superior (using erasure coding): 1152 TB
    • in case of a multiple disk crash you'll have to rebuild only the number of failed disks (unless it's an entire server, it will always be less than 256 TB).
wazoox
  • 6,782
  • 4
  • 30
  • 62
  • I am getting the requirements of Ceph, but still one major question is not answered. What are the requirements for the 36 drive chassis? Afaik you need 36 cores from the description of ceph for it. Also what config would you suggest for your example? What are the replication efforts and what is the benchmark of it? – cilap Jan 26 '18 at 06:17
  • just forgot. Afaik your setup needs more instances or maybe even more servers for the management. – cilap Jan 26 '18 at 06:24
  • @cilap it depends upon the needed performance really. You generally don't need 1 core per OSD, using about half the cores is enough. Performance of erasure coding is inferior to full replication. – wazoox Jan 26 '18 at 15:55
  • I didn't mention MDS as you'll them either way. depending upon your cluster charge, you may use the storage nodes as MDS and MON servers. – wazoox Jan 26 '18 at 15:56