Ceph architecture for small HPC cluster

Question

I want to build Ceph Storage Cluster for HPC use. (CentOS 7 based) For now I have enterprise SAS RAID enclosure with 3 shelves by 12 4TB disks(36 total). Now it is configured as default RAID6 rig. And it's performance is very bad. Also I can't scale system. No way to switch to 6TB disks for example. So what I want to do.

Switch from RAID6 to JBOD.
Map each 12 disks to 3 different controller ports.
Connect 3 servers to enclosure by SAS HBA card.
Set one ceph pool. Type: CephFS. 512 pg_num. Erasure coding. Failure-domain=host. Bluestore.
Mount CephFS pool on computing nodes with IPoIB.

Main questions are around 4th step.

How to choose erasure coding k+m numbers? 3+3 4+2 8+3 8+4 10+4? Actually I can't fully understand how it will handle different failures. As I undestand my system need to handle 1 host down + 1-2 OSDs fails. Is it possible with 3 hosts config? If not, what will happen if OSD fail during heal process after host failure? What will happen if OSD fail when 1 host down for maintenance(heal not started)?
Is it possible to add WAL/DB SSDs for Bluestore later as it is at filestore?
Will HPC MPI calls suffer from IPoIB traffic on same IB interface and switch?

And overall question. Will it work at all, or I missed something global?

score 0 · Accepted Answer · answered Nov 13 '18 at 14:11

Performance

Erasure coding is CPU intensive. If you need performance, use 3 copies.

More disks, better performance. JBOD is way to go.

Enterprise SSDs are highly reccomended. You can reconfigure,add,remove OSDs later.

Availability and data protection

The more nodes you have, storage is more resistant to data loss.

For erasure code and 3 hosts, minimum is k=3 m=2. When host fails you will lose 1 data part, and one parity part is needed to recover. So, you need at least two parity parts in case one is on failed host.

It would be best if you have more nodes than k+m. When 1 host fails you want to have all erasure parts on remaining hosts.

For protection with 3 copies minimum recommended is 4 hosts. When one fails, you still have space for 3 copies.

For production you'll need more servers.

Ceph architecture for small HPC cluster

1 Answers1