Computing stripe count for erasure coded storage

Question

I'm setting up a ceph cluster (first time for me) which in the end will be made of ~100 disks spread over 10 hosts. I'm going with a single erasure coded data pool to maximize disk space; my constraints are ~80% efficiency and a fault tolerance of 2 disks. This can be achieved most simply with a k=8 m=2 erasure code, but also with k=16 m=4 with the bonus of tolerating up to 4 disk faults.

I'm thus wondering which are the downsides of growing the number of stripes; a few come to my mind (e.g. increased CPU and network overhead due to increased file fragmentation) but given my very poor knowledge of the subject I'm not sure. I'd really appreciate any insight on this topic.

If you have 10 hosts I would advise against using all of them to have an EC chunk. In case of a node failure a recovery won't be possible until the node is back online. I would recommend to use something like k=6 m=2 if you want to sustain the failure of 2 disks. In that case you'd have two "spare" hosts, which is a misleading term since all hosts will be in use, of course. An EC profile like k=7 m=2 would work as well. — eblock, Jan 26 '22 at 09:00
Thanks for your suggestion. Are you considering a hosts failure domain case? I am thinking about setting a osd failure domain since I'm mainly concerned about loosing disks rather than machines, and that's why I'd be interested in high stripe counts. I am also concerned that my machines have different total osd capacities ranging from 6 to 32 TB so in a hosts failure domain with almost fully-occupied hosts count I'd end with the small hosts actually limiting the overall cluster capacity, right? — Nicola Mori, Jan 26 '22 at 11:30
Well, the disk on which the operating system is installed on can fail, too. ;-) But yes, I was thinking about host failure domain, this is the usual case in all of our customer's clusters. Your assumption is correct, the smallest (or fullest) OSD limits the overall capacity. I don't think it's a good idea to mix OSDs with such huge capacity differences. If you created different device classes for huge and smaller OSDs, that could work if you created pools with those device classes. But I don't think you'll be happy with the result if you go this route, regardless the failure-domain. — eblock, Jan 27 '22 at 08:09
I have several disk servers of different age and disk size that up to now have operated separately and now I need to create just one big storage since I need a single big pool. So I will live with the downsides of this arrangement, but thanks anyway for the heads up. Back to my original question, do you see any severe issue in growing the stripe count even up to k=40 m=8? — Nicola Mori, Jan 27 '22 at 08:21
Basically it's about your requirements regarding resiliency, but too many chunks result in a higher CPU load, it can also have an imact on your storage overhead if you think about the `bluestore_min_alloc_size_hdd`. If you have many small files your chunks will eat up space if you don't change the allocation size. So in conclusion, I would advise against using 48 chunks but rather stay somewhere between 8 and 18 chunks. We have a few customers with erasure-coded pools, both 18 chunks and 9 chunks work quite well. — eblock, Jan 27 '22 at 08:31

Computing stripe count for erasure coded storage

0 Answers0