Ceph: Why is a greater number of "placement groups" a "bad thing"?

Question

I have been researching distributed databases and file systems, and while I was originally mostly interested in Hadoop/HBase because I'm a Java programmer, I found this very interesting document about Ceph, which as a major plus point, is now integrated in the Linux kernel.

Ceph as a scalable alternative to the HDFS

There is one thing that I didn't understand, and I'm hoping one of you can explain it to me. Here it is:

A simple hash function maps the object identifier (OID) to a placement group, a group of OSDs that stores an object and all its replicas. There are a limited number of placement groups to create an upper bound on the number of OSDs that store replicas of objects stored on any given OSD. The higher that number, the higher the likelihood that a failure of multiple nodes will lead to data loss. If, for example, each OSD has replica relations to every other OSD, the failure of just three nodes in the entire cluster can wipe out data that is stored on all three replicas.

Can you explain to me why a greater number of placement groups increases the likelihood of data loss? I would have thought it's the other way around.

score 3 · Accepted Answer · answered Aug 15 '11 at 21:45

I am currently investigating ceph as an alternative to our data storage as well. I found your question and did some reading and am hoping this idea makes sense. The way they do dynamic distribution of data would suggest that if you have a high number of OSDs (significantly more than the replication level). Then it seems like it would be possible (and likely) that the distribution algorithm would put some parts of files on a huge number of OSDs, such that if you lost N nodes (where N is greater than your replication level) its highly likely you would lose your data (or at least have a significant amount of corruption). Which isn't really a surprise. I would expect to have data loss if you lost 3 nodes in your cluster (like their example) unless your replication level was 4 or higher.

I did not realize that individual files could be split between several OSD. In that light, it does make sense that large files would be more likely to get corrupted if they get split up. Thank you. — monster, Aug 16 '11 at 07:22

Ceph: Why is a greater number of "placement groups" a "bad thing"?

1 Answers1