1

For the sake of understanding I setup a 4 node cluster using the latest/greatest(released) version of Cassandra. The four nodes were brought up in sequence using almost entirely default settings and seem to be communicating properly.

I then created a schema as follows:

CREATE KEYSPACE first WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '1'
};

Create a simple table with 5 columns and added ~100K rows of data. All well and good. Data is available from every client so I'm thinking it's evenly spread about.

So I'm looking into a backup strategy and starting to mess about with snapshots and so forth. After running nodetool snapshot on each machine I want to know what it created. I go to the first machine and look in /var/lib/cassandra/data/first and see that it's empty. Hmm.. second machine.. same thing.. third.. finally on the 4th machine I see files in the data folder and a snapshot directory.

Running nodetool ring shows that each system owns roughly 25% but the load is heavily biased towards the one system that (seems to have) ended up with all the data.

Is all the data truly on this one machine? What step did I miss in the configuration?

ethrbunny
  • 2,327
  • 4
  • 36
  • 72

1 Answers1

1

Cassandra assigns token range to each node in the cluster.

Since there are 4 nodes in your configuration, 25% of tokens are assigned to each node.

Every insert query will have a hash generated for that row data. The hash will always belong to one and only one of the token ranges, so this row data is stored in that particular physical node.

In your case , mostly the insert queries have same or similar partition key leading to the same physical node i.e. node 4.

For more details, watch datastax website explaining partitioning.