1

We have a 4 data node ElasticSearch cluster: each node has 4 cores, 16GB RAM, and 160GB storage (the cluster has separate dedicated master nodes). The cluster is responsible for storing and presenting (with Kibana) a swath of different logs across a variety of clients and services we maintain across dev / test / prod.

We're trying to develop the best approach to indexing the data. Our goal is to keep data segregated at the lowest level feasible level so we can easily manage (i.e. archive, delete, etc) each client environment with different retention lengths, depending on data value.

We already know we want to index by date, but how granular can we go before the sheer index count becomes unwieldy? For instance, is logstash-{client}-{environment}-{date} reasonable? Will we have too many indices?

J. Doe
  • 11
  • 1

1 Answers1

1

Scaling ES out for LogStash is a hard problem, especially for cases with very different retention intervals. For ES scaling, there are two big knobs to turn:

  • Fields in the global catalog
  • Number of shards (indexes * partition * replica-count)

The number of fields scales Java HEAP requirements on everything. I've heard horror-stories of people allowing JSON inputs that result in events like this:

http.192-168-82-19.request = "GET /"
http.192-168-82-19.verb = "GET"
http.192-168-82-19.path = "/"
http.192-168-82-19.response_time = 12022

And so on. As you could imagine, this makes for an astonishingly large number of fields in the catalog. It took them a long time to dig out of that hole, so pay attention to your inputs and try not to enter it. For a multi-client architecture such as yours, it is a good idea to exercise change-control on the fields allowed into your indices; you'll be happier for it.

The number of shards scales Java HEAP again, as well as recovery time when a node fails over. 8TB in 30 shards recovers differently than 8TB in 300. Generally, fewer is better but that depends on your RAM.

One of the metrics I use to determine if an ElasticSearch cluster is sized correctly is how much of a pain in the butt it is to do maintenance on it. If I'm in the cloud and my method of patching is to create a new base image and deploy a new VM, I'm going to be very concerned with how long it takes to populate that new instance with data. In that case, I'm more likely to add data-nodes to keep my recovery time down entirely for maintenance reasons. If I'm on actual servers and my method of patching is to reboot the box and keep running, full-box recoveries are much more rare so I'm less concerned with multiple TB of data on my data-nodes. Only you can decide where your pain point is here.

Newer versions of ElasticSearch (the 5.x series specifically) has a reindexing feature that can be amazingly useful for cases like yours, especially when paired with Curator. Once your indices age to a certain point, where they're only being kept for compliance reasons, you can reindex a week's worth of daily indexes into a single weekly. This turns 70 shards (2 replicas * 5 partitions * 7 days) into 10 shards, at the cost of slowing down searches in that week.

That kind of thing can be very hard on your servers, so may not be the right choice. But, it's a technique that would allow you to run an 'archive' cluster of ES servers with their own retention and query periods.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
  • Great answer! never thought of this idea "_can reindex a week's worth of daily indexes into a single weekly_". As long as we call the data "archives", the performance doesn't matter. Retention is what we would be contended with – asgs Oct 09 '18 at 19:40