When to use a new index in Graylog (Elasticsearch)?

Question

I have been searching for days now to find a good explanation on how indices are used by Graylog/Elasticsearch and when to create a new one. There is a lot of info about sharding indices but not much about the indices themselves apart from that they are a set of settings for how much data to retain and how to mange them, but not why. (or so it seems to me)

Background:

We are using Graylog 4.0 with Elasticsearch 7.10 and MongoDB 3.

We try to centralize the logs from 6 warehouse locations (all a few hundred km apart from on another). Each has 6 to 20 RFID gates which each have a log. Each gate has its on connector middleware to a central controller middleware and all of those have logs. Then there is the controller of the automated warehouse 'AWMS', WMS server, ERP server and their frontends. We also consider to collect at least some of the events from the Windows Eventlog of the servers those services run on.

Usually we need to analyse a problem in one subsystem so need to search one of those logs. Occasionally we need to look at the whole flow from RFID gate to AWMS, WMS and ERP.

At the moment I consider having a stream for each of those logs and using the relevant streams in the search. (or is that approach already flawed, if so why?)

Questions:

Is an index set in Graylog just the settings about the retention strategy?
What impact does it have if I have a lot or a few indices?
- in Elasticsearch Index Model it sounds like shard sizes and their distribution have the main impact on search performance, and indices are just a framework to manage the shards
How many index sets should I have for my use case?
- Multiple per stream?
- One per stream?
- One per location?
- One per subsystem?
- One per retention time or size interval?
- One global one?
- Does it matter from a performance point of view?
Where can I find more information about this stuff, that explains the 'Why', not just the 'How' to mange indices? (I have been looking at both the Graylog index model, Elasticsearch index model, Elasticsearch index templates)

score 2 · Accepted Answer · answered May 17 '21 at 21:18

Is an index set in Graylog just the settings about the retention strategy?

Don't forget that an Index Set has a direct impact on Indices in the underlying ElasticSearch infrastructure, you should take that into account because ElasticSearch is all about Indices and their Shards (data distribution, replica,...).
Data Type and Fields are a thing too: you can't (shouldn't) have the same field with mixed data type in the same Index Set (e.g if the field device exists as Integer because System1 uses a device number but System2 requires the type Text for this field because the device identifier is a string, then you should either store everything as string or create a separate index set to keep both datatypes and their respective benefits under the same field name).

That's typically the reason why you probably don't want to store Windows Logs in the same Index Set that anything else (apply this to your use case, this may be true for your ERP/WMS data sources?...) because they can easily lead to hundreds of differents fields (and it's recommended to avoid exceeding the 1000 fields per index limit).

So, no, it's not just about the retention strategy. As a starting point for your reflection I recommend that you consider grouping various data source types in their Index Set (An Index Set for Windows Log, another one for Linux servers, another one for the firewalls for example, because it makes sense from a datatype point of view).

What impact does it have if I have a lot or a few indices?

It depends on your ElasticSearch infrastructure, and "a lot" is undefined... take a look at Sizing ElasticSearch and Size your shards. Keeping in mind what kind of queries you'll perform and over which time range may help to find the right balance between index size and the number of indices ElasticSearch will have to query to fulfil your request.

Unfortunately, there is no one-size-fits-all sharding strategy. A strategy that works in one environment may not scale in another. A good sharding strategy must account for your infrastructure, use case, and performance expectations.[...]

How many index sets should I have for my use case?

The stream is configured with one index set, you can't set multiple index sets for one stream. Regarding the other points, I already answered above.
However, note that you can configure multiple Streams on the same Index Set, this is very useful if you want to use these streams with the same underlying data and just want to restrict access to a subset of logs for certain users: you can route messages based on the conditions you want between various streams and if those streams all shares the same Index Set you'll not duplicate the messages.

When to use a new index in Graylog (Elasticsearch)?

Background:

Questions:

1 Answers1