Elasticsearch is using way too much disk space

Question

I have a CentOS 6.5 server on which I installed Elasticsearch 1.3.2.

My elasticsearch.yml configuration file is a minimal modification of the one shipping with elasticsearch as a default. Once stripped of all commented lines, it looks like:

cluster.name: xxx-kibana

node:
    name: "xxx"
    master: true
    data: true

index.number_of_shards: 5

index.number_of_replicas: 1

path:
    logs: /log/elasticsearch/log
    data: /log/elasticsearch/data


transport.tcp.port: 9300

http.port: 9200

discovery.zen.ping.multicast.enabled: false

Elasticsearch should have compression ON by default, and I read various benchmarks putting the compression ratio from as low as 50% to as high as 95%. Unluckily, the compression ratio in my case is -400%, or in other words: data stored with ES takes 4 times as much disk space than the text file with the same content. See:

12K     logstash-2014.10.07/2/translog
16K     logstash-2014.10.07/2/_state
116M    logstash-2014.10.07/2/index
116M    logstash-2014.10.07/2
12K     logstash-2014.10.07/4/translog
16K     logstash-2014.10.07/4/_state
127M    logstash-2014.10.07/4/index
127M    logstash-2014.10.07/4
12K     logstash-2014.10.07/0/translog
16K     logstash-2014.10.07/0/_state
109M    logstash-2014.10.07/0/index
109M    logstash-2014.10.07/0
16K     logstash-2014.10.07/_state
12K     logstash-2014.10.07/1/translog
16K     logstash-2014.10.07/1/_state
153M    logstash-2014.10.07/1/index
153M    logstash-2014.10.07/1
12K     logstash-2014.10.07/3/translog
16K     logstash-2014.10.07/3/_state
119M    logstash-2014.10.07/3/index
119M    logstash-2014.10.07/3
622M    logstash-2014.10.07/  # <-- This is the total!

versus:

6,3M    /var/log/td-agent/legacy_api.20141007_0.log
8,0M    /var/log/td-agent/legacy_api.20141007_10.log
7,6M    /var/log/td-agent/legacy_api.20141007_11.log
6,7M    /var/log/td-agent/legacy_api.20141007_12.log
8,0M    /var/log/td-agent/legacy_api.20141007_13.log
7,6M    /var/log/td-agent/legacy_api.20141007_14.log
7,6M    /var/log/td-agent/legacy_api.20141007_15.log
7,7M    /var/log/td-agent/legacy_api.20141007_16.log
5,6M    /var/log/td-agent/legacy_api.20141007_17.log
7,9M    /var/log/td-agent/legacy_api.20141007_18.log
6,3M    /var/log/td-agent/legacy_api.20141007_19.log
7,8M    /var/log/td-agent/legacy_api.20141007_1.log
7,1M    /var/log/td-agent/legacy_api.20141007_20.log
8,0M    /var/log/td-agent/legacy_api.20141007_21.log
7,2M    /var/log/td-agent/legacy_api.20141007_22.log
3,8M    /var/log/td-agent/legacy_api.20141007_23.log
7,5M    /var/log/td-agent/legacy_api.20141007_2.log
7,3M    /var/log/td-agent/legacy_api.20141007_3.log
8,0M    /var/log/td-agent/legacy_api.20141007_4.log
7,5M    /var/log/td-agent/legacy_api.20141007_5.log
7,5M    /var/log/td-agent/legacy_api.20141007_6.log
7,8M    /var/log/td-agent/legacy_api.20141007_7.log
7,8M    /var/log/td-agent/legacy_api.20141007_8.log
7,2M    /var/log/td-agent/legacy_api.20141007_9.log
173M    total

What am I doing wrong? Why is data not being compressed?

I have provisionally added index.store.compress.stored: 1 to my configuration file, as I found that in the elasticsearch 0.19.5 release notes (that's when the store compression came out first), but I'm not yet able to tell if it is making a difference, and anyhow compression should be ON by default, nowadays...

Did you ever consider the overhead it takes to store and index that data? This is where the difference comes from. — mailq, Oct 10 '14 at 10:10
@mailq - AFAIK, Elastic compresses *both* data and indices, and you still should notice a *decrease* in space usage on your disk, compared to text logs. I assume mileage may vary according to log structure, but logs are typically very repetitive in nature, so indexing shouldn't be the most space-consuming of operations. ...or am I getting this wrong? — mac, Oct 10 '14 at 10:20
Logs are not really repetitive. User A logs in at time 1. User B logs in at time 2. What is repetitive? Both tuples have to be indexed and stored separately. In addition to the log entry itself. — mailq, Oct 10 '14 at 10:29
Try those recommendations. https://github.com/jordansissel/experiments/tree/master/elasticsearch/disk — mailq, Oct 10 '14 at 10:42
@mailq - Supercool maliq, thank you a ton. If you expand on your comment and write a proper answer, I'd be glad to mark it as accepted (otherwise I will do it later on, but don't want to steal your thunder!). — mac, Oct 10 '14 at 11:20

mailq · Accepted Answer · 2014-10-10T11:50:20.397

Elasticsearch does not shrink your data automagically. This is true for any database. Beside storing the raw data, each database has to store metadata along with it. Normal databases only store an index (for faster search) for the columns the db-admin chose upfront. ElasticSearch is different as it indexes every column by default. Thus making the index extremely large, but on the other hand gives perfect performance while retrieving data.

In normal configurations you see an increase of 4 to 6 times of the raw data after indexing. Although it heavily depends on the actual data. But this is actually intended behavior.

So to decrease the database size, you have to go the other way around like you did in RDBMs: Exclude columns from being indexed or stored that you do not need to be indexed.

Additionally you could turn on compression, but this will only improve when your "documents" are large, which is probably not true for log file entries.

There are some comparisons and and useful tips here: https://github.com/jordansissel/experiments/tree/master/elasticsearch/disk

But remember: Searching comes with a cost. The cost to pay is disk space. But you gain flexibility. If your storage size exceeds, then grow horizontally! This is where ElasticSearch wins.

Elasticsearch is using way too much disk space

1 Answers1