I have a new database of 2 million records (around 3gb when dumped json) produced every several days. I want to quickly put it into ElasticSearch.
That's what I do right now:
- Create a new index and set up a couple of mapping that I want
- Set
refresh_interval
to-1
. - Split all documents into batches 300-500 documents each
- Send them to
bulk
index api batch after batch (waiting for the results to come back before sending the next batch, of course). I also tried doing it concurrently, 3-5 batches simultaneously at a time.
After ~10% of documents were processed, ElasticSearch bulk API starts timing out from time to time (the request timeout is 30 seconds). I added retries, but closer to 30-40% some batches start failing for like 10 times in a row.
I tried manipulating the different numbers. With smaller batches it's just too slow. With bigger batches/concurrency it just fails faster.
The requests are sent from the same machine where the ElasticSearch is. I have a lot of memory:
$ free -g
total used free shared buffers cached
Mem: 31 24 6 0 0 8
-/+ buffers/cache: 15 15
Swap: 15 6 9
There's nothing much else going on on the server at the time.
So, what may I be doing wrong? I tried looking for the one true way to index lots of documents but couldn't find any.