0

I have a new database of 2 million records (around 3gb when dumped json) produced every several days. I want to quickly put it into ElasticSearch.

That's what I do right now:

  1. Create a new index and set up a couple of mapping that I want
  2. Set refresh_interval to -1.
  3. Split all documents into batches 300-500 documents each
  4. Send them to bulk index api batch after batch (waiting for the results to come back before sending the next batch, of course). I also tried doing it concurrently, 3-5 batches simultaneously at a time.

After ~10% of documents were processed, ElasticSearch bulk API starts timing out from time to time (the request timeout is 30 seconds). I added retries, but closer to 30-40% some batches start failing for like 10 times in a row.

I tried manipulating the different numbers. With smaller batches it's just too slow. With bigger batches/concurrency it just fails faster.

The requests are sent from the same machine where the ElasticSearch is. I have a lot of memory:

$ free -g
             total       used       free     shared    buffers     cached
Mem:            31         24          6          0          0          8
-/+ buffers/cache:         15         15
Swap:           15          6          9

There's nothing much else going on on the server at the time.

So, what may I be doing wrong? I tried looking for the one true way to index lots of documents but couldn't find any.

valya
  • 203
  • 2
  • 7

0 Answers0