indexing a ~3gb database into elasticsearch: how?

Question

I have a new database of 2 million records (around 3gb when dumped json) produced every several days. I want to quickly put it into ElasticSearch.

That's what I do right now:

Create a new index and set up a couple of mapping that I want
Set refresh_interval to -1.
Split all documents into batches 300-500 documents each
Send them to bulk index api batch after batch (waiting for the results to come back before sending the next batch, of course). I also tried doing it concurrently, 3-5 batches simultaneously at a time.

After ~10% of documents were processed, ElasticSearch bulk API starts timing out from time to time (the request timeout is 30 seconds). I added retries, but closer to 30-40% some batches start failing for like 10 times in a row.

I tried manipulating the different numbers. With smaller batches it's just too slow. With bigger batches/concurrency it just fails faster.

The requests are sent from the same machine where the ElasticSearch is. I have a lot of memory:

$ free -g
             total       used       free     shared    buffers     cached
Mem:            31         24          6          0          0          8
-/+ buffers/cache:         15         15
Swap:           15          6          9

There's nothing much else going on on the server at the time.

So, what may I be doing wrong? I tried looking for the one true way to index lots of documents but couldn't find any.

mostly nothing of interest, I think there were some exceptions (I'll look for them) but most timeouts are happening without them — valya, Jul 16 '14 at 11:29

indexing a ~3gb database into elasticsearch: how?

0 Answers0