My goal is to use ElasticSearch v1.3.2 for analyzing product cross-sales, so I need to filter for the receipts of interest (having an alcoholic product for example) and finding top selling products of each category. New data would be indexed monthly and meanwhile we'd be running analytical queries on it.
Each document is a single receipt with overall info (total sales, store id etc.), individual product info (product id, number of productrs, total value) and aggregated info on different levels of the product tree. An average receipt has 8 items, so each receipt document has 10 - 100 nested documents. In total I have 50 million receipts with 390 million product sub-documents (plus some more for aggregated product tree levels).
Currently a single ES node is running on a Ubuntu virtual machine with 16 GB of RAM (50/50 for ES and OS) and the data on a virtual disk on a HDD. The total index size is about 120 GB, and all fields have "format: doc_values" because of earlier out-of-memory problems. After all data is cached on RAM I get 500 - 4000 ms response times, but once the data gets sufficiently large then ES grinds to a halt. I have 140 shards (10 per index) which vary from 200 MB to 2 GB in size.
After running some benchmark queries then ES loses its performance, constantly uses 50% of the CPU for doing something (even when queries aren't running) and head plugin's query to "localhost:9200/stats?all=true" takes up-to 45 seconds. I installed a development version of Marvel and it started reporting 404 on /.marver-kibana/appdata/marvelOpts queries.
Do I really need more RAM and/or more nodes (currently RAM is 13% of the full data size), or are there some tweaks I should try? I'd like to index 4x the amount of current data. Earlier I was testing on a 8 GB virutal machine, and I got similar symptoms on half the data. I'll provide more information if it would be useful.