0

I'd like to syncronize some MySQL tables to an ElasticSearch index, you know, for search.

But almost every examples I can find show a big bad fat "SELECT * FROM", it's OK for an example, but I'm in production, approching 10M rows in one of my tables, so I don't even want to try a "SELECT *" every other minutes.

I'm used to Sphinx and it's

UPDATE ... SET current_id = last_id, last_id = MAX(id);
SELECT * WHERE id > (SELECT current_id...);

kind of strategy.

I's almost OK, but only "almost", as an item have typically a huge probability to be modified in its first minutes of life, it will be indexed "newborn" and stay as is.

So I can imagine better strategies, like triggers storing primary key in a "to index" table, at update and at creation, the river becoming a

SELECT * FROM ... WHERE id IN (SELECT id ... FROM to_index)

Never tried it, but seems better, at least a better point to start.

There's also the question of deleted rows...

But it may exist well-known strategies, well discussed and well tested, I didn't find them, am I missing something big here ? Or am I the only guy trying to avoid SELECT * on millions of raws ?

Mandark
  • 251
  • 1
  • 2
  • 8

1 Answers1

1

The readme is a bit long but there is a section about this:

https://github.com/jprante/elasticsearch-river-jdbc#how-to-select-incremental-data-from-a-table

The idea is to keep a microsecond-level timestamp of the changes of each rows, and ask for ES to query since the last river run:

{
    "type" : "jdbc",
    "jdbc" : {
        "url" : "jdbc:mysql://localhost:3306/test",
        "user" : "",
        "password" : "",
        "sql" : [
            {
                "statement" : "select * from \"products\" where \"mytimestamp\" > ?",
                "parameter" : [ "$river.state.last_active_begin" ]
            }
        ],
        "index" : "my_jdbc_river_index",
        "type" : "my_jdbc_river_type"
    }
}
Mandark
  • 251
  • 1
  • 2
  • 8