4

I'm planning on using the ElasticSearch S3 cloud plugin to create snapshots of our ES cluster. This all looks fairly straight forward but I'm wondering whether its possible integrate it into our existing backup strategy.

With our other data stores we take a full back up every hour. We keep the latest 24 hours, 1 for each of the past 7 days, 1 for each of the past 4 weeks and 1 for each for the last 2 months...

Is it possible to create snapshots in this way, or would I be better off using the FS snapshot repository and then zipping up the contents and hooking into the same upload procedure?

My only concern is that it sounds like the snapshot feature essentially creates incremental backups which would mean this wouldn't work. It would be good to know how others are backing up their ES clusters.

Many thanks

justcompile
  • 141
  • 3

1 Answers1

3

To quote the documentation:

The index snapshot process is incremental. In the process of making the index snapshot Elasticsearch analyses the list of the index files that are already stored in the repository and copies only files that were created or changed since the last snapshot. That allows multiple snapshots to be preserved in the repository in a compact form.

Having spent a career in backup and disaster-recovery planning, I understand your concern. Like all backups, dealing with this strategy requires a bit of analysis. Some things to consider:

  • Data turnover rate. If you only ever keep (n) weeks of data in the index, you can tolerate some backup strategies. If the index is an accumulator table where nothing gets deleted and it gets bigger over time, different styles are worth it.
  • Growth rate. Like turnover, how big it gets over time.
  • Backup storage constraints. Pretty obvious. If it's continual-increments, and you have a high turnover rate, your backup repo will contain a lot of things it doesn't need.
  • Backup I/O constraints. While the operation is non-blocking, it is not zero-resource. Incrementals are faster than fulls, but fulls may be needed for other reasons.

The snapshot procedure is a continual-incremental strategy. For an accumulator table (no turnover), it is sufficient to take one full and keep the incrementals forever. Except...

During snapshot initialization, information about all previous snapshots is loaded into the memory, which means that in large repositories it may take several seconds (or even minutes) for this command to return even if the wait_for_completion parameter is set to false.

This is your incentive to not actually keep everything. An hourly snapshot history going back two years will take up a lot of heap. Happily, they do have a DELETE functionality to prune this history.

If you do have a high turnover rate, definitely plan on issuing DELETE to older snapshots over time. Per the documentation, the ES snapshot process is smart enough to correctly handle the data purge process. Your GFS policy of snapshots can definitely be done, even with a 'continual-incremental' backup system. Think of it like a deduplicating backup-to-disk system; you don't purge the dedupe cluster every 2 months, you let the backup system reap out the blocks/files no longer needed.

If you need to offsite this stuff, you can copy the snapshot-repo itself and do the usual media rotation on it. The es-repo for snapshots is for hot backup/restore. If you need to load an older one for some reason, you should be able to restore the offline copy over the hot copy and call a restore from the ES API and it will load from the data you put into the snapshot repo.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296