How to minimize bandwidth in frequent postgres backup scenario?

Question

I'm looking to make very frequent backup (every hour) of postgres data on several VMs (say 20-50) towards the same archive server.

Here are more data if needed: Ideally, the system should support the load of 80 to 200 databases located on all VMs. The databases are small (10MB - 100MB) to medium sized (500MB - 2GB), composed of hundredth of tables, a small portion of these tables can easily contain several thousands of lines up to around a million of lines. Changes to the database is often new records, some updates, not so much deletion. The bandwidth would be 100Mbits/s.

As I'm already doing so with a standard filesystem using incremental backup (rsync), I'm wondering if something similar could be achieved with postgres database backups.

I have several possible options:

I could choose to put database on snapshotable filesystem (aufs docker style, ZFS, btrfs, but some of these seems really slowing down postgres).
I'm ready to use WAL if necessary
It would be better if I can backup only at the database level if necessary. As I do not need to backup the whole postgres data, only the customers databases.
I have some disk space on the postgres server that could keep an intermediate backup.
I can afford some reasonable CPU workload on the VM side, but would rather minimize it on the backup server as it'll add up the more database to backup there will be.
I'm not really looking for continuous backup or PITR recovery options. My backup server has a file based system (brfs) to do efficient periodic snapshots of backups. It's good enough.

I've thought about:

using rsync in combination with pg_dump locally to the server in SQL, but I'm not sure which of the different format I should use to keep maximum efficiency.
using snapshotable filesystem that allows to send binary diffs at the block level (btrfs and ZFS are good at it) with or without using a local dump (same question about the backup format to use).
I've learned about the existence of pg_rman, I don't really know if it can be relied upon, and the setup and various process seems slightly heavier than pg_dump. Would it support having only incremental backups ? And can we have a practical format on the backup side ?.

and is there another way than incremental backups to reach small bandwith ?

So... how could I minize bandwith in my postgres backup scenario ?

@ewwhite Our provider is using Xen, we'll use docker in the provided VMs, the current archive server is an LXC running on dedicated server we own, but I would prefer solutions independant to any specific hypervisors or VM technology. I must add that I've very limited experience for now on hypervising technology, but a good background around LXC, and some knowledge with docker, KVM and openvz, so if you feel some important information related to the virtualization technology should be mentionned feel free to enlighten me ! — vaab, Sep 16 '14 at 04:51
Does postgres not support differential or log based backups? I use sql server but we do backups every 15 minutes on a terabyte database with active changes and a backup there is regularly just 200-500mb. We make a full backup daily, and log backups in between. — TomTom, Sep 16 '14 at 07:22
@TomTom No, postgres doesn't support incremental backups out of the box. However, `pg_rman` does this, but I have no idea how much I can rely to it or if it this practical, or even if it is really more efficient than the other proposed solutions. — vaab, Sep 16 '14 at 08:38
Whow. THat makes postgres pretty much unusable for any large database deployment with high backup requirements. Amazing. I would focus on getting pg_rman working. — TomTom, Sep 16 '14 at 08:53

score 3 · Answer 1 · answered Sep 16 '14 at 10:21

3

You're attempting to solve a well-practiced problem (in real database systems) using an awkward solution ; this is understandable for most people coming from a background in smaller database systems (and I've done a very similar thing myself with MySQL and slogged through the consequences of bandwidth blowout).

You should use PostgreSQL's replication features; see http://www.postgresql.org/docs/9.3/interactive/high-availability.html

answered Sep 16 '14 at 10:21

Cameron Kerr

3,919
18
24

Thanks for the replication suggestion that indeed seems quite interesting. – vaab Sep 16 '14 at 11:06

score 1 · Answer 2 · answered Sep 16 '14 at 06:56

1

Make dump in sql format. Keep one full copy on local vm, lets say refreshed every day. Then dump new copy and make a diff from full copy. Copy full copy once a day and only diff at other times. To restore you will have to patch full copy with a diff and execute sql file.

answered Sep 16 '14 at 06:56

Kazimieras Aliulis

2,324
2
26
45

On mysql, a good way is to do consistent backup, and later copy just binary logs - then you can do a point in time recovery by replaying a binary log on backup, but it may take longer to recover and I do not know how to do that on postgres :) – Kazimieras Aliulis Sep 16 '14 at 07:02
Hm.. Maybe this one could help: http://www.postgresql.org/docs/9.3/static/continuous-archiving.html – Kazimieras Aliulis Sep 16 '14 at 07:03
do you have good reason to think that SQL plain text format and diff solution is better than binary custom format and rdiff (rsync) solution ? – vaab Sep 16 '14 at 07:13
With snapshot that would be more efficient, but snapshot slows down database and non binary backup is considered safer. If you dump and do rsync, that most probably will not be efficient, because dump will change modification time of files and rsync will have to check all files between local and remote hosts. So it is better to do diffs locally instead of using remote copy. – Kazimieras Aliulis Sep 16 '14 at 07:48

How to minimize bandwidth in frequent postgres backup scenario?

2 Answers2