0

There are two servers. One is the on-site database server; the other is the off-site backup server. Since the on-site database server already contains just shy of 900G of data, backups are important. Originally, we used a daily pg_dumpall, but for performance reasons that's no longer an option.

Since an upgrade of the database server, including also an upgrade of postgresql itself (from postgresql 9.1 to 9.4) was due anyway, I decided to try to move to use WAL shipping to the off-site server, so that backups would work. Here's what I've done so far:

  • Create a new cluster on the new server
  • Stop the database server on the new server, rsync the data directory to the off-site server (so that I have a base backup)
  • Configure postgresql.conf on the new server so that wal_level is set to hot_standby, archive_mode to on, and archive_command to test ! -f /srv/pg_wal/%f && cp %p /srv/pg_wal/%f
  • Start the cluster on the new server
  • ssh oldserver pg_dumpall | psql to read in the data, which generates large amounts of WAL segments
  • Copy the WAL segments over a rather slow link (between 1 and 6Mbit, depending on time of day) to the off-site server. This took several days
  • Create a restore.conf in the off-site server's data directory containing restore_command = 'cp /srv/pg_wal/%f "%p"' and standby_mode = on.
  • Started postgresql on the off-site server
  • Realized that it should have been recovery.conf, so stopped the server again and moved the file to the correct name.

Unfortunately, it doesn't seem to work. When I look at the log file right now, I see the following appear:

2016-07-25 12:30:59 CEST [137390-71] LOG:  record with zero length at 0/2000200
2016-07-25 12:31:04 CEST [137390-72] LOG:  restored log file "000000010000000000000002" from archive
2016-07-25 12:31:04 CEST [137390-73] LOG:  record with zero length at 0/2000200
2016-07-25 12:31:09 CEST [137390-74] LOG:  restored log file "000000010000000000000002" from archive

This is repeated over and over again.

Checking the pg_xlog directory on the off-site server reveals that there are two files there: one is 000000010000000000000002, which has the same checksum as the one with the same name in /srv/pg_wal; the other is ...03, which does not.

My /srv/pg_wal contains all WAL segments, including also 000000010000000000000001 (the very first segment the original database server made). However, I seem to be unable to load them into the off-site server.

Question: can I, with the given situation, somehow massage the off-site server into still accepting those WAL logs? If so, how should I proceed?

Technically I could create a new base backup and copy that to the new server, but due to the large amounts of data already on the server now, and the low bandwidth between the two sites, I would prefer not to do that if I can avoid it.

Wouter Verhelst
  • 418
  • 3
  • 8
  • Maybe use rsync to copy only differences?? – Florin Asăvoaie Jul 25 '16 at 15:31
  • @FlorinAsăvoaie: Yes, that's what "create a new base backup" would entail, presumably. But given the bandwidth issues and the fact that the $PGDATA directory on the new database server contains over just shy of 900G whereas the one on the off-site server contains 13MB or thereabouts, it's not the way I would like to do this. – Wouter Verhelst Jul 25 '16 at 15:35

0 Answers0