There are two servers. One is the on-site database server; the other is the off-site backup server. Since the on-site database server already contains just shy of 900G of data, backups are important. Originally, we used a daily pg_dumpall
, but for performance reasons that's no longer an option.
Since an upgrade of the database server, including also an upgrade of postgresql itself (from postgresql 9.1 to 9.4) was due anyway, I decided to try to move to use WAL shipping to the off-site server, so that backups would work. Here's what I've done so far:
- Create a new cluster on the new server
- Stop the database server on the new server, rsync the data directory to the off-site server (so that I have a base backup)
- Configure postgresql.conf on the new server so that
wal_level
is set tohot_standby
,archive_mode
toon
, andarchive_command
totest ! -f /srv/pg_wal/%f && cp %p /srv/pg_wal/%f
- Start the cluster on the new server
ssh oldserver pg_dumpall | psql
to read in the data, which generates large amounts of WAL segments- Copy the WAL segments over a rather slow link (between 1 and 6Mbit, depending on time of day) to the off-site server. This took several days
- Create a
restore.conf
in the off-site server's data directory containingrestore_command = 'cp /srv/pg_wal/%f "%p"'
andstandby_mode = on
. - Started postgresql on the off-site server
- Realized that it should have been
recovery.conf
, so stopped the server again and moved the file to the correct name.
Unfortunately, it doesn't seem to work. When I look at the log file right now, I see the following appear:
2016-07-25 12:30:59 CEST [137390-71] LOG: record with zero length at 0/2000200
2016-07-25 12:31:04 CEST [137390-72] LOG: restored log file "000000010000000000000002" from archive
2016-07-25 12:31:04 CEST [137390-73] LOG: record with zero length at 0/2000200
2016-07-25 12:31:09 CEST [137390-74] LOG: restored log file "000000010000000000000002" from archive
This is repeated over and over again.
Checking the pg_xlog
directory on the off-site server reveals that there are two files there: one is 000000010000000000000002
, which has the same checksum as the one with the same name in /srv/pg_wal
; the other is ...03, which does not.
My /srv/pg_wal
contains all WAL segments, including also 000000010000000000000001
(the very first segment the original database server made). However, I seem to be unable to load them into the off-site server.
Question: can I, with the given situation, somehow massage the off-site server into still accepting those WAL logs? If so, how should I proceed?
Technically I could create a new base backup and copy that to the new server, but due to the large amounts of data already on the server now, and the low bandwidth between the two sites, I would prefer not to do that if I can avoid it.