Postgresql 9.3 Log Shipping on a Hot Standby

Question

My setup

I've just set up streaming replication in postgres for the first time (9.3.5), and while the streaming is working as I expect, I'm struggling to get my standby to run the archive_command so I can store all of the log files.

Master postgres.conf:

wal_level = hot_standby
checkpoint_segments = 8
max_wal_senders = 4
wal_keep_segments = 8

Standby postgres.conf:

wal_level = hot_standby
checkpoint_segments = 8
archive_mode = on
archive_command = 'test ! -f /backup/postgres_archive/constant/%f && cp %p /backup/postgres_archive/constant/%f'
max_wal_senders = 3
wal_keep_segments = 8
hot_standby = on

Standby recovery.conf:

standby_mode = 'on'
primary_conninfo = 'host=my-host.example.com port=5432 user=replicator password=my_password sslmode=require'
restore_command = 'cp /backup/postgres_archive/constant/%f %p'
trigger_file = '/tmp/postgresql.trigger'

The permissions on the folder I'm trying to write into are correct and archive_command works fine when I run it manually as the user postgres is running as. To be sure I tried changing the archive command to a simple touch (again, tested fine as the postgres user), but it didn't make any difference.

Things that are working

My DB is still in its relative infancy so there's not much load on it at all. To this end I'm just simulating it by writing random data into a test table. After I commit on the master I can see the changes appearing straight away in the standby so I'm happy with that.

One thing I don't quite understand is that the WAL files are the standby and the master are slightly different, but it almost looks as though one or other has provisioned a WAL that it hasn't started writing to yet (which isn't on the other). Is that normal?

If I do select pg_switch_xlog() on the master and then write some more both the master and standby seem to switch and start writing to the next/same WAL file. So that mirrors my understanding of what's happening.

Help

I have a few questions about all this. I have read through every page of the postgres manuals regarding this but I couldn't find anything to help in my particular case.

I tried finding some way of getting postgres to show me more information about what it might be doing / not doing in the log files but it didn't come up with anything useful. In debugging this, what should I change in the config to get as much useful information into the logs as possible?

In terms of when the log archiving would run, I guess because the master is sort of controlling which WAL file is active it's effectively the trigger for when the log shipping should run on the standby. Is that correct?

The streaming replication all seems to work as I expect but trying to get log shipping running on the standby doesn't appear to even try. What am I doing wrong?

Why are you trying to run `archive_command` on the standby, rather than the master? — Craig Ringer, Oct 15 '14 at 11:10
I sort of assumed it was normal! I'm trying to have my master do as little as possible and leaving my standby to deal with backups etc. Maybe that's just not the right approach? — Aidan Kane, Oct 15 '14 at 11:12
The master should be archiving WAL anyway, because it's necessary to allow the replica to run a `restore_command` to catch up if the master has removed WAL the standby still requires. So in general it doesn't make any sense to have the replica archive WAL received by streaming. — Craig Ringer, Oct 15 '14 at 11:14
If the replica needs a log file that it doesn't have it goes to the master. If the master has already removed it from pg_xlog it needs to get it from the WAL archive. But you need to setup a recovery.conf for the master to know where to get the file from, and won't that put the master in recovery mode? My understanding was that if the replica ended up too far behind master we'd have to resync it manually. — Aidan Kane, Oct 15 '14 at 11:21
The master never gets WAL from the archive. Ever. When it no longer needs an WAL segment it removes it unless `wal_keep_segments` is set, in which case it keeps WAL for bit longer then deletes it. The purpose of WAL archiving is (a) to allow point in time recovery from a base backup; and (b) to allow a replica to read and replay the archived WAL, either as its only way of following the master, or in case it gets too far behind to stream data from the master. In general the master should be archiving WAL and the replica, even if streaming, should be set up to replay it if needed. — Craig Ringer, Oct 15 '14 at 11:27
That was my understanding but I was confused by your comment that _'it's necessary to allow the replica to run a `restore_command` to catch up if the master has removed WAL'_. I now guess you mean that's done by setting a `restore_command` on the replica to read from the shared location the master archived to, rather than some automatic operation via the streaming replication, which makes total sense. So your advice is to have the master archiving WAL. I'm happy to set that up. And, do you know if there's a restriction in postgres that stops the standby from running `archive_command`? — Aidan Kane, Oct 15 '14 at 11:38
Don't use `trigger file` in world writable location - it's insecure. You don't need it - you can always use `pg_ctl promote` if you ever need to initiate failover. — Tometzky, Oct 15 '14 at 21:25
Seems like reasonable advice. I'll change that. Any ideas about any of the other questions? — Aidan Kane, Oct 15 '14 at 21:58

score 3 · Accepted Answer · answered Dec 16 '14 at 19:56

I also just ran into this issue. The key here is actually the archive_cleanup_command in the "recovery.conf" on the standby. The standby will run the archive_cleanup_command command when it is done processing a WAL segment from the primary, so at that point you know you can backup that WAL segment and all prior segments. In my "recovery.conf" I have:

archive_cleanup_command = '/var/lib/postgresql/wal_backup_mirror.sh "%r"'

The contents of that script are (simplified version):

CURRENT_WAL_FILE="$1"
for WAL_FILE in $(find /pg_logs/main -maxdepth 1 -type f | sort | awk "\$0 <= \"/pg_logs/main/${CURRENT_WAL_FILE}\""); do
  WAL_NAME=$(basename "$WAL_FILE")
  gzip -c "$WAL_FILE" > "/backups/wal/${WAL_NAME}.gz"
  #now upload the just created .gz to S3 or some other offsite storage

  rm -f "${WAL_FILE}"
done

Note here that I delete the WAL segment after backing it up to keep my log directory clean on the standby, but the one caveat there is to be careful of a cascaded replication setup, since a standby that is further down the chain might still need those files.

One final note, remember that backing up the WAL segments isn't sufficient, it has to be done in combination with some sort of regular full backup (pg_basebackup). We do full backups daily, and then just backup the WAL segments as needed throughout the day.

Interesting. Seems like it should do the trick. Will do a little more reading up so I fully understand this before I accept the answer. Thanks for all the info. — Aidan Kane, Dec 16 '14 at 21:30

Postgresql 9.3 Log Shipping on a Hot Standby

My setup

Things that are working

Help

1 Answers1