I am stumped trying to prevent an overflowing UDP buffer on a standby Postgres service. Any help would be most appreciated.
Essentially a UDP buffer associated with the pg_standby process on my localhost interface gradually fills up once I start Postgres until it hits its maximum capacity and then proceeds to steadily drop packets. A restart of Postgres (of course) clears the buffer, but then it begins filling up again.
As far as I can tell, this is not actually causing any problems. (It is only happening to the standby service, and failover data recovery shows nothing missing.) Nevertheless, I don't want any buffers to overflow.
Salient points:
a) by querying "/proc" information for UDP I can see non-empty buffers; and the UDP port for the only non-empty buffer (hex E97B --> dec 59771) allows us to use netstat to show the interface (localhost) and PID (438), which confirms that the "pg_standby" process is the culprit:
# cat /proc/net/udp | grep -v '00000000:0000'
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
16: 0100007F:E97B 0100007F:E97B 01 00000000:01000400 00:00000000 00000000 600 0 73123706 2 ffff880026d64ac0 0
# netstat -anp | grep 59771
udp 16778240 0 127.0.0.1:59771 127.0.0.1:59771 ESTABLISHED 438/pg_standby
# ps -F -p 438
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
postgres 438 29613 0 1016 496 0 11:18 ? 00:00:00 /usr/pgsql-9.1/bin/pg_standby -t /archive_wals/stoprecovery.trigger -c /archive_wals 000000010000000A000000C8 pg_xlog/RECOVERYXLOG 000000010000000A000000C6
b) the overflow occurs even when my firewalls on both servers (iptables) are shut down
c) my UDP buffers seem more than big enough. I could make them larger but that would only mask the problem
# grep rmem /etc/sysctl.conf | grep -v tcp
net.core.rmem_max = 26214400
net.core.rmem_default = 16777216
d) online discussions of similar problems seem to finger either older versions of Postgres or the Statistics Collector; to rule this out I have tried to turn off all statistics collection, but the problem continues:
# egrep '(track)' postgresql.conf | grep -v '^\s*#'
track_activities = off
track_counts = off
e) the UDP packet received is not very informative; a tshark verbose sniff shows something like this for each new dropped UDP packet:
Data (72 bytes)
0000 0b 00 00 00 48 00 00 00 01 00 00 00 00 00 00 00 ....H...........
0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0040 00 00 00 00 00 00 00 00 ........
Data: 0B0000004800000001000000000000000000000000000000...
[Length: 72]
f) there is not a great deal of database activity (e.g. roughly one 16MB WAL file is replicated from the primary to the secondary service every 45 minutes)
g) I formerly ran Postgres 8.3.5, with an otherwise identical setup; this problem only began when I upgraded to 9.1.9
Background on my setup:
- two CentOS 6.4 x86_64 bit systems (VMs), each running Postgres 9.1.9, each in a geographically separated (<50 miles) datacenter
- Postgres is active on my primary server and running in standby mode on my backup:
- the backup Postgres service is receiving its data two ways:
- as a warm standby processing WAL files via log shipping (see sections 25.2.1-25.2.4 here)
- on failover the current WAL file on the primary (not yet shipped) is recovered from a DRBD partition synced from the primary box (no standard procedure here, but here is a discussion)
- nothing else (of consequence) runs on these boxes except Postgres