4

I am stumped trying to prevent an overflowing UDP buffer on a standby Postgres service. Any help would be most appreciated.


Essentially a UDP buffer associated with the pg_standby process on my localhost interface gradually fills up once I start Postgres until it hits its maximum capacity and then proceeds to steadily drop packets. A restart of Postgres (of course) clears the buffer, but then it begins filling up again.

As far as I can tell, this is not actually causing any problems. (It is only happening to the standby service, and failover data recovery shows nothing missing.) Nevertheless, I don't want any buffers to overflow.

Salient points:

a) by querying "/proc" information for UDP I can see non-empty buffers; and the UDP port for the only non-empty buffer (hex E97B --> dec 59771) allows us to use netstat to show the interface (localhost) and PID (438), which confirms that the "pg_standby" process is the culprit:

# cat /proc/net/udp | grep -v '00000000:0000'
sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops
16: 0100007F:E97B 0100007F:E97B 01 00000000:01000400 00:00000000 00000000   600        0 73123706 2 ffff880026d64ac0 0

# netstat -anp | grep 59771
udp   16778240      0 127.0.0.1:59771             127.0.0.1:59771             ESTABLISHED 438/pg_standby

# ps -F -p 438
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
postgres   438 29613  0  1016   496   0 11:18 ?        00:00:00 /usr/pgsql-9.1/bin/pg_standby -t /archive_wals/stoprecovery.trigger -c /archive_wals 000000010000000A000000C8 pg_xlog/RECOVERYXLOG 000000010000000A000000C6

b) the overflow occurs even when my firewalls on both servers (iptables) are shut down

c) my UDP buffers seem more than big enough. I could make them larger but that would only mask the problem

# grep rmem /etc/sysctl.conf  | grep -v tcp
net.core.rmem_max = 26214400
net.core.rmem_default = 16777216

d) online discussions of similar problems seem to finger either older versions of Postgres or the Statistics Collector; to rule this out I have tried to turn off all statistics collection, but the problem continues:

# egrep '(track)' postgresql.conf | grep -v '^\s*#'
track_activities = off
track_counts = off

e) the UDP packet received is not very informative; a tshark verbose sniff shows something like this for each new dropped UDP packet:

Data (72 bytes)
0000  0b 00 00 00 48 00 00 00 01 00 00 00 00 00 00 00   ....H...........
0010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
0020  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
0030  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
0040  00 00 00 00 00 00 00 00                           ........
Data: 0B0000004800000001000000000000000000000000000000...
[Length: 72]

f) there is not a great deal of database activity (e.g. roughly one 16MB WAL file is replicated from the primary to the secondary service every 45 minutes)

g) I formerly ran Postgres 8.3.5, with an otherwise identical setup; this problem only began when I upgraded to 9.1.9


Background on my setup:

  1. two CentOS 6.4 x86_64 bit systems (VMs), each running Postgres 9.1.9, each in a geographically separated (<50 miles) datacenter
  2. Postgres is active on my primary server and running in standby mode on my backup:
  3. the backup Postgres service is receiving its data two ways:
  4. nothing else (of consequence) runs on these boxes except Postgres
Daniel C
  • 41
  • 3
  • Bizarre issue. The only thing I think Pg uses UDP for is the stats collector, so I'm a bit confused as to how you're seeing these issues on a standby, especially with activity tracking off. – Craig Ringer Jan 04 '14 at 01:15
  • It's strange. And note that while the statistics collector is (I believe) associated with the postmaster process, the UDP drops here are coming from a socket owned by pg_standby. – Daniel C Jan 05 '14 at 21:10
  • It might be a good idea to raise this on pgsql-general - details in post, but also with a link back here. I'd love to dig into this one in detail but would probably need direct machine access, and can't currently spend the time (linux.conf.au + development work = busy). – Craig Ringer Jan 06 '14 at 00:09

0 Answers0