Postgresql 11 logical replication - stuck in `catchup` state

1

I'm running two postgresql 11 servers - master and slave (setup with logical replication).

The problem I'm facing is that today after weeks of uninterrupted work slave got out of sync with this error message:

2019-09-16 07:39:44.332 CEST [30117] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 07:39:44.539 CEST [12932] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 07:39:44.542 CEST [27972] LOG:  background worker "logical replication worker" (PID 30117) exited with exit code 1

I did see this error message before and my process was to increase wal_sender_timeout on master (more details on this here: https://dba.stackexchange.com/questions/245855/logical-replication-in-postgresql-server-closed-the-connection-unexpectedly)

So then I wanted to restore replication however the state of replication is stuck on catchup:

master=# select * from pg_stat_replication;
  pid  | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  |    write_lag    |    flush_lag    |   replay_lag    | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------
 86864 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46110 | 2019-09-16 12:45:56.491325+02 |              | catchup | D55/FA04D4B8 | D55/F9E74158 | D55/F9E44CD8 | D55/F9E74030 | 00:00:03.603104 | 00:00:03.603104 | 00:00:03.603104 |             0 | async
(1 row)

I tried to restart slave a few times, with different combinations of subscription enabled and disabled - nothing helps, the replication status keeps staying on catchup. I can see sent_lsn and write_lsn values are changing so something is being sent through...

I'm not sure if this is relevant but on slave I tried following to see wal receiver status:

slave=# select * from pg_stat_wal_receiver ;
 pid | status | receive_start_lsn | receive_start_tli | received_lsn | received_tli | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time | slot_name | sender_host | sender_port | conninfo
-----+--------+-------------------+-------------------+--------------+--------------+--------------------+-----------------------+----------------+-----------------+-----------+-------------+-------------+----------
(0 rows)

However when I do simple ps -aux | grep postgres:

postgres 26087  0.0  1.1 348788 46160 ?        S    12:45   0:00 /usr/lib/postgresql/11/bin/postgres -D /var/lib/postgresql/11/slave_logical -c config_file=/etc/postgresql/11/slave_logical/postgresql.conf
postgres 26089  0.0  0.2 349108 12080 ?        Ss   12:45   0:00 postgres: 11/slave_logical: checkpointer
postgres 26090  0.0  0.0 348996  3988 ?        Ss   12:45   0:00 postgres: 11/slave_logical: background writer
postgres 26091  0.0  0.1 348788  7204 ?        Ss   12:45   0:00 postgres: 11/slave_logical: walwriter
postgres 26092  0.0  0.1 349740  4396 ?        Ss   12:45   0:00 postgres: 11/slave_logical: autovacuum launcher
postgres 26093  0.0  0.0 170024  3028 ?        Ss   12:45   0:00 postgres: 11/slave_logical: stats collector
postgres 26094  0.0  0.1 349572  4516 ?        Ss   12:45   0:00 postgres: 11/slave_logical: logical replication launcher
postgres 26095  0.0  0.2 350164 10036 ?        Ss   12:45   0:00 postgres: 11/slave_logical: my_user db ::1(56086) idle
postgres 26125  4.5  0.9 359876 36884 ?        Ds   12:45   0:20 postgres: 11/slave_logical: logical replication worker for subscription 37614

So either logical replication does not produce anything within pg_stat_wal_receiver or for some reason worker details do not make it to the sys view. In the same the process seems to be there (or at least I am guessing after reading output of ps -aux above)

This is my slave configuration:

wal_level=logical
max_replication_slots=2
max_logical_replication_workers=4

wal_receiver_timeout=1200000

And this is my master:

wal_level=logical

max_replication_slots=10
max_wal_senders=10

# maximum wait time in milliseconds that the walsender process on the active master
# waits for a status message from the walreceiver process on the standby master.
wal_sender_timeout=1200000

I have no idea what to do (even worst, at this stage I have no idea what to check next...)

Can you help me understand what should I do to make my slave to catch up so it's back to streaming state?


Edit (12 hours later)

When I checked in the morning synchronisation was still in catchup state

master=# select * from pg_stat_replication;
  pid  | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 12965 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46630 | 2019-09-17 06:40:18.801262+02 |              | catchup | D56/248E13A0 | D56/247E3908 | D56/247E3908 | D56/247E3908 |           |           |            |             0 | async
(1 row)

But when I checked again 60 seconds later the results set was empty...

Logs now show multiple incarnations of the same error:

2019-09-16 22:43:33.841 CEST [20260] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 22:43:33.959 CEST [26087] LOG:  background worker "logical replication worker" (PID 20260) exited with exit code 1
2019-09-16 22:43:34.112 CEST [3510] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-16 23:12:01.919 CEST [3510] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-16 23:12:02.073 CEST [26087] LOG:  background worker "logical replication worker" (PID 3510) exited with exit code 1
2019-09-16 23:12:02.229 CEST [4467] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 00:27:01.990 CEST [4467] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 00:27:02.131 CEST [26087] LOG:  background worker "logical replication worker" (PID 4467) exited with exit code 1
2019-09-17 00:27:02.177 CEST [6917] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 01:05:35.121 CEST [6917] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 01:05:35.220 CEST [26087] LOG:  background worker "logical replication worker" (PID 6917) exited with exit code 1
2019-09-17 01:05:35.252 CEST [8204] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 01:49:08.388 CEST [8204] ERROR:  could not send data to WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 01:49:08.520 CEST [26087] LOG:  background worker "logical replication worker" (PID 8204) exited with exit code 1
2019-09-17 01:49:08.583 CEST [9549] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:06:19.601 CEST [9549] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 03:06:19.732 CEST [26087] LOG:  background worker "logical replication worker" (PID 9549) exited with exit code 1
2019-09-17 03:06:19.754 CEST [12120] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:58:48.184 CEST [12120] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 03:58:48.254 CEST [13781] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 03:58:48.318 CEST [26087] LOG:  background worker "logical replication worker" (PID 12120) exited with exit code 1
2019-09-17 04:27:12.838 CEST [13781] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 04:27:12.931 CEST [26087] LOG:  background worker "logical replication worker" (PID 13781) exited with exit code 1
2019-09-17 04:27:12.967 CEST [14736] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 04:55:48.923 CEST [14736] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 04:55:49.032 CEST [15686] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 04:55:49.043 CEST [26087] LOG:  background worker "logical replication worker" (PID 14736) exited with exit code 1
2019-09-17 05:41:48.526 CEST [15686] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 05:41:48.590 CEST [17164] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 05:41:48.638 CEST [26087] LOG:  background worker "logical replication worker" (PID 15686) exited with exit code 1
2019-09-17 06:03:32.584 CEST [17164] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
2019-09-17 06:03:32.642 CEST [17849] LOG:  logical replication apply worker for subscription "logical_from_master" has started
2019-09-17 06:03:32.670 CEST [26087] LOG:  background worker "logical replication worker" (PID 17164) exited with exit code 1
2019-09-17 06:40:18.732 CEST [17849] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

In order to make replication to show up as catchup on master I now have to restart slave first...


Edit 2 - I thought maybe the rate at which sent_lsn is changing would help to establish how busy my servers are:

master=# select now(), * from pg_stat_replication ;
              now              |  pid   | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |        backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 2019-09-17 08:31:02.547143+02 | 100194 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46766 | 2019-09-17 08:23:05.01474+02 |              | catchup | D56/24B1BC88 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 |           |           |            |             0 | async
(1 row)

master=# select now(), * from pg_stat_replication ;
             now              |  pid   | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |        backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 2019-09-17 08:34:02.45418+02 | 100194 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46766 | 2019-09-17 08:23:05.01474+02 |              | catchup | D56/24B54958 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 |           |           |            |             0 | async
(1 row)

master=# select now(), * from pg_stat_replication ;
              now              |  pid   | usesysid | usename | application_name  |  client_addr  | client_hostname | client_port |        backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------------------------------+--------+----------+---------+-------------------+---------------+-----------------+-------------+------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 2019-09-17 08:41:01.815997+02 | 100194 |    16680 | my_user    | logical_from_master | 10.10.10.10 |                 |       46766 | 2019-09-17 08:23:05.01474+02 |              | catchup | D56/24B778B0 | D56/24A39B58 | D56/24A39B58 | D56/24A39B58 |           |           |            |             0 | async
(1 row)

When I look at above I'm worried that write_lsn does not follow sent_lsn - I don't know if this is correct.

Checkpoints are happening every 15 minutes and I can't see much IO through iostat.

When I executed select * from pg_stat_replication for the fourth time nothing was showing up (suggesting replication died by itself). Coincidentally checkpoint was completing around that time (I restarted replication on slave and waited until next checkpoint completion time and this time worker have not died).

There is one worrying sign I noticed in logs while looking for checkpoints:

2019-09-17 08:49:02.155 CEST,,,35656,,5d382555.8b48,10873,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2019-09-17 08:56:32.086 CEST,,,35656,,5d382555.8b48,10874,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 5671 buffers (0.1%); 0 WAL file(s) added, 0 removed, 1 recycled; write=449.927 s, sync=0.000 s, total=449.930 s; sync files=138, longest=0.000 s, average=0.000 s; distance=19690 kB, estimate=398335 kB",,,,,,,,,""
2019-09-17 09:04:02.186 CEST,,,35656,,5d382555.8b48,10875,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2019-09-17 09:04:51.376 CEST,,,35656,,5d382555.8b48,10876,,2019-07-24 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 490 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=49.187 s, sync=0.000 s, total=49.190 s; sync files=41, longest=0.000 s, average=0.000 s; distance=3165 kB, estimate=358818 kB",,,,,,,,,""

It seems that session_start_time column refers date from 24th of July. Coincidentally this is when I started replication on master...

Greg0ry

Posted 2019-09-16T11:10:40.507

Reputation: 138

No answers