5

In the startup log entries indicate the autovacuum not working. I query the pg_stat_user_tables table and the last_vacuum and last_autovacuum columns are empty in spite of the vacuum query I ran just before. Connecting pgadmin to the database gives the indication that the vacuum is not working.

I am using postgresql on two Ubuntu Azure VM's. One VM is set up to be the master, the second is the replicated database by means of streaming. Roughly discribed in https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps.

All seems so work except for the autovacuum. During the startup the following error is logged:

LOG:  test message did not get through on socket for statistics collector
LOG:  disabling statistics collector for lack of working socket
WARNING:  autovacuum not started because of misconfiguration
HINT:  Enable the "track_counts" option.
LOG:  database system was shut down at 2017-01-19 14:07:13 UTC
DEBUG:  checkpoint record is at 38/F6000028

In the postgresql.config I use the following settings:

track_counts = on  
autovacuum = on
log_autovacuum_min_duration = 200 
autovacuum_max_workers = 1  
autovacuum_naptime =960
autovacuum_vacuum_threshold = 128 
autovacuum_analyze_threshold = 256

A query (select * from pg_stat_user_tables) on the database to find the last (auto)vacuum gives empty colums for the last (auto)vacuum in stead of an datetime. Were just before I ran the VACUUM FULL VERBOSE; and this gave me vacuum results.

If I query for the vacuum settings with:

select *
from pg_settings 
where name like 'autovacuum%'

This is the result:

"autovacuum";"on"<br />
"autovacuum_analyze_scale_factor";"0.1"
"autovacuum_analyze_threshold";"256"
"autovacuum_freeze_max_age";"200000000"
"autovacuum_max_workers";"1"<br />
"autovacuum_multixact_freeze_max_age";"400000000"
"autovacuum_naptime";"960"<br />
"autovacuum_vacuum_cost_delay";"20"
"autovacuum_vacuum_cost_limit";"-1"
"autovacuum_vacuum_scale_factor";"0.2"
"autovacuum_vacuum_threshold";"128"
"autovacuum_work_mem";"-1"

These are the 'track_' results:

"track_activities";"on"
"track_activity_query_size";"1024"
"track_commit_timestamp";"off"
"track_counts";"off"
"track_functions";"none"
"track_io_timing";"off"

The pg_hba.conf (without the replication and network/user settings) looks like this:

local   all             all                                     trust
host    all             all             localhost               trust
host    all             all             10.1.1.5/32             md5
host    all             all             127.0.0.1/32            md5
host    all             all             0.0.0.0 0.0.0.0         md5

the /etc/hosts:

127.0.0.1       localhost
127.0.1.1       ubuntu
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

This is the result of 'netstat -ant|grep 5432' It if cleaned up and formatted.

User@Machine:/datadrive/log/postgresql/pg_log$ netstat -ant|grep 5432
tcp        0      0 0.0.0.0:5432            0.0.0.0:*               LISTEN
tcp       39      0 InternIpMaster:5432           InternIpSlave:36338          ESTABLISHED
tcp        0      0 InternIpMaster:5432           IpJob:63814     TIME_WAIT
tcp        0      0 InternIpMaster:5432           IpJob:22192      TIME_WAIT
tcp        0      0 InternIpMaster:5432           IpJob:47729      TIME_WAIT
tcp        0      0 InternIpMaster:5432           IpJob:55663      TIME_WAIT
tcp6       0      0 :::5432                 :::*                    LISTEN

I don't expect the autovacuum needed work yet because of the

So during startup the track_counts are disabled on runtime.

I have been looking for solutions changing the iptables. Without any iptable rules it will not work. I have connected to localhost as a host. I have changed the firewall settings in Azure. I opened the 5432 to access the vm from all ip's. I am able to access the database from other other systems. I have reset the conf to the default with only replication changes. I restarted the service many times.

What am i missing?

Bart Dirks
  • 91
  • 9
  • Could you use `telnet standardip 5432` on master VM? Also, you could check `ssh postgres@ip` without password. – Shui shengbao Jan 20 '17 at 02:17
  • I have tried three ip addresses: the localhost, the subnet ip and the external (dynamic) ip. With the telnet standardip 5432 I was able to connect with all three ip adresses. With ssh postgres@ip I was able to connect with the localhost and the subnet ip. The external ip did not work. That is probably blocked by the Azure network security group – Bart Dirks Jan 20 '17 at 16:24
  • With the 'ssh postgres@ip' a question is asked: *The authenticity of host 'ip' can't be established. ECDSA key fingerprint is SHA256:. Are you sure you want to continue connecting (yes/no)? yes* After this I have to enter the password for the postgres user. – Bart Dirks Jan 20 '17 at 17:08
  • I read the article, you need ssh another VM without password. You could use `ssh-keygen` to set it. – Shui shengbao Jan 20 '17 at 22:41
  • You need open port 5432 for the two VMs on,NSG. The two VMs could ssh another without password. – Shui shengbao Jan 20 '17 at 22:43
  • I did set up the 'ssh-keygen' from master to slave and back. That seems to be used for the replication. Or do I fail to see the working of ssh-keygen. This is a means of authentication between two servers, right? The 5432 port is opened on the nsg, as indicated. 'telnet ip 5432' works so i think that this is set up. I see no need to open ssh on the nsg. Is this relevant for the statistics collection to work? – Bart Dirks Jan 21 '17 at 08:39
  • Yes, the two VMs's apps communication by ssh, but your app does not know your the password of ssh. You could use `ssh-keygen` to create a double-click trust relationship. http://www.linuxproblem.org/art_9.html – Shui shengbao Jan 23 '17 at 01:49
  • Hey Walter. Can you elaborate on that? I have not read that the trust relationship between the master and slave prevents the autovacuum to function. Or in my case prevents the statistic collector to function. As said, I have established a ssh authorization between the master and slave. In the fist comment i thought you asked to connect on the master to the master. I was thinking to check the local firewall settings. But do I understand correctly that you intended to call the slave from the master? – Bart Dirks Jan 23 '17 at 06:30
  • Yes, you need ssh slave from the master without password and ssh master from slave without password. – Shui shengbao Jan 23 '17 at 06:33
  • Just want to know if these helped? – Shui shengbao Jan 23 '17 at 07:52
  • I assumed that the replication did not work if the relation was not correct. Replication is working thus i thought that the trust relation was good. I added a ssh from the public master to the public slave and the other way around in the nsg. – Bart Dirks Jan 23 '17 at 08:22
  • I have set up the trusted relation for the postgres user both on the subnet and the external ip address. from the master to the slave and from the slave to the master. I rebooted, I restarted the postgres service. Still i have the same error on startup. I have double checked the ssh login without a password. However i need the passphrase. Or should i have left this blank? – Bart Dirks Jan 23 '17 at 08:22
  • Yes, you are right.`Press "ENTER" to all of the prompts that follow.` it means that you need left blank. – Shui shengbao Jan 23 '17 at 08:24
  • ok, that i will try. Which ip does postgres use? the external or the one from the subnet. How can i check – Bart Dirks Jan 23 '17 at 08:29
  • I think you could try to use private IP firstly. – Shui shengbao Jan 23 '17 at 08:31
  • I set up the relation for both ip-addresses and for bot machines. I double checked using 'ssh postgres@ip' on both machines. I noticed that when I ssht from the master to the slave with the external ip the last login message had a from with the internal ip. Thus that answers the ip question. However the statistic collector is still not working. Still the message _test message did not get through on socket for statistics collector_ – Bart Dirks Jan 23 '17 at 08:42
  • Do you try private ip (one from the subnet)? – Shui shengbao Jan 23 '17 at 08:45
  • I tried both. The ssh connection is working for both. – Bart Dirks Jan 23 '17 at 08:46
  • Do you check port 5432? Please give the result `netstat -ant|grep 5432` – Shui shengbao Jan 23 '17 at 09:02
  • https://www.pokertracker.com/forums/viewtopic.php?t=14928 – Shui shengbao Jan 23 '17 at 09:02

3 Answers3

3

You want to fix this:

LOG: test message did not get through on socket for statistics collector
LOG: disabling statistics collector for lack of working socket

The stats collector expects UDP packets from localhost. Given than localhost looks fine in your /etc/hosts (specifically it does not resolve to IPv6) the next more plausible explanation is that there's a firewall filtering these packets.

Related: Problem in creating UDP sockets solved with: Found and resolved the problem in creating UDP sockets. It was because of the OS firewall (iptables) restricting in creating UDP sockets.

Daniel Vérité
  • 2,740
  • 14
  • 19
  • I tried to flush the iptables `sudo iptables --flush`. After your suggestion i tried to stop the firewall. `sudo service ufw stop`. That did not give me feedback thus i used `sudo ufw disable`. After restarting the postgres service i still get the same ; _LOG: test message did not get through on socket for statistics collector 2017-01-23 14:49:59 UTC [61119-5] LOG: disabling statistics collector for lack of working socket_ Do you know a way to test the route the statistic collector uses? – Bart Dirks Jan 23 '17 at 14:56
  • Correction: I tried `service netfilter-persistent stop`. Restarted the postgres service. --> the same error. After this I flushed the iptables once more: `sudo iptables --flush`. Restarted postgres and the error is gone. I investigate some more. – Bart Dirks Jan 23 '17 at 15:09
1

According to your link,You should now be able to ssh freely between your two servers as the postgres user.So, you need set up the trusted relation for the postgres user from master to the slave and slave to the master.

You could use ssh-keygen to create a pair of key with blank password.

shui@shui:~$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/shui/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/shui/.ssh/id_rsa. Your public key has been saved in /home/shui/.ssh/id_rsa.pub. The key fingerprint is: SHA256:mCyBHNLeEdCH2VqBjhtOC8njVLSXnjU7V9GbufK+hlE shui@shui The key's randomart image is: +---[RSA 2048]----+ |..++.*.. .. | | o.+B = .. | |.o+=.B o . + | |o+= *oooo . E | |o+.+.o+oS. . . | | .+ . o o . | | = | | . o | | oo. | +----[SHA256]-----+ More information please refer to this link.

Also, you need open port 5432 on Azure NSG.

Shui shengbao
  • 3,503
  • 1
  • 10
  • 20
  • Hey Walter. For now I don't think anything changed. With the setting up of the servers I already added this trusted connection. I redid it to be sure it to be correct. But I think for others this could be helpfull – Bart Dirks Jan 23 '17 at 09:35
  • @Walter-MSFT can you explain what is the error from Bart and then write what needs to be done and why. Because I don't understand your answer to set up only an SSH RSA certificate. Why should we be doing that? – Nordes Jan 23 '17 at 09:51
  • @Nordes I checked the OP's [link](https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps) `You should now be able to ssh freely between your two servers as the postgres user.` – Shui shengbao Jan 23 '17 at 09:56
  • @Nordes The date which copy between the two VMs is used `ssh`. – Shui shengbao Jan 23 '17 at 09:58
  • @Nordes I asked OP, he find when he `ssh` he needs password. – Shui shengbao Jan 23 '17 at 10:06
  • The statistics collector is still not working. I don't know if I am closer to a solution or just at the same level. The problem to me is still the same. I am unsure how the process of the statistics collector work. What credentials are needed? What equivalent command can i used. I think it has nothing to do with the connection between the master and slave. I suspect it is local. But I cannot isolate the problem. I think the collector is a process to monitor the db with the postgres credentials but this seems to work, hens the `psql -h localhost -U postgres` command is connecting – Bart Dirks Jan 23 '17 at 10:20
  • @Walter-MSFT Usually in the answer you try to avoids only links, since they can become obsolete. Also the question is about the AutoVacuum and this answer is about replication. How is that related? Are you saying that if I stop the replication (With Bart Dirks ), the autovacuum will start magically by itself? – Nordes Jan 23 '17 at 10:34
  • @Nordes Thank for you reply, I will update my answer tomorrow. – Shui shengbao Jan 23 '17 at 10:48
  • @BartDirks Do you check log in `/var/log/postgresql`? I test in my lab, I don't get your error. – Shui shengbao Jan 24 '17 at 05:27
  • I test in my lab. I get `2017-01-24 05:58:30 UTC FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000002 has already been removed ` Now, I try to solve it. – Shui shengbao Jan 24 '17 at 05:59
  • I suggest you check your configuration according to your link. – Shui shengbao Jan 24 '17 at 05:59
  • @Walter-MSFT This error indicates that the wall segment on the master is deleted before the slave could take it to push in the database. In this case I would restore the slave database.This is addressed in the [link](https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps) as _Replicating the Initial database_.But I prefer to use a script to generate the slave database. This is explined in this [link](http://dba.stackexchange.com/questions/53545/cannot-get-my-slave-replication-server-to-start).Creating the recreate_standby.sh – Bart Dirks Jan 24 '17 at 10:45
1

I want to elaborate on the answer @Daniel gave and the solution to my problem.

I had set up the iptables in order to get acces to postgresql like this:

sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
sudo iptables -A INPUT -i lo -j ACCEPT
sudo iptables -A OUTPUT -o lo -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 22 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 5432 -m state --state NEW,ESTABLISHED -j ACCEPT
sudo iptables -A INPUT -j DROP

I assumed this was plenty. However when i used sudo iptables --flush and restarted the postgres server the error disabling statistics collector for lack of working socket was gone.

I also used iptraf to investigate the traffic (sudo apt-get install iptraf sudo iptraf). I noticed that a traffic originated on the ip local (subnet) address of the server but on different ports. This is the traffic on the slave machine (without the azure traffic).

SubnetIpSlave:22
SubnetIpSlave:45622
SubnetIpSlave:44770
SubnetIpSlave:48948
SubnetIpMaster:5432

I assume that this traffic is blocked by the iptables for it is not going via the loopback. Therefore I cleaned the iptables. This is the result:

sudo iptables -A INPUT -i lo -j ACCEPT
sudo iptables -A OUTPUT -o lo -j ACCEPT
sudo iptables -A INPUT -p icmp -j ACCEPT
sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 22 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 443 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 5432 -j ACCEPT
sudo iptables -A INPUT -s 10.1.1.0/24 -j ACCEPT
sudo iptables -A INPUT -j DROP

I included the subnet. I think this is what makes it work, for the SubnetIpSlave and SubnetIpMaster are in this range. I probably am allowed to remove the ESTABLISHED,RELATED rule.

The log looks like it should:

2017-01-24 09:19:38 UTC [1482-1] LOG:  database system was shut down in recovery at 2017-01-24 09:17:41 UTC
2017-01-24 09:19:38 UTC [1483-1] [unknown]@[unknown] LOG:  incomplete startup packet
2017-01-24 09:19:38 UTC [1482-2] LOG:  entering standby mode
2017-01-24 09:19:38 UTC [1482-3] DEBUG:  checkpoint record is at 5D/F2042CA8

I am happy ;)

Bart Dirks
  • 91
  • 9