Bad performance on better hardware

Question

I have postgresql streaming replication on 2 hosts and I've faced with the problem of different performance compared between two servers. It looks like all sql queries on one host are slower on 70-90% than on another.

At first I checked query reports frm both servers and found that the almost all queries in the master's report are slower than corresponding queries from standby's report. I tested several example queries on both hosts and got the same result - queries on master host are executing slower than on standby.

I supposed, the problem is a different power management. But power management is disabled on both hosts, there are no scaling governors enabled and no "cpufreq" directories in "/sys/devices/system/cpu/cpuX/"

Next I compared hardware and found only one difference, slower host has newer CPUs than faster standby.

Master (which is slower)

CPU1 Version: Intel(R) Xeon(R) CPU           E5630  @ 2.53GHz --- Signature: Type 0, Family 6, Model 44, Stepping 2
CPU2 Version: Intel(R) Xeon(R) CPU           E5630  @ 2.53GHz --- Signature: Type 0, Family 6, Model 44, Stepping 2
16x DDR3 DIMM 4096MB 1333MHz

Standby (which is faster)

CPU1 Version: Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz --- Signature: Type 0, Family 6, Model 26, Stepping 5
CPU2 Version: Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz --- Signature: Type 0, Family 6, Model 26, Stepping 5
16x DDR3 DIMM 4096MB 1333MHz

As you can see master host has newer E5630 CPUs, but they work slower tnan older E5530. Memory configuration are the same (compared using lshw -C memory), there are only vendor and serial numbers are different.

Next, I took Intel Memory Latency Checker, checked the memory (both hosts are NUMA aware), and got the following results: master

$ sudo ./mlc-linux/mlc_avx512 --latency_matrix
Intel(R) Memory Latency Checker - v3.5                                                                                                                                                                                          
Command line parameters: --latency_matrix                                                                                                                                                                                       
Using buffer size of 200.000MB                                                                                                                                                                                                  
Measuring idle latencies (in ns)...                                                                                                                                                                                             
                Numa node
Numa node            0       1
       0         366.1   228.9
       1         386.9   218.0
$ sudo ./mlc-linux/mlc_avx512 --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --bandwidth_matrix 
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1
       0         6021.5  5102.2
       1         4799.6  6473.1

standby

$ sudo ./mlc-linux/mlc_avx512 --latency_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --latency_matrix 
Using buffer size of 200.000MB
Measuring idle latencies (in ns)...
                Numa node
Numa node            0       1
       0          82.3   130.1
       1         129.4    80.1

$ sudo ./mlc-linux/mlc_avx512 --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --bandwidth_matrix 
Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1
       0        13439.5  7736.3
       1         7749.9 13846.2

Well, ideally, the picture on both hosts must look the same - access to the remote zone have to be slower than to local. But it's not, if everything is ok on standby host, on master hosts everything is wrong. 1) local access on 0->0 is slower than on remote 0->1. Why? 2) access times on master are significantly bigger than on standby, 218ns vs 80ns (in the best cases), even though master has newer CPUs. I tried tests several times and got similar results.

Also I thought maybe results depend on that the master server is under production load and its standby is idle. But this load is not so high, load average is about 3,87, 3,85, 3,70.

Sysctl's on both servers are the same, there is only one family of sysctl's keys that differs significantly - kernel.sched_domain.cpu*.domain*.max_newidle_lb_cost. I haven't found any info about this parameter, but I tried to change it and it changes dinamically after some time.

PostgreSQL configuration (postgresql.conf) are the same on both hosts.

Tested queries don't consume and disk IO (checked via pg_stat_statements contrib module which provides query details), all dataset is completely in the memory (shared buffers). Tested queries have the same execution plan on both servers, but their execution time is different.

I also thought about recent meltdown/spectre vulns, checked systems and found the following difference. The master host with newer CPUs has complete mitigation of spectre_v2.

$ for i in $(sudo ls -1 /sys/devices/system/cpu/vulnerabilities/); do echo $i $(sudo cat /sys/devices/system/cpu/vulnerabilities/$i); done
meltdown Mitigation: PTI
spec_store_bypass Vulnerable
spectre_v1 Mitigation: Load fences
spectre_v2 Mitigation: Full retpoline

vs. standby host

$ for i in $(sudo ls -1 /sys/devices/system/cpu/vulnerabilities/); do echo $i $(sudo cat /sys/devices/system/cpu/vulnerabilities/$i); done
meltdown Mitigation: PTI
spec_store_bypass Vulnerable
spectre_v1 Mitigation: Load fences
spectre_v2 Vulnerable: Retpoline without IBPB

Well. ok, I disabled all protections through "/sys/kernel/debug/x86/*_enabled" but that doesn't helped at all (enabled them back)

So, does anybody can explain what's wrong with master server and why its memory access and modern CPUs work much slower? And how to fix this issue?

what do you mean? as i mentioned above, i disabled these patches and there is no profit — lesovsky, Jul 26 '18 at 10:15
Forgot to mention, both hosts are CentOS Linux release 7.5.1804 (Core) with 3.10.0-862.6.3.el7.x86_64 kernel — lesovsky, Jul 26 '18 at 10:55

Bad performance on better hardware

0 Answers0