2

I have two Dell R730 systems which have an identical hardware configuration purchased at the same time. Both are running RHEL6.9 where were imaged from the same image. It was imaged in January. I update the packages from the repository once a month so in general everything on the system should be "nearly" identical. (ie. any software or setting I change on one system gets changed on the other but since it is a manual process, there could be something missed)

I have noticed the performance on one system is 2.5X slower than the other. The jobs I am testing are single threaded CPU intensive. Reading some data files but very low disk io utilization according to iostat. Top shows the process is constantly pegged at 100% but the system has 88 threads and the load average is only approx 1. Very little memory utilization. No network utilization. (All files that it uses are local) One is a complex python script, another is a proprietary software program, both are running much slower on one system versus the other.

/proc/cpuinfo is identical. BIOS settings are identical. Only one user on the system. The faster system is connected to the internet, the slower one is on a standalone network.

In my investigations I've only found two differences. 1. The faster system is running BIOS version 2.25 the slower system is running BIOS version 2.43 2. The slower system has auditd running. However there is zero activity in the audit log during the process.

I realize this is difficult to debug but I am running out of ideas of what to look for. Are there some builtin software tools I can use to give more insight on what might be going on?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
eng3
  • 157
  • 8
  • You can try to swap the disks between the machines - then you will know if the problem is in the installed OS or the hardware setup. I would also try doing the same with RAM sticks - if swapping the RAM between them will cause the other one to slow down you will need to perform memtest on those sticks - if its ECC RAM that is failing it can cause slowdowns without harming the system stability. – bocian85 Oct 12 '17 at 23:45
  • That is a good idea but unfortunately I'm not allowed to swap the disks or the RAM. I can try performing a memtest (it is ECC RAM). Perhaps I can run the same test on both systems and check the memory io speed. Is there a good way to check this? – eng3 Oct 13 '17 at 00:44
  • Do not try to debug, just upgrade your BIOS (intel bug, see : https://stackoverflow.com/questions/42144791/cpu-not-transitioning-into-higher-c-states ) – sfk Oct 13 '17 at 09:51

2 Answers2

2

My recommendations today with EL6 systems on enterprise hardware are the following:

  • Set your Dell servers to "OS Control" mode for power, versus a "High Performance" or "Dynamic" mode. This will allow your single-threaded processes to actually leverage Turbo Boost a bit better and give the OS CPU governor the right control.
  • Is there any reason you can't bring the firmware to the same revision?
  • For EL6, you should set the tuned-adm profile to enterprise-storage or latency-performance.
  • If your slower system isn't internet connected, check DNS and your /etc/hosts file definition to make sure that you're not being slowed down by any resolution issues.
  • Examine and compare your /etc/sysctl.conf settings across systems.
  • You can run sosreport to try to get a summary of both systems' configs.

Of course, you could also profile the processes... top, perf top, pidstat, strace.

Or look at the servers in realtime with Netdata and correlate all of the system metrics to see where the bottleneck(s) exist.

I also do the following in /etc/profile.d/tzfix.sh for good reason:

# Set TZ variable to reduce stat("/etc/localtime" activity
# See: https://blog.packagecloud.io/eng/2017/02/21/set-environment-variable-save-thousands-of-system-calls/
#
export TZ=:/etc/localtime

Just some ideas to start.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • So there is some conflicting advice. "Max Performance" and disable C states OR OS Control (C states enabled). I guess I will try both I don't think I can downgrade a bios version. I am afraid to upgrade the fast system to the version of the slow system in case that is the cause. Although I noticed that Dell just released a new version yesterday. I'll look into tuned-adm, I've never heard of it I did check the DNS, that is ok. I'll double check he sysctl.conf file What surprises me is any of these settings would cause a 2.5X difference – eng3 Oct 13 '17 at 16:08
  • Max performance disables Turboboost, so it’s potentially leaving performance on the table for single-threaded workloads. – ewwhite Oct 13 '17 at 20:32
  • I do not see a "OS Control" option in the dell bios on the R730. There is System DAPC for cpu power management. There is a separate "Turbo boost" enable or disable setting. Some for C1E and C states – eng3 Dec 15 '17 at 16:21
0

This is probably related to power management. Try putting both servers in high performance mode (power management disabled) and redo your performance tests.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • 1
    Also change your C State, http://kb.eclipseinc.com/kb/how-do-i-disable-c-states-on-a-dell-server/ – Jacob Evans Oct 12 '17 at 22:06
  • I'll give it a try and report back. I thought of this (and I dont recall the setting) but the setting is the same on both systems so I wouldn't think that would be the culprit. Unless the newer BIOS version caused the setting to mess with the performance. – eng3 Oct 12 '17 at 22:33