9

I am running two Dell R410 servers in the same rack of a data center (behind a load balancer). Both have the same hardware configuration, run Ubuntu 10.4, have the same packages installed and run the same Java web servers (no other load) and I'm seeing a substantial performance difference between the two.

The performance difference is most obvious in the average response times of both servers (measured in the Java app itself, without network latencies): One of them is 20-30% faster than the other, very consistently.
I used dstat to figure out, if there are more context switches, IO, swapping or anything, but I see no reason for the difference. With the same workload, (no swapping, virtually no IO), the cpu usage and load is higher on one server.

So the difference appears to be mainly CPU bound, but while a simple cpu benchmark using sysbench (with all other load turned off) did yield a difference, it was only 6%. So maybe it is not only CPU but also memory performance.

So far I've checked:

  • Firmware revisions on all components (identical)
  • BIOS settings (I did a dump using dmidecode, and that showed no differences)
  • I compared /proc/cpuinfo, no difference.
  • I compared the output of cpufreq-info, no difference.
  • Java / JVM Parameters (same version and parameters on both systems)

Also, I completely replaced the RAM some months ago, without any effect.

I am lost. What can I do to figure out, what is going on?

UPDATE: Yay! Both servers perform equally now. It was the "power CRAP" settings as jim_m_somewhere named them in the comments. The BIOS options for "Power Management" were on "Maximum Performance" on the fast server, and on "Active Power Controller" (default setting from Dell) on the other one. Obviously I forgot, that I made that setting two years ago, and I didn't do that on all servers. Thanks to all for your very helpful input!

the.duckman
  • 93
  • 1
  • 7
  • 2
    Its possible you have faulty RAM. If your application is network heavy it could be anything along the network stack. – Kyle Dec 04 '12 at 17:03
  • Are you 100% certain the servers are identical (from the same manufacturing batch, same RAM/FSB speed, same CPU model/revision, same disk controllers and drives, same OS & OS configuration, etc.) -- there's lots of little things that can make big differences in performance. 20%+ sounds like RAM/FSB or disk subsystem to me... – voretaq7 Dec 04 '12 at 17:45
  • 2
    if they are serving the same data, any load balancing going on from a fw or dns? what do the network stats look like? are the java configurations identical as well? is java heap size the same? shooting in the dark on this one. – au_stan Dec 04 '12 at 18:28
  • 2
    Can you compare the "Advance CPU Settings" in the BIOS? - might be able to run an ipmitool command to do so? Is the speed on the RAM the same? I assume you have checked if you have battery backup on disks/controllers...just thinking "out loud"...is the RAM on both boxes the same? registered or not registered...AH...have you checked that the "power CRAP" - ACPI is off on both servers? – jim_m_somewhere Dec 04 '12 at 17:26
  • 2
    Is the software configuration truly identical? For example, is AppArmor enabled on one and disabled on the other? Also check 'dmesg' for errors. – Anton Cohen Dec 06 '12 at 19:56
  • 1
    Are you checked the wired cable of network, the port on the Switch and also you see the iops or check the health of HDD... Regards –  Dec 07 '12 at 20:23
  • 1
    I do not believe you have sufficient information about the problem. You should run some system profiling tools like perf or oprofile on the program. Rather than trying to look for differences in the running system (which has huge scope) look for differences in the application itself. If at all possible reduce the problem down to a specific test case of a few lines of code or a few functions. You are going to spend a very long time eliminating variables when all you know about this is a higher CPU usage. – Matthew Ife Dec 08 '12 at 09:13

5 Answers5

6

Two ideas, depending on how far you want to go with this:

  1. Swap the disks of both servers and see if the speed performance stays on the hardware or moves with the software.

  2. Compare the output of /opt/dell/toolkit/bin/syscfg -o complete-bios-config.out if you can somehow trick this package to install.

chutz
  • 7,569
  • 1
  • 28
  • 57
  • The output of dstat showed pretty clearly, that the difference in performance occurs also, when no IO is happening. Installing syscfg on Ubuntu 10.4 seems tricky indeed. I did compare the output of dmidecode already, would sysctl show more? Maybe it is less work to photos of each BIOS screen and compare them. I might try this. – the.duckman Dec 07 '12 at 18:48
  • 1
    By swapping the disks I did not mean to investigate the IO, but rather if it is software (mis)configuration that is causing the slowness (an odd kernel parameter for example). – chutz Dec 08 '12 at 03:40
3

More possibilities to output and diff:

  • sysctl -a (make sure kernel tuneables are the same)
  • cat /proc/interrupts (Maybe there is some other piece of hardware messing up?)
  • ipmitool sensor list (long shot, but check for more low level differences, overheating, voltage problems, etc)
  • Thanks, no obvious difference in the output of these commands, unfortunately. – the.duckman Dec 07 '12 at 18:46
  • 2
    *All* differences are obvious, if you compare files using *software*. Please refer to this question: [How do I diff two config files?](http://serverfault.com/questions/14212/how-can-i-diff-two-config-files) – Skyhawk Dec 07 '12 at 19:21
3

This sounds like it might be load-balancer related to me. When you say "same workload" how are you measuring this?
Are you directly benchmarking each server by applying a test load in isolation?
or Are you applying some load to the load-balancer and looking at the results on both servers?

If you're doing the latter (measuring the load placed on both servers through the load balancer) your load balancer may not be splitting the workload exactly evenly between the servers (a 20% skew for a pair of servers is not uncommon depending on how your load balancer decides who gets which requests), which is causing one server to take more load, and thus perform poorly.

(If you're directly benchmarking each server, in isolation, without using the load balancer as an intermediary, and you've verified that every component is identical (down to manufacturer revisions) between both systems then I'm at a loss -- I can't think of any other measurable reason for this kind of performance difference between otherwise identical servers)

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • You are right, our load balancer does that too - it's actually a feature. So I measured in lots of ways, and yes, I even "replayed" the same requests on each server individually once. But even to simply put all live traffic to a single server for some time and compare the time each server needed to prepare the response yields the same results as the more complex setups. – the.duckman Dec 07 '12 at 19:45
  • Hmm - in that case I'm officially stumped - if everything is truly identical (and we seem to have confirmed pretty well that it is) you should be within a reasonable margin of error on performance numbers (±5-7%) - you're seeing variations of more than double that, and I've got no idea why :-/ – voretaq7 Dec 07 '12 at 21:28
3

Try some profiling tools, either system profiling like perf or Java profiling like VisualVM.

With perf you could profile either the running Java process by PID or profile a benchmark. Look at both systems, see where the slow system is spending its time.

apt-get install linux-tools-common linux-tools

Then something like:

perf record -e cpu-cycles -p <pid>

or

perf record -a -g <benchmark command>

then

perf report

A couple ideas of how systems can perform differently:

Environment: Is the air temperature or airflow different? Are they in racks? I have seen systems perform differently in different rack positions, caused by vibration. There are different levels of vibration throughout each rack. It's unlikely, considering you said there is almost no I/O being used. But I have seen disks slow down to 2MB/sec sequential writes due to vibration in parts of a rack.

Hardware Faults: Any of the hardware could be faulty. Use the profiling to see what is slow. It could be a bad CPU or chipset, a heatsink not attached properly, out of balance fans causing vibration, failed fans, even a bad PSU. Try swapping things that are easy to swap.

Anton Cohen
  • 1,112
  • 6
  • 7
1

Why has nobody suggested 'sysprof'..?

This is what it was designed for.

Or ummm second thought... try stuffing some limits in /etc/security/limits.conf

Try both.

If you get nothing.... you have a security problem most likely or a physical defect.

see also: My linux server "Number of processes created" and "Context switches" are growing incredibly fast

ArrowInTree
  • 154
  • 6