We have several web servers running on Amazon (ec2) c1.xlarge, over Amazon AMI.

The servers are duplicates of each other, running the exact same hardware and software. Each server spec is:

  • 7 GB of memory
  • 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
  • 1690 GB of instance storage
  • 64-bit platform
  • I/O Performance: High
  • API name: c1.xlarge

A couple of weeks ago we have run a yum upgrade on one of the servers. Starting on this upgrade the upgraded server started showing a high load average. Needless to say, we did not update the other servers and we can not do so until we understand the reason for this behavior.

The strange thing is that when we compare the servers using top or iostat, we can not find the reason for the high load. Note that we have moved traffic from the "problematic" server to the others, which have made the "problematic" server less crowded in terms of requests, and still his load is higher.

Do you have any idea what could it be, or where else can we check?

# proper server
# w command
 00:42:26 up 2 days, 19:54,  2 users,  load average: 0.41, 0.48, 0.49
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
      pts/1     00:28   14:05   0.01s  0.01s -bash
      pts/2     00:38    0.00s  0.02s  0.00s w

# proper server
# iostat command
Linux 3.2.12-3.2.4.amzn1.x86_64   _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.03    0.02    4.26    0.17    0.13   86.39

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvdap1            1.63         1.50        55.00     367236   13444008
xvdfp1            4.41        45.93        70.48   11227226   17228552
xvdfp2            2.61         2.01        59.81     491890   14620104
xvdfp3            8.16        14.47        94.23    3536522   23034376
xvdfp4            0.98         0.79        45.86     192818   11209784

# problematic server
# w command
 00:43:26 up 2 days, 21:52,  2 users,  load average: 1.35, 1.10, 1.17
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
      pts/0     00:28   15:04   0.02s  0.02s -bash
      pts/1     00:38    0.00s  0.05s  0.00s w

# problematic server
# iostat command
Linux 3.2.20-1.29.6.amzn1.x86_64          _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.97    0.04    3.43    0.19    0.07   88.30

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvdap1            2.10         1.49        76.54     374660   19253592
xvdfp1            5.64        40.98        85.92   10308946   21612112
xvdfp2            3.97         4.32        93.18    1087090   23439488
xvdfp3           10.87        30.30       115.14    7622474   28961720
xvdfp4            1.12         0.28        65.54      71034   16487112

# sar -q proper server
Linux 3.2.12-3.2.4.amzn1.x86_64 (***.com)        07/01/2012      _x86_64_        (8 CPU)

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
12:10:01 AM        13       194      0.41      0.47      0.51
12:20:01 AM         7       188      0.26      0.39      0.49
12:30:01 AM         9       198      0.64      0.49      0.49
12:40:01 AM         9       194      0.50      0.48      0.48
12:50:01 AM         7       191      0.44      0.36      0.41
01:00:01 AM        10       195      0.76      0.64      0.51
01:10:01 AM         7       175      0.41      0.58      0.56
01:20:01 AM         8       183      0.38      0.42      0.49
01:30:01 AM         8       186      0.43      0.38      0.44
01:40:01 AM         8       178      0.58      0.46      0.43
01:50:01 AM         9       185      0.47      0.45      0.45
02:00:01 AM         9       184      0.38      0.47      0.48
02:10:01 AM        10       184      0.50      0.51      0.50
02:20:01 AM        13       200      0.37      0.45      0.48
Average:            9       188      0.47      0.47      0.48

02:28:42 AM       LINUX RESTART

02:30:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
02:40:01 AM         9       151      0.55      0.55      0.37
02:50:01 AM         7       163      0.54      0.48      0.42
03:00:01 AM         9       164      0.35      0.43      0.42
03:10:01 AM        10       168      0.31      0.36      0.40
03:20:01 AM         8       170      0.27      0.34      0.39
03:30:01 AM         8       167      0.50      0.55      0.48
03:40:01 AM         8       153      0.22      0.36      0.43
03:50:01 AM         7       165      0.38      0.38      0.41
04:00:01 AM         8       169      0.70      0.45      0.42
04:10:01 AM         8       160      0.58      0.46      0.43
04:20:01 AM         8       166      0.31      0.35      0.40
04:30:01 AM         9       166      0.17      0.33      0.38
04:40:01 AM         9       159      0.13      0.29      0.37
04:50:01 AM        12       170      0.36      0.28      0.32
05:00:01 AM         7       162      0.16      0.22      0.28
05:10:01 AM         6       163      0.51      0.43      0.36
05:20:01 AM         8       162      0.50      0.45      0.41
05:30:01 AM        10       170      0.30      0.32      0.36
05:40:01 AM         7       167      0.37      0.32      0.33
05:50:01 AM         8       166      0.48      0.44      0.38
06:00:01 AM        12       177      0.41      0.41      0.40
06:10:01 AM         8       166      0.47      0.44      0.42
06:20:01 AM         9       177      0.32      0.38      0.40
06:30:01 AM         5       166      0.29      0.37      0.40
06:40:01 AM         8       165      0.57      0.41      0.40
Average:            8       165      0.39      0.39      0.39

# sar -q problematic server
Linux 3.2.20-1.29.6.amzn1.x86_64 (***.com)       07/01/2012      _x86_64_        (8 CPU)

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
12:10:01 AM        12       194      1.20      1.19      1.28
12:20:01 AM         7       200      0.95      1.26      1.34
12:30:01 AM        11       199      1.16      1.23      1.30
12:40:01 AM         7       200      0.96      1.03      1.18
12:50:01 AM         8       208      1.42      1.17      1.16
01:00:02 AM         8       201      0.91      1.09      1.16
01:10:01 AM         7       200      1.08      1.15      1.19
01:20:01 AM         9       200      1.45      1.25      1.23
01:30:01 AM        11       195      0.97      1.10      1.19
01:40:01 AM         7       188      0.78      1.05      1.16
01:50:01 AM         9       196      1.32      1.22      1.24
02:00:01 AM        12       206      0.96      1.17      1.22
02:10:01 AM         9       187      0.96      1.09      1.17
Average:            9       198      1.09      1.15      1.22

02:23:22 AM       LINUX RESTART

02:30:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
02:40:01 AM         9       160      1.12      1.16      0.87
02:50:01 AM         9       163      0.77      0.94      0.91
03:00:01 AM         7       162      1.03      1.10      1.03
03:10:01 AM         9       164      0.99      1.07      1.05
03:20:01 AM         8       171      1.08      1.11      1.07
03:30:01 AM         8       167      1.02      0.99      1.02
03:40:01 AM         5       158      1.20      1.06      1.05
03:50:01 AM         8       171      1.11      1.10      1.07
04:00:01 AM         7       162      1.12      1.10      1.10
04:10:01 AM         9       164      0.90      0.94      1.02
04:20:01 AM         7       169      0.90      1.08      1.10
04:30:01 AM        13       169      1.07      1.07      1.10
04:40:01 AM        11       166      0.95      1.12      1.13
04:50:01 AM         7       173      1.04      1.12      1.13
05:00:01 AM         7       166      1.26      1.20      1.19
05:10:01 AM        10       169      1.14      1.25      1.22
05:20:01 AM        10       170      0.98      1.12      1.19
05:30:01 AM        10       166      0.82      0.98      1.09
05:40:01 AM        11       171      1.18      1.16      1.11
05:50:01 AM        12       187      1.07      1.19      1.16
06:00:01 AM         9       171      1.27      1.17      1.16
06:10:01 AM         7       169      1.40      1.26      1.22
06:20:01 AM         8       171      0.91      1.12      1.19
06:30:01 AM         8       172      1.00      1.11      1.17
06:40:01 AM         9       177      1.02      1.10      1.15
Average:            9       168      1.05      1.10      1.10
  • 21
  • 1
  • 4
  • I don't know about this version, but when we did an upgrade to Debian Squeeze we found imagemagick has really bad performance. I'd definitely try iostat with -x as well to see what percentage of IO capability it's reporting as being used, and average service times. – Bron Gondwana Jul 01 '12 at 08:06
  • Are they running different kernels? There were changes in the way load average accounting is done. Do you also see performance differences? Is it possible the problem is purely cosmetic? – David Schwartz Jul 01 '12 at 09:31
  • @David Schwartz - thanks for the feedback. We "feel" the problematic server is slower. We will measure it and get back to you. Thanks for the info about the accounting change. Meanwhile I have updated the question with sar -q info. – Oz. Jul 01 '12 at 12:53
  • @David Schwartz - we did the tests, everything seems fine between the servers but the numbers are higher in the server with the new kernel. Where did you see that changed the load average accounting? --- http://www.kernel.org/pub/linux/kernel/v3.x/ --> linux kernel release notes. Thanks – Oz. Jul 01 '12 at 15:58

3 Answers3


AWS overcontend their VM servers; they're assuming that not everyone will be consuming all the resources allocated to them, and so Amazon can make more money per unit of hardware deployed. Thus you can have two otherwise-identical systems running with wildly different performance patterns. The correlation with the upgrade is likely to be a coincidence.

A note on your diagnostic data: you really want the output of sar -q to help you diagnose this sort of problem. iostat is really only examining a very small portion of the possible sources of the issue.

  • 95,029
  • 29
  • 173
  • 228
  • Rackspace is better in these terms. They got cheaper cloud storage too. – Andrew Smith Jul 01 '12 at 08:28
  • 4
    All cloud providers suck, they just suck differently. – womble Jul 01 '12 at 08:58
  • @womble - I have updated the question with sar -q info. I don't think it's CPUs being taken by Amazon - top is showing a steal time of 0.2% --- Cpu(s): 7.6%us, 3.4%sy, 0.0%ni, 88.2%id, 0.6%wa, 0.0%hi, 0.0%si, 0.2%st – Oz. Jul 01 '12 at 13:02
  • @Oz.: You appear to assume that the only possible overcontended resource is CPU time. That is not a correct assumption. – womble Jul 02 '12 at 05:57

Also, don't keep staring at load alone. It's quite cantankerous. Your I/O-states and CPU-states are easier to read and less likely to lie to you.

To give you an example: make ten nfs-mounts. Take de nfs-server down. Your box has now a load of 10 (and a bit) and no I/O or CPU-usage to speak off.

Your nfs-mounts want to know when the nfs-server comes back. So they put themselves in the run-queue, all ten of them. When their turn comes up in de scheduler they check wether the nfs-server is back, which takes microseconds, and since it's still down they put themselves back on the runqueue again. Ten programs in the runqueue is a load of 10.0

  • 31
  • 1

At the risk of "me too" we see this exact same issue on EC2. This isn't simply an overcommit issue -- the problems seems to be confined to instance (XLs in our case) at 3.2.20 versus 3.2.12.

In our case, this is basically phantom load -- we see a load average of around .75 on the 3.2.20 instances; the 3.2.12 stay closer to 0.01. We are not convinced, however, that these instances are really slower than the others.