slow file server with contradictory values for CPU and CPL in atop

Question

I've got a fileserver (centos 6.3) that slowed down suddenly earlier today. The cluster that mounts it could access other NFS mounts without a problem, but the access to this one was VERY slow. Logging on via ssh was very slow too (and the idrac virtual console had no signal - maybe a different problem).

Running iostat -x 5 on the server didn't show anything to be a problem. 'await' was mostly 0, sometimes up to 2, and %util was mostly 0, sometimes up to 3, rarely 5. From what I understand this indicates no obvious io problem?

Running atop on the server showed nothing that seemed unusual to me except that CPL avg's were in the 14-17 range, whereas the CPU utilization was always between 100-200% out of 3200% during the 30 minutes or so that I was looking at things. atop output is below.

A question about CPL that might relate to this: the system is hyper-threading so shows 32 cpu when there are 16 physical cores (2x8). Does CPL apply to physical cores only or also to hyper-thread virtual cores (if that's the term)? A CPL of 14-17 should be fine if the latter, but not the former. But in either case, I don't understand why CPL looks so different than CPU.

Thanks for any thoughts.

PRC |  sys   10.70s  |  user   0.18s  |  #proc   2846 |  #tslpu     9  |  #zombie    0  |  #exit      6  |
CPU |  sys     107%  |  user      2%  |  irq       0% |  idle   3094%  |  wait      0%  |  curscal   ?%  |
CPL |  avg1   14.86  |  avg5   17.50  |  avg15  17.52 |  csw     4265  |  intr   31460  |  numcpu    32  |
MEM |  tot    31.3G  |  free  128.6M  |  cache  25.2G |  dirty  94.9M  |  buff  165.6M  |  slab    2.1G  |
SWP |  tot     1.0G  |  free  960.8M  |               |                |  vmcom   5.4G  |  vmlim  16.6G  |
LVM |  rt-lv_export  |  busy      0%  |  read       0 |  write     35  |  MBw/s   0.02  |  avio 0.00 ms  |
DSK |           sda  |  busy      0%  |  read       0 |  write     10  |  MBw/s   0.01  |  avio 0.30 ms  |
DSK |           sdb  |  busy      0%  |  read       0 |  write     25  |  MBw/s   0.02  |  avio 0.00 ms  |
DSK |           sdc  |  busy      0%  |  read       0 |  write      9  |  MBw/s   0.00  |  avio 0.00 ms  |
NET |  transport     |  tcpi      25  |  tcpo      22 |  udpi       0  |  udpo       0  |  tcpao      0  |
NET |  network       |  ipi       37  |  ipo       27 |  ipfrw      0  |  deliv     25  |  icmpo      0  |
NET |  pem3      0%  |  pcki     299  |  pcko       0 |  si   16 Kbps  |  so    0 Kbps  |  erro       0  |
NET |  pem1  0%  |  pcki      57  |  pcko      12 |  si    3 Kbps  |  so    1 Kbps  |  erro       0  |
NET |  em1     ----  |  pcki      57  |  pcko      12 |  si    2 Kbps  |  so    1 Kbps  |  erro       0  |

  PID   TID RUID      THR  SYSCPU  USRCPU  VGROW  RGROW   RDDSK  WRDSK ST EXC S CPUNR  CPU CMD         1/3
20539     - root        1   1.09s   0.00s     0K     0K      0K     0K --   - D     7  11% nfsd
20544     - root        1   1.01s   0.00s     0K     0K      0K     0K --   - D     1  10% nfsd
  356     - root        1   0.99s   0.00s     0K     0K      0K     0K --   - D    25  10% kswapd1
20545     - root        1   0.93s   0.00s     0K     0K      0K     0K --   - R     2   9% nfsd
20546     - root        1   0.93s   0.00s     0K     0K      0K     0K --   - D     4   9% nfsd
  355     - root        1   0.90s   0.00s     0K     0K      0K     0K --   - R    22   9% kswapd0
20540     - root        1   0.87s   0.00s     0K     0K      0K     0K --   - D    26   9% nfsd
20541     - root        1   0.86s   0.00s     0K     0K      0K     0K --   - D    30   9% nfsd
 1170     - root        1   0.84s   0.00s     0K     0K      0K     0K --   - D     6   8% cook-news
20542     - root        1   0.83s   0.00s     0K     0K      0K     0K --   - D    22   8% nfsd
20543     - root        1   0.83s   0.00s     0K     0K      0K     0K --   - D     6   8% nfsd
  536     - root        1   0.40s   0.14s     0K     0K      0K     0K --   - R    19   5% atop
 1650     - root        0   0.16s   0.04s     0K     0K       -      - NE   1 E     -   2% <ps>
 5798     - root       47   0.01s   0.00s     0K     0K      0K     4K --   - S    13   0% dsm_om_connsvc
 4944     - root        1   0.01s   0.00s     0K     0K      0K     0K --   - S    13   0% snmpd
  138     - root        1   0.01s   0.00s     0K     0K      0K     0K --   - S     7   0% events/7
  139

score 1 · Answer 1 · answered Nov 26 '14 at 19:30

CPL is load average figures reflecting the number of threads that are available to run on a CPU (i.e. part of the runqueue) or that are waiting for disk I/O. You seem to have ~16 processes that seems to be waiting for the disk. That's the reason you see the cpu mostly idle, it doesn't have anything to do other than waiting for the disk.

I would check the disks of this system, check dmesg for disk errors, smartctl attributes & log, also run a short self-test. I think this might be your problem, as disk read & write speeds are very low.

Perhaps a raid is running in degraded mode or reconstructing.

Thanks this is helpful. The CPL figures make much more sense to me now. I'll look at the attached raid. — Michael S, Dec 02 '14 at 21:41

slow file server with contradictory values for CPU and CPL in atop

1 Answers1