1

my web server is down every 2 days. it became unresponsive and I can only see the message below on the console.

task **** blocked for more than 120 seconds.

The only way to bring it up is resetting the server which is not good and I would like to get it fixed.

My server is running on a VMWare environment.

QUESTIONS:

  • Is the 14.89 LOAD AVERAGE my issue?
  • If not what would be my next step to troubleshoot?

This is my crash log:

 KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.3.1.el6.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2014-08-26-01:25:16/vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Tue Aug 26 01:25:11 2014
      UPTIME: 00:14:00
LOAD AVERAGE: 14.89, 5.57, 2.09
       TASKS: 207
    NODENAME: test01
     RELEASE: 2.6.32-431.3.1.el6.x86_64
     VERSION: #1 SMP Fri Jan 3 21:39:27 UTC 2014
     MACHINE: x86_64  (2933 Mhz)
      MEMORY: 2 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
         PID: 58
     COMMAND: "khungtaskd"
        TASK: ffff88007a4e9500  [THREAD_INFO: ffff88007a4ea000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)

...

INFO: task vmtoolsd:1186 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.3.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vmtoolsd      D 0000000000000003     0  1186      1 0x00000000
 ffff88007bc39b98 0000000000000086 0000000000000000 ffff880037de6f28
 ffff88003793c800 ffff88007bc39c38 ffff88007bc39b38 ffffffffa00026d0
 ffff88007b847af8 ffff88007bc39fd8 000000000000fbc8 ffff88007b847af8
Call Trace:
 [<ffffffffa00026d0>] ? dm_unplug_all+0x50/0x70 [dm_mod]
 [<ffffffff8111f940>] ? sync_page+0x0/0x50
 [<ffffffff815280b3>] io_schedule+0x73/0xc0
 [<ffffffff8111f97d>] sync_page+0x3d/0x50
 [<ffffffff81528b7f>] __wait_on_bit+0x5f/0x90
 [<ffffffff8111fbb3>] wait_on_page_bit+0x73/0x80
 [<ffffffff8109b330>] ? wake_bit_function+0x0/0x50
 [<ffffffff81135c05>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8111ffdb>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff8112009f>] filemap_fdatawait+0x2f/0x40
 [<ffffffff811206c4>] filemap_write_and_wait+0x44/0x60
 [<ffffffff811c5864>] __sync_blockdev+0x24/0x50
 [<ffffffff811c58a3>] sync_blockdev+0x13/0x20
 [<ffffffff811f0034>] sync_quota_sb+0x74/0x130
 [<ffffffff811bad1a>] __sync_filesystem+0x7a/0x90
 [<ffffffff811baf2b>] sync_filesystem+0x4b/0x70
 [<ffffffff811c61ee>] freeze_bdev+0xfe/0x280
 [<ffffffff8119de6b>] do_vfs_ioctl+0x1fb/0x580
 [<ffffffff8119e271>] sys_ioctl+0x81/0xa0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 58, comm: khungtaskd Not tainted 2.6.32-431.3.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81527213>] ? panic+0xa7/0x16f
 [<ffffffff810e6346>] ? watchdog+0x246/0x250
 [<ffffffff810e6100>] ? watchdog+0x0/0x250
 [<ffffffff8109af06>] ? kthread+0x96/0xa0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109ae70>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

sar details

# sar -r
Linux 2.6.32-431.3.1.el6.x86_64 (test01)        08/26/2014      _x86_64_        (4 CPU)

12:00:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
12:10:01 AM    287032   1634920     85.07    111308    937968    597956      9.88
12:20:01 AM    252300   1669652     86.87    113592    946640    633132     10.46
12:30:01 AM    229752   1692200     88.05    115680    958476    703876     11.63
12:40:01 AM    147420   1774532     92.33    121260   1045248    601400      9.94
12:50:02 AM    155808   1766144     91.89    125484   1037572    595232      9.84
01:00:01 AM    120928   1801024     93.71    128380   1043552    664652     10.98
Average:       198873   1723079     89.65    119284    994909    632708     10.46

01:11:22 AM       LINUX RESTART

01:26:05 AM       LINUX RESTART

01:40:45 AM       LINUX RESTART

01:55:30 AM       LINUX RESTART

02:00:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
02:10:01 AM    993480    928472     48.31     56524    463256    599784      9.91
02:20:01 AM    965424    956528     49.77     68212    478076    591716      9.78
02:30:01 AM    874868   1047084     54.48     78216    508992    657420     10.87

# sar -q
Linux 2.6.32-431.3.1.el6.x86_64 (test01)        08/26/2014      _x86_64_        (4 CPU)

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
12:10:01 AM         2       183      0.08      0.71      0.86
12:20:01 AM         2       185      0.03      0.11      0.44
12:30:01 AM         5       207      1.98      1.28      0.77
12:40:01 AM         2       183      0.01      0.69      0.84
12:50:02 AM         1       181      0.00      0.10      0.43
01:00:01 AM         5       202      1.99      1.33      0.80
Average:            3       190      0.68      0.70      0.69

01:11:22 AM       LINUX RESTART

01:26:05 AM       LINUX RESTART

01:40:45 AM       LINUX RESTART

01:55:30 AM       LINUX RESTART

02:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
02:10:01 AM         2       184      0.05      0.13      0.11
02:20:01 AM         1       184      0.06      0.05      0.07
02:30:01 AM         8       204      2.07      1.33      0.61
02:40:01 AM         4       187      0.01      0.65      0.71

# sar -d
Linux 2.6.32-431.3.1.el6.x86_64 (test01)        08/26/2014      _x86_64_        (4 CPU)

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:10:01 AM    dev8-0      8.96     60.88    141.18     22.54      0.24     27.08      3.74      3.35
12:10:01 AM   dev8-16      0.10      0.03      2.01     20.47      0.00     36.49     24.80      0.25
12:10:01 AM  dev253-0     21.05     60.68    143.19      9.69      0.58     27.56      1.60      3.36
12:10:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:20:01 AM    dev8-0      4.60     23.08     73.54     21.00      0.07     14.29      2.45      1.13
12:20:01 AM   dev8-16      0.06      0.01      1.08     17.73      0.00     29.70     13.68      0.08
12:20:01 AM  dev253-0      9.77     23.09     74.63     10.00      0.13     13.02      1.16      1.13
12:20:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:30:01 AM    dev8-0      3.13     32.93     41.52     23.80      0.03     10.57      3.58      1.12
12:30:01 AM   dev8-16      0.02      1.57      0.09     90.18      0.00     12.45     13.73      0.03
12:30:01 AM  dev253-0      6.01     34.93     41.61     12.75      0.06      9.46      1.89      1.14
12:30:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:40:01 AM    dev8-0     27.83    841.82    198.60     37.39      0.35     12.65      2.40      6.69
12:40:01 AM   dev8-16      2.46    202.41      2.21     83.16      0.01      5.28      4.29      1.06
12:40:01 AM  dev253-0     49.09   1043.80    200.80     25.35      3.62     73.59      1.53      7.52
12:40:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:50:02 AM    dev8-0      6.25     17.44    130.97     23.74      0.07     11.60      2.57      1.61
12:50:02 AM   dev8-16      0.63     56.52      3.12     94.54      0.01      8.11      4.74      0.30
12:50:02 AM  dev253-0     18.11     73.85    134.09     11.48      0.21     11.74      0.95      1.72
12:50:02 AM  dev253-1      0.01      0.11      0.00      8.00      0.00     13.88      1.75      0.00
01:00:01 AM    dev8-0      5.30     12.40     86.55     18.67      0.06     11.20      2.48      1.32
01:00:01 AM   dev8-16      0.09      0.23      2.36     27.57      0.00     22.68     11.36      0.11
01:00:01 AM  dev253-0     11.26     12.63     88.91      9.02      0.11      9.89      1.18      1.33
01:00:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
Average:       dev8-0      9.31    163.72    111.88     29.59      0.14     14.72      2.71      2.53
Average:      dev8-16      0.56     43.25      1.81     80.72      0.00      7.72      5.41      0.30
Average:     dev253-0     19.16    206.91    113.70     16.73      0.78     40.65      1.40      2.69
Average:     dev253-1      0.00      0.02      0.00      8.00      0.00     13.88      1.75      0.00

01:11:22 AM       LINUX RESTART

01:26:05 AM       LINUX RESTART

01:40:45 AM       LINUX RESTART

01:55:30 AM       LINUX RESTART

02:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
02:10:01 AM    dev8-0      9.86    160.76     92.83     25.72      0.08      8.07      3.25      3.20
02:10:01 AM   dev8-16      1.03     43.08     21.27     62.61      0.01      8.14      4.74      0.49
02:10:01 AM  dev253-0     20.23    203.27    114.09     15.69      0.18      8.70      1.73      3.49
02:10:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
02:20:01 AM    dev8-0      4.76     74.76     51.09     26.42      0.03      6.83      3.51      1.67
masber
  • 195
  • 2
  • 4
  • 16
  • This looks to be running on a VM, if this is true, this information should be in your request. Remember a LOT of VMs are running on a single host so other VMs could be the primary cause of your issue. – mdpc Aug 26 '14 at 05:48

1 Answers1

2

The first thing I will do here is to stop/disable vmtoolsd daemon and if you have a vmware subscription please raise a case with them.Here it look like vmtoolsd daemon is trying to freeze the filesystem.

These are the further steps you can provide to vmware team

On crash prompt run

1: set 1186

2: bt

3: dis -rl 0xfe/0x280 i.e the place where block_device is passed to freeze_bdev()

4: Also you can verify how long this process is in UNinterruptible state

a: task -R last_ran

b: runq |grep "CPU number obtained from set command or in whcih CPU this task is last running)

c: rq.timestamp_last_tick

pd (value obtained by c - value obtained by a)/60 --> This will give you how long vmtoolsd is in D state

Also from OS side some changes you can made is to lower down the values of vm.dirty_ratio and dirty_background_ratio as this will flush the dirty pages quickly.

But based on my experience I would say best approach is to disable vmtoolsd as its has known history of freezing filesystem and raise case with VMware in parallel to know there opinion.

Prashant Lakhera
  • 683
  • 1
  • 9
  • 25
  • thank you @prashatlakhera, your comment made me think and I found the issue which was with a combination of my backup solution and a snapshot on the VM. – masber Aug 26 '14 at 11:08