3

I have XenServer guest OS (Ubuntu 10.10 lts) and have an issue with copy huge amount of files from nfs to local disk (cp -rp nfs:/dir /local/dir). After some time local disk hangs and I can't perform any IO operation. in log files I see this:

Oct 28 19:13:21 ls0 kernel: [1947885.457070] INFO: task cp:3904 blocked for more than 120 seconds.
Oct 28 19:13:21 ls0 kernel: [1947885.457075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 28 19:13:21 ls0 kernel: [1947885.457081] cp            D ffff88002804dbc0     0  3904   2052 0x00000000
Oct 28 19:13:21 ls0 kernel: [1947885.457085]  ffff880273a91ae8 0000000000000286 0000000000015bc0 0000000000015bc0
Oct 28 19:13:21 ls0 kernel: [1947885.457090]  ffff880141bf4890 ffff880273a91fd8 0000000000015bc0 ffff880141bf44d0
Oct 28 19:13:21 ls0 kernel: [1947885.457095]  0000000000015bc0 ffff880273a91fd8 0000000000015bc0 ffff880141bf4890
Oct 28 19:13:21 ls0 kernel: [1947885.457099] Call Trace:
Oct 28 19:13:21 ls0 kernel: [1947885.457103]  [<ffffffff8121adcd>] do_get_write_access+0x31d/0x5e0
Oct 28 19:13:21 ls0 kernel: [1947885.457107]  [<ffffffff8100eb6d>] ? xen_force_evtchn_callback+0xd/0x10
Oct 28 19:13:21 ls0 kernel: [1947885.457110]  [<ffffffff8100f302>] ? check_events+0x12/0x20
Oct 28 19:13:21 ls0 kernel: [1947885.457113]  [<ffffffff810845c0>] ? wake_bit_function+0x0/0x40
Oct 28 19:13:21 ls0 kernel: [1947885.457117]  [<ffffffff8121b221>] jbd2_journal_get_write_access+0x31/0x50
Oct 28 19:13:21 ls0 kernel: [1947885.457121]  [<ffffffff81202388>] __ext4_journal_get_write_access+0x38/0x70
Oct 28 19:13:21 ls0 kernel: [1947885.457125]  [<ffffffff811d9004>] ext4_new_inode+0x234/0xb40
Oct 28 19:13:21 ls0 kernel: [1947885.457128]  [<ffffffff811f7908>] ? ext4_journal_start_sb+0xf8/0x130
Oct 28 19:13:21 ls0 kernel: [1947885.457132]  [<ffffffff811e6e40>] ext4_create+0xc0/0x150
Oct 28 19:13:21 ls0 kernel: [1947885.457137]  [<ffffffff8114ca63>] ? generic_permission+0x23/0xc0
Oct 28 19:13:21 ls0 kernel: [1947885.457141]  [<ffffffff8114e4f4>] vfs_create+0xb4/0xe0
Oct 28 19:13:21 ls0 kernel: [1947885.457144]  [<ffffffff8114e5e4>] __open_namei_create+0xc4/0x110
Oct 28 19:13:21 ls0 kernel: [1947885.457148]  [<ffffffff81151d8b>] do_filp_open+0xa6b/0xba0
Oct 28 19:13:21 ls0 kernel: [1947885.457162]  [<ffffffffa0084b3b>] ? nfs_attribute_timeout+0x1b/0x70 [nfs]
Oct 28 19:13:21 ls0 kernel: [1947885.457170]  [<ffffffffa0085fe6>] ? nfs_revalidate_inode+0x26/0x60 [nfs]
Oct 28 19:13:21 ls0 kernel: [1947885.457174]  [<ffffffff8114d80b>] ? getname+0x3b/0x240
Oct 28 19:13:21 ls0 kernel: [1947885.457178]  [<ffffffff8115d17a>] ? alloc_fd+0x10a/0x150
Oct 28 19:13:21 ls0 kernel: [1947885.457182]  [<ffffffff81140d99>] do_sys_open+0x69/0x170
Oct 28 19:13:21 ls0 kernel: [1947885.457185]  [<ffffffff81140ee0>] sys_open+0x20/0x30
Oct 28 19:13:21 ls0 kernel: [1947885.457189]  [<ffffffff810121b2>] system_call_fastpath+0x16/0x1b

task can be "flush-202" "jbd2/xvdc1-8" ... anything what involved in this copy operation

I tried to change IO scheduler (to deadline), vm.dirty_ratio and vm.dirty_background_ratio. Nothing helps

now my server has hang disk and I can perform some investigation:

~# grep -A 1 dirty /proc/vmstat 
nr_dirty 9598
nr_writeback 0

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  1      0 3221016 301536 5558968    0    0     0    26   25   19  0  0 88 12
 0  1      0 3221008 301536 5558968    0    0     0     0   47   22  0  0 85 15
 0  1      0 3221008 301536 5558968    0    0     0     0   17   13  0  0 63 37
 0  1      0 3221008 301536 5558968    0    0     0     0   16   17  0  0 100  0
 0  1      0 3221008 301536 5558968    0    0     0     0   34   28  0  0 100  0
 0  1      0 3221008 301536 5558968    0    0     0     0   15   12  0  0 100  0
 0  1      0 3221008 301536 5558968    0    0     0     0   14   14  0  0 39 61
 0  1      0 3221008 301536 5558968    0    0     0     0   21   22  0  0 100  0
 0  1      0 3221008 301536 5558968    0    0     0     0   15   16  0  0 100  0




~# iostat -xm 1| grep xvdc
xvdc              0.02   519.33    0.03   53.36     0.00     0.40    15.21    28.55    8.22  17.60  93.98
xvdc              0.00     0.00    0.00    0.00     0.00     0.00     0.00    30.00    0.00   0.00 100.00
xvdc              0.00     0.00    0.00    0.00     0.00     0.00     0.00    30.00    0.00   0.00 100.00
xvdc              0.00     0.00    0.00    0.00     0.00     0.00     0.00    30.00    0.00   0.00 100.00
xvdc              0.00     0.00    0.00    0.00     0.00     0.00     0.00    30.00    0.00   0.00 100.00
xvdc              0.00     0.00    0.00    0.00     0.00     0.00     0.00    30.00    0.00   0.00 100.00
xvdc              0.00     0.00    0.00    0.00     0.00     0.00     0.00    30.00    0.00   0.00 100.00
xvdc              0.00     0.00    0.00    0.00     0.00     0.00     0.00    30.00    0.00   0.00 100.00

and there is memory graph http://i.stack.imgur.com/zZaaI.png

A M
  • 158
  • 8

2 Answers2

2

Does the copy operation hang around the same file/directory every time?

Have you tried to use rsync instead of cp? This might sound stupid, but couple of times I've managed to copy files in similar situation with it. I don't know why cp can make everything stall...

Another culprit might be ext4 + nfs combination.

Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78
  • >Does the copy operation hang around the same file/directory every time? – A M Oct 29 '10 at 07:53
  • >"Does the copy operation hang around the same file/directory every time?" mm , do you mean remote file/dir on NFS? – A M Oct 29 '10 at 08:13
  • "Have you tried to use rsync instead of cp? This might sound stupid, but couple of times I've managed to copy files in similar situation with it. " I need to do this faster , rsync will be very slow :( – A M Oct 29 '10 at 08:13
  • 1
    How is rsync slow? With the latest (1-3 years old...) 3.x generation it's actually quite fast. Third generation removed the old "I'm gonna generate the whole file list before transferring anything" annoyance, and as a bonus if transfer gets interrupted with rsync, it can nicely continue from the point it was interrupted and it's not gonna copy everything again, like your copy command would. I'm using rsync for transferring 4.5 terabytes (around 35 millions files) and it works just fine. :) – Janne Pikkarainen Oct 29 '10 at 08:19
  • ok,I'll try it, but I still interesting why disk hangs with this simple operation – A M Oct 29 '10 at 09:20
  • +1 for recommending `rsync`. Would also recommend to try the `rsync` transfer without involving NFS (that is, either via ssh or via the rsync daemon). – Steven Monday Oct 29 '10 at 19:33
  • same situation with rsync – A M Nov 04 '10 at 05:57
  • Did you try rsync over nfs or rsync over ssh/rsyncd? If even rsync over ssh or rsyncd dies (thus nfs can be taken out of equation), there's something SERIOUSLY wrong. – Janne Pikkarainen Nov 04 '10 at 07:54
  • no, I've started rsync with xinetd – A M Nov 04 '10 at 11:38
  • Are there some possible corner cases among the directory structure? Something like couple of huge files, or a huge directory with 4514123770 small files? Huge number of files itself should not be a problem if the directory structure is rationally balanced. – Janne Pikkarainen Nov 04 '10 at 11:42
  • on remote server I have ext3 with ~100 gb data with small, very small files but it works fine. And I'm trying to migrate on the new linux version with ext4. I guess problem in xen disk drivers or ext4 support in xen guest... I'll try to create ext3 – A M Nov 04 '10 at 12:20
  • and I don't like big amount of data in buffers . free command shows now 1235mb – A M Nov 04 '10 at 14:18
  • same with ext3. probably it is disk driver issue... disk hungs when system has huge buffers (~3gb) – A M Nov 04 '10 at 15:38
  • it was ubuntu+xenserver bug.. I've tried CentOS and Debian without any problems. – A M Nov 05 '10 at 07:28
  • 1
    Wow. Might be worth reporting to Ubuntu Launchpad. – Janne Pikkarainen Nov 05 '10 at 07:34
0

It could be your hardware failing. Have you checked SMART values of the drive? Do a local bonnie++ test or just dd the disk to /dev/null.

Hubert Kario
  • 6,351
  • 6
  • 33
  • 65