Solaris 11 hangs randomly: need help to figure out the reason

Question

I've got Solaris 11 (+latest SRU) running on an HP DL385 G7 (attached to P2000 storage w. 30 disks; they are registered as separate RAID0 drives, but I'm using ZFS' raidz1), which is our file server. Every couple of days, the system freezes and needs to be restarted. There is nothing special in the logs and fmdump.

I ended up with a cron job dumping various statistics every 2 minutes to the hard drive which show that there is a load increase and memory decrease just before the crash:

$ grep load top.120512*

top.120512063601:last pid: 21751;  load avg:  0.61,  2.30,  2.93;  up 4+17:03:45        06:36:02
top.120512063800:last pid: 21765;  load avg:  0.27,  1.62,  2.59;  up 4+17:05:44        06:38:01
top.120512064000:last pid: 21779;  load avg:  0.29,  1.17,  2.30;  up 4+17:07:45        06:40:02
top.120512064200:last pid: 21793;  load avg:  0.56,  0.97,  2.09;  up 4+17:09:44        06:42:01
top.120512064400:last pid: 21807;  load avg:  0.20,  0.71,  1.85;  up 4+17:11:45        06:44:02
top.120512064600:last pid: 21821;  load avg:  0.60,  0.66,  1.68;  up 4+17:13:45        06:46:02
top.120512064800:last pid: 21835;  load avg:  1.25,  0.87,  1.64;  up 4+17:15:44        06:48:01
top.120512065000:last pid: 21851;  load avg:  4.77,  2.35,  2.10;  up 4+17:17:45        06:50:02
top.120512065200:last pid: 21864;  load avg:  5.10,  3.20,  2.45;  up 4+17:19:45        06:52:02
top.120512065400:last pid: 21878;  load avg:  5.81,  4.16,  2.91;  up 4+17:21:44        06:54:01
top.120512065601:last pid: 21892;  load avg:  5.26,  4.53,  3.20;  up 4+17:23:45        06:56:02
top.120512065800:last pid: 21906;  load avg:  5.36,  4.79,  3.46;  up 4+17:25:45        06:58:02

// here was the crash

top.120512163801:last pid:   701;  load avg:  1.18,  0.29,  0.10;  up 0+00:01:16        16:38:02
top.120512164000:last pid:  1456;  load avg:  0.36,  0.33,  0.14;  up 0+00:03:16        16:40:02
top.120512164200:last pid:  1470;  load avg:  0.14,  0.26,  0.14;  up 0+00:05:16        16:42:02
top.120512164400:last pid:  1499;  load avg:  0.39,  0.35,  0.19;  up 0+00:07:15        16:44:01
top.120512164600:last pid:  1513;  load avg:  0.10,  0.26,  0.17;  up 0+00:09:16        16:46:02

Or grep Memory:

top.120512064600:Memory: 16G phys mem, 2031M free mem, 2048M total swap, 2048M free swap
top.120512064800:Memory: 16G phys mem, 2047M free mem, 2048M total swap, 2048M free swap
top.120512065000:Memory: 16G phys mem, 1443M free mem, 2048M total swap, 2048M free swap
top.120512065200:Memory: 16G phys mem, 1313M free mem, 2048M total swap, 2048M free swap
top.120512065400:Memory: 16G phys mem, 892M free mem, 2048M total swap, 2048M free swap
top.120512065601:Memory: 16G phys mem, 418M free mem, 2048M total swap, 2048M free swap
top.120512065800:Memory: 16G phys mem, 294M free mem, 2048M total swap, 2044M free swap

// restart

top.120512163801:Memory: 16G phys mem, 14G free mem, 2048M total swap, 2048M free swap

or grep trap:

top.120512064800:Kernel: 50542 ctxsw, 13 trap, 113144 intr, 850 syscall, 9 flt    
top.120512065000:Kernel: 76357 ctxsw, 9 trap, 199203 intr, 399 syscall, 9 flt   
top.120512065200:Kernel: 72294 ctxsw, 13 trap, 254779 intr, 481 syscall, 9 flt    
top.120512065400:Kernel: 87671 ctxsw, 11 trap, 256663 intr, 401 syscall, 11 flt    
top.120512065601:Kernel: 72696 ctxsw, 11 trap, 281765 intr, 402 syscall, 11 flt    
top.120512065800:Kernel: 77316 ctxsw, 458 trap, 272329 intr, 412 syscall, 450 flt    
// restarted here    
top.120512163801:Kernel: 1570 ctxsw, 10 trap, 2380 intr, 1741 syscall, 9 flt

this one is from echo "::memstat" | mdb -k:

top.120512064800:ZFS File Data             2898132             11320   69%    
top.120512065000:ZFS File Data             3039466             11872   73%    
top.120512065200:ZFS File Data             3081508             12037   74%    
top.120512065400:ZFS File Data             3188175             12453   76%    
top.120512065601:ZFS File Data             3309405             12927   79%    
top.120512065800:ZFS File Data             3393392             13255   81%    
// restart    
top.120512163801:ZFS File Data               70094               273    2%    
top.120512164000:ZFS File Data               93547               365    2%    
top.120512164200:ZFS File Data              197571               771    5%    
top.120512164400:ZFS File Data             1175965              4593   28%    
top.120512164600:ZFS File Data             1205865              4710   29%    
top.120512164800:ZFS File Data             2537072              9910   61%

The ZFS pool is not corrupted, the actual load is below the average (compared to our other file servers), the hardware seems to be ok, too.

What do you think what might be the reason for this behavior? What other statistics do I need to collect?

Edit:

$ zpool status -v
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h6m with 0 errors on Wed Apr 25 14:40:49 2012
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c3t0d0s0  ONLINE       0     0     0
            c3t1d0s0  ONLINE       0     0     0

errors: No known data errors

  pool: volume
 state: ONLINE
  scan: resilvered 285G in 2h57m with 0 errors on Mon May  7 22:01:38 2012
config:

        NAME                                       STATE     READ WRITE CKSUM
        volume                                     ONLINE       0     0     0
          raidz1-0                                 ONLINE       0     0     0
            c0t600C0FF00012FBB1F749674F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7DDDA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB1DEA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D7CA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB1EAA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7DEBA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB1F0A1154F01000000d0  ONLINE       0     0     0
          raidz1-1                                 ONLINE       0     0     0
            c0t600C0FF00012FC7DFCA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB1FDA1154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D08A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB109A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D14A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB115A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D20A2154F01000000d0  ONLINE       0     0     0
          raidz1-2                                 ONLINE       0     0     0
            c0t600C0FF00012FBB171A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D2CA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB12DA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D38A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB139A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D44A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7DA3CA754F01000000d0  ONLINE       0     0     0
          raidz1-3                                 ONLINE       0     0     0
            c0t600C0FF00012FC7D50A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB151A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D5CA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB15DA2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D68A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FBB169A2154F01000000d0  ONLINE       0     0     0
            c0t600C0FF00012FC7D70A2154F01000000d0  ONLINE       0     0     0
        spares
          c0t600C0FF00012FBB1D7A1154F01000000d0    AVAIL   
          c0t600C0FF000131E9277AD154F01000000d0    AVAIL   

errors: No known data errors

$ zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                     24.4G   249G  39.5K  /rpool
rpool/ROOT                14.1G   249G    31K  legacy
rpool/ROOT/solaris        5.59M   249G  11.5G  /
rpool/ROOT/solaris-1      14.1G   249G  11.5G  /
rpool/ROOT/solaris-1/var  2.15G   249G  1.94G  /var
rpool/ROOT/solaris/var    2.71M   249G  1.29G  /var
rpool/dump                8.24G   250G  7.98G  -
rpool/export                63K   249G    32K  /export
rpool/export/home           31K   249G    31K  /export/home
rpool/swap                2.06G   249G  2.00G  -
volume                    8.77T  33.6T  6.77T  /volume
volume/gluster            33.5G  1.97T  33.5G  /volume/gluster

Edit 2

Here is a diff of various statistics: http://diffchecker.com/k19ZP458 (left: "normal" system state, right: just a minute before the crash)

Please post your `zpool status -v` and `zfs list`. Also relevant details will help... DL385 generation? How are you connected to the P2000 (fibre, SAS, iSCSI)? What's the disk layout of the P2000? What HBA cards are installed in the DL385? Do you have HP management agents installed? Is the ILO configured? — ewwhite, May 12 '12 at 16:05
Please find some more details in the initial post. Yes, ILO is configured but doesn't tell anything about the hangs. — Pavel, May 12 '12 at 16:14
The ILO should provide some on-screen detail at the point of the crash. What's on the screen at the time you need to reboot? Also, that probably means that you're not running any HP agents on the hardware... Those agents would be helpful in trapping errors like this. Solaris 11 isn't technically supported on ProLiant systems yet. — ewwhite, May 12 '12 at 16:19
This is hard to understand. Are you using a P2000 SAN or just the basic storage shelves? You said you configured multiple RAID 0 arrays, so it's a bit confusing... — ewwhite, May 14 '12 at 14:24

score 2 · Answer 1 · edited Apr 13 '17 at 12:14

You haven't provided all of the details that would be helpful in diagnosing this problem.

How long have these crashes been occurring? (e.g. Is this a new installation or not?)
Could you show the output of zpool history or explain if you're using compression or deduplication in your storage pool?
How are you connected to the P2000? (fibre, SAS, iSCSI)
What HBA cards are installed in the DL385 G7?
What is the disk layout of the P2000? (it's a SAN, so it's not ideal for a ZFS solution)
Do you any HP Management Agents for Solaris installed?
Is your system firmware up-to-date?
Is the ILO configured? Have you examined its log? How is the health of the RAM?
Do you have the Automatic Server Recovery watchdog configured in the BIOS? That would trigger a restart after the system crashes. It would also help determine if this is a hardware or software issue.

So I ask about when these issues started because if this is a new installation, you have some options. From the look of your disk layout, this is intended to be a hefty storage system based on ZFS. There are several red flags in the setup, though...

One, you're exposing multiple vdisks from your SAN to ZFS. Basically, you have 30 individual RAID0 arrays defined in the P2000 SAN and are presenting those to Solaris. If you lose a disk, you'll need a reboot to recognize the new device.

Two, the choice of OS may be a problem as HP does not certify or fully support Solaris 11 on ProLiant systems yet. If this is purely a storage unit and you're not running any Solaris-specific software, NexentaStor would be a safe solution that supports the server hardware. I build most of my ZFS storage solutions on HP hardware. Even OpenIndiana would be a bit easier to support.

But if you need to troubleshot the real crashes, we'll need to know what's happening on the system. There could potentially be logs leading up to the crash. You may also have core files that could be of use. The method of connection to the SAN matters a bit, too, since I've seen odd NIC issues with Solaris and HP/Broadcom devices. That said, I'm betting this is networking related...

Thank you for your detailed answer! (1) it's a new installation, the crashes occur from the very beginning (2) connection to P2000 via SAS (what do you mean by layout?) (3) HBA: [HP SC08e](http://h18004.www1.hp.com/products/servers/proliantstorage/adapters/sc08e/index.html) (4) no Management Agents installed (5) firmware is up-to-date, ILO configured (but nothing in the logs) (6) RAM ok (tested with memtest from some live-CD) (7) ASR is enabled but it's been never triggered — Pavel, May 14 '12 at 12:51
By layout, I want to know where the disks are located. The P2000 SAS is usually 24 or 12 disks with the ability to cascade additional D2600 or D2700 shelves. Is that what you have? — ewwhite, May 14 '12 at 12:55
As for your other comments: (A) I already tried to hotswap a disk and it worked pretty well, ZFS marked a disk as degraded and added a hot-spare. I was able to detach the disk, replace it and register as a new hot-spare, so I don't think this should be a problem in our setup. (B) well, I hope the support of Solaris 11 is a matter of time. Anyway, switching to something else will be necessary, if I don't get behind these crashes. (C) There are no core dumps as there is no kernel panic :(. The network interface used is qlcnic (firmware v. 4.8.20) — Pavel, May 14 '12 at 13:07
In P2000, there is a cascade of 3 LFF blocks with 10 disks each. Is it a precise answer? — Pavel, May 14 '12 at 13:11
ZFS should really handle the individual disks. You have the equipment necessary to do this. Use your existing SAS HBAs and run them to your external shelves, bypassing the the P2000 SAN head. You could also consider SAS multipath (MPxIO) if you're using dual-ported drives. As for the version of software, NexentaStor is definitely known to work with this hardware and is a bit more straightforward. — ewwhite, May 14 '12 at 13:27

Solaris 11 hangs randomly: need help to figure out the reason

1 Answers1