I've got Solaris 11 (+latest SRU) running on an HP DL385 G7 (attached to P2000 storage w. 30 disks; they are registered as separate RAID0 drives, but I'm using ZFS' raidz1), which is our file server. Every couple of days, the system freezes and needs to be restarted. There is nothing special in the logs and fmdump.
I ended up with a cron job dumping various statistics every 2 minutes to the hard drive which show that there is a load increase and memory decrease just before the crash:
$ grep load top.120512*
top.120512063601:last pid: 21751; load avg: 0.61, 2.30, 2.93; up 4+17:03:45 06:36:02
top.120512063800:last pid: 21765; load avg: 0.27, 1.62, 2.59; up 4+17:05:44 06:38:01
top.120512064000:last pid: 21779; load avg: 0.29, 1.17, 2.30; up 4+17:07:45 06:40:02
top.120512064200:last pid: 21793; load avg: 0.56, 0.97, 2.09; up 4+17:09:44 06:42:01
top.120512064400:last pid: 21807; load avg: 0.20, 0.71, 1.85; up 4+17:11:45 06:44:02
top.120512064600:last pid: 21821; load avg: 0.60, 0.66, 1.68; up 4+17:13:45 06:46:02
top.120512064800:last pid: 21835; load avg: 1.25, 0.87, 1.64; up 4+17:15:44 06:48:01
top.120512065000:last pid: 21851; load avg: 4.77, 2.35, 2.10; up 4+17:17:45 06:50:02
top.120512065200:last pid: 21864; load avg: 5.10, 3.20, 2.45; up 4+17:19:45 06:52:02
top.120512065400:last pid: 21878; load avg: 5.81, 4.16, 2.91; up 4+17:21:44 06:54:01
top.120512065601:last pid: 21892; load avg: 5.26, 4.53, 3.20; up 4+17:23:45 06:56:02
top.120512065800:last pid: 21906; load avg: 5.36, 4.79, 3.46; up 4+17:25:45 06:58:02
// here was the crash
top.120512163801:last pid: 701; load avg: 1.18, 0.29, 0.10; up 0+00:01:16 16:38:02
top.120512164000:last pid: 1456; load avg: 0.36, 0.33, 0.14; up 0+00:03:16 16:40:02
top.120512164200:last pid: 1470; load avg: 0.14, 0.26, 0.14; up 0+00:05:16 16:42:02
top.120512164400:last pid: 1499; load avg: 0.39, 0.35, 0.19; up 0+00:07:15 16:44:01
top.120512164600:last pid: 1513; load avg: 0.10, 0.26, 0.17; up 0+00:09:16 16:46:02
Or grep Memory
:
top.120512064600:Memory: 16G phys mem, 2031M free mem, 2048M total swap, 2048M free swap
top.120512064800:Memory: 16G phys mem, 2047M free mem, 2048M total swap, 2048M free swap
top.120512065000:Memory: 16G phys mem, 1443M free mem, 2048M total swap, 2048M free swap
top.120512065200:Memory: 16G phys mem, 1313M free mem, 2048M total swap, 2048M free swap
top.120512065400:Memory: 16G phys mem, 892M free mem, 2048M total swap, 2048M free swap
top.120512065601:Memory: 16G phys mem, 418M free mem, 2048M total swap, 2048M free swap
top.120512065800:Memory: 16G phys mem, 294M free mem, 2048M total swap, 2044M free swap
// restart
top.120512163801:Memory: 16G phys mem, 14G free mem, 2048M total swap, 2048M free swap
or grep trap
:
top.120512064800:Kernel: 50542 ctxsw, 13 trap, 113144 intr, 850 syscall, 9 flt
top.120512065000:Kernel: 76357 ctxsw, 9 trap, 199203 intr, 399 syscall, 9 flt
top.120512065200:Kernel: 72294 ctxsw, 13 trap, 254779 intr, 481 syscall, 9 flt
top.120512065400:Kernel: 87671 ctxsw, 11 trap, 256663 intr, 401 syscall, 11 flt
top.120512065601:Kernel: 72696 ctxsw, 11 trap, 281765 intr, 402 syscall, 11 flt
top.120512065800:Kernel: 77316 ctxsw, 458 trap, 272329 intr, 412 syscall, 450 flt
// restarted here
top.120512163801:Kernel: 1570 ctxsw, 10 trap, 2380 intr, 1741 syscall, 9 flt
this one is from echo "::memstat" | mdb -k
:
top.120512064800:ZFS File Data 2898132 11320 69%
top.120512065000:ZFS File Data 3039466 11872 73%
top.120512065200:ZFS File Data 3081508 12037 74%
top.120512065400:ZFS File Data 3188175 12453 76%
top.120512065601:ZFS File Data 3309405 12927 79%
top.120512065800:ZFS File Data 3393392 13255 81%
// restart
top.120512163801:ZFS File Data 70094 273 2%
top.120512164000:ZFS File Data 93547 365 2%
top.120512164200:ZFS File Data 197571 771 5%
top.120512164400:ZFS File Data 1175965 4593 28%
top.120512164600:ZFS File Data 1205865 4710 29%
top.120512164800:ZFS File Data 2537072 9910 61%
The ZFS pool is not corrupted, the actual load is below the average (compared to our other file servers), the hardware seems to be ok, too.
What do you think what might be the reason for this behavior? What other statistics do I need to collect?
Edit:
$ zpool status -v
pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h6m with 0 errors on Wed Apr 25 14:40:49 2012
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c3t0d0s0 ONLINE 0 0 0
c3t1d0s0 ONLINE 0 0 0
errors: No known data errors
pool: volume
state: ONLINE
scan: resilvered 285G in 2h57m with 0 errors on Mon May 7 22:01:38 2012
config:
NAME STATE READ WRITE CKSUM
volume ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c0t600C0FF00012FBB1F749674F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7DDDA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB1DEA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D7CA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB1EAA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7DEBA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB1F0A1154F01000000d0 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
c0t600C0FF00012FC7DFCA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB1FDA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D08A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB109A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D14A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB115A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D20A2154F01000000d0 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
c0t600C0FF00012FBB171A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D2CA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB12DA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D38A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB139A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D44A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7DA3CA754F01000000d0 ONLINE 0 0 0
raidz1-3 ONLINE 0 0 0
c0t600C0FF00012FC7D50A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB151A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D5CA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB15DA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D68A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB169A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D70A2154F01000000d0 ONLINE 0 0 0
spares
c0t600C0FF00012FBB1D7A1154F01000000d0 AVAIL
c0t600C0FF000131E9277AD154F01000000d0 AVAIL
errors: No known data errors
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 24.4G 249G 39.5K /rpool
rpool/ROOT 14.1G 249G 31K legacy
rpool/ROOT/solaris 5.59M 249G 11.5G /
rpool/ROOT/solaris-1 14.1G 249G 11.5G /
rpool/ROOT/solaris-1/var 2.15G 249G 1.94G /var
rpool/ROOT/solaris/var 2.71M 249G 1.29G /var
rpool/dump 8.24G 250G 7.98G -
rpool/export 63K 249G 32K /export
rpool/export/home 31K 249G 31K /export/home
rpool/swap 2.06G 249G 2.00G -
volume 8.77T 33.6T 6.77T /volume
volume/gluster 33.5G 1.97T 33.5G /volume/gluster
Edit 2
Here is a diff of various statistics: http://diffchecker.com/k19ZP458 (left: "normal" system state, right: just a minute before the crash)