2

I've an x4540 Sun storage server running NexentaStor Enterprise. It's serving NFS over 10GbE CX4 for several VMWare vSphere hosts. There are 30 virtual machines running.

For the past few weeks, I've had random crashes spaced 10-14 days apart. This system used to open OpenSolaris and was stable in that arrangement. The crashes trigger the automated system recovery feature on the hardware, forcing a hard system reset.

Here's the output from mdb debugger:

panic[cpu5]/thread=ffffff003fefbc60: 
Deadlock: cycle in blocking chain


ffffff003fefb570 genunix:turnstile_block+795 ()
ffffff003fefb5d0 unix:mutex_vector_enter+261 ()
ffffff003fefb630 zfs:dbuf_find+5d ()
ffffff003fefb6c0 zfs:dbuf_hold_impl+59 ()
ffffff003fefb700 zfs:dbuf_hold+2e ()
ffffff003fefb780 zfs:dmu_buf_hold+8e ()
ffffff003fefb820 zfs:zap_lockdir+6d ()
ffffff003fefb8b0 zfs:zap_update+5b ()
ffffff003fefb930 zfs:zap_increment+9b ()
ffffff003fefb9b0 zfs:zap_increment_int+68 ()
ffffff003fefba10 zfs:do_userquota_update+8a ()
ffffff003fefba70 zfs:dmu_objset_do_userquota_updates+de ()
ffffff003fefbaf0 zfs:dsl_pool_sync+112 ()
ffffff003fefbba0 zfs:spa_sync+37b ()
ffffff003fefbc40 zfs:txg_sync_thread+247 ()
ffffff003fefbc50 unix:thread_start+8 ()

Any ideas what this means?

Additional information. I don't believe I have any quotas enabled on the filesystem or at a per-user level.

========== Volumes and Folders ===========
NAME                    USED    AVAIL   REFER  MOUNTED QUOTA  DEDUP COMPRESS
syspool/rootfs-nmu-000  9.84G   195G    3.84G  yes     none   off   off
syspool/rootfs-nmu-001  79.5K   195G    1.16G  no      none   off   off
syspool/rootfs-nmu-002  89.5K   195G    2.05G  no      none   off   off
syspool/rootfs-nmu-003  82.5K   195G    6.30G  no      none   off   off
vol1/AueXXXch           33.9G   1.28T   23.3G  yes     none   on    on
vol1/CXXXG              8.72G   1.28T   6.22G  yes     none   on    on
vol1/CoaXXXuce          97.8G   1.28T   61.4G  yes     none   on    on
vol1/HXXXco             58.1G   1.28T   41.1G  yes     none   off   on
vol1/HXXXen             203G    1.28T   90.0G  yes     none   off   on
vol1/HXXXny             9.65G   1.28T   8.48G  yes     none   off   on
vol1/InXXXuit           2.03G   1.28T   2.03G  yes     none   off   on
vol1/MiXXXary           196G    1.28T   105G   yes     none   off   on
vol1/RoXXXer            45.5G   1.28T   28.7G  yes     none   off   on
vol1/TudXXXanch         6.06G   1.28T   4.54G  yes     none   off   on
vol1/aXXXa              774M    1.28T   774M   yes     none   off   off
vol1/ewXXXte            46.4G   1.28T   46.4G  yes     none   on    on
vol1/foXXXce            774M    1.28T   774M   yes     none   off   off
vol1/saXXXe             69K     1.28T   31K    yes     none   off   on
vol1/vXXXre             72.4G   1.28T   72.4G  yes     none   off   on
vol1/xXXXp              29.0G   1.28T   18.6G  yes     none   off   on
vol1/xXXXt              100G    1.28T   52.4G  yes     none   off   on
vol2/AuXXXch            22.9G   2.31T   22.9G  yes     none   on    on
vol2/FamXXXree          310G    2.31T   230G   yes     none   off   on
vol2/LAXXXty            605G    2.31T   298G   yes     none   off   on
vol2/McXXXney           147G    2.31T   40.3G  yes     none   off   on
vol2/MoXXXri            96.8G   2.31T   32.6G  yes     none   off   on
vol2/TXXXta             676G    2.31T   279G   yes     none   off   on
vol2/VXXXey             210G    2.31T   139G   yes     none   off   on
vol2/vmXXXe2            2.69G   2.31T   2.69G  yes     none   off   on
ewwhite
  • 194,921
  • 91
  • 434
  • 799

3 Answers3

2

I know nothing about this setup but,

ffffff003fefb820 zfs:zap_lockdir+6d () seems to indicate that the worker thread is locking the directory and then mutex_vector_enter tries to lock it too.

This all seems to stem from a situation that begins with updating quota. If its possible you might want to consider turning quotas off if they are unnecessary.

Its only a workaround rather than a fix and I have no idea if it'll work as expected! But might be worth a try.

Matthew Ife
  • 22,927
  • 2
  • 54
  • 71
  • They are definitely unnecessary in this case. I'm not using this as anything but a vmware storage solution. I'll dig in and see if something indicates that quotas are on by default. – ewwhite Mar 19 '11 at 23:54
1

The stack trace references "userquota" which is not typically used by our customers. Note that it is separate from the file system quotas that you can also set. I encourage you to turn off user quotas if you can, especially since you think they are unnecessary, but also I encourage you to file a support ticket if you have a support contract. This can be sent from the Web GUI, which would then include diagnostics from your system in the ticket.

Brad
  • 11
  • 1
  • How do I check if userquota is set somewhere? I've never done this explicitly. – ewwhite Mar 21 '11 at 05:58
  • There are some examples in the user guide located at http://www.nexenta.com/corp/static/docs-stable/NexentaStor-UserGuide.pdf. In NMC you can use the command "setup folder property userquota" then enter the user name, then enter "none" to remove the quota. If you aren't sure if the user has a quota you can first do "show folder property userquota". – Brad Mar 25 '11 at 00:02
  • I updated my original comment a few days ago with the output showing that the userquota parameter is "none" on all filesystems. – ewwhite Mar 25 '11 at 02:06
  • Perhaps this is obvious, but there is a difference between (a) creating a folder for each user and assigning it a quota, and (b) creating a folder and assigning a userquota. The above output seems to be indicating that you are not doing (a). But I am trying to verify that you are not doing (b). – Brad Mar 25 '11 at 05:11
  • `nmc@xx:/vol1/xt$ show folder vol1/xt property userquota SystemCallError: bad property list: invalid property 'userquota'` – ewwhite Mar 25 '11 at 12:18
  • I never assigned quotas to any folders, and I don't have individual users on this system. It's purely vmware storage via NFS with a couple of iSCSI LUNs. There's no file-level access outside of that. – ewwhite Mar 25 '11 at 12:20
1

This was resolved permanently by recreating all of the zpools under Nexenta. There was a lot of baggage carried along with the zpools as they were imported from an OpenSolaris installation. And while I imported and upgraded the pools and filesystems, the stability wasn't there until everything was rebuilt.

ewwhite
  • 194,921
  • 91
  • 434
  • 799