9

Situation:

The following strange problem has occurred on a single file server running OmniOS r151018 (95eaa7e) serving files over SMB to Windows and OS X guests.

Saving certain files (.docx, .xlsx, some images) through the "Save as..." dialog window on a SMB share results in a lag of about 3 to 5 seconds, where the application does not respond at all, afterwards the file is saved normally.

The problem did occur "over night", without doing anything to the server, but it is difficult to pinpoint the exact date, as user complaints only came in some time after the first occurrence. After a reboot of the server, one vdev of the mirrored root pool was unavailable, but closer inspection did not find any faults on the device and it was reattached to the pool. The problem still persists.

Some observations:

  • It happens on all Windows 7 clients
  • It happens for all file sizes
  • It happens on all shares of this machine, regardless of permissions
  • It happens for faster storage imported on the host over iSCSI from another server
  • Normal copy speed is 110 MB/sec over GBit Ethernet
  • Data and root pool seem to be fine
  • It does not happen on other file servers
  • It does not happen when the file is saved locally, then copied over through explorer
  • It does not happen on OS X (could only test it with OpenOffice)
  • dmesg shows several counts of NOTICE: bge0: interrupt: flags 0x0 - not updated? with various values, but this was also the case before and did no harm

Further ideads/plans:

As there is no clear error message to be found, I might need to do some trial and error searching for the cause. Some things I will consider (results are in italics):

  • Replace the Broadcom network card with an Intel card => did not make a difference
  • Replace the root pool with SATA SSDs (currently SLC memory USB sticks which worked fine for over 3 years) => did not make a difference
  • Check the network in between (hardware, by connection directly to the server)
  • Traffic capture with WireShark: difficult if you don't know what you are looking for exactly
  • Revert to a previous OmniOS boot environment/version to rule out software conflicts => did not make a difference
  • Roll back Windows/Office updates to rule out bugs
  • Remove files with : (colon) in filenames from snapshots, suggestion by txgsync on the reddit thread created by ewwhite => did not make a difference

    I've seen something similar to this when the Windows "previous versions" feature is enabled with automatic snapshots that include a ":" character. Just shooting at the wind with this, but may be worth a look as the ":" character is not allowed in Windows file names.

  • Monitoring of file access: as suggested by shodanshok, I used DTrace and this script to monitor file access. I used it while saving the alread open file, removed unrelated output and personal information, and the result centers around three files:

    CPU ID       FUNCTION:NAME
    1   18753    fop_open:entry Open: Workbook
    0   18181 fop_create:return Create: temp_1
    0   18753    fop_open:entry Open: temp_1
    0   18753    fop_open:entry Open: Workbook
    0   18753    fop_open:entry Open: Workbook
    0   18753    fop_open:entry Open: temp_1
    0   18888  fop_rename:entry Rename: Workbook -> temp_2
    0   18888  fop_rename:entry Rename: temp_1 -> Workbook
    0   18753    fop_open:entry Open: Workbook
    0   18753    fop_open:entry Open: temp_2
    0   18892  fop_remove:entry Remove: temp_2
    0   18753    fop_open:entry Open: Workbook
    0   18753    fop_open:entry Open: Workbook
    

    The same procedure on another server where the problem does not occur yields a similar result:

    CPU ID       FUNCTION:NAME
    1   25182 fop_create:return Create: temp_1
    1   25750    fop_open:entry Open: temp_1
    1   25750    fop_open:entry Open: Workbook
    1   25750    fop_open:entry Open: temp_1
    1   25750    fop_open:entry Open: Workbook
    1   25750    fop_open:entry Open: temp_1
    1   25889  fop_rename:entry Rename: Workbook -> temp_2
    1   25889  fop_rename:entry Rename: temp_1 -> Workbook
    1   25750    fop_open:entry Open: Workbook
    1   25750    fop_open:entry Open: temp_2
    1   25893  fop_remove:entry Remove: temp_2
    1   25750    fop_open:entry Open: Workbook
    1   25750    fop_open:entry Open: Workbook
    1   25750    fop_open:entry Open: Workbook
    

    I also added timestamps (walltimestamp) to the script, but in both cases all file operations take place at the same second. => did not make a difference

  • Import disks on another host to check if pool fragmentation or disks are faulty => did not make a difference
  • Move data and root pool over to identical machine to rule out cabling, mainboard etc. => problem does persist, so must be either the root pool (software) or a specific hardware that is incompatible with the software (or did suddenly become incompatible...)

Could you suggest anything else that be be the cause of this behavior? Or did you experience something similar? because I could not find anything helpful online, I suspect it is either a strange hardware problem (because it is limited to one machine) or a problem with Windows/Office.

user121391
  • 2,452
  • 12
  • 31
  • See: https://www.reddit.com/r/zfs/comments/4zo682/omnioszfswindows_7_save_as_from_within/d70t2eb – ewwhite Aug 29 '16 at 04:58
  • @ewwhite Thanks for creating the thread! The suggestion was really interesting, as indeed some snapshots on shares had perl files that were copied over from a unix machine, but removing them and the snapshots did not change the behaviour. – user121391 Aug 30 '16 at 07:35
  • When saving file on a share, Office has a peculiar behavior: it first create an empty file, then delete it, finally it re-create and save the file with your data. If one of these steps take longer than expected, the "Save as" window is effectively frozen. Does your system have some facilities to track file-level access? Can you debug the SMB session? Are you using something similar to Samba or ZFS integrated SMB server? – shodanshok Aug 30 '16 at 08:01
  • @shodanshok Thanks for your suggestion, see my updated question for results. I don't know why the order is slightly off, but the timestamps appear to be similar on both machines. Regarding your question, it is the integrated Solaris/illumos CIFS server, which is currently using SMB 2.1 IIRC. – user121391 Aug 30 '16 at 09:08
  • Maybe the anti-virus program at the Windows 7 clients is causing the stall? – Janne Pikkarainen Aug 30 '16 at 09:23
  • @JannePikkarainen Good idea, but that was also not the case. Aside from that, the file was created locally and saved on both servers from the same client (one with the problem, one without). My gut feeling is still on hardware failure, but hardware checks take more time unfortunately. – user121391 Aug 30 '16 at 09:28

1 Answers1

4

Solution:

The problem only affects OmniOS r151018, not previous versions. This thread on the omnios-discuss mailing list was exactly about my problem, quote from Geoff:

I saw a similar thread with Nexenta forum. There seems to be an issue with opslock. I disabled opslock and we are good now.

svccfg -s network/smb/server setprop smbd/oplock_enable=false

Not sure why this isn't biting more people.

So, biteCount++; I guess. The problem was solved by applying the fix and doing a fast reboot.

Lessons for the future: before attempting any troubleshooting, just use the advanced search on the official mailing lists, because most likely your problem already occurred on someone else's machine. Also, spin up a quick VM to rule out any software, updates or configuration errors before looking for hardware errors.


How I got there:

After several different tests as seen in the updated question, I narrowed it down to either software problems or hardware/driver conflicts on the specific hardware. To rule out the second, I installed two fresh OmniOS virtual machines, r151018 and r151016 on another host and configured by hand a basic SMB share on each of them.

The r151018 experienced the problem, r151016 works fine. I suspect I did not notice it in my very first tests, because I only rolled back some updates on r151018, not back to an earlier release. I think the problem must have been existed longer than I assumed.

When looking for a way to only update packages one by one, I looked at the mailing list and searched for smb from the last 6 months, where the correct solution/same problem popped up, dated back from May.

user121391
  • 2,452
  • 12
  • 31