8

We've got an EMC NX4 SAN box serving a CIFS share to a number of Windows Server 2008 R2 app servers. The app servers are using the CIFS share to serve lots of image files (~2500 ops/sec on the share), however neither the SAN nor the app servers are showing any obvious signs of stress.

Once in a while an app server will, apparently all of a sudden, drop the connection to the SAN. Any .NET code trying to serve a file from the SAN fails with:

System.IO.IOException: The specified network name is no longer available

If I RDP to the app server and try to access "\san-name" through explorer, I get the same error. All other app servers can access it just fine. I can also access "\ip-of-san" just perfectly, pinging works as well.

A reboot of the app server fixes the issue, but that's a somewhat drastic measure to the problem, given that it seems like the SAN is working fine and the computer can access it - it just looks like the "\san-name" access has barfed up.

This has happened to two different app servers during the last week, so I don't suspect a single app server of being the cause. Ignoring the cause for now - how would I restore the "\san-name" connection without rebooting the machine? And can I somehow query what went wrong?

Event logs shows nothing (besides related ASP.NET errors caused by the issue), neither on app servers nor on the SAN.

Update:
Based on the suggestions I'll try a restart of the Workstation service the next time and see if that helps the issue. Definitely not a fix, but way faster to do than to reboot the whole machine as I've currently been doing. Any way to query the status of the connections that the Workstation service maintains?

Update 2:
Confirmed that restarting the Workstation service "fixes" the issue. Next step is to try the reg change to heighten the MaxCmds value. Won't be able to confirm whether it's the issue, can only assume if it runs for a lengthy period without issues.

Mark S. Rasmussen
  • 2,108
  • 2
  • 21
  • 31
  • Are there any indications in the event logs on the App servers, specifically in the System log, that point to either a transient failure or some other mechanism being triggered (e.g. DOS protection in the LanManagerService as described here http://blog.mreza.info/archive/2007/09/26/vmware-server-and-quot-network-name-is-no-longer-available-quot.aspx ). Also what AV setup is in place and how is the Celerra integrated with that. – Helvick Dec 04 '10 at 11:55
  • @Helvick No relevant entries in event logs, neither app nor system. We do not run AV neither on the servers nor on the Celerra. I searched the event log for the LanManagerService DOS protection event too, but it came back empty. – Mark S. Rasmussen Dec 04 '10 at 19:13

6 Answers6

6

This sounds like it's the MaxCmds have run out. Here are two good articles about that: here and here.

Here's now to change it. Create a file called update.reg and place the following in it:

Windows Registry Editor Version 5.00 

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanworkstation\parameters] 
"MaxCmds"=dword:00000800 

Save and then double click and accept the prompt. A reboot is required.

Scott Forsyth
  • 16,339
  • 3
  • 36
  • 55
  • Since the bounty is about to run out, I'll award it to your answer so far as it's the best bet imho, though I'll have to test it out before accepting. I have previously altered the FCNMode to only register the bin directory as I had "bios command limit reached" errors on some of the apps hosted on another UNC share. But I guess the FCNMode setting does not affect dirs outside of the application directory. – Mark S. Rasmussen Dec 10 '10 at 08:16
  • The FCNMode can also help, but a large disk structure over UNC they may cause both to come into play. I 'believe' that FCN is against the whole directory tree for .NET 2.0 and greater. – Scott Forsyth Dec 10 '10 at 16:21
  • Further to that: I've seen the MaxCmds run out with multiple front end nodes and multiple users used for different folders. The MaxCmds is a setting I apply to all of my UNC webfarms. I've never seen a drawback to that change. There is a server setting too if the CIFS share target is a windows server, but that doesn't apply to you. – Scott Forsyth Dec 10 '10 at 16:30
  • Just to clarify my comment, the actual .NET applications are stored on local disk. The main purpose of the apps is to serve image data, which is stored on UNC shares. The FCNMode setting, as I understand it, only applies to the application directory, thus having no impact in my case. MaxCmds is still a possible culprit though. All apps are running under the same account, but with 500+ web apps on each server, it's likely I'm running out. – Mark S. Rasmussen Dec 13 '10 at 04:29
  • The default behavior in ASP.NET for FCN is to traverse the entire directory structure. The registry key of HKLM\Software\Microsoft\ASP.NET\FCNMode can be 0, 1 or 2. 0 is the default which has a FCN object for every folder. If you change it to 2 then it will use one object for the root and all subdirectories. Setting to 1 turns it off completely. http://support.microsoft.com/kb/911272. You may find this blog post and discussion helpful too: http://weblogs.asp.net/owscott/archive/2006/02/21/ASP.NET-v2.0-_2D00_-AppDomain-recycles_2C00_-more-common-than-before.aspx. – Scott Forsyth Dec 13 '10 at 16:21
  • Encountered this on Windows Server 2016. The reg fix worked. – user1429980 Jun 12 '17 at 07:42
1

maybe restart the workstation service on the app server!

tony roth
  • 3,844
  • 17
  • 14
  • if its truly losing name resolution you may try as an experiment using a hosts file to short circuit the name resolution process. – tony roth Dec 07 '10 at 16:41
  • I tried to restart service, didn't work, but then I reboot the server and it seems to work after that. – Circle Hsiao Aug 14 '19 at 08:45
0

I've had cases like this before, though not with an EMC back end. For userland applications, force-closing the connection to the remote server and reopening it will bring it back, though you may have to try a couple of times before it gets its act together. For serverland applications, recycling the Application Pool for that service works. If that fails, recycling the Workstation Service can avoid a reboot, but it's almost as drastic.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
0

On the source :

Could you give more details on the software installed on the apps server ? On the net you will find that its usually a problem with an AV but since you don't run any... maybe another kernel-mode app like a backup software ?

Is the firewall active ? Have you checked event logs on the DC for the faulty app server ?

You should also sniff CIFS network traffic when the problem arises to see what happens.

The only times I ran into this error were when the server/workstation somehow "lost" its link with the domain. Re-forcing domain membership did the trick (netdom /resetpwd). Can you access other network shares (from the RDP session to the app server) when the problem arises ?

Renik
  • 436
  • 2
  • 5
  • The only software running on the server is IIS running a .NET web applicaiton. Firewall is not active as this is behind our DMZ. I'll try and check the AD logs the next time it happens. Good tip regarding CIFS - I'll try and add an ISCSI LUN the next time too to see if it's related to just CIFS or if it's a general connectivity issue using the hostname. I can access all other machines & shares using CIFS while this error is occurring. – Mark S. Rasmussen Dec 08 '10 at 09:53
0

Can this be issue with name resolution. Can you check with your DNS Server? If that is not allowing to resolve the name and after rebooting your app server it would allow to access.

I had same issue when some workstation user complain that they were not able to access application stored in another server, we had done the same by trying to access with server-ip that would worked but not with name so we have checked DNS. We have made change in Application to access another server to using IP address as we have static IP network.

Let me know if my suggestion works for you.

maniargaurav
  • 393
  • 1
  • 2
  • 8
  • While I get the error message, I can perform a nslookup just fine, returning the correct IP from our local AD DNS. I can also ping using both the hostname and the IP address. – Mark S. Rasmussen Dec 08 '10 at 09:50
0

I ran into a similar issue. I was not able to map a share to windows server 2012 from a windows 2003 server.

The network group had implemented an AD policy that had isolated the lower windows versions to an AD container that did not allow lower version of TLS to connect to servers running higher versions of TLS. Moving the server back or disabling the policy to connect with lower version of TLS corrected this issue.

Here are some errors I came across in the system log:

The certificate received from the remote server was issued by an untrusted certificate authority. Because of this, none of the data contained in the certificate can be validated. The SSL connection request has failed. The attached data contains the server certificate.

A fatal alert was generated and sent to the remote endpoint. This may result in termination of the connection. The TLS protocol defined fatal error code is 48. The Windows SChannel error state is 552.

Hope it helps to resolve your issue.

peterh
  • 4,914
  • 13
  • 29
  • 44