2

What follows is a description of the strangest network problem I have encountered.

The story

I am working with a client that reported being unable to check their email, starting this morning. Two weeks ago they completed a big upgrade that saw all of their PCs replaced and a new SBS server added to the Windows domain.

The client fetches email over POP3 using Outlook 2010 on Windows 7, but messages are not being received. The Send/Receive progress window indicates that the download is stalled at about 1% completion.

I tried connecting to the POP3 server using a telnet client and observed a similar behaviour. After issuing the command RETR 1 I saw small chunks of data (about 1K) coming through, with increasingly long pauses in between. Between chunks of data, the pauses appear to double in length -- I observed pauses that lasted 1, 3, 7, 14, and 28 seconds. I stopped counting after that.

Executing the same test of a POP3 message download by running telnet on a Terminal Server (Svr 2008 Std) in the same domain yielded the same results.

Then uploaded a small PHP script which would return a page of arbitrary length to a server on the internet and tried accessing it from the Terminal Server on the LAN. I tested various sizes from 1K to 1MB, and no stalls were observed. HTTP appears to be unaffected.

Lastly, I plugged my personal laptop (not a domain member) into the network and tried the POP3 test again -- the entire message downloaded right away.

Update (2011-03-10)

I used Wireshark to get a clear packet capture of the POP3 conversation today. The initial conversation (USER, PASS, LIST) works as expected with the server responding immediately. (The results of LIST fit within a single packet.) After the RETR command is issued and the message begins streaming, the delays start. My earlier estimate was slightly off, and the delays are actually of the expected duration: 1, 2, 4, 8, 16 seconds, etc. The client is sending ACKS right away, within 200 ms of each packet received.

Also, we tried connecting one of the affected workstations to the internet directly, and it was able to download messages at full speed. At this point, I strongly suspect the router (a Cisco 1711) is at fault, but I don't know enough about IOS to conduct further diagnosis.

What I know

  • The POP3 server is working fine for clients outside the network.
  • The cable modem is delivering a full speed connection.
  • The router is probably not malfunctioning, because it worked perfectly when I connected my own machine to the network.
  • The L2 switch is delivering LAN traffic at gigabit speed.
  • Only the newly installed computers are exhibiting issues.
  • The issue started on the new computers nearly two weeks after they were installed.

What I don't know

  • What the heck causes this kind of stalling?
Nic
  • 13,025
  • 16
  • 59
  • 102
  • The real question is why are they using POP when they have Exchange as part of SBS? – joeqwerty Mar 10 '11 at 11:43
  • @joeqwerty They will be switching to Exchange when everything is stable. If this is symptomatic of a more serious network issue, I don't want to aggravate it by bringing email in-house just yet. – Nic Mar 10 '11 at 16:21
  • If it were me and my client was using some two-bit external POP service and they had an internal SBS server the first thing I would do would be to bring the email in-house. Even if the problem exists internally why troubleshoot it from the perspective of the external POP server? – joeqwerty Mar 10 '11 at 17:49
  • It would help to know the topology of the network. When you performed the tests with your laptop and the computer connected directly to the Internet was this avoiding the router? If so the likely cause would seem to be the router. – Will Mar 10 '11 at 20:22
  • @Will It is a flat network with one switch connected to the router. I connected my personal laptop to the switch, so the router was not bypassed. – Nic Mar 10 '11 at 21:06

2 Answers2

2

As an administrator run the following from the client's command line:

netsh interface tcp set global autotuninglevel=disabled

This disables TCP window scaling? Does the problem persist? If not, then the problem is somewhere between the client and the server in the network equipment, including cables, network interfaces and their drivers, switches and routers.

adamo
  • 6,867
  • 3
  • 29
  • 58
  • From initial tests, it looks like this has fixed the problem. – Nic Mar 10 '11 at 21:18
  • This did work for us to clear the symptoms, but we ended up disabling IP inspection on the router to fix the issue for everybody. Since your answer led us to the solution first, I'm giving you full credit. – Nic Mar 10 '11 at 22:12
1

Is there some kind of software firewall or antivirus software installed? It may intercept all port 110 traffic in order to run it against its virus database. It may be causing a slowdown. The way it chunks 1kb like that suggests that if it is an AV, it may be pegging the CPU/Disk IO/etc.

Try starting up the Resource Monitor (Start Task Manager (Ctrl+Shift+Esc), Performance, Resource Monitor) and watch the process list for both CPU and Disk to see if there's any specific process spiking either while downloading a message. (I'd suggest testing with telnet still just to avoid polluting the data with Outlook/etc.)

You may also want to take a look at the traffic with a packet sniffer for further clues. I'm wondering if maybe the client (due to AV, or whatever reason) is really slow at returning the ACKs to the server's packets, and that's causing the TCP congestion avoidance to continue to increase the backoff and thus delay between each packet. To the best of my memory, a little over 1KB would have you sitting right around the maximum segment size (size of an individual packet), which would make sense as to why you're only receiving that much data per "burst". (Disclaimer: My knowledge in this area is old and faded. Don't rely on it too much.)

If it is the TCP backoff, it could also be due to dropped packets, but I doubt all of the new machines have a bad cable or something. Seems more likely to be a software issue as I imagine they're all similarly imaged.

EDIT: Based on the information provided, and the other answer by adamo, I found a Microsoft Knowledge Base article that seems to directly address your issue:

KB-935400 It takes much longer than expected to download an e-mail message from a POP3 server in Outlook 2007 or Outlook 2010

Specifically, it says the problem is:

"This problem occurs if a network hardware device, such as a router, does not support TCP Window Scaling. TCP Window Scaling is a new Windows Vista feature."

Running the command provided by adamo is one fix. It looks like you may also be able to update the router as well. Looking at the feature navigator on Cisco's website, it looks like TCP Window Scaling is supported in the newer IOS releases for your hardware. If you get a hold of a newer image for your router and flash it on there, it should resolve your issue without losing the benefits of the TCP window scaling.

NuclearDog
  • 101
  • 6