2

I'm a new systems and network admin. My experience has been in the hardware and software of systems and servers, the network part is pretty new to me. I'm familiar with plugging numbers into network configurations, but if you ask me about subnetting or packet droppings (;) you'll see this really lost look flash across my face. I'm learning.

Here's my problem:

Since about two months before taking over the reins here, the previous Network Admin reports they've had issues downloading large files. Well, not really just large files, it's just more frustrating with the large ones. Now that I'm the one doing the downloads (everything from a random driver to the latest distros and SPs from our Technet subscription and Licensing agreement, to multi-gigabyte engineering software packages for our various departments) I have to "baby-sit" the downloads, keeping me chained to my desk for hours at a time.

The downloads will start just fine and get to some random point ranging from a few K to a couple G before the download will stall and fail if I don't pause and restart the download before it fails. Sometimes the pause/restart works right away, the download picks up speed and progresses a bit before the cycle repeats. Sometimes I have to go through several pause/restart cycles before the download starts actually downloading again.

The network and ISP details:

  • Fiber internet connection served by our ISP (our local city is our ISP). Download speeds generally even out around 1.1Mbps, with spikes as high as 1.6Mpbs. Sometimes in the midst of pause/restart cycles we'll see speeds as low as a few hundred Kbps, but a few cycles later and it'll speed up again. Speeds from different hosts are pretty consistent.
  • There is no proxy in our internal network and no firewall that I am aware of blocking the connection. We use a Cisco 1811W as our gateway, but it has not had any trouble before.

The issue was first noted around September, and there were no changes on our side around that time we can attribute this to.

What should I test, check, etc, to determine whether the problem is on our side or the ISPs?

Update:

I'm watching a wireshark feed filtered for the TCP stream of a large download I've had trouble with for a few days now. Most of the traffic frames are labeled...

Continuation or non-HTTP traffic

...which I assume is just the subsequent packets comprising the download. However, relatively frequently (between every 3-20 seconds) and corresponding pretty much exactly to any dips in the download speed reported by Firefox are large sections of frames labeled...

[TCP Retransmission] Continuation or non-HTTP traffic

There are also a few random frames, usually spread out surrounding the Retransmission packets a few dozen frames on either side, labeled...

[TCP Previous segment lost] Continuation or non-HTTP traffic

...and whadayaknow, the download just failed about halfway through the 3.2GB file. The final frame is a TCP Previous Segment Lost frame. This came immediately after I had to pause the download and attempted to restart it: queue immediate failure.

Final frames in the download were http [ACK] followed by http [FIN, ACK], which I believe indicated a "graceful" TCP connection closure.

I did not see anything else indicating interruption by an intermediary.

Update 2

The issue is observed in all browsers and apps that download and the pause/restart functionality works 99% of the time in all apps that allow pause/restart. Specific apps and browsers I can replicate this easily in: Firefox (current versions), IE (9), iTunes (downloading apps and updates for iOS devices). I'm not sure if these all use the same functionality for the pause/resume function in downloading.

iTunes downloads from servers that all allow restarting (except the iOS update files) and so it does not matter how long I pause the download for. Most sites I'm downloading large files from (MS, PTC, Solidworks, AutoDesk) don't support resuming stopped/canceled downloads (MS does but only from there java-based download manager) and so I can only pause for around 15 seconds max before the download will fail immediately upon attempted resume.

Update 3

Using mturoute (Thanks Tom H), I found the consistent route max MTU is 1500 bytes before fragmentation, and the path carried ICMP payloads with fragmentation of 10000 bytes from end to end without much issue, including the hops through my ISPs devices. So the issue does not appear to be fragmentation or incompatible MTU settings.

ICMP is also not blocked by my ISP, and neither is BitTorrent, though I'm not using BT to download these files.

UPDATE 4

So what I need to look into, judging from the WireShark logs, is how to pin down the cause of the Retransmissions and Previous segment lost frames. How would I isolate the probable source of these?

music2myear
  • 1,893
  • 3
  • 26
  • 51
  • Smells like the ISP (or their upstream) may be the culprit - but you should be able to confirm by taking a look at what's going on at the TCP level. I'd say break out Wireshark and watch the download traffic. You're looking for anything out of the ordinary when the connection stalls - given those symptoms, one side or the other is probably repeatedly re-sending a packet that's not making it for some reason. – Shane Madden Jan 31 '12 at 17:02
  • I'm totally guessing, but I think that pause in firefox would take advantage of the Accept-Ranges header, which actually stops the download, and restarts at the correct place by asking for the remaining range of the file. But that this includes a possibly new TCP connection being established, and if that seems to be fixing the problem then I would look at the error counters on from the IOS console on the router. see my suggestion below... – Tom Jan 31 '12 at 23:02

5 Answers5

1

Typically you can work to isolate and solve the problem, by systematically proving good various parts of the network. This is a process of being confident to say, I know this works by using the appropriate tools to investigate, and by parts you will arrive at some final piece of the jigsaw and say, I know this is the problem, because everything else is good!

  1. If you can replicate the problem in devices attached to both ethernet AND wireless then that isolates the problem in the final link between the network <=> Cisco 1811W <=> DSL Fibre <=> ISP <=> and the Internet

  2. If you only see the problem in either wired network OR wireless devices, then you can target the wired ethernet or wireless configuration on the Cisco 1811W. Then you can can review the settings common to the problematic segment as a next step.

  3. Generally Reseat any commonly linked ethernet cable, and try swapping the DSL cables if available, when testing some device.

  4. Check the MTU and auto-negotiation settings on the router, that are set for the DSL, review the router log file from IOS.

The router will be running IOS 12 or something like that, which will have some good command line tools accessed via ssh for checking negotiated settings.

Use the show interfaces command to review error statistics such as resends and dropped packets. It might even have a web interface (but I am not working with cisco IOS devices at the moment so this is not tested just from some notes I made on trouble shooting cisco networking)

However you should be able to pull up a table of per port error statistics from the cisco console using

# show interfaces status
# show interfaces counter errors

and for a particular port e.g.

# show interface GigabitEthernet 5/28 status
# show interface GigabitEthernet 0/24 switchport

Edit: here is a little video of some guy showing how to use the ios "show interfaces counters errors" to troubleshoot problems. It is actually really cool, but its probably in too much depth, but it gives you the information required to detect duplex mismatch, or auto-negotiate settings.

p.s. you can prove the router part of the connection, by plugging an alternative DSL router into the fibre connection, if downloads work find them, you know the problem is this side, rather than ISP side.

Tom
  • 10,886
  • 5
  • 39
  • 62
  • This is a network with about 300 devices. The Cisco functions only as the gateway, with the main router connected to it, and all other devices chained off various other switches depending on their location in our areas. – music2myear Feb 01 '12 at 14:36
  • I'm testing this over wireless using an iPad right now. Takes a while to test a 3.2GB file download over the iPad. – music2myear Feb 01 '12 at 14:53
  • @music2myear I would try the IOS show interfaces error counters command, if you have access to the router management port. Typically Cisco products are *very very* good, and its a pretty basic way to dip your toe into networking. (one of the things about Cisco and it being expensive gear, is that the tools are really good, though a little steep learning curve for non-command line users) – Tom Feb 01 '12 at 18:38
  • I'm wading into the console this afternoon. I'll let you know what I find. – music2myear Feb 01 '12 at 18:45
1

Some ISPs make the strange decision to block all ICMP packets on their switches or firewalls. This blocks calculation of the Path MTU, which means you get more fragmented packets occurring as they pass through routes with lower MTUs. Maybe you are seeing the result of this.

Fragmented packets have to be reassembled which can be a problem if you also have packet loss! Given you are trying to download large files, both fragmentation and loss of packets will be a greater problem. Path MTU discovery is designed to reduce fragmentation.

So how do you know if your ISP has done this to you? You could ask them - however, in my experience ISPs will far prefer to send you off with basic troubleshooting for several days/weeks rather than admit they might have done something wrong. And of course sometimes they are right to!

You should gather information to show them what you are seeing. Packet Captures like you have done in Wireshark or collected at your firewall are helpful as they often reveal the level of fragmentation. You can check whether path MTU discovery is working using tracepath (*nix) or mturoute (Windows).

If you do find pMTU is not working, it could be either your ISP, or the ISP of the site you are trying to download from. If you see the problem for downloads from multiple sites, chances are it is your ISP.

And of course, it could be a whole bunch of other things too :-) Good luck!

dunxd
  • 9,482
  • 21
  • 80
  • 117
  • The ICMP blocking would be apparent if all ping requests to public IPs failed, correct? If this is true, then I know it is not the case with my ISP. Also, I checked with mturoute, the max MTU is 1500 consistently along the route with a max on some hops of 10000 (mturoute default max), so I don't think it's a fragmentation issue. – music2myear Feb 01 '12 at 17:44
0

Are you using BitTorrent to download these large files? Many ISPs have installed special hardware to detect and rate limit traffic abusers.

I'd call your ISP to ask them what plan you have with them and whether they are aware of any traffic shaping or throttling.

Here's what my ISP uses:

http://www.sandvine.com/

I'll leave it as an exercise to the OP to determine how to bypass any such hardware/software rate-limiting device should they be found to exist.

dmourati
  • 24,720
  • 2
  • 40
  • 69
  • No Bittorrent. I've used it before and it's not blocked. But not many big companies use BT to share large downloads of their packages. It's not an option for most of the packages I have to download, even from alternate sites. – music2myear Feb 01 '12 at 14:27
0

just curious, are these all Windows 7 machines? I had a similar issue that affected Win 7 machines only. The unlikely solution worked and I have never been more happy in my life.

Although my question was originally regarding email, I soon realized that the issue was widespread to almost anything involving the network. The Microsoft fix was simple and easy and is something that I am now configuring to all W7 machines pre-deployment. I havent had any issues at all since.

Here is the question: Original Question

cop1152
  • 2,626
  • 3
  • 21
  • 32
0

The problem is resolved!

The issue was extremely difficult to diagnose because it happened irregularly, and, while not infrequently, not frequently either (yes, that's a contradiction, I'll live with it).

Eventually the issue seemed to be getting worse, and affecting other aspects of our connection, and I was able to catch it in dropped pings and such, and it became clear to me the issue was not in our network.

Our ISP (at the time) was a resold AT&T connection, and so I talked to the reseller first presented them with the information I'd gathered (this is from memory, the issue was resolved about 6 months ago now, so little technical detail, sorry) proving the issue was not internal to our network. They found one of their own switches was having trouble and replaced it, but this didn't fix the issue, so they did testing and found issues upstream with AT&T, and AT&T was able to corroborate and resolve the issues.

I'm not entirely certain the issue was only with AT&T. Based on how the symptoms escalated, I'd say the escalation was due to issues on AT&Ts side, but the original problem was with our own local ISP, and so we had a trust issue there.

We switched ISPs, leaving the local reseller then for that reason, and went to... AT&T. I know, out of the frying pan and into the fire. But we're now paying much less for a guaranteed more, and as soon as AT&T saw their issue, they fixed it, which is OK in our book.

music2myear
  • 1,893
  • 3
  • 26
  • 51