Intermittent high ping/latency problem

Question

I have been working with my ISP (which is a WISP, actually Fixed Broadband Wireless) trying to figure out why I intermittently get high latency. The latency is detectable in online games and other streaming applications. If I do a trace route you can see the path though the back-haul network:

Tracing route to google.com [74.125.67.105]
over a maximum of 30 hops:

  1     1 ms     4 ms    <1 ms  192.168.23.1
  2     1 ms     8 ms     9 ms  10.100.100.1
  3     9 ms     9 ms     3 ms  10.7.37.1
  4    15 ms    24 ms    19 ms  10.7.36.1
  5    10 ms    79 ms     9 ms  10.7.31.3
  6    10 ms    39 ms    39 ms  10.10.5.9
  7    19 ms    19 ms    19 ms  10.10.5.5
  8     9 ms    19 ms    19 ms  10.10.5.1
  9   341 ms   237 ms   226 ms  10.250.200.1
 10   249 ms   280 ms   991 ms  <ISP WAN IP>
 11   703 ms   681 ms   401 ms  <ISP WAN IP>
 12   819 ms   628 ms   484 ms  <AT&T IP>    <- Traffic enters AT&T backbone
 13   699 ms   528 ms   290 ms  <AT&T IP>
 14   201 ms   106 ms    52 ms  <AT&T IP>
 15   624 ms   392 ms   436 ms  <AT&T IP>
 16   666 ms     *      252 ms  <AT&T IP>
 17   456 ms   403 ms   581 ms  209.85.254.120
 18   430 ms   339 ms     *     209.85.242.215
 19  1061 ms    56 ms    53 ms  72.14.239.131
 20  3514 ms   734 ms   219 ms  209.85.255.190
 21    49 ms    59 ms    56 ms  74.125.67.105

Which seems to indicate that the problem is the host at 10.250.200.1. However, if I directly ping the host everything seems fine (~10ms round-trip). Pinging subsequent hops after the suspected node gives reasonable round-trip times as well. The high latency can persist for only a few seconds up to a few minutes at a time.

EDIT Yes this is a bad example of a trace showing a definite problem, but after repeated tests there is never latency >100ms before hop 9, that's why I thought it could be a problem.

A pathping during the event produces the following:

        Source to Here   This Node/Link
Hop  RTT    Lost/Sent = Pct  Lost/Sent = Pct  Address
  0                                           192.168.23.129
                                0/ 100 =  0%   |
  1    2ms     0/ 100 =  0%     0/ 100 =  0%  192.168.23.1
                                0/ 100 =  0%   |
  2    3ms     0/ 100 =  0%     0/ 100 =  0%  10.100.100.1
                                0/ 100 =  0%   |
  3   14ms     0/ 100 =  0%     0/ 100 =  0%  10.7.37.1
                                0/ 100 =  0%   |
  4   15ms     0/ 100 =  0%     0/ 100 =  0%  10.7.36.1
                                0/ 100 =  0%   |
  5   19ms     0/ 100 =  0%     0/ 100 =  0%  10.7.31.3
                                0/ 100 =  0%   |
  6   27ms     0/ 100 =  0%     0/ 100 =  0%  10.10.5.9
                                0/ 100 =  0%   | 
  7   28ms     0/ 100 =  0%     0/ 100 =  0%  10.10.5.5
                                0/ 100 =  0%   |
  8  ---     100/ 100 =100%   100/ 100 =100%  10.10.5.1
                                0/ 100 =  0%   |
  9   25ms     0/ 100 =  0%     0/ 100 =  0%  10.250.200.1
                                0/ 100 =  0%   |
 10   24ms     1/ 100 =  1%     1/ 100 =  1%  <ISP WAN IP>
                                0/ 100 =  0%   |
 11   25ms     4/ 100 =  4%     4/ 100 =  4%  <ISP WAN IP>
                                0/ 100 =  0%   |
 12   35ms     0/ 100 =  0%     0/ 100 =  0%  <AT&T IP>
                                0/ 100 =  0%   |
 13  ---     100/ 100 =100%   100/ 100 =100%  <AT&T IP>
                                0/ 100 =  0%   |
 14  ---     100/ 100 =100%   100/ 100 =100%  <AT&T IP>
                                0/ 100 =  0%   |
 15  ---     100/ 100 =100%   100/ 100 =100%  <AT&T IP>
                                0/ 100 =  0%   |
 16   58ms     0/ 100 =  0%     0/ 100 =  0%  <AT&T IP>
                                1/ 100 =  1%   |
 17   59ms     1/ 100 =  1%     0/ 100 =  0%  209.85.254.120
                                0/ 100 =  0%   |
 18   59ms     1/ 100 =  1%     0/ 100 =  0%  209.85.242.215
                                0/ 100 =  0%   |
 19   56ms     1/ 100 =  1%     0/ 100 =  0%  72.14.239.127
                                0/ 100 =  0%   |
 20   60ms     1/ 100 =  1%     0/ 100 =  0%  209.85.255.194
                                0/ 100 =  0%   |
 21   59ms     1/ 100 =  1%     0/ 100 =  0%  74.125.67.105

Why does this latency only show up during a trace-route and not with a normal ping? The lack of performance I see in my application coincides with this.

In other words, while having troubles with my application, if I run a trace at the same time I get the above result while simultaneously pinging the suspect host shows a normal ping.

If your ISP has only one connection to the internet (at hop 11) just ping that for some amount of time. — dbasnett, Jul 03 '11 at 18:03
My ISP is a WISP. If my neighbors that use the same WISP are downloading large amounts of video material ;) my connection can suffer. Remember that wireless is CSMA/CA, not CSMA/CD. — dbasnett, Jul 03 '11 at 18:05

score 4 · Answer 1 · answered Jul 03 '11 at 17:48

4

WISP? Meaning Wireless ISP? If so, there's your likely answer. Wireless is unreliable and you're seeing proof of that.

You can't really fix it because your medium (the atmosphere) is really awful for transmitting data. First because air is a hub instead of a switch so you're sharing it with anybody around you and colliding packets, second because CSMA/CA is slower than CSMA/CD, third because wireless is generally half-duplex instead of full duplex, and fourth because there are orders of magnitude higher interference through the air versus copper. [Microwaves, for example, operate at the same wavelength as 802.11b/g... but the microwave operates at about 500-1000 Watts vs your wireless antenna's 100 milliwatts. Microwaves are shielded, but shielding isn't perfect and microwaves aren't regulated by the FCC so it's not illegal if they cause interference.] Plus the fact that you're going through 10+ hops just to get to the Internet. That can't be helping, particularly if there's any NAT or firewalling going on.

As @dbasnett says, the traceroute ping latency to a given host only indicates the state of the entire network in between the interfaces taken as a whole at that point in time. That's why the response times go down sometimes. They're spiky because the network is unreliable. Your pathping looks good because it is running a large number of queries instead of just 3 that tracert is running. So pathping shows you what the network is doing over a period of 325 seconds (by default), and tracert is showing you what 3 packets per hop on the network are doing.

answered Jul 03 '11 at 17:48

Bacon Bits

1,511
1
9
8

Pathping calculates statistics at 25 seconds per hop. 16 hops would be 400 seconds. Just wanted to throw that in there. – joeqwerty Jul 03 '11 at 17:55
This seems like the standard answer, but the suspect problem looks so deep in the network that it is probably beyond the wireless portion. – Hugh Jeffner Jul 03 '11 at 17:56
@Hugh: I'm with Bacon on this one. Additionally, hops 14, 19, and 21 are proof that there's nothing wrong with the path. If there were a problem prior to hop 14 that was affecting the path, then all hops subsequent to the problem hop would show high response times, and technically would be showing higher respone times than the bad hop because they're futher upstream than the bad hop. I would focus on looking at the quality of the wireless connection and any packet loss between you and the destination. Trace route results are a red herring and will lead you on a wild goose chase. – joeqwerty Jul 03 '11 at 18:02
The key here is **intermittent**. I guess a traceroute isn't the best tool to demonstrate the problem. There's not a lot I can do from this end it seems. – Hugh Jeffner Jul 03 '11 at 18:13
@joeqwerty Ah, I tried two different sites and got 325s for both. My bad luck they were both 15 hops! It looked like a strange default value. I wonder why they didn't do 30 seconds per hop. – Bacon Bits Jul 04 '11 at 05:04
@Bacon: 25 seconds does seem like a pretty random number. I've always wondered why they chose that. http://technet.microsoft.com/en-us/library/cc958876.aspx – joeqwerty Jul 04 '11 at 22:01

score 3 · Answer 2 · answered Jul 03 '11 at 17:42

9 times out of 10 trace route results are not an indication of network issues. Traceroute sends ICMP echo request packets to each successive hop between the source and destination, incrementing the TTL by one for each successive hop. The result from each hop are an indication of how THAT hop is responding to ICMP traffic, it is not an indication of the quality of the path through and beyond that hop. A routers job is to forward traffic and as such, many are programmed to ignore, drop, or give low priority to ICMP traffic directed to themselves. The fact that hops 14, 19, and 21 have very good response times are indicators that there's nothing wrong with the path. If there were a problem at hop 12 (as you highlighted) or at any other hop that was affecting the path then you would see a problem at every successive hop and you would see each hop worse than the one before. Only when you see those types of results in traceroute should you suspect a path issue. Hop 21 is the destination and, with a 59ms response time, is telling you that the path between the source and the destination is fine. The key to analyzing a path issue is to analyze it's performance while real data is transiting it, which can't be done unless you have a packet sniffer/network monitor at each hop and have access to memory, CPU, and throughput counters on each network node (routers and switches) in the path from source to destination.

Rather than trying to figure out why you have performance problems based on a tracert of the path, you should concentrate on the actual TCP session between the source and destination and look at the response time (latency) and any packet loss between these two endpoints.

Trace route, as it's name implies, is a tool for discovering the path between two endpoints, it is not a tool for analyzing the quality of that path.

I understand all that. The condition can last a very short time so subsequent pings look OK. — Hugh Jeffner, Jul 03 '11 at 17:51
@joeq - each hop measurement is an entity. All that can be said is that a packet experienced a delay between a and z. If the path from a to z was a-b-c-d-e-...z then any of the points along the way could have had a problem that is reported as a high RTT for hop z. — dbasnett, Jul 03 '11 at 18:26
@dbasnett: As evidenced by hops 14 and 19, there is no issue at any prior hops that is affecting the path. The destination hop 21 reports response times of 49, 59, and 56ms demonstrating that none of the hops downstream of the destination are having any problem forwarding traffic. If they were then all hops subsequent to the problem hop would exhibit the same symptoms. The hops displaying high response times tells me that they're giving ICMP traffic directed at themselves low priority. — joeqwerty, Jul 03 '11 at 23:19
Yeah that's a bad example of the trace I am seeing but after repeated tests, there is never latency > 100ms before hop 9. But as you said, ICMP isn't reliable for this sort of thing. — Hugh Jeffner, Jul 05 '11 at 13:17

score 1 · Answer 3 · answered Jul 03 '11 at 18:11

I have to agree with joeqwerty, ICMP long ago stopped being a reliable measure of performance, latency, or throughput. This would be especially true for routes with a lot of hops over unknown networks.

A more realistic test would be one with the protocol(s) that you are using. For example if it were http, you could setup a Wireshark network packet capture. Filter on the conversation with the specified host, and use Wireshark's Statistics > TCP Stream Graph > Round Trip Time Graph. This test is more accurate if you perform the capture for at least several minutes.

Another interesting option is PingPlotter Standard (not free, but is feature-complete for 30 days). This provides a very nice ability for protocol-specific throughput testing by specifying the port number, and has graphs of round-trip time and can be saved and loaded.

I agree. The only way to quantitatively (or qualitatively) measure throughput, latency, packet loss, or otherwise determine the quality of a given path is to do so with real traffic (HTTP, FTP, etc), not with ICMP. — joeqwerty, Jul 03 '11 at 23:09

score 0 · Answer 4 · answered Jul 03 '11 at 17:17

0

Pings and trace route (pings with a specific TTL) are temporal. What you see at some specific instant in time is just that, and has nothing to do with past or future events.

Part of the bandwidth of the internet (2% ish) is ping traffic, which, unless you are an internet backbone person, serves no real purpose. If you have a problem call your ISP.

answered Jul 03 '11 at 17:17

dbasnett

683
5
11

I am trying to collect evidence of a problem for my ISP so they can't just say "It's the game server". – Hugh Jeffner Jul 03 '11 at 17:51

score 0 · Accepted Answer · answered Aug 10 '11 at 21:56

After more testing, this is an issue with UDP latency.

The reason the high latency coincides with poor application performance seems to be a CPU-bound host. A ICMP packet with an expired TTL requires CPU time to craft a response, and thus most routers are configured to "answer when I feel like it". The latency on expired-TTL ICMP traffic is an indication of a busy router in this case. The host seems to be on the edge of their network so a large chunk of all the traffic is going through that hop.

I highly suspect the ISP is doing some kind of traffic inspection or shaping of UDP protocol which also requires CPU time.

score -1 · Answer 6 · answered Jul 04 '11 at 08:53

-1

Make sure to reduce your buffers on the saturated link.

answered Jul 04 '11 at 08:53

Teddy

5,134
1
22
27

Intermittent high ping/latency problem

6 Answers6

Linked