Sporadic high latency on my home network

2

tl;dr My home network recently has been experiencing jumps from 27ms latency to 600ms. It doesn't happen always, and seems to occur frequently at night. What equipment should I buy and tests should I run to deduce the cause?

Setup

My home has 12Mb/800kb DSL. I live in the mountains, far away from other Wi-Fi sources. Historically (for years) I could ping google.com and get ~27ms times. If something was flooding the network or connection (an iPhone syncing all photos with iCloud) pings would jump into the 2000-6000ms range. But normally everything was good.

Recently, however, the network stays pegged around 600ms for tens of minutes at a time. I cannot find any device that is flooding the network. (It may exist, but I haven't found it.) The connection is generally completely fine in the morning, and generally persistently bad at night (just when we want to stream shows in bed!)

During high latency times pings to other devices on the network (some that I've tried) are unchanged (always <2ms).

Failed and Confusing Troubleshooting

I have purchased all new hardware (DSL modem, Wi-Fi routers, network switches) to rule that out. The problem persists. Here is the setup:

Phrogz's home network

I have tried using the DSL Modem as the router (PPPoE + DHCP + NAT) with the Wi-Fi base stations in bridge mode. I have tried putting the DSL Modem in transparent Bridging mode and having the first Airport Extreme handle PPPoE, DHCP, and NAT. The problem persists.

I have disconnected all wired connections (leaving only the DSL modem and the Wi-Fi base station). The problem persists.

I have used only the DSL Modem (with PPPoE) and used its own Wi-Fi. The problem persists. I have attempted to hunt down every old tablet, phone, laptop on the Wi-Fi and turn them off. The problem persists. I have renamed the Wi-Fi SSID and put a password on it, connecting a single MacBook Pro laptop over Wi-Fi. The problem persists. I have used a different laptop over Wi-Fi. The problem persists.

I have connected a laptop directly to the modem over Ethernet, with Wi-Fi disabled on the modem and nothing else connected. The problem goes away! (I think...it *could* be that the problem just was not exhibiting itself on the three occasions that I tested this.)

At one point, with just a laptop connected over Ethernet, I turned Wi-Fi on for the modem and the problem exhibited itself. Ping latency immediately jumped as soon as I turned on Wi-Fi, though I do not believe that any devices were connected over Wi-Fi.

I have used iStumbler and there does not appear to be any correlation between the bad latency and increases in noise. Indeed, the SNR looks good consistently over Wi-Fi.

Remember that when things are bad they are not ALWAYS bad. Even with every device in the house turned on and connected, there are times when the latency will drop to 30ms or so for a few seconds (or minutes, or hours) before getting bad again.

Next Steps?

I think that iStumbler has shown me that the problem is not related to RF problems. (Maybe I'm wrong?) So I'm thinking it must be real traffic on the network.

The Airport Extreme base station does not support any sort of SNMP logging. Neither does the Actiontec C1000A. I don't have a switch with a monitor port, or a hub. I've never used Wireshark before.

BUT I AM WILLING TO THROW MONEY AND TIME AT THIS PROBLEM TO SOLVE IT

What should I buy? Where should I inject it into my network? What should I look for? How can I watch every packet on the network and build histograms and graphs to determine if one bad device is ruining the situation for everyone?


Edit 1: DSL Statistics when everything is fine

+-----------------+-------------+
|   Connection    |   Status    |
+-----------------+-------------+
| DSL Downstream: | 15.869 Mbps |
| DSL Upstream:   | 0.896 Mbps  |
+-----------------+-------------+

DSL Link Statistics

+------------------------------+---------------------+
|        Link Statistic        |       Status        |
+------------------------------+---------------------+
| Broadband Mode Setting:      | Auto Select         |
| Broadband Mode Detected:     | VDSL2 - 8A          |
| DSL Link Uptime:             | 0 Days, 10H:39M:57S |
| Retrains:                    | 1                   |
| Retrains in Last 24 Hours:   | 1                   |
| Loss of Power Link Failures: | 0                   |
| Loss of Signal Link Failure: | 0                   |
| Loss of Margin Link Failure: | 0                   |
| Link Train Errors:           | 0                   |
| Unavailable Seconds:         | 23                  |
| Estimated Loop Length:       | 2250                |
| Uncanceled Echo:             | N/A                 |
| Transport Mode:              | PTM                 |
| Path Parameter:              | 201                 |
| Priority:                    | 0                   |
| Service Type:                | PTM-Tagged          |
+------------------------------+---------------------+

DSL Power

+--------------+-------------------------+------------------------+
|    Levels    |       Downstream        |        Upstream        |
+--------------+-------------------------+------------------------+
| SNR:         | 16 dB                   | 10 dB                  |
| Attenuation: | (DS1)21.7, (DS2)58.8 dB | (US1)4.3, (US2)47.8 dB |
| Power:       | 16.4 dBm                | 7.8 dBm                |
+--------------+-------------------------+------------------------+

DSL Transport

+----------------------+------------------+---------------+
|      Transport       |    Downstream    |   Upstream    |
+----------------------+------------------+---------------+
| Packets:             | 1482864          | 1088249       |
| Error Packets:       | 0                | 0             |
| 24 Hour Usage:       | 1225940.68 Mbits | 2420.93 Mbits |
| Total Usage:         | 1225940.68 Mbits | 2420.93 Mbits |
| 30 Minute Discarded: | 0                | 3930          |
+----------------------+------------------+---------------+

DSL Channel

+----------------+-------------+-------------+
|    Channel     |  Near End   |   Far End   |
+----------------+-------------+-------------+
| Channel Type:  | Interleaved | Interleaved |
| CRC Errors:    | 0           | 0           |
| 30 Minute CRC: | 0           | 0           |
| RS FEC:        | 5873        | 29          |
| 30 Minute FEC: | 372         | 0           |
+----------------+-------------+-------------+

Edit 2: DSLReports Bufferbloat report

Running the speedtest during otherwise-normal latency indicates that the problem occurs during uploading

Graph showing bad bufferbloat during uploading


Ping times at night and overnight

The spike around 10:35pm was one computer starting to upload to Dropbox.

enter image description here

enter image description here


Edit 3: ISP tech support said:

Modem is getting more signals that it is suppose to. If the cables are not enough to carry the load we are sending we can lower it down to 100%. To test this is for me to lower down the signal for 7 days and you can observe if the browsing \ internet is better. After 7 days our server would run test and would boost your signals up again. And by that time we would have enough figures what to do next.

Our server is provisioning you more than your purchase. Technically this should make the internet faster but if pings and delay that are caused by traffic are observed by the customer. We can bring it to the purchased speed\ signal and observe if the DSL line on the customers premise are cable to carry the load.

Actual/Provisioned/Purchased speeds
Down: 15868/15872/12128Mbps
Up: 896/896/896kbps

Phrogz

Posted 2015-10-19T17:19:18.653

Reputation: 850

Have you tried using a faster DNS server? Even with your iPhone syncing wirelessly those ping times are not actually explained. – Ramhound – 2015-10-19T17:22:42.853

Has the modem been replaced? What modem is it? What are your ADSL stats? – Linef4ult – 2015-10-19T17:23:05.270

@Linef4ult Yes, I replaced the modem. It was an Actiontec Q1000, and I replaced it with an Actiontec C1000A. I'm not at home at the moment, but when I get there: could you please clarify what sort ADSL stats are you looking for? – Phrogz – 2015-10-19T17:24:45.907

Its the line statistics. A degraded link to the DSLAM(modem on your ISPs side) could cause bursts of errors and thus intermittent issues like this. Pastebin the contents of the page that looks like this: http://screenshots.portforward.com/routers/Actiontec/C1000A_CenturyLink/DSL_Status.jpg

– Linef4ult – 2015-10-19T17:27:46.447

@Linef4ult Thank you! Will do in ~7 hours. I hope this is the case (that it's the provider's/line's fault). The fact that I thought I've seen a situation where Ethernet-only fixed the problem and adding Wi-Fi caused it to go bad fills me with FUD that the problem is on my side. We'll see! – Phrogz – 2015-10-19T17:29:27.923

@Phrogz That bit doesnt make sense, this covers all the symptoms so lets see. Tag me in a comment whenever you post them and I'll check back. – Linef4ult – 2015-10-19T17:33:06.463

I don't see any basic network diagnostics in here such as determining where the latency is coming from... Throwing money at the problem should done only after you know where the problem actually is. Start with posting results of pings and traceroutes between various devices until you pin down the problem. – qasdfdsaq – 2015-10-19T20:36:01.627

@qasdfdsaq I will embark upon rigorous testing tonight. I'm experiencing the latency on every device (that I can ping with) to the first hop on the other side of my DSL modem. I have not yet proven it (will tonight) but I believe that in-LAN latency is fine. – Phrogz – 2015-10-19T21:10:30.990

If the latency is fine in your network and high on the first hop on your ISP then the problem is with your ISP (especially if it's worse at peak time, classic ISP congestion symptoms). The only thing you can do about that is phone them and complain, or switch ISP. Nothing you can change in your home will make any difference. – qasdfdsaq – 2015-10-19T22:04:50.417

@qasdfdsaq No? With switches involved, pinging from laptop B on Wi-Fi to computer C on Ethernet would be unaffected by massive problems that might be caused by device D going through router A onto the Internet. Right? Everyone (including laptop B) would experience problems as soon as they touch the main DSL modem/router, as the pipes get clogged, but that doesn't necessarily mean that it's only the fault of the ISP or the lines. I hope that it's their fault, but I don't believe that a good ping between two random devices on the LAN necessarily means the problem is not elsewhere in my house. – Phrogz – 2015-10-19T22:45:02.537

@Linef4ult I've edited the question with DSL statistics. – Phrogz – 2015-10-20T01:34:40.367

Your words, not mine. I said IF the latency is fine on your network - the only way to know that is to test every link including pinging the router and modem. If you haven't done that, then you wouldn't know whether in-LAN latency is fine or not. – qasdfdsaq – 2015-10-20T10:26:19.900

Answers

3

The symptoms you've reported sound like a bufferbloat problem, where your router, DSL modem, or your ISP's DSLAM buffers too many packets when the link is congested, resulting in high latency. Typically, TCP looks for dropped frames as evidence of congestion, and backs off. But if your router or modem or DSLAM buffers forever and never lets a frame drop, you end up with huge latency increases without TCP getting a chance to back off to relieve the congestion. You should never have a huge latency increase just because your upstream or downstream bandwidth is saturated. If you do, you almost certainly have bufferbloat.

Run the dslreports.com speed test tool. Unlike other speed test tools, this tool also measures and reports bufferbloat problems, which can cause high latency whenever something is using all of your downstream or upstream bandwidth (like when you decide to stream video at night).

The fact that you've already proven that your latency jumps when something is using all of your upload bandwidth (your iCloud Photo sync example) is a good indication that you're suffering from bufferbloat problems.

Your DSL modem is probably the source of any upstream bufferbloat problems. One solution might be to buy a DSL modem known to not have bufferbloat problems. I haven't researched this market though, so I can't help you with any suggestions. Your Google-fu is probably as good as mine.

Alternatively, consider buying a home gateway that can run CeroWrt, OpenWrt, or DD-WRT, all of which now have the anti-bufferbloat technologies such as FQ_CoDel that were first pioneered/developed in CeroWrt. By using a box like that to artificially limit your upstream and downstream bandwidth to something slightly slower than what your DSL link is actually capable of, and having that box actually drop frames and send Explict Congestion Notifications (ECN) when that limit is hit, instead of buffering forever, it allows TCP to detect the congestion and back off like TCP is supposed to do.

You don't necessarily have to ditch your DSL modem or your AirPort Extreme to install this *Wrt box; you can install it as a wired box between your DSL modem and your first AirPort Extreme. Just make sure that all the traffic to/from your home network goes through this box. That is, make sure you don't have any devices directly attached to the DSL modem other than this *Wrt box.

If you know you have bufferbloat, you should probably eliminate it before looking for other potential sources of latency spikes, otherwise it will hinder your attempts to find other sources of latency.

Spiff

Posted 2015-10-19T17:19:18.653

Reputation: 84 656

This looks like it may be spot on. See the graph added to the bottom of my question: when the network was running normally I ran DSL reports and it showed bad bufferbloat when uploading. Could this problem be on the ISP's side? The problem started after a power and service outage; could they have fixed the problem poorly? – Phrogz – 2015-10-20T04:28:09.050

@Phrogz Bufferbloat exists on the device where the buffer queues build up, which is typically on the last box before the slowest link. The slowest link is typically your broadband link to your ISP. So your upload bufferbloat is likely to be in your DSL modem. The only way the outage could have triggered this is if your phone line conditions became worse after service was restored, making upload slower and thus easier to overload. – Spiff – 2015-10-20T05:37:14.520

Thanks, Spiff. Your diagnosis is looking more likely. I reconfigured the network so that the DSL modem was doing PPPoE/NAT/DHCP (but no Wi-Fi) and put all other Wi-Fi APs into bridging mode. I connected one computer directly to the modem via Ethernet. When the problem was exhibiting itself (500ms pings seen on that direct-connected-computer) I pulled the Ethernet cable to the rest of the network. Instantly the problem got better. So now I still need to figure out how to determine the in-house culprit, and then to inject an OpenWRT box into my network. – Phrogz – 2015-10-21T15:06:50.930

1@Phrogz If I were you, I'd take a small semi-manageable gigabit switch that supports port mirroring (like a Netgear GS105Ev2 for $40), and plug it in between the Actiontec C1000A DSL modem/gateway and the first AirPort Extreme, and use port mirroring and a machine running Wireshark to capture all the traffic going to/from the Internet. For best results, keep the C1000A in NAT mode and the AirPort Extreme in bridge mode during this test, that way the IP address of the culprit box won't be hidden behind the AirPort Extreme's NAT. – Spiff – 2015-10-22T00:01:54.970

Roger that, Spaceman Spiff. Thanks for including a particular model that supported mirroring. Annoying that it requires Windows to turn it on, but that's a small price to pay for the knowledge. – Phrogz – 2015-10-22T03:38:07.283

@Phrogz Oh, does that one not have a web admin interface? Well, the D-Link DGS-1100-05 is similarly priced and seems to have a web UI. There's also a similar TP-Link model but I haven't looked to see if it has a web UI. – Spiff – 2015-10-22T04:43:59.093

2

Somethings wrong. 24hr stats say:

312,600 MBytes Down 247,500 Mbytes Up

You didnt include link rates but 8A at 2KM gives you maybe a 15/5 link. At 5Mb US you could only upload around 55GB/24hrs. Even at 10Mb you wouldnt reach 250GB, so dont trust those stats.

Still, this sounds really like peer to peer/sync/malware on your network is self DOSing.

UPDATE:

You're connection is balanced like an older style ADSL connection(8D 0.5U, 12D 0.7U, 15D 1U) vs what you'd normally do with VDSL(2) (15D, 3U). This leaves you in a situation where its very easy to congest your own link.

Anything running on your network can cause an upstream queue where the modem holds a series of frames that are trying to send but are coming faster than it can forward them. So for example instead of 1ms from your laptop to modem, 20ms from modem to exchange, 5 ms from exchange to website you have: 1ms from you to modem, 100ms waiting in the frame buffer, 20ms to exchange and 5ms to site. The more thats sent, the bigger that wait time.

Things to look for: Peer to Peer (bit torrent, game launchers) Syncing apps: Windows 7/8/10 One Drive, Dropbox(esp Camera Sync), iCloud Offsite backup like Crashplan/Backblaze etc VOIP/Video call apps: Skype, TS/Mumble

Anything that sends data out to the web.

Linef4ult

Posted 2015-10-19T17:19:18.653

Reputation: 3 705

+1 This is a very good point. His upstream usage is remarkably similar to his downstream, and I believe that's pretty rare for most users unless you're doing something upload-heavy like seeding a lot of torrents or doing a large online backup or sync or something. – Spiff – 2015-10-20T18:27:16.487

@Spiff So...assuming that I've missed something—that blackhats broke into my house and injected a nefarious heavy-upload device—how can I find/prove that such a device exists, and find its IP and/or MAC address? (I'll leave finding the machine as a separate problem.)

– Phrogz – 2015-10-20T22:02:28.647

Your "somethings wrong" seems right. I wonder if it's because the DSL modem is in Transparent Bridging mode at the moment that it's stopped recording things properly. Tonight I'll reset it, move PPPoE/DHCP/NAT back to be its responsibility (instead of the Airport Extreme) and see if that makes the stats not insane. (That will also make it easier for me to grab the stats at a time when the network is in a bad state instead of good.) – Phrogz – 2015-10-20T22:05:17.003

@Phrogz What have your ISP said? Their tests from the MTAU will pick up the vast majority of copper faults so we can rule that in or out. – Linef4ult – 2015-10-21T09:01:17.433

@Linef4ult I've added updated DSL line stats to my question and also added, at the bottom, what the ISP says. They saw no problem with the line except over provisioning, which they've now dropped to see if it helps. – Phrogz – 2015-10-21T15:04:29.420

@Phrogz That's strange. its VDSL but using whats basically an ADSL2+ profile(oldschool, bit crap). I'll update my answer now. – Linef4ult – 2015-10-21T16:08:55.223