38

I realise this is very subjective and dependent on a number of variables, but I'm wondering what steps most folks go through when they need to diagnose packet loss on a given system?

Jason Berg
  • 18,954
  • 6
  • 38
  • 55
KushalP
  • 485
  • 1
  • 5
  • 8
  • What's the "system" ? Do you mean that you have a single server (or desktop) experiencing packet loss? Or is it a whole network segment? How have you diagnosed this as packet loss (which I'm assuming you mean is network-caused) and not, for example, poor performance on an application server, running out of transient ports or Java heap or a million other possibilities? – mfinni Nov 30 '10 at 13:41
  • I realise it's a bad problem description. Think of it as being purely academic and hypothetical. Assume it's packet loss, just curious to know what steps most engineers take. – KushalP Nov 30 '10 at 14:25

5 Answers5

35

I am a network engineer, so I'll describe this from my perspective.

For me, diagnosing packet loss usually starts with "it's not working very well". From there, I usually try to find kit as close to both ends of the communication (typically, a workstation in an office and a server somewhere) and ping as close to the other end as possible (ideally the "remote end-point", but sometimes there are firewalls I can't send pings through, so will have to settle for a LAN interface on a router) and see if I can see any loss.

If I can see loss, it's usually a case of "not enough bandwidth" or "link with issues" somewhere in-between, so find the route through the network and start from the middle, that usually gives you one end or the other.

If I cannot see loss, the next two steps tend to be "send more pings" or "send larger pings". If that doesn't sort give an indication of what the problem is, it's time to start looking at QoS policies and interface statistics through the whole path between the end-points.

If that doesn't find anything, it's time to start question your assumptions, are you actually suffering from packet loss. The only sure way of finding that is to do simultaneous captures on both ends, either by using WireShark (or equivalent) on the hosts or by hooking up sniffer machines (probably using WireShark or similar) via network taps. Then comes the fun of comparing the two packet captures...

Sometimes, what is attributed as "packet loss" is simply something on the server side being noticeably slower (like, say, moving the database from "on the same LAN" to "20 ms away" and using queries that requires an awful lot of back-and-forth between the front-end and the database).

Vatine
  • 5,390
  • 23
  • 24
19

From the perspective of a Linux system, I'll first look for packet loss on the network interface with ethtool -S ethX.

Most of the time, increasing the ring buffer with ethtool -G ethX rx VALUE solves this.

Sometimes interrupts are not balancing because the system is missing the irqbalance service, so look in chkconfig (EL) or update-rc (Debuntu) to see if this service is running. You can tell if interrupts are not balancing because /proc/interrupts will show only Core 0 servicing all IRQ channels.

Failing this, you might need to increase net.core.netdev_max_backlog if the system is passing more than a few gigabit of traffic, and maybe net.core.netdev_budget.

If that doesn't work, you could tweak the interrupt coalescing values with ethtool -C.

If there are no packet drops on the network interface, look in netstat -s and see if there are drops in the socket buffers, these will be reported with statistics like "pruned from receive queue" and "dropped from out-of-order queue".

You can try increasing the default and max socket buffers for the appropriate protocol (eg: net.ipv4.tcp_rmem for TCP).

If the application sets its own socket buffer size, then the application may need configuration changes. If your application has hard-coded socket buffer sizes, complain to your application vendor.

Personally I dislike protocol offloading onto NICs (checksumming, segmentation offload, large receive offload) as it seems to cause more trouble than it's worth. Playing around with these settings using ethtool -K may be worth a shot.

Look at the module options for your NIC (modinfo <drivername>) as you may need to alter some features. To give one example I have encountered, using Intel's Flow Director on a system which handles one big TCP stream will probably harm the efficiency of that stream, so turn FDir off.

Beyond that you are getting into hand-tuning this specific system for its specific workload, which I guess is beyond the scope of your question.

suprjami
  • 3,476
  • 20
  • 29
4

I will start by using packet capturing tool such as: wireshark (on Windows) and tcpdump (on Linux terminal).

I will also check the firewall configuration (host firewall as well as network firewall).

Khaled
  • 35,688
  • 8
  • 69
  • 98
4

Isolate, then eliminate.

Find the smallest subset of paths with the problem. Do this by testing out different combinations and/or distilling user reports. Don't forget to factor time in the equasion. Maybe it's only packetloss on all traffic to a specific network, or maybe only the wireless clients are suffering. Take different traffic types into account (rate limit on pings). Find the most reliable and easily repeatable way to test it.

Then eliminate potential causes. Reduce traffic on the links (temporarily), remove interference sources from the spectrum, disconnect certain clients. Eventually you'll find the source of the problem.

You can sometimes take shortcuts by looking at packet dumps or take guesses (it's always bittorrent). Also, tell your professor serverfault is awesome.

Joris
  • 5,939
  • 1
  • 15
  • 13
2

Pings may not show packet loss unless you send large pings! I had packet loss on my network that was invisible until I upped my ping packet size.

For windows:

ping -n 30 -l <largevalue> <target>

For largevalue I used 40960 (40k packet)

For target I used the first few IP addresses from tracert google.com

(which was my routers & cable modem). One of the devices further down the chain had terrible packet loss (>60%) for large packets but 0% for small. I fixed it by restarting it but it could also be a cable or something internal that needs replacing.

Jonathan
  • 252
  • 1
  • 13