19

We're reviewing Wireshark captures from a few client machines that are showing multiple duplicate ACK records which then triggers retransmit and out-of-sequence packets.

These are shown in the following screen shot. .26 is client and .252 is server.

enter image description here

What causes the duplicate ACK records?

More background if it helps:

We're investigating network throughput concerns at one particular client site. The perceived issue from a user interface perspective is that data is being transmitted slowly despite an underutilized 1gbps WAN connection.

Almost all of the client machines have the same issue, tested at more than 20 machines. We did find two machines that do not have the problem. We're in the process of identifying what is different in their configuration. We did notice that in the two machines that do not have the problem, we only ever saw at most one duplicate ACK record. The machines that have the problem usually have three duplicate ACK records. One notable difference is that the machines that work fine all belong to members of the network operations team and all of other machines are for "regular" employees. The machines are supposed to be standard but the network admins could have made changes on their local systems, which is another aspect we're researching.

We tried changing the TcpMaxDupAcks setting on the server but the value we really need is 5 and the valid range is only 1-3.

Server is Windows Server 2003. Clients are all enterprise managed Windows XP. All clients, including the two working ones, have Symantec anti-virus installed.

This is the only client site out of hundreds that has exhibited this problem.

pathping shows 56ms RTT and consistent 0/100 packet loss even from the problem machines.

Thanks,

Sam

Sam
  • 720
  • 2
  • 8
  • 18

4 Answers4

25

Note: I'm assuming that this capture was taken on the client machine.

A brief summary on TCP sequencing: TCP reliably delivers streams of bytes between two applications. "Reliably" in this case means that, among other things, TCP guarantees to never deliver out of order data to a listening application.

In-order, reliable delivery is implemented through the use of sequence numbers. Every packet in each stream is assigned a 32 bit sequence number (remember that TCP is effectively two independent streams of data, A->B and B->A). If A sends an ACK to B, the value in the ACK field is the next sequence number A expects to see from B.

From the above, it appears that at least one TCP segment being sent from the server to the client was lost. The three duplicate ACKs in sequence are an attempt by the client to trigger a fast retransmit. When a TCP sender receives 3 duplicate acknowledgements for the same piece of data (i.e. 4 ACKs for the same segment, which is not the most recently sent piece of data), it can reasonably assume that the segment immediately after the segment being ACKed was lost in the network, and results in an immediate re-transmission.

In this case, the re-transmission gets through, and is identified by Wireshark as out-of-order.

As mentioned by joeqwerty, packet loss is most often caused by congestion. It may also be a result of CRC or other errors on a link, due to a bad interface card, loose cable, etc. I'd look at the stats of every link along the path to see if any are highly utilized and/or are experiencing large numbers of errors.

If you can't see any obvious candidates, perform concurrent packet captures at multiple points along the path to try and isolate where the loss is occurring.

What kind of WAN connection is in use here? Is it a dedicated line? MPLS VPN link? IPsec VPN over the public internet? Something else?

Murali Suriar
  • 10,166
  • 8
  • 40
  • 62
  • Thanks for your comments. You're right, the packet capture is from the client. If I understand what you're saying, the duplicate ACKs are not the client doing anything wrong but are actually a trigger from the client that it didn't receive a different record (the one after the ACKs). Is that correct? What things can I look into on the client PC that would cause this? If it's not a client PC problem why would it consistently show up on some clients and not others? – Sam May 06 '11 at 00:54
  • The WAN is "two point to point circuits" between three sites on the east coast and mid-west United States. – Sam May 06 '11 at 00:55
  • That's correct; the DUPACKs are a symptom of packet loss. As to why the issue would occur on some clients and not others, you need to work out what's common to the affected clients. Are they all in the same office? Going through common network infrastructure? (A switch or a link?). One thing that's worth doing is using `mtr` (or `pathping` on Windows) on each of the impacted machines and seeing if there are any common hops along the path to the server which seem to be experiencing packet loss. Do you have a network monitoring system you can use to look at switch port data? – Murali Suriar May 06 '11 at 10:36
4

While you are isolating where the problem is, think of a packet dump as just one of the symptoms... As an analogy, if someone walks into the doctor's office with chest pains, the doc won't spend three hours investigating the nature of the pain. He spends about two-minutes on that and then knows that 95% of the causes are either heartburn or angina... In the same way, if you see duplicate ACKs, don't rat-hole on the weeds of the trace right away.

After the connection establishes, slow TCP performance is not always because of transit network problems; sometimes it comes as the result of server CPU or disk limitations... and occasionally because of some issue on a client PC. I have chased my tail for weeks digging into the weeds of wireshark traces only to give up and find the problem relatively quickly with mtr, or by looking at other host metrics such as CPU and disk I/O.

Your first task is to prove whether this is a network issue or a host-level issue. Focus on sending real traffic through your network and prove whether you're queuing / loosing / re-ordering Note 1 it; that always is the bottom-line for a potential network issue like this.

I would do a ping sampling for an extended period of time (typically an hour for me) between the client and server while the throughput problem is happening; you can use mtr or ping plotter freeware for this. If you're consistently loosing packets at some hop, and all hops afterwards loose as much or more, then you have a potential network suspect. Keep in mind that device ICMP rate-limiting can cause some hops to appear that they loose packets... that's why you want to look for a trend starting from that hop, and those following.


Note 1 If you are re-ordering traffic, that will show up rather quickly in the expert info field that wireshark provides

Mike Pennington
  • 8,266
  • 9
  • 41
  • 86
  • Agree that blaming the network by default isn't a good approach. Instrumenting throughout the stack is always good practice. However in this case, the DUPACKs, out-of-order and retransmitted segments do seem to be indicative of some sort of network loss between the two endpoints. – Murali Suriar May 05 '11 at 21:42
  • @Murali Suriar, let's go with your assertion (which has a decent chance of being right)... then what next? You have to isolate **why** there is packet loss. We IT people have mysteriously fallen in love with `wireshark` to the point that we like looking at the microscope far too long. The point I'm making is take a quick glance at the `pcap`, after that you're better off spending cycles on instrumenting packet loss, CPU cycles, and disk I/O than delving deep into the annals of TCP. There is a time to do that, but it normally is not at this stage of analysis. – Mike Pennington May 05 '11 at 21:48
  • @Mike agreed, which is why I suggested looking for errors/utilisation information for devices along the path as a first step. I'm not a big fan of ICMP based diagnostics other than for reachability. As you say, rate limiting and incorrectly configured ACLs/firewalls can make it unreliable; though in an enterprise network (which this sounds like), MTR can often point you in the right direction. The other issue with MTR is that it often only points at one problem; it's entirely possible that there are *multiple* faults along the path, which you won't be able to find until you fix the first one. – Murali Suriar May 05 '11 at 21:55
  • We are not disagreeing, ICMP with TTL-stepping is not a panacea and there can be multiple faults. However, for all it's flaws dealing with firewalls and load-balancers, ICMP is the best remote diagnostic we have unless you can run host-level instrumented TCP/UDP sessions on the specific application ports in question... even then you can only say, this socket is retransmitting a lot... but why? 70% of the time, I'm pulling out `mtr` or it's ilk, and I've been solving problems the same way for the last 15 years. Once I've focused in on a specific device, then we can look at drop counters – Mike Pennington May 05 '11 at 22:04
  • Thank you for your answer. In my case I know enough to run Wireshark and see that there are duplicate ACKs and that it is a problem. I don't know what are the possible causes and what to look into, hence the question. I'm also not blaming the network. On the contrary, I think it's almost certainly a client PC issue. Most standard-issue enterprise managed PC's exhibit the problem but the two PC's that belong to network administrators do not. My question is what types of things no the client PC can cause duplicate ACK records? – Sam May 06 '11 at 00:39
  • 1
    @Sam: Just a point regarding troubleshooting network problems: every network has "issues". The key is determining whether those issues are causing performance and/or connectivity problems. You'll find duplicate ACK's, TCP Retransmits, broadcasts, errant protocols, etc. on every network. You should focus on the volume of duplicate ACK's and the hosts most involved in sending the duplicate ACK's to determine if that's really a symptom of a larger problem or just the natural operation of the network. If I see 5 duplicate ACK's out of 1,000 packets I'm not going to give it a second thought. – joeqwerty May 06 '11 at 01:17
  • @joeqwerty, thanks, I'm focusing on the duplicate ACK's because that's the only thing I see different in the clients that work (i.e., can stream video smoothly) and clients that don't work (cannot stream smoothly). The ones that work have an occasional duplicate ACK and the ones that don't work have much more regular three duplicate ACKs in sequence. It's what stood out as different. – Sam May 06 '11 at 02:05
  • @Sam, this whole exchange is symptomatic of what I was saying in my post. Quit obsessing about TCP ACKs. You are troubleshooting performance. There are three network-related causes for IP performance problems: packet loss, packet delay, and packet reordering (if TCP). Your job is to find out whether you have any of that, or if there are contributions from the individual hosts. – Mike Pennington May 06 '11 at 02:15
  • @Mike Pennington, I wouldn't say I'm obsessing over the ACKs. I see a problem--certain clients can't stream video smoothly. The ACKs is one difference between the clients that work and those that don't. I don't understand what that difference, so I'm asking here. I'm also concurrently investigating five other things that may be totally unrelated to the ACKs that could effect the streaming performance. I still would like to learn more about this particular issue while I investigate everything fully. Murali Suriar's answer actually explained a lot; I'm glad to have learned something new today. – Sam May 06 '11 at 02:39
3

By seeing lots of [TCP segment of reassembled PDU] without ACKs - I'd say those ACKs are likely shown as [TCP Dup ACK ...] due to Selective Acknowledgement (aka SACK) behavior.

Example:

  • client sends data parts (...,0,1,2,3,4,5,6,...)

  • server acked (0), then received (2,4,3), then (5), then (6) and never got (1)

In above scenario - server can legitimately choose to ack (2-4) range first, then (2-5) range, then (2-6) range. While forming the "(A-B) range ack" packet - server has to specify the last-acked part (0) in TCP header. Wireshark marks the range-acks (SACKs) as [TCP Dup ACK ...] because all those range-acks have same last-acked part value in TCP header (Ack=872619 in Your case).

dubrov
  • 31
  • 1
1

Duplicate ACK's in combination with slow network performance sounds like a network congestion problem to me. Look at the volume and rate of broadcast traffic on the network. Make sure to look at physical layer and network layer broadcasts as well as multicasts.

joeqwerty
  • 108,377
  • 6
  • 80
  • 171