9

We've got infrastructure distributed in a few major locations around the world - Singapore, London and Los Angeles. The RTT between any two locations is over >150ms.

We've recently upgraded all of the servers to use 1Gbps links (from 100Mbps). We've been running some TCP-based tests between servers at the different locations and have seen some surprising results. These results are completely repeatable.

  1. Los Angeles (100Mbps) to London (100Mbps): ~96Mbps throughput
  2. Los Angeles (100Mbps) to London (1Gbps): ~96Mbps throughput
  3. Los Angeles (1Gbps) to London (100Mbps): 10-40Mbps throughput (volatile)
  4. Los Angeles (1Gbps) to London (1Gbps): 10-40Mbps throughput (volatile)
  5. Los Angeles (1Gbps) to Los Angeles (1Gbps): >900Mbps throughput

It appears that whenever the sender is running at 1Gbps, our throughput suffers very significantly over long links.

The testing approach earlier is extremely simple - I'm just using cURL to download a 1GB binary from the target server (so in the above case, the cURL client runs on the London server and downloads from LA, so that LA is the sender). This is using a single TCP connection of course.

Repeating the same tests over UDP using iperf, the problem disappears!

  1. Los Angeles (100Mbps) to London (100Mbps): ~96Mbps throughput
  2. Los Angeles (100Mbps) to London (1Gbps): ~96Mbps throughput
  3. Los Angeles (1Gbps) to London (100Mbps): ~96Mbps throughput
  4. Los Angeles (1Gbps) to London (1Gbps): >250Mbps throughput

This points squarely at some TCP or NIC/port configuration issue in my eyes.

Both servers are running CentOS 6.x, with TCP cubic. Both have 8MB maximum TCP send & receive windows, and have TCP timestamps and selective acknowledgements enabled. The same TCP configuration is used in all test cases. The full TCP config is below:

net.core.somaxconn = 128
net.core.xfrm_aevent_etime = 10
net.core.xfrm_aevent_rseqth = 2
net.core.xfrm_larval_drop = 1
net.core.xfrm_acq_expires = 30
net.core.wmem_max = 8388608
net.core.rmem_max = 8388608
net.core.wmem_default = 131072
net.core.rmem_default = 131072
net.core.dev_weight = 64
net.core.netdev_max_backlog = 1000
net.core.message_cost = 5
net.core.message_burst = 10
net.core.optmem_max = 20480
net.core.rps_sock_flow_entries = 0
net.core.netdev_budget = 300
net.core.warnings = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_mem = 1528512      2038016 3057024
net.ipv4.tcp_wmem = 4096        131072  8388608
net.ipv4.tcp_rmem = 4096        131072  8388608
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_frto = 2
net.ipv4.tcp_frto_response = 0
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_abc = 0
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_dma_copybreak = 4096
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.tcp_available_congestion_control = cubic reno
net.ipv4.tcp_allowed_congestion_control = cubic reno
net.ipv4.tcp_max_ssthresh = 0
net.ipv4.tcp_thin_linear_timeouts = 0
net.ipv4.tcp_thin_dupack = 0

Attached are a couple of images of wireshark IO graphs of some test cases (sorry, I can't post images directly yet):

Test case 1 (100Mbps -> 100Mbps) - nice smooth transfer. No losses in capture. - http://103.imagebam.com/download/dyNftIGh-1iCFbjfMFvBQw/25498/254976014/100m.png

Test case 3 (1Gbps -> 100Mbps) - votaile transfer, takes a long time to get to any speed - never approaches 100Mbps. Yet no losses/retransmits in the capture! - http://101.imagebam.com/download/KMYXHrLmN6l0Z4KbUYEZnA/25498/254976007/1g.png

So in summary, when a long link is used with a 1Gbps connection, we get a much lower TCP throughput than when we use a 100Mbps connection.

I'd very much appreciate some pointers from any TCP experts out there!

Thanks!

UPDATE (2013-05-29):

We've solved the issue with test case #4 above (1Gbps sender, 1Gbps receiver, over a large RTT). We can now hit ~970Mbps within a couple of seconds of the transfer starting. The issue appears to have been a switch used with the hosting provider. Moving to a different one solved that.

However, test case #3 mostly remains problematic. If we have a receiver running at 100Mbps and the sender at 1Gbps, then we see approximately a 2-3 minute wait for the receiver to reach 100Mbps (but it does now reach the full rate, unlike before). As soon as we drop the sender down to 100Mbps or increase the receiver to 1Gbps, then the problem vanishes and we can ramp up to full speed in a second or two.

The underlying reason is that we're seeing losses, of course, very soon after the transfer starts. However, this doesn't tally with my understanding of how slow-start works; the interface speed shouldn't have any bearing on this, as it should be governed by the ACKs from the receiver.

Suggestions gratefully received please! If I could offer a bounty here, I would!

Sam
  • 201
  • 2
  • 5
  • 1
    Are you using TCP offload on the NIC on either side? Is your usage of TCP offload varying from the 100M to the 1G NIC? If that is in use in any of the test cases, it may be worth repeating the tests with that disabled just to see if the TCP offload engine on the 100M NIC may be getting in the way of how 1G communication performs (this comment is intentionally hand-wavey just to generally bring up TOE) – FliesLikeABrick May 17 '13 at 17:34
  • Good question! TCP segmentation offload is disabled on both ends. Generic segmentation offload is enabled on both ends. I also repeated it with TSO enabled, and it didn't make any noticeable difference. – Sam May 17 '13 at 17:37
  • Try disabling generic segmentation offload, at least on the 100M side, and repeat your tests – FliesLikeABrick May 17 '13 at 18:36
  • Thanks for the suggestion, but no joy - same results with gso on or off on both sides. – Sam May 17 '13 at 21:42
  • 1Gbps at 150ms+ gives a very large Bandwidth Delay Product, over 18Mb. What happens if you bump your socket buffers up? `tcp_*mem = 4096 1048576 33554432` You haven't enabled Jumbo Frames on the 1Gbps links have you? That could be causing fragmentation overhead somewhere. – suprjami May 18 '13 at 13:38
  • I did try that, but to no effect. This makes sense too, as (a) the receiver is still at 100Mbps and (b) the port speed on the sender should not affect TCP behaviour (as far as I understand) - the same TCP settings were able to easily hit approx 96Mbps when the sender was connected at 100Mbps. Jumbo frames are not enabled. It's important to note that literally changing the port speed on the sender is enough to trigger this slowdown - no other options or parameters change whatsoever. – Sam May 19 '13 at 13:43
  • We had something similar where the Maximum Transmission Unit was too high for a chosen link, and strange delays, packet loss and ultimately throughput was affected - have you tried doing MTU tests to see if that's set correctly for the links? – Ashley May 22 '13 at 20:12
  • Thanks, but we've checked the MTUs - they're all at sane values, and adjusting them makes no difference. – Sam May 29 '13 at 17:12

3 Answers3

1

The main issue is big WAN delay. It will be very worse if it also having random packet lost.

1, the tcp_mem also need set large to allocate more memory. For example, set it as net.ipv4.tcp_mem = 4643328 6191104 9286656

2, you can capture the packets through wireshark/tcpdump for about several minutes then analysize whether it has random packet lost. You can also upload the packets file if you like.

3, you can try to tune the other tcp parameters Eg. set tcp_westwood=1 and tcp_bic=1

HarryREN
  • 11
  • 1
  • Thanks, but we've tried all of those. The WAN delay is not the issue - we can hit 100Mbps almost immediately if we use 100Mbps ports, but as soon as one changes to 1Gbps then we're toast. – Sam May 29 '13 at 17:12
1

Solved! For full details see http://comments.gmane.org/gmane.linux.drivers.e1000.devel/11813

In short, it appears the 1Gbps connected server would send bursts of traffic during TCP's exponential growth phase that would flood buffers in some intermediate device (who knows what). This leaves two options:

1) Contact each intermediate network operator and get them to configure appropriate buffers to allow for my desired bandwidth and RTT. Pretty unlikely! 2) Limit the bursts.

I chose to limit each TCP flow to operate at 100Mbps at most. The number here is fairly arbitrary - I chose 100Mbps purely because I knew the previous path could handle 100Mbps and I didn't need any more for an individual flow.

Hope this helps someone in the future.

Sam
  • 201
  • 2
  • 5
0

Repeating the same tests over UDP using iperf, the problem disappears!

Los Angeles (1Gbps) to London (1Gbps): >250Mbps throughput

The problem does not seem to be gone, roughtly 75% of your packets are getting dropped? If TCP goes into slow start all the time, your average bandwith might be rather low.

Btw, do you have benchmarks for London to LA, and London to London?

Jens Timmerman
  • 866
  • 4
  • 10
  • I forgot to mention that the client is a slow one... If we repeat with two fast clients, then we hit ~970Mbps bi-directionally. – Sam May 29 '13 at 17:10