We've got infrastructure distributed in a few major locations around the world - Singapore, London and Los Angeles. The RTT between any two locations is over >150ms.
We've recently upgraded all of the servers to use 1Gbps links (from 100Mbps). We've been running some TCP-based tests between servers at the different locations and have seen some surprising results. These results are completely repeatable.
- Los Angeles (100Mbps) to London (100Mbps): ~96Mbps throughput
- Los Angeles (100Mbps) to London (1Gbps): ~96Mbps throughput
- Los Angeles (1Gbps) to London (100Mbps): 10-40Mbps throughput (volatile)
- Los Angeles (1Gbps) to London (1Gbps): 10-40Mbps throughput (volatile)
- Los Angeles (1Gbps) to Los Angeles (1Gbps): >900Mbps throughput
It appears that whenever the sender is running at 1Gbps, our throughput suffers very significantly over long links.
The testing approach earlier is extremely simple - I'm just using cURL to download a 1GB binary from the target server (so in the above case, the cURL client runs on the London server and downloads from LA, so that LA is the sender). This is using a single TCP connection of course.
Repeating the same tests over UDP using iperf, the problem disappears!
- Los Angeles (100Mbps) to London (100Mbps): ~96Mbps throughput
- Los Angeles (100Mbps) to London (1Gbps): ~96Mbps throughput
- Los Angeles (1Gbps) to London (100Mbps): ~96Mbps throughput
- Los Angeles (1Gbps) to London (1Gbps): >250Mbps throughput
This points squarely at some TCP or NIC/port configuration issue in my eyes.
Both servers are running CentOS 6.x, with TCP cubic. Both have 8MB maximum TCP send & receive windows, and have TCP timestamps and selective acknowledgements enabled. The same TCP configuration is used in all test cases. The full TCP config is below:
net.core.somaxconn = 128
net.core.xfrm_aevent_etime = 10
net.core.xfrm_aevent_rseqth = 2
net.core.xfrm_larval_drop = 1
net.core.xfrm_acq_expires = 30
net.core.wmem_max = 8388608
net.core.rmem_max = 8388608
net.core.wmem_default = 131072
net.core.rmem_default = 131072
net.core.dev_weight = 64
net.core.netdev_max_backlog = 1000
net.core.message_cost = 5
net.core.message_burst = 10
net.core.optmem_max = 20480
net.core.rps_sock_flow_entries = 0
net.core.netdev_budget = 300
net.core.warnings = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_mem = 1528512 2038016 3057024
net.ipv4.tcp_wmem = 4096 131072 8388608
net.ipv4.tcp_rmem = 4096 131072 8388608
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_frto = 2
net.ipv4.tcp_frto_response = 0
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_congestion_control = cubic
net.ipv4.tcp_abc = 0
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_dma_copybreak = 4096
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.tcp_available_congestion_control = cubic reno
net.ipv4.tcp_allowed_congestion_control = cubic reno
net.ipv4.tcp_max_ssthresh = 0
net.ipv4.tcp_thin_linear_timeouts = 0
net.ipv4.tcp_thin_dupack = 0
Attached are a couple of images of wireshark IO graphs of some test cases (sorry, I can't post images directly yet):
Test case 1 (100Mbps -> 100Mbps) - nice smooth transfer. No losses in capture. - http://103.imagebam.com/download/dyNftIGh-1iCFbjfMFvBQw/25498/254976014/100m.png
Test case 3 (1Gbps -> 100Mbps) - votaile transfer, takes a long time to get to any speed - never approaches 100Mbps. Yet no losses/retransmits in the capture! - http://101.imagebam.com/download/KMYXHrLmN6l0Z4KbUYEZnA/25498/254976007/1g.png
So in summary, when a long link is used with a 1Gbps connection, we get a much lower TCP throughput than when we use a 100Mbps connection.
I'd very much appreciate some pointers from any TCP experts out there!
Thanks!
UPDATE (2013-05-29):
We've solved the issue with test case #4 above (1Gbps sender, 1Gbps receiver, over a large RTT). We can now hit ~970Mbps within a couple of seconds of the transfer starting. The issue appears to have been a switch used with the hosting provider. Moving to a different one solved that.
However, test case #3 mostly remains problematic. If we have a receiver running at 100Mbps and the sender at 1Gbps, then we see approximately a 2-3 minute wait for the receiver to reach 100Mbps (but it does now reach the full rate, unlike before). As soon as we drop the sender down to 100Mbps or increase the receiver to 1Gbps, then the problem vanishes and we can ramp up to full speed in a second or two.
The underlying reason is that we're seeing losses, of course, very soon after the transfer starts. However, this doesn't tally with my understanding of how slow-start works; the interface speed shouldn't have any bearing on this, as it should be governed by the ACKs from the receiver.
Suggestions gratefully received please! If I could offer a bounty here, I would!