Behaviour of solaris tcp stack with relatively high RTT and bursty traffic

Question

I have an application that is distributing data from New York to Tokyo over TCP running Solaris 10. Mean throughput is < 1Mbps, peak throughput can reach 20-30Mbps for seconds at a time though typical spikes are more like 10Mbps. Individual message sizes are small (~300bytes) and consistency of latency is key. This means we are attempting to remove batching aka so nagles is off & application is configured to send asap rather than queue then send.

The RTT between New York and Tokyo is ~180ms and the TCP window is tuned to a theoretical throughput in the region of ~40Mbps, aka 1M tcp_xmit_hiwat/tcp_rcv_hiwat. tcp_max_buf and tcp_cwnd_max are also 1M.

The problem here is that we frequently but intermittently see mysterious "pauses" where the sender gets EWOULDBLOCK leading to a buildup in an internal queue and then subsequent discharge of data. There are 2 problems here

no obvious reason for the blocking socket, we don't appear to be hitting peak throughput and nothing in the packet captures suggest any slowdown
during the "discharge period" (i.e. when the sender socket is no longer blocking but it has a buffer of data to send), we see a steadily increasing sawtooth pattern to the message rates

The former is the key to the problem, if I can work this out then the latter shouldn't occur. However the latter is odd, I was naively expecting it to quickly ramp to peak throughput and stay there until it had got through the backlog.

CPU utilisation is not a problem at either end, SAs say boxes look good. Network congestion on the WAN link is also not a problem, networks say network looks good. In fact everyone says every individual piece looks fine, it's just still performing badly!

Any thoughts on how to optimise for this situation? or things to investigate that might provide a hint as to what is going on?

You might wish to describe your network situation between the New York and Tokyo. Also, do this issues hit at specific times of the day? — mdpc, Jun 23 '11 at 22:10
The network link is not saturated, stats on the router to router link at the edge of the data centres say peak is never past 60% or so and average is more like 15-20%. The problem occurs at no particular interval nor time of day but every few minutes is not uncommon, on other occasions it is happy for 20mins or so. — Matt, Jun 23 '11 at 22:17
I guess I am wondering your network provider configuration (between NY and Tokyo). On a link this long I could really see where several intermittent slowdown points might occur depending on traffic/sharing at each switching site. You might want to discuss this issue with your network provider(s) and see what your Service Level agreement is. — mdpc, Jun 23 '11 at 22:50
One other random thought, are there any network accelerators on the linkage? I've had a few interesting experiences with problems in this area. — mdpc, Jun 23 '11 at 22:53
the network provider is sure this behaviour is not a result of the network itself. I don't know about network accelerators. — Matt, Jun 24 '11 at 08:50
Is it possible to setup a local test to transfer the information locally to verify the application is not causing the problem? This could be used as a demonstration to the vendor that the application(s) are not at fault. — mdpc, Jun 24 '11 at 19:48

score 1 · Answer 1 · answered Jul 06 '11 at 03:18

EWOULDBLOCK/EAGAIN means that data could not be sent right away. We need more details about your code to figure it out.

Try to figure out what happens on the sender side when EWOULDBLOCK is returned. Check for threads and other processes, monitor memory/swap and cpu usage. Check your logs (/var/messages, ...) for any hardware error.
Determine the packet loss
Run the program on another OS before accusing the Solaris TCP stack or write a small test program that sends 300B/s to the remote end (no need for non blocking sockets, just send), check for any latency, this would isolate the network problem.

I am not a developer but I suggest you try to replace non blocking sockets by an I/O multiplexer: select or poll or /dev/poll and check if the socket is ready for writing. It might change the behavior of your program for the better or in the worst case give you more debug and hints about the real problem.

On such long distances all the packets are probably using different routes, going though different AS, so no one can really assess the network quality. A packet may take a long time to arrive and be acknowledged because of a problem somewhere deep in the internet (it's probably near though, otherwise people would have reported it and fixed it), try to join from/to other locations. If a single packet takes a long time to be acknowledged, the TCP window will get stuck and further data may not be processed. You may want to try and tune your TCP window size to a higher value.

Additionally you can simply run mtr to rapidly check the network quality. Run it several times as the packets may take different paths.

Hope this helped, somehow :/

Behaviour of solaris tcp stack with relatively high RTT and bursty traffic

1 Answers1