We run a web application serving up web APIs for an increasing number of clients. To start, the clients were generally home, office, or other wireless networks submitting chunked http uploads to our API. We've now branched out into handling more mobile clients. The files ranging from a few k to several gigs, broken down into smaller chunks and reassembled on our API.
Our current load balancing is performed at two layers, first we use round robin DNS to advertise multiple A records for our api.company.com address. At each IP, we host a Linux LVS: http://www.linuxvirtualserver.org/, load-balancer that looks at the source IP address of a request to determine which API server to hand the connection to. This LVS boxes are configured with heartbeatd to take-over external VIPs and internal gateway IPs from one another.
Lately, we've seen two new error conditions.
The first error is where clients are oscillating or migrating from one LVS to another, mid-upload. This in turn causes our load balancers to lose track of the persistent connection and send the traffic to a new API server, thereby breaking the chunked upload across two or more servers. Our intent was for the Round Robin DNS TTL value for our api.company.com (which we've set at 1 hour) to be honored by the downstream caching nameservers, OS caching layers, and client application layers. This error occurs for approximately 15% of our uploads.
The second error we've seen much less commonly. A client will initiate traffic to an LVS box and be routed to realserver A behind it. Thereafter, the client will come in via a new source IP address, which the LVS box does not recognize, thereby routing ongoing traffic to realserver B also behind that LVS.
Given our architecture as described in part above, I'd like to know what are people's experiences with a better approach that will allow us to handle each of the error cases above more gracefully?
Edit 5/3/2010:
This looks like what we need. Weighted GSLB hashing on the source IP address.