Until recently our setup consisted of 4 web-servers sharing sessions to a single server running memcache. We are hosted on Amazon cloud and had a crash at peak load on 2 consecutive days. The problem was that the memcache service crashed (the load to our site has been increasing steadily).
So, we took the following measures:
1) Added 2 more servers for storing sessions
2) Set following variables in php ini file on all web-servers
session.save_handler = memcache
session.save_path = tcp://ip1.port, tcp://ip2.port, tcp://ip3:port
memcache.hash_strategy = consistent
memcache.allow_failover = 1
Things are running in order now. To ensure availability, we tried testing by randomly killing one of our session servers, and the site keeps running (some users got logged out, which is business-wise acceptable at the moment).
But there is one major problem. I expected the load on memcache servers to be more or less equally distributed. But it is not!
If I look at the "Max Network Out (Bytes)" in CloudWatch, then I can see that the load is roughly in the ratio of 10:5:1. In other words, In and Out network bandwidth wise, first server is 10 times as loaded as the third one. And the second server is 5 times as loaded as the third one.
Any ideas?