I have two servers in AWS. One is a live production server (a multi site WordPress installation with hundreds of sites and about 5,000 users) and the other is a clone of prod that is being configured for a test server. The live one has four array servers, an Elastic Load Balancer and is connected to a large RDS in AWS. And until yesterday, I naively thought our caching was being handled via APC and a WordPress plugin here and there. But no. Turns out someone here had added AWS's ElastiCache to our live server. Essentially, ElastiCache is memcache for those not in the cloud.
Anyway, we tried to enable caching on our test server two days ago and it introduced a really strange bug (a redirect mysteriously appeared on our live site's main admin dashboard that then went to our test server). So once we realized the bug was most likely related to a caching system we didn't know we had, we disabled caching. As it turned out, when we enabled caching on our test server, it used the same Elasticache server our live server was using (because test was a clone of live). So we disabled it when we removed/renamed the object-cache.php file.
Disabling it solved our redirect issue, but suddenly, many (not all) of our 5,000 users could no longer log into their individual sites. For some reason, the values that were in our database were not working for a good percentage of users, forcing them to have to reset their passwords instead. Obviously, this is huge with 5,000 users in the mix. So we reenabled caching on our live instance and decided to fix our cached redirect with WP configuration changes instead (we added define('RELOCATE',true); into the config to force the redirection to our test server to be overridden).
One of the things we noticed with memcache was that it kept updating our wp_options table with the domain for the test server in place of our live one. In fact, it's still doing it whenever I run a query to find the string for the test domain and update it to the live domain. Every few minutes, the caching changes it back. Scary. But it looks like our configuration change for now forces an override. The really concerning thing about all this was the fact that it seems memcache is drawing from its own key:value pairs for the user passwords instead of directly from the database. I mean with caching enabled, the users can get in. Without it, many users are forced to reset their passwords.
Does anyone have any ideas for me as to how to effectively understand what's going on with memcache in this case and how to fix it so the database gets written to appropriately and so password info isn't just being held in the cache? To my thinking it's a ticking time bomb. All it would take is one flush_all command to make life very, very painful for most of my users.
We are on Nginx with MySQL on the RDS. WordPress 3.4.2.