4

I have two servers in AWS. One is a live production server (a multi site WordPress installation with hundreds of sites and about 5,000 users) and the other is a clone of prod that is being configured for a test server. The live one has four array servers, an Elastic Load Balancer and is connected to a large RDS in AWS. And until yesterday, I naively thought our caching was being handled via APC and a WordPress plugin here and there. But no. Turns out someone here had added AWS's ElastiCache to our live server. Essentially, ElastiCache is memcache for those not in the cloud.

Anyway, we tried to enable caching on our test server two days ago and it introduced a really strange bug (a redirect mysteriously appeared on our live site's main admin dashboard that then went to our test server). So once we realized the bug was most likely related to a caching system we didn't know we had, we disabled caching. As it turned out, when we enabled caching on our test server, it used the same Elasticache server our live server was using (because test was a clone of live). So we disabled it when we removed/renamed the object-cache.php file.

Disabling it solved our redirect issue, but suddenly, many (not all) of our 5,000 users could no longer log into their individual sites. For some reason, the values that were in our database were not working for a good percentage of users, forcing them to have to reset their passwords instead. Obviously, this is huge with 5,000 users in the mix. So we reenabled caching on our live instance and decided to fix our cached redirect with WP configuration changes instead (we added define('RELOCATE',true); into the config to force the redirection to our test server to be overridden).

One of the things we noticed with memcache was that it kept updating our wp_options table with the domain for the test server in place of our live one. In fact, it's still doing it whenever I run a query to find the string for the test domain and update it to the live domain. Every few minutes, the caching changes it back. Scary. But it looks like our configuration change for now forces an override. The really concerning thing about all this was the fact that it seems memcache is drawing from its own key:value pairs for the user passwords instead of directly from the database. I mean with caching enabled, the users can get in. Without it, many users are forced to reset their passwords.

Does anyone have any ideas for me as to how to effectively understand what's going on with memcache in this case and how to fix it so the database gets written to appropriately and so password info isn't just being held in the cache? To my thinking it's a ticking time bomb. All it would take is one flush_all command to make life very, very painful for most of my users.

We are on Nginx with MySQL on the RDS. WordPress 3.4.2.

user144722
  • 41
  • 1
  • 2
    Stop letting your test systems touch your production systems. – Michael Hampton Nov 09 '12 at 01:39
  • The systems touched each other becasue we didn't even know that enabling caching on test was tied to a caching server in the first place. Let alone one that ran our live server. Someone else had set it up and not told us and then left the company. – user144722 Nov 09 '12 at 17:15

1 Answers1

1

Your cache got overwritten with data and session information from the test instance. Use a memcached client to clear your cache. Rebooting the cache cluster might do that as well. Resetting your password also resets your sessions, which is why that was a possible solution.

That said, your security groups are probably set up wrong. Your test instance should have never been able to connect to the ElastiCache cluster. Memcached does not have authentication, so if you can reach the cache servers, you have access to the data. Check and make sure your security groups aren't set to allow every address in.

  • Also, if they cloned the production, wordpress salts in wp-config are probably the same. Regenerate the salts in wp-config – Sibin Grasic Oct 27 '13 at 23:37