Suggestions for swapping out a production server

Question

Background

My team has been investigating an issue in our production environment (see Stack Overflow post). We've looked thoroughly at application layer (i.e., code, logs, etc.) and also have done some low-level packet sniffing, but to no avail. The odd thing is that this issue only occurs in production. Even more odd is that the code at failing point hasn't been changed in more than a year.

Question

We're now at a point where we need to start exploring other options, one of which is to replace the production environment with a new one. This is where I hope you all can help me out in some way.

I'm looking for suggestions/recommendations for how to swap out the old production environment with the new one as seamlessly as possible. However, for some period, I need the old and new environment to operate in tandem, to validate that the new environment resolves the issue. The new environment would be used by a set of administrators while the old environment would be used by non-administrators. Once we have done our validation, the old environment would be turned off completely.

I was thinking of putting some sort of proxy out in front of the server so that I can redirect the requests as necessary and was looking at Apache Tomcat's Load Balancer application. I'm not sure if this would be the best approach, so I hope someone here can offer some suggestions.

Assumptions

Only the application servers will be swapped out
Database server will remain intact and while the two production environments are operating in tandem, they would be pointing to the same database
Complete control of the servers

Application Server Technologies

RHEL 5.7
Tomcat 6.0

This would be a *LOT* easier if we knew what the environment looked like. You may also want to ask a new question regarding the "acting up" you're seeing: Someone may be able to solve that problem for you. — voretaq7, Jul 22 '11 at 17:13
I added brief information about the situation as context, not as the subject or focus of this question, so let's please try and stay on point. — John, Jul 22 '11 at 17:58
No worries. :) What additional information do you need me to include about the environment? I've mentioned RHEL 5.7 and Tomcat 6.0, but I guess I should have been a little more clear about that being our server technologies. I'll update my post. — John, Jul 22 '11 at 18:17
Still in need of more detail in order to give you a plan of action (mostly database stuff - if they're involved, if it's OK to have both environments hitting a single DB, etc.) but I'll post something generic shortly that should give you a starting point. — voretaq7, Jul 22 '11 at 18:17
Ah ok. Yea, exactly what I had in mind is that only the application servers would be swapped out and the database servers remain intact, and while the two production environments are running in tandem, they'd be pointing to the same database server. I'll add this to my post. — John, Jul 22 '11 at 18:20

score 3 · Accepted Answer · edited May 23 '17 at 12:41

Looking at the SO question I don't know that this is a systems-level problem -- The description over there sounds like an app bug. Either way upgrading your environment is always something it's good to think about, so I'll take a swing :-)

A general plan of action for a major software change or migration usually looks like this (From your SO question, everywhere I say DB/Database you should be thinking about your App2 server):

Duplicate your environment as best you can on new hardware (and optionally upgraded software -- latest OS, web server, DB, etc.)
This can include cloning all your prooduction databases (which is great if you don't have convenient test data).
Test the bejeebus out of it to make sure your problem is gone.
(This part is problematic in your case since you said you haven't been able to reliably reproduce the problem)
Clean up the detritus from your testing
Pick a convenient time to make the switch-over
("convenient" for your users: Unfortunately that typically means 3AM on a Saturday or something equally loathsome for the admin team)
Make the switch-over - This includes (roughly in this order)
- Disconnecting the old environment from the network / disabling user access
- Putting the old environment into a quiescent state so it's not changing anymore
- Synchronizing any databases/volatile data to the new environment
- Doing any tests you can do before you make the new environment live
- Turning on access to the new environment if the tests pass
  (or being ready to put the old one back)

In your case depending on where the funky behavior comes up you may be able to short-circuit most of this around step 3: If your admins are the only ones who see the misbehaving portion of the application then your admins can beat on a testing copy of the environment until they either reproduce the bug or are satisfied that it's gone (and if the bug pops up you're back in application-land).
If the problem is user-facing the only real solution is putting the new stuff out where users can get at it, which basically means going through the whole process.

You also have a few different challenges because you want to run your environments in parallel: If both environments will be writing to a database you will need to take precautions to ensure that either both environments write the same information to their copy of the database (multiplex the connections at your load balancer), or that both environments can safely interact with a single database.
Running in parallel pretty much eliminates the first and third bullets from #5 above (you don't duplicate the back-ends, and the "old" environment keeps running - you just prop up the new one next to it).

In your specific case with identical applications on App1 you may be able to use App2 as a shared database, but that's something you need to think about from a software standpoint (would App2 freak out if it saw multiple hosts talking to it?).

No matter what you do definitely hang on to your old environment for a while without touching it (this can be a longer or shorter while, depending on your particular situation -- For example in my company about 8 hours after a major DB Schema change we've accumulated so much data that we can't roll back: The data loss would be catastrophic and recovery protracted).
Once you're sure the new environment has solved your problem (or at least works as well as the old environment with no new problems) you can turn the old stuff into a development lab.

Thanks for your response voteaq. This isn't far from what I had originally had in mind and the roll out strategy is how we typically do things here (i.e., late night deploys). What I really wanted information about is proxies as I need this requirement satisfied: " for some period, I need the old and new environment to operate in tandem, to validate that the new environment resolves the issue." Any ideas around that? — John, Jul 25 '11 at 13:20
You may be able to configure a proxy to feed requests to both hosts (depends on the proxy - you'd have to dive into the config details) -- The problem is that you will then have replies from both hosts which need to be dealt with (ignored or passed along). It adds a lot of complexity, so if you are using the same back-end and the front-end software is identical it may be better/easier to make the hard cutover (with provisions to switch back) or round-robin between the two machines (with provisions to take either one out of rotation) — voretaq7, Jul 25 '11 at 15:07
Thanks again Voretaq. Curious, what sorts of complexities are you referring to? I'm looking at using HAProxy and configuring it as a reverse proxy. What I'm concerned about is how proxies and the back-end servers will handle session cookies. I'm about to do a POC in my development environment, so I hope that answers my questions. The idea is to have two production environments running and distributing requests accordingly, almost like load balancing except only the set administrators would be redirected to the new environment. Also, just an FYI, I've updated my SO post with **Other oddities**. — John, Jul 25 '11 at 16:04
The two big complexity items are "what do I do with the 'extra' reply" (Only one can go back to the original client) and "What happens if there's a back-and-forth conversation?" (which is probably easier to handle, but may get hairy if cookies are involved - a sub-problem of "only one reply can go to the client"). — voretaq7, Jul 25 '11 at 16:12
Ah, I see where you're going with that now. I don't intend a single request from a client to be sent to both servers. It would be just one or the other. Also, to simplify it even further, I only want a set of administrators to go to the new environment. All other requests would go to the old environment. The way I was thinking I would do this is grab the IP addresses of the few administrators who would be testing the new environment and place that in some sort of redirection rule. Does that help? — John, Jul 25 '11 at 16:21
That's much easier and has substantially less complexity - you can do that with a firewall rule (Requests from this subnet go to box A, all others go to box B). If your proxy software lets you route requests based on address you can do it there as well with no issues. — voretaq7, Jul 25 '11 at 16:27
Hi Voretaq, thanks again for your help. Would say then that a proxy might be overkill if a firewall can do the same thing? — John, Jul 28 '11 at 14:03
@John - definitely. If your firewall can handle the redirection and this is just a temporary thing there's no need to add a proxy into the mix -- just one more thing to break. — voretaq7, Jul 28 '11 at 14:47
Great. Thanks! My last concern is application session cookies, and that's only because I don't know how firewalls or proxies affect them. Is there anything I should be aware of or possible issues that I may encounter? — John, Jul 28 '11 at 16:35

Suggestions for swapping out a production server

Background

Question

1 Answers1