Any ideas on how to run maintenance on a site that is always under use?

Question

I help out with a large gaming site in Australia. We run competitions from 7am local time to 1am the next day, every day of the week. We haven't skipped a day since the site was released. Naturally, this makes maintenance extremely hard to run, and we find that our staging server is up to 50 commits ahead of our production branch. Usually, the main dev has to wake up extremely early to merge branches and make sure everything is working properly.

We have been trying to make our staging site as similar as we can to the production site, but we can only make it so similar.

Our site is based off Laravel with a Node.JS server for realtime. We are using Laravel Forge.

Does anyone have any suggestions on how we could push updates more frequently? We are open to anything.

@MichaelHampton Our deploys don't take long, it's just that we can't afford the downtime if something does go wrong. — cheese5505, Nov 28 '15 at 06:44
I guess the question, then, is: Why does a rollback take so long? — Michael Hampton, Nov 28 '15 at 06:44
@MichaelHampton we haven't properly looked at rollbacks, however at times we do large updates that will take the site down for too long. — cheese5505, Nov 28 '15 at 06:46
Whatever is taking the large blocks of time, that is what you need to look at. — Michael Hampton, Nov 28 '15 at 06:48

Michael Hampton · Answer 1 · 2015-11-28T08:22:44.687

There are a lot of things you could be doing to improve your deployment process. A few of them are:

Ensure your code is well tested.

Ideally you should have 100% unit test coverage, as well as integration testing for every conceivable scenario.

If you haven't got this, you should probably drop everything and get this taken care of.

Look into behavior-driven development.

Having a complete test suite will allow you to...
Run continuous integration.

Whenever someone commits a change, CI can then automatically run the test suite on it. If the test suite passes, it can then deploy immediately (or schedule a deployment). For changes that don't require any significant change to your databases, this alone will save you a lot of time and headache.

In case of a problem, CI can also give you a one-click rollback.

CI is much less useful if your test suite isn't complete and correct, as the entire premise rests on being able to validate your code in an automated way.
Make atomic updates.

Ideally you should not just be copying new files over the old on the production server. Instead, use a tool such as capistrano, which copies every file to a new location, and then uses a symbolic link to point to the desired deployment. Rolling back is instantaneous as it involves simply changing the symlink to point to the previous deployment. (Though this doesn't necessarily cover your database migration.)

Also look into whether containers such as Docker can help you.
Make smaller, more frequent changes.

Whether you have tests, CI, or nothing, this alone can help you significantly. Every change should have its own git branch, and a deployment should have as few changes as possible. Because changes are smaller, there is less to potentially go wrong during a deployment.

On that note, make changes more isolated whenever possible. If you've made a change to the Omaha game, and it doesn't affect Texas Hold'em, 5 card stud or anything else, then that is the only game that needs to be suspended for a maintenance.
Analyze anything long-running.

You mentioned some parts of your deployments take a long time. This is probably database schema changes. It's well worth having a DBA look at your database, along with each schema change, to see what can be performing better.

Have a subject matter expert look at any other part of a deployment which takes up large blocks of time.
Work odd hours.

You may already be doing this, but it bears mentioning. Developers (and sysadmins!) should not be expected to work "9 to 5" anymore, especially for a 24x7 operation. If someone is expected to spend the overnight hours babysitting a deployment, fixing any problems, and then keep a daytime schedule, your expectations are unrealistic, and you are setting that person up for burnout.

The simplest solution here is to use deployment scripting in a tool such as Capistrano and perhaps even doing load balancing as well. — Giacomo1968, Nov 28 '15 at 09:22
Thanks for the advice. We will look into this. At the moment we don't have a test suite at all, and I would really like to look into it however the site has been in development for over 8 months and is so large it would take more than a week to make one. We are running Laravel Forge which just symlinks the new version to the folder that nginx is set up for. I'm unable to work odd hours due to school, and the same is for the other dev. — cheese5505, Nov 28 '15 at 09:47
@cheese5505 I know this is frustrating and this does not solve your problem but when you say this, **“…is so large it would take more than a week to make one.”** that seems patently ridiculous. Any sane development and deployment process would allow a server to be built up in less than a day or maybe a few hours to an hour. You should really review what you did to build up this pile of unmanageable stuff to avoid this. The problem is not complexity but basic foresight in planning. — Giacomo1968, Nov 29 '15 at 00:37
"At the moment we don't have a test suite at all" -- fix this **now**, before developing new features. This is your biggest pain point and will be an availability risk. Automated testing will go a long way towards preventing outages and will reduce ops pain significantly. — Josh, Nov 29 '15 at 18:10

score 6 · Answer 2 · answered Nov 28 '15 at 07:24

It seems from what you say that you have a maintenance window from 1 am to 7 am every day the issue is not time but convenience. This is normal and many people just deal with it as part of business.

You could have a 2 (or more backend) systems with a front end that directs traffic to whichever is currently live. Once you are happy that a release is going to work you tell the front end to switch to the new system. this should be easy to script an take a short time.

Now you have a choice of either leaving the old system as is so you can back out or bring it up-to-date so it can be used as a spare for the live system until it's time to build/test the next updates.

When you differentiate backend from frontend, do you mean completely modular software architecture? Or server architecture such as a load balancer? — Giacomo1968, Nov 28 '15 at 09:19
just something that accepts connections and deliver them to the current primary backend. — user9517, Nov 28 '15 at 11:12

score 5 · Answer 3 · answered Nov 28 '15 at 13:06

5

Amending the other answers: You should follow the blue-green deployment model. When you want to release a new version you deploy it to an internal staging website. Then, you can run automated tests on the next version production site. When the tests go through you point the load balancer to use the new website.

This helps in the following way:

Severe problems are always found with zero downtime.
Switching to a new version has exactly zero downtime because the new version is already started and warmed up.
You can switch back to the old version at any time because it is still physically running.

All the other problems that you and others have mentioned becomes less severe when you can deploy at any time in a stress-free manner. The blue-green deployment model is a quite complete solution for deployment problems.

answered Nov 28 '15 at 13:06

usr

245
3
11

We already have a staging server which we use to test, but at the moment production and staging are on different server providers in different locations. We are trying to move production to where staging is as it provides better performance for us. – cheese5505 Nov 28 '15 at 13:19
1

The key is to just have to switch the load balancer over to a proven working version. With that current model you don't have that. – usr Nov 28 '15 at 13:39
1

How good a model this is depends a lot on what the site is doing. If the site is stateless then great but if it's not stateless you have to work out how you are going to tranfer that state on switchover. – Peter Green Nov 28 '15 at 15:02
@PeterGreen it's very hard for websites to be stateful because that does not allow for clustering and the state can be lost at any time on redeployment/reboot/crash/bluescreen etc. Therefore, this is very uncommon. – usr Nov 28 '15 at 15:06
@usr most websites have state. That state may be stored either in files or in a database. In the latter case the database may be either local or remote. Some upgrades are likely to have an impact on that state meaning upgrading and rollback are not as simple as just switching over the code. – Peter Green Nov 28 '15 at 15:14
@PeterGreen this is a separate issue because we are now talking about the database. Yes, your point is valid. It is a good deployment model to only make compatible changes to the database (e.g. add columns as nullable). This model becomes more compelling the lower your tolerance for deployment problems and downtime is. – usr Nov 28 '15 at 15:24
@peter green, when an upgrade will have an effect on rollback, you are no longer deploying an upgrade, but a new version. The dev archtect should have designed extensible state stores so that state would not be transferrable. Blue green is the most common deployment model I've seen and unlike others, lends nicely to scaleable solutions – Jim B Nov 28 '15 at 16:33

score 3 · Answer 4 · answered Nov 28 '15 at 08:24

What will you do if your main data centre suffers an outage, which happens at all data centres from time to time? You might accept the downtime, you might fail over to another data centre, you might be running in active-active mode in multiple data centres all the time, or you might have some other plan. Whichever one of those it is, do it when you do releases, and then you can take your main data centre down during a release. If you're prepared to have downtime when your data centre has an outage, then you're prepared to have downtime, so it shouldn't be a problem during a release.

score 2 · Answer 5 · answered Nov 29 '15 at 04:12

To add to the previous answers:

Use a deployment strategy that allows for rollbacks and instant switching, Capistrano or pretty much any other deployment system will help with this. You could use things like database snapshots and code symlinks to be able to quickly revert to a previous state.
Use complete configuration management, don't leave anything managed manually. Systems like SaltStack, Ansible and Puppet are examples. They can be applied to Docker container configurations and vagrant boxes as well.
Use HA to make sure you can hand off requests when upgrading a node. If the upgrade fails, simply down the node, and when it's rolled back, bring it back up and your HA solution will notice and push requests to said node again. HAProxy is an example, but nginx works just fine as well.
Make sure the application can handle concurrent instances, used central versioned data repositories for non-code data that needs to be stored on disk, such as cache. This way, you will never have en upgraded application run in to cache files from a different version. This would be done on top of purging caches and doing cache warmups of course. (The caching thing is just an example)

I usually set up workflows where team managers can approve merge requests to a special branch that does all the normal CI stuff, but as an additional last step also starts pushing to production nodes. What you basically do is run a manual CI deploy to a production instance. If that instance doesn't generate invalid responses, breaks, or does weird things to your data, you then mass-upgrade all nodes using your CI solution of choice. This way, if one deployment works, you know all deployments will work for a specific tag/commit.

Right now, it sounds as if you are running a production application on a single node, with one deployment process, one source and one target. This practically means that every single step in that workflow is a point of failure that by itself can break the website. Ensuring that such a thing cannot happen is the base of all the CI, HA and failover processes. Don't run just one node, don't run just one HA process, don't run on just one IP address, don't run just one CDN etc. It might sound expensive, but putting a duplicate of what you already have in a rack on a server with it's own connection usually costs less than one hour of downtime on a business site.

score 0 · Answer 6 · edited Apr 13 '17 at 12:13

I globally agree with Michael on every of his points (https://serverfault.com/a/739449/309477).

In my opinion, the first improvement you should make is using a deployment tool (Capistrano).

It will allow you to deploy peacefully, then switch to the newer version instantly. If anything goes wrong, you can switch back to the working version instantly, simply by changing the current symlink to a working version.

And Capistrano is pretty fast to first handle (compared to start using tests and CI which will be a bigger time investment).

Secondly, if money is not your main issue, you should have a iso-prod development server to test your app before deploying it in production. Use an industrial solution (Ansible, Chef, Puppet) to manage VPS instances.

Any ideas on how to run maintenance on a site that is always under use?

6 Answers6