How to plan Plan Server downtime

Question

I work in this new place that host applications that should serve a specific type of organisation in the country. We usually have downtime. These servers are of heavy capacities. It was discovered last time we had downtime that the server had issues handling about 8000 request per second. The solution to this was to revert to backup server and immediately add more RAM to the server and the server was restarted. Currently we are handling like 15% of the organisations and I believe that in the next 5 - 10 years the organisations being handled will increase to 50 - 80%.

To me, we can't continue adding RAM, restarting server and buying high end server. I wouldn't know the policy guiding the purchase of servers in this organisation because I am new here. My question is: 1. What need to be done to these servers and their applications in order to avoid such downtimes and also to anticipate for heavier loads in the future? I am not too experienced in server management. 2. Since this is not my call and department, how best should a solution to these issues be passed to the management. I hope I am asking this question at the right stackexchange site.

score 3 · Accepted Answer · answered Apr 01 '14 at 08:35

The first question I would ask is can this application you serve work in a cluster set-up ?

If so, expanding for the future and trying to cover a machines downtime could be resolved by setting up a load-balanced cluster environment.

The way this works (simply put) is you have a pool of servers which are identical and serve the application you are offering. in a "logical" fasion you have infront of these machines a load-balancer (made redundant so preferrably 2 load balancers in a cluster as well).

This load-balancer will then, when a client wishes to connect to the application, tell the client which individual server to connect to , based on certain parameters.

These parameters can range from: Individual load on the machines, and then attempt to do keep the load on all servers the same, or this could be a "dumb" way of doing load-balancing the round-robin style.

With Round-Robin you assign:

client 1 To Server 1
Client 2 To Server 2
Client 3 To Server 3
Client 4 To Server 1
Client 5 To Server 2
Client 6 To Server 3

How does this "handle" downtime ? Well, this allows you to seamlessly remove a client machine from the load-balancer pool so it'll go un-used (depending on load-balancer and software you serve you might also be able to "drain" a server to different machines as to pro-actively empty one of the servers which needs to go down for maintenance or alike.

How does this allow for seamless expansion to handle higher loads ?

You can "just" plug an extra server into your load-balancing pool. This doesn't require you to take a current machine off-line to add more RAM or alike and is seamless. As soon as the machine is added to the load-balancing pool it'll recieve connections and begin serving additional clients. Using this in combination with a "clever" load-balancing mechanism will also take care of any spikes in load you might come across (for instance when serving web application like a ticket platform. A Spike in load can be handled by simply adding a few machines to the pool to serve the extra load you are expecting and be removed afterwards once they have been drained.

Hope this is of help.

score 2 · Answer 2 · answered Apr 01 '14 at 10:33

To answer your question How to plan Plan Server downtime, that is mostly a service level agreement issue. Usually there's a stipulated maintenance window in the contracts like:

Every second and fourth Tuesday between 20:00 and 24:00 GMT is the scheduled service window for planned maintenance on service XYZ. Planned changes will published the preceding Monday by 13:00 GMT on website www... and/or e-mail distribution list maintenance@... . Emergency maintenance outside of this service window can be scheduled at the discretion of the service provider...

So look up the service level agreement (SLA) and plan you maintenance according to the terms in your contracts.

Test the planned changes, data migrations and fallback scenario in your test environment first and only when you have nailed it then proceed to the production systems.

The actual contents of your question is more like:

How to scale with increased usage?

Typically dealing with more clients, more users and larger datasets it comes down to two options

Scale Up : buy a larger and faster computer system, what you've been doing already. A bigger server with additional CPU's, more memory, more disks, faster storage, faster CPU's etc. This usually works to some extent although eventually you may reach a point where either your budget won't allow for more, or there doesn't exists a single more powerful server for you to buy anymore.
Scale Out : spread the load over multiple servers, rather then a larger single server. The best approach depends on how the actual application functions and how much control you have over that.

A common first step in a scale-out scenario is a dedicated database server and running the application itself separate server.

Another common approach is having a multiple servers each with an instance of the application and reserved for a specific subset of your users i.e. Customer A & B on server 1, Customer C & D on server 2 etc.

A common approach for web application is a load balancing cluster, with multiple identically configured servers, each running the same version of the web application and a load balancer that distributes requests evenly over those servers.

How to plan Plan Server downtime

2 Answers2