Server utilization and how to deal with datacenter failure

Question

New technologies like Docker, Mesos, Kubernetes etc. allow much better server utilisation in an organisation. However, I'd like to know how utilisation can be maximised across two datacenters, taking into account the failure of an entire datacenter.

Given this scenario:

Two datacenters (DC-A and DC-B) with an equal amount of compute resources. Both datacenters are running and serving requests in a load balanced/round robin fashion. The server utilisation in both datacenters is 80%.

Let's say DC-B fails (physical or network) and is unreachable. DC-A will not be able to add an additional 80% utilisation as it's already at 80% itself. Leaving the organisation in state where potentially DC-A will not be able to handle the extra demand and will cause disruptions...

Does this mean that a two datacenter (DC-A and DC-B) organisation can only have 50% maximum utilisation per datacenter? I.e. Either DC fails and the other DC will be able to pick up the slack that the other was carrying (50% + 50%).

Is this thinking correct? How are others handling this problem or am I missing something?

I don't think I'd want to run at 80% utilization under any circumstances. — ewwhite, Jul 31 '15 at 09:29
@ewwhite So are you saying the 50% max utilisation I mentioned is indeed the max per DC? Assuming 2 DC's... — Donovan Muller, Jul 31 '15 at 09:33
It depends in the event of a DC failure what the biggest requirement is. Do you need all the machines to be live? Are there dev machines in there that could be turned off or reduced resources. There would likely be infrastructure related resources that wouldn't need to failover. — Drifter104, Jul 31 '15 at 09:35
No, I'm saying that there's probably some need to factor in growth and additional capacity for unpredictable usage patterns. Something like 60% utilization seems healthier. — ewwhite, Jul 31 '15 at 09:36
@ewwhite Understood. Assuming this would improve if more DC's are added? — Donovan Muller, Jul 31 '15 at 09:39
@Drifter104 Understood, prioritize apps/environments in the event... — Donovan Muller, Jul 31 '15 at 09:39

Grant · Accepted Answer · 2015-07-31T11:25:43.560

For services that need to always be available, you need N+1 redundancy, where N is the number of datacenters or servers (or whatever else you lose in the proposed failure scenario) needed to handle the load. This gets less expensive bigger you get - at the low end with two datacenters each needs to be able to handle the entire workload. But if you have 10, they can do the work of 9 and still be redundant.

The other option is load shedding, though that phrase is more often used with power systems. Basically, turn off any non-essential services in a failure scenario, so the remaining systems have enough resources.

score 4 · Answer 2 · answered Jul 31 '15 at 11:17

A fairly common approach is that for the production environment the hard reserved capacity is sufficient that in case of calamity the remaining datacenter(s) ought to be able to handle the full load and all operations continue business as usual.

Typically budgets never stretch far enough nor is the apparent business-case viable to allow full disaster recovery/fail-over for non-production environments. Degradation or complete unavailability might be deemed acceptable there.

Depending on the platform some may choose the option to increase available production capacity to meet increased load in the remaining datacenter(s) by scaling down the non-production environments in case of disaster.

Server utilization and how to deal with datacenter failure

2 Answers2