0

Yesterday, the network of Linode over at HE.net has suffered a major outage -- supposedly, he.net has had "one out of eight" of their power generators fail, which, apparently, has one way or another resulted in the whole Linode Fremont cloud going down for five whole hours (2015-05-29T18:30/23:30 PT). There have been some reports that supposedly the network core was the part without power, however, upon coming back up, it appears that all servers might have been power cycled, too.

What's the best practice for supplying the power to the servers?

  • Is it generally sufficient to rely on the power provided by the Data Centre alone (which almost always claim UPS and power generators, don't they?), or are you supposed to have extra UPS within your own shelves?

  • Should the networking core be under its own UPS?

  • Does any major cloud or dedi provider have dedicated UPS units for each server/rack at all?

cnst
  • 12,948
  • 7
  • 51
  • 75

2 Answers2

6

Stuff fails. It's part of sysadmin life. Any business plan you have that relies on a service offering 100% uptime is a bad one. Before I say anything else, let me note that I know none of the details about this particular outage.

That said, I've had industrial-grade UPSes fail on me before. At a high-end colo we had an 800A breaker fail part-open, meaning that all protected servers were connected to both street and UPS power for a short while, then nothing for four hours. When it came back, we found that our main DB server had lost nearly half its HDDs due to the rapid power-cycling and spikes. That was an interesting day.

Sure, you could duplicate the site UPS's function with a UPS in every rack. I've never met anyone that does this, and I suspect the reason is that it doubles the single points of failure, and worse, it interposes a second lower-quality SPOF between your kit and the industrial UPS. Data-centre-sized UPSes will be regularly serviced, highly-monitored, and will hardly ever (but not "never") fail; rack-sized UPSes are much more consumer-level gear, and will fail more often. I've had my personal server down for a whole weekend after the individual UPS it was on failed, even though the colo power was good the whole time.

If you truly want a high-availability offering, you need BGP-routed PI netblocks, duplicate kit spread over multiple DCs with multiple providers, heavy-duty SLAs with teeth; the whole very, very expensive tamale. This is why I say that you get 99% for no extra cost; every extra 9 increases cost by up to an order of magnitude. And if anyone in your organisation thought that putting stuff in the cloud meant that you weren't running on hardware or didn't need to worry about it, well, they were wrong.

MadHatter
  • 78,442
  • 20
  • 178
  • 229
  • If colo power was good all this time, it would have helped to have a dual PSU server, one of which being fed by the UPS on colo feed A, the other by colo feed B directly. – Halfgaar May 31 '15 at 10:03
  • In a rack with dual PDUs, I completely agree. – MadHatter May 31 '15 at 10:17
4

I don't know the specifics of this outage but there is no magic bullet "one weird trick to never having an outage" that this provider doesn't know about or refuses to implement that the provider down the road does use.

No matter what you do, no matter how carefully you plan, there's always a chance that something will go wrong. I used to work in a very large datacentre for an oil and gas exploration company and we had what was then the latest and greatest IBM mainframe technology. Not only was it the fastest that money could buy, it was also the most reliable, redundant and resilient system IBM could supply.

But it failed and we had a 36 hour outage. Not because of a code bug or a power issue or any of the things you might normally associate with causing major outages but because of a small rubber washer that cost a few pennies.

The system was water cooled and the water cooling system also had redundancy and resiliency built in. No one really realised it at the time but there was just one little 'single point of failure' - the pump that allowed both water cooling circuits to be pressurised or drained from just one inlet and outlet pipe. Guessed where the washer I mentioned previously was, yet...?

So, where am I going with this anecdote? If you want redundancy from a cloud/hosted service then rather than thinking tactically about the arrangement of UPS vs. power rails you need to think strategically and pick a provider (or more than one provider and accept the overhead of managing this in house) that has multiple geographically dispersed locations and a foolproof (how do they/you define that? How do they/you test it?) failover between them.

Rob Moir
  • 31,664
  • 6
  • 58
  • 86