23

I'm a small company on not much budget providing websites and databases for charity and not-for-profit clients.

I have a few Debian Linux VPS servers and ensure I have daily backups to a different VPS than the one the service is hosted on.

Recently one of my hosting companies told me two drives failed simultaneously and so that data was lost forever. Stuff happens, they said sorry, what else could they do? But it made me wonder about cost-effective ways to basically get a VPS up again in the event of a hardware or other host-related failure.

Currently I would have to

  1. Spin up a new VPS
  2. Get the last day's backup (which includes databases, web root and website-specific config) over onto the VPS, and configure it like the last one etc.
  3. Update DNS and wait for it to propagate.

It would probably take a day or so achieve this, with the DNS propagation being a big unknown, although I have the TTL set quite low (hour or so).

Some hosts provide snapshots which can be used to replicate a set up to a new VPS, but there's still the IP and this doesn't help in the case that the host company cancels/suspends an account outright (I've been reading about this behaviour from certain hosting providers and it's scared me! I'm not doing anything spammy/dodgy and keep a close eye on security, but I realise that they literally have the power to do this and I'm quite risk averse).

Is this, combined with choosing reputable hosts, the best I can do without going for an incredibly expensive solution?

artfulrobot
  • 2,627
  • 11
  • 30
  • 56
  • 1
    Very skeptical of a claim that 2 drives failed simultaneously, particularly on a vps – symcbean Feb 12 '15 at 23:46
  • Apparently one failed then another while the new one was rebuilding. – artfulrobot Feb 12 '15 at 23:47
  • take a look at http://drbd.linbit.com , this might fit your requirements.. – The Unix Janitor Feb 13 '15 at 07:28
  • 2
    @symcbean: The problem is that a RAID-5 rebuild requires reading all data of all remaining disks. That's a fairly long operation (hours if not days). A cheap RAID-5 system may have a 9+1 setup using desktop drives. All of those 9 disks will be stressed beyond design limits in a RAID rebuild. Failure is then to be expected, in fact. – MSalters Feb 13 '15 at 11:47
  • @MSalters It's certainly possible (and sometimes deceptively probable) for rebuilds to fail in such cheap set-ups but you'd generally expect any reputable VPS to foresee these scenarios and design their system to prevent them from happening. – Lilienthal Feb 13 '15 at 12:34
  • Choose a host who is big enough to use virtualisation. Hardware failures mean your VPS will transparently transition to alternate hardware. This shouldn't really be any more expensive. – JamesRyan Feb 13 '15 at 14:24
  • @MSalters: I'm quite aware of the limitations of desktop drives and most RAID configs. I would like to think that someone offering vps would have more robust storage technology than a 9+1 RAID5 disk set. – symcbean Feb 13 '15 at 14:31
  • There's one bit that I think was missed in other answers: "and configure it like the last one etc." -> you can skip that step by keeping all server configuration in some repository as chef/salt/ansible/puppet scripts. This will both prevent silly mistakes where some obscure setting is forgotten and allow you to rebuild everything in less time. – viraptor Feb 13 '15 at 17:37
  • @JamesRyan - VPS are all done via virtualization – warren Feb 18 '15 at 19:13
  • @warren the difference is how that is implemented. Do they just have a bunch of VMs on a physical machine or do they have a bunch of VMs on a pool of physical machines. – JamesRyan Feb 19 '15 at 10:45
  • 1
    Well actually they have storage pools and processor+memory pools as separate entities, but the question is not about what happened or didn't on one particular provider; its more general than a specific implementation. – artfulrobot Feb 19 '15 at 10:51

5 Answers5

28

For me, choosing reputable hosts and doing regular backups - both of which you seem to be doing already - is about as well as you can do without starting to think about business continuity planning, high-availability setups, SLAs, and so on.

I tell people that you get 99% uptime for free (ie, without spending anything extra on high availability). That's about three and half days downtime a year. Every extra 9 on that uptime increases the cost by somewhere between three and ten times.

If people aren't ready to pay that kind of money, it is in my opinion a mistake to mislead them into thinking they can get any extra protection of any significance.

MadHatter
  • 78,442
  • 20
  • 178
  • 229
  • 3
    This is a great answer. I have a very similar set-up and type of clients to @artfulrobot (we even use the same hosting company), and his question and your answer have made me realise that it is my responsibility to communicate to my clients the limitations and risks, in very plain English, to make sure they have realistic expectations. Most of them are very non-teccie, so there's a very real likelihood they think everything will just somehow magically work, non-stop and ad infinitum. I don't want to be managing their expectations during/after a major failure, I need to do it before! – Simon Blackbourn Feb 13 '15 at 00:14
  • I'm not saying that failures are fully uncorrelated, but 1+1 redundancy in theory should give you *two* extra nines for twice the cost. You suggest the cost for two extra nines is somewhere between 9 times and a 100 times. 2x versus ~30x is a huge difference. – MSalters Feb 13 '15 at 11:51
  • 2
    @MSalters that's true, against certain kinds of failure (server failure). Against eg site failure, it does nothing, unless the two servers are at different sites, and *that* gets extremely complex in terms of network admin. You also consider only the capital costs, and overlook the increased running costs - keeping two servers perfectly in sync isn't trivial, depending on what sort of thing they're doing, and there is the admin cost of load-balancers. My feeling is that redundant servers on a single site, sharing the LB load, gives you another nine in exchange for 3-4 times the cost. – MadHatter Feb 13 '15 at 18:33
  • Good and easy way to present it. (But ... I'd just add some price somewhere, as 3 to 10 times "free" is still free ;). Or, of course, you mean the overall cost of the service itself? ) – Olivier Dulac Feb 16 '15 at 06:44
  • @OlivierDulac precisely so! – MadHatter Feb 28 '15 at 17:18
8

Small businesses with small budgets, especially nonprofits, typically are not going to be able to afford high availability. The question is, if you have virtually no budget, as is commonly the case in situations like this, what is your restore strategy?

I do have some clients like this, and this is what I do:

First, for some of them I have an incremental backup and full database dump every six hours. One client was already using CrashPlan Pro so I just used that. Whatever you do, you need to make sure you have a restorable backup.

I have a simple ansible playbook I put together in about an hour (not having previously worked with ansible) that installs nginx, php-fpm and MariaDB and prepares them to host a web site or sites. Running this playbook results in a server (or servers) that are ready to host a typical web application, and I can simply restore the nginx virtual host, application files and database to it.

The result of this is that I can bring up such a web site from backup in just a few minutes, as opposed to the manual way which could take an hour or more.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • Hey that sounds spot on. I'll look into that. Thanks. – artfulrobot Feb 12 '15 at 11:16
  • High availability is readily available even for small clients from good providers. They get economy of scale. – JamesRyan Feb 13 '15 at 14:27
  • @JamesRyan Yes, but you don't get economy of ... economy. Tell me if it makes sense to run two Amazon instances and an elastic load balancer for a website that sees 300 hits a month? – Michael Hampton Feb 13 '15 at 14:40
  • @MichaelHampton that is not even remotely what I was suggesting. A company hosting VPS's for hundreds of clients can spread them amongst redundant hardware rather than simply put a bunch of them on a single physical server and cross their fingers. – JamesRyan Feb 13 '15 at 23:23
4

The complexity of the implementation depends on the application stack, but ideally you'd want to setup a "hot standby" (at a different provider), with data being replicated in real-time (or as close to real time) as possible.

Making the business case for having 2 "live" servers is as simple as comparing the potential loss of revenue during a "recovery from images" period to the expense of another server.

Mark R.
  • 365
  • 1
  • 5
  • Thanks. I'm using a LAMP stack. I guess real time would be something like MySQL replication, although that can get pretty tricky to manage. And it's doubling the servers I have to manage. Maybe it would make sense to have one low spec box that had a live copy of all the other servers, so it was just the DNS propagation. Then I could clone that back to a new VPS and change the DNS (hmmm.). – artfulrobot Feb 12 '15 at 10:42
  • MySQL replication is usually quite simple to set up and configure, aside from the time spent transferring the initial dataset. As for DNS, most resolvers respect low TTLs these days, and setting a record's TTL as low as 60 seconds usually works well. – Mark R. Feb 12 '15 at 10:43
  • MySQL replication is more complex when you need to add in new extra databases and I believe it's still tricky to have one server being a slave for more than one master (replicating several dbs on one standby server). Also of course you need to secure the access between servers, e.g. stunnel, so that's a PKI to maintain etc. unless you have a private lan but that's ruled out by the need for this to be with separate hosting company. – artfulrobot Feb 12 '15 at 11:34
  • There's always _replicate-do-db_ and SSH tunnels with keys. – Mark R. Feb 12 '15 at 11:36
  • Used to run standard SSH tunnel but it was not reliable. Stunnel is brilliant once you've got it up and running, though. – artfulrobot Feb 12 '15 at 11:48
  • I've a couple of clients that rely on autossh-maintained tunnels, but yeah, stunnel is great as well. – Mark R. Feb 12 '15 at 11:49
  • Hot standby is a long way from ideal. Balancing the traffic across 2 well separated nodes is a much better solution. – symcbean Feb 12 '15 at 23:49
  • @symcbean, in this case, setting up a master/master solution, ensuring bidirectional replication of both databases and files etc, seems to me to be a bit more hands-on than a standby that can lag behind the active box by a little while. – Mark R. Feb 13 '15 at 08:29
  • @Mark: master/master can lag just as much as master/slave - but the lag should be reduced with load spread across the boxes and the complexity and pain of a switchover is MUCH reduced. Multi-master replication eliminates the lag but at the cost of some complexity. – symcbean Feb 13 '15 at 09:54
  • @symcbean, keep in mind he'll probably won't have a proper budget for "real" load balancing, which means DNS-based round-robin... I wouldn't want to support that kind of config. – Mark R. Feb 13 '15 at 10:54
2

Remember that uptime is not the same as data integrity. You can have 99.99% uptime and have lost all of your data twice in a year as long as the server was restarted "soon enough". Most of the VPS providers are guaranteeing that your server is running, NOT that your data is safe. Your data is Your problem :(.

What you're looking for is something that will store your backups on a separate server and (IMHO) not even in the same provider. Depending on the data size that you're talking about, a portable hard drive could be used as a third line of offline defence. Backup your data as you have been doing and then regularly copy that (or just the changes if possible) to the portable hard drive or even a local computer. There are also reasonably cheap options like Backblaze for backup solutions, but the price will depend on the amount of data you're talking about. If you can do incremental backups it will be much cheaper than full backups, but incremental backups can be very difficult depending on where the data is stored (flat files = easy, database = not so easy).

millebi
  • 61
  • 5
  • Yeah, I do that :-) And yes, hosting companies do not care about data, I've dealt with disk corruption before too! – artfulrobot Feb 12 '15 at 22:44
0

The answer totally depends on you architecture and requirements. Some time ago 3 discs failed on a server of mine, taking down 20+ vm's when a Raid 6 failed.

I wrote about it at

https://www.linkedin.com/pulse/20140827173324-2064263-how-i-nearly-lost-my-business-to-3-hard-discs

But: Because this is critical, we had backups - daily for non-important stuff, 15 minutes for databases and emails. Heck, now I added a server that gets replicated to another machine every 30 seconds.

You say nothing about the stack, nothing about any budget - so the best and only advice here is to go to some cloud provider and start using their backup mechanisms. But start defining what you actually - need.

Also - the budget for this backup should be in your pricing. It needs to be paid. And whatever infrastructure you need.... you need it. It is not "ridiculous expensive" then.

TomTom
  • 50,857
  • 7
  • 52
  • 134