16

When is the right time to introduce high availability for web site?

There are many articles on High Availability options. It’s not that obvious however WHEN is the right time to switch from single server to high availability configuration.

Please consider my situation:
http://www.postjobfree.com is 24/7 web site with significant traffic:
http://www.similarweb.com/website/postjobfree.com

Currently I run it on a single server: both IIS 7.0 web server and SQL Server 2008 run on the same hardware box.

There is occasional (~one per month) ~5 minutes downtime usually caused by reboot required by some Windows Server update. Usually downtime is scheduled and happens at night. Still it’s unpleasant, because Google Bot and some users are still active at night.

Current web site revenue is at ~$8K/month.

I consider switching to two-servers configuration (web farm of 2 web servers and cluster of 2 SQL Servers hosted on two hardware servers).

Pros:
1) High Availability (theoretically no downtime). Even if one of servers goes down – another server would take over.
2) No data loss: without SQL cluster, up to one day of data can be lost in case of hardware failure (we do daily backup).

Cons:
1) More effort to setup and maintain such configuration.
2) Higher hosting cost. Instead of ~$600/month it would be about $1200/month.

What would be your recommendation?

Dennis Gorelik
  • 331
  • 2
  • 8
  • The answer to my question might affect development. For example, I may consider splitting database in parts and keep data that requires high reliability (user input) separately from data that requires high performance (calculations). –  Jun 14 '11 at 04:26
  • 2
    Hi Dennis, this isn't really a recommendation so I've stuck it as a comment, but your hosting costs seem pretty high for a single windows server? I assume it's a fully dedicated server (not a VM), but even then you should be looking at perhaps half that cost for a decent specification server with 8GB of RAM, a good amount of disk space, etc. It might be worth speaking with your hosting company about getting a better price. – Ewan Leith Jun 14 '11 at 08:12
  • 6
    I think High Availability should be planned for from the first moment of the project's conception. – Tom O'Connor Jun 14 '11 at 09:57
  • Ewan, I want my web site to work fast, so I have Quad processor with 8 GB memory and SDD drive. Factor in cost of software licenses (Windows, SQL Server), SSL and tech support. Do you have a good solution with low price for that? I currently use Server Intellect (backed by SoftLayer) for hosting. Would you recommend something better? – Dennis Gorelik Jun 14 '11 at 13:18
  • Why are you taking Windows updates on a production server every month? – EkoostikMartin Jun 15 '11 at 19:32
  • 2
    Windows update are coming with security updates. If I don't patch my server, it might be vulnerable to attacks. What update frequency would you recommend for Windows production server? – Dennis Gorelik Jun 16 '11 at 05:42

8 Answers8

15

Short answer: When down time or the risk of it costs you more than it would cost you to have high availability.

It is fundamentally an economic decision. As an example. $8k/month implies that an outage of 2 hours will cost you $22. If you can configure your system such that you can go from scratch to a fully functional site in 2 hours, then high availability would only gain you $22 of functionality above that.

Put another way, you can save money unless / until you have 54 hours of unpreventable down-time in a given month.

Slartibartfast
  • 3,265
  • 17
  • 16
  • 17
    You have to consider risk to reputation too – gbn Jun 14 '11 at 04:57
  • 7
    The cost per hour of downtime will almost certainly depend on just when the server goes down. The transactions are very unlikely to be evenly spread over a 24 hour period. It is more normal to occur during just a few peak hours, at which time the loss would be much greater. – John Gardeniers Jun 14 '11 at 08:44
  • Slartibartfast, I understand your answer that way: make sure that recovery time after catastrophic failure is reasonable (few hours), data loss is reasonable (few hours), and allow myself to have short scheduled downtimes time to time (at least for now). That would mean having daily backups, incremental partial backups, and a server available to restore all that configuration to. Does it sound right? – Dennis Gorelik Jun 14 '11 at 13:41
  • Responses: gbn: Agreed; I was going for a simple explanation, but reputation could easily be a significant factor. John Gardeniers: Sure, but if the site is only used on Sundays between 11AM and 1PM then scheduled down time isn't really a problem, while the $2k price tag for an unplanned 2 hour outage _right_then_ is. At that point you have to figure out how likely that untimely outage is (at $2k revenue cost) against the certain $600/month charge for the addnl server. Hint: unless random failures during the critical period happen more often than 4/year, It's not worth it. – Slartibartfast Jun 20 '11 at 03:02
  • Dennis Gorelik: Decide on the risks you want to protect against, (e.g. loss of business during maintenance, loss of server, loss of datacenter, account / security / database breech) and act to protect against them. In this case you're protecting against down time due to maintenance and unpredictable failure (as far as I can tell). What you describe should do the trick, but keep in mind that you don't have to own the server as long as you can be confident that you can procure it and get it set up in the restore period. – Slartibartfast Jun 20 '11 at 03:04
11

Your stakeholders/business folk (which could be you!) have to decide

Loss of revenue is easy to quantify: the rest can't be answered here sorry...

gbn
  • 6,009
  • 1
  • 17
  • 21
2

I think most users can handle a bit of scheduled downtime. Consider that ebay has weekly updates on friday nights, and bids around then sometimes don't work. My (major australian) bank's online banking has scheduled outages for hours every week. Twitter goes offline all the time. Heroku / EC2 was down for days recently.

I'd keep it in that perspective, if you're really only talking 5 mins a month, you're doing quite a good job as a sysadmin.

Chris
  • 1,201
  • 5
  • 15
  • 17
1

You've already mentioned Google as a factor in terms of indexing, but it may also be worth considering the impact that latency/site responsiveness may have on SEO. It's a black box and all that, so difficult to quantify - though for what it's worth, Matt Cutts reckons it's a one-percenter. I'd be more concerned about reputation, as others have stated.

1

Keep in mind that HA, like security, isn't a product, but rather a process.

For example, database replication will only get you to the point where each mirror of the database will be able to continue on its own, but you will also need a strategy for resynchronization after failed components have been replaced.

Consider an ordering system as an example: the customer submits an order, and during processing, the physical system he was talking to fails after storing the order information in its local copy of the database. Impatient, the customer presses "submit" again, and is directed to another server, which accepts the order. If your databases resynchronize by simply replaying the missing INSERT statements on the other side, then the order will be duplicated, which may not be what you want.

As @Slartibartfast suggested, it all boils down to an economic decision, however I'd recommend that you also plan a few years in the future here. If you expect to need a proper HA setup then, then now would be a good time to set aside resources for the preparatory work.

Simon Richter
  • 3,209
  • 17
  • 17
1

While you think about this I think you consider setting up a "fail whale" page.

There are plenty of ways to do this but the aws combo of route53 and s3 works well on my small sites.

I setup the domain with healthchecks so that on failures DNS sends users to users to a static html page sitting in s3; Costs next to nothing.

In my experience having your site say "sorry things are broken but we are working on it" makes a world of difference to users. A Twitter account where you can communicate with users even is even better.

This goes a long to mitigating the "loss of reputation" that can be the most significant impact of an outage.

see: https://aws.amazon.com/blogs/aws/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting/ for a guide on setting it up.

DynDns' social failover http://dyn.com/managed-dns/social-failover/ is a simlar kind of thing.

You could roll your own and do your healthchecks and then script the DNS changes, provided your DNS records have a low TTL and you have some way of manipulating them programatically.

Nath
  • 1,282
  • 9
  • 10
  • Do these healthchecks have to be executed from the same server that hosts DNS? I cannot picture how to make conditional DNS update. – Dennis Gorelik Nov 21 '15 at 11:42
  • @DennisGorelik not necesaririly but your DNS records need a short TTL and whatever is doing your healthcheck needs to be able to change the records quickly. Updated the answer with more info on how to achieve this. – Nath Nov 21 '15 at 13:11
  • Short TTL for DNS in combination with dependency on health check may make overall system a little less stable (it may switch even if main server works just fine). It may actually make situation worse for the end users, not better. – Dennis Gorelik Nov 21 '15 at 21:45
  • Short TTL by themselves shouldn't be an issue with any decent DNS provider and if you set a pretty low bar on your healthchecks (i.e. Failover if No http 200s for 10 minutes) then stability isn't an issue. Alternatively you can skip the healthchecking part and have a manual cutover. This will mean a longer period of time when your users get "connection timed out" and other ugly errors but no chance of false positives . – Nath Nov 21 '15 at 21:59
0

Have you considered using something like EC2 that will let you scale flexibly and also negate your cons ? It is ultimately an economic decision if using EC2 is worth it or not, but it is at the least, an option to consider.

manku
  • 111
  • 1
-2

To avoid data loss, you should look into Raid configurations before clusters. You should also configure a Failover IP that you can switch from one server to another in case of a disaster without having to wait for DNS propagation.

yqt
  • 1