if you have your entire infra on Amazon VPC, what kind of uptime guarantee is a safe committment?

Question

We are a small startup trying to cater to our first customer. At the moment, the entire h/w setup is on amazon cloud (will be moving shortly to VPC). I have to give an estimate to the customer, on what kind of uptime guarantee my company can offer. While Amazon offers Less than "99.95% but equal to or greater than 99.0%", I think it would make sense to factor in my application upgrades, patching and other maintenance activity on my side and go with a much lesser estimate, say 95%.

I think my question is more in a general sense as to what is a safer commitment for a startup dealing with its first client, in terms of SLA. Would something like 90-95% sound acceptable to for my customer (which is a billion dollar company and they pay us per transaction), consdering that we are not a mature company in this space?

Why not co-locate or host yourself? A reputable datacenter or infrastructure provider can guarantee better uptime than what your simple Amazon design can offer. — ewwhite, Jan 17 '16 at 08:19
@ewwhite co-locating with the customer is not an option because the license model is subscription based and if we ask them for h/w they would expect us to reduce costs. Also, we are virtual company with less than 5 people in two different geographical locations. We have one physical office in the USA, but that's just like a couple of chairs. All the hardware is rented on Amazon. I don't think setting up own infrastructure makes any sense for a small firm like us, given the upfront costs (h/w, security, network, etc). — Jay, Jan 17 '16 at 23:53
I wasn't suggesting co-locating with your customer. I was recommending using a hosted service with a reputable datacenter because it gives you definite advantages over Amazon's offering. I'll detail in an answer below. — ewwhite, Jan 18 '16 at 00:42

Ondra Sniper Flidr · Answer 1 · 2016-01-17T08:23:40.777

90%-95% SLA is useless, it is better to don't say it (even old shared hosting guarantee better SLA for your webapp), you need at least 99.5% for serious business. If you need better SLA (and your customers will!) you need to have mirrored resources (2 app servers, 2 database servers etc.), setup loadbalancing and failover (like keepalived, haproxy, squid etc.), setup good internal and external monitoring and alerting solution (something like Zabbix or Nagios, newrelic and Logstash/Kibana for logs management) and you will need system administrators, who will manage it, monitor it and react to problems.

You should look over table of SLA on wikipedia and there you can find how long can your app be offline for your SLA level. Don't forget that outage can and will occurs when you cannot react instantly (ie. 3am), so you need to have big enough admin team to provide 24/7 support. You have to find and identify all of SPOFs and eliminate it. Don't forget that not only your developers are source of potential problems, but your servers will be under various types of automated attacks from first minute (ssh bots, DDoS etc.)

To have good and stable environment is really, really hard to achieve, very, very expensive, and it is even more expensive when you are in cloud (because of impact of another users of cloud).

You can find examples how your environment should looks for simple webpage to ensure high availability on aws, provided by amazon itself here (pdf) or more in aws architecture center.

Last but not least, you should never forget about doubling of resources! If you have only one VM of single type, you cannot guarantee anything. And second part - you (resp. your admins) need to prepare disaster recovery plans and should do regular "fire drill" to ensure plans are up to date and working well.

Tim · Answer 2 · 2016-01-17T18:56:34.827

This question is probably going to get closed as "too vague" very quickly.

With AWS you can architect a highly available solution, or you can provide a low reliability solution. Any single virtual server is probably fairly reliable, 99.9% or better, but the software you run on it and the monitoring you do will probably be the limiting factor. However a single machine can't really be called "high availability".

You can use an ELB, geographic load balancing, mirrored databases and servers, and a variety of other techniques to increase reliability. Human error or oversight is probably going to be the limiting factor again.

AWS has an architecture center that will help you build a high availability solution. Making use of software split across multiple availability zones is key - an AZ is effectively a data center, with very high speed links to other AWS data centers in the region. For example Amazon RDS (relational database service) can make the same database available in multiple AZs, and you can run multiple compute instances behind a load balancer, so if something goes wrong you should still have a working application. The architecture center has sample application patterns for you. I'm an AWS certified solution architect, (associate level), I learned AWS using free online resources - there's a heap of information available.

Splitting across different regions is more difficult as regions are effectively independent, but is probably required if you want super high availability. This is typically done with route 53's DNS capabilities, using latency based routing. Splitting across AZs is adequate for most applications.

But, if you need a number, I suggest you say 98%. That's really low availability, but if you don't even know how to approach working this out it may be all you can achieve.

to give you the sense of expected load, the service is just a simple web application serving at most 100 users during US business hours - something a single ec2 machine will be able to handle. — Jay, Jan 17 '16 at 05:51
@Jay If reliability and uptime are a concern, then a single EC2 machine _can't_ handle it. You need multiple machines in different AZs, or preferably different regions. — Mike Scott, Jan 17 '16 at 08:28

if you have your entire infra on Amazon VPC, what kind of uptime guarantee is a safe committment?

2 Answers2