They want 100% uptime with off-site failover on a web application. From our web application's viewpoint, this isn't an issue. It was designed to be able to scale out across multiple database servers, etc.

However, from a networking issue I just can't seem to figure out how to make it work.

In a nutshell, the application will live on servers within the client's network. It is accessed by both internal and external people. They want us to maintain an off-site copy of the system that in the event of a serious failure at their premises would immediately pick up and take over.

Now we know there is absolutely no way to resolve it for internal people (carrier pigeon?), but they want the external users to not even notice.

Quite frankly, I haven't the foggiest idea of how this might be possible. It seems that if they lose Internet connectivity then we would have to do a DNS change to forward traffic to the external machines... Which, of course, takes time.

Ideas?

UPDATE

I had a discussion with the client today and they clarified on the issue.

They stuck by the 100% number, saying the application should stay active even in the event of a flood. However, that requirement only kicks in if we host it for them. They said they would handle the uptime requirement if the application lives entirely on their servers. You can guess my response.

Dont underestimate the huge downtime caused by hacking, look at Sony and the PlayStation network. you can guarantee they had the same %100 uptime idea and the money/hardware to back it up. make clear with the client that 100% uptime is an unfeasible expectation, even google techs would be hesitant to mutter "100% uptime". a hint btw is to look into using dynamic DNS, they only cache for 60 seconds, this should include OS and local DNS servers.
–
SilverfireSep 29 '11 at 0:39

173

I would personally RUN from this client as fast as possible. I suspect this won't be the last crazy idea they may have (from a technology standpoint).
–
GregDSep 29 '11 at 0:53

If you figure out 100% uptime let me know. I'll create a business with it and sell it to google. It's impossible to guarantee 100%. Even companies like microsoft, amazon or google won't go that high because they know it's impossible. The best i've seen is 99.999% and even that is a stretch (5 minutes in a year). The best you could probably do is 99.99% reliably.
–
MattSep 29 '11 at 4:50

32

Just make up an insanely high price tag to put on their insane request. That will probably bring them back to their senses. Either than, or it will send them off looking for someone willing to lie to them.
–
Nate C-KSep 29 '11 at 5:28

27 Answers
27

Interestingly, only 3 of the top 20 websites were able to achieve the mythical 5 nines or 99.999% uptime in 2007. They were Yahoo, AOL, and Comcast. In the first 4 months of 2008, some of the most popular social networks, didn't even come close to that.

From the chart, it should be evident how ridiculous the pursuit of 100% uptime is...

Pingdom also isn't checking every second. On top of that, the ones that did meet five nines likely still had localized disruptions that Pingdom might not have detected, or glitches that made some services unavailable while still responding to pings.
–
ceejayozSep 29 '11 at 1:16

8

Which in and of itself makes the five nines dubious...
–
GregDSep 29 '11 at 1:22

Sorry to disturb the chat going on, but the OP's question was how to go about striving towards the goal of 100% uptime on a technical level not conceptually, I'm sure he knows it's not always possible because of natural occurrences that happen to hardware and the environment. Could we help him with that?
–
David d C e FreitasSep 29 '11 at 10:11

4

To the OP: I have seen SLAs that guaranteed uptime in the context of "outside of normal maintenance". The normal maintenance of course being scheduled downtime per month for updates, patches, etc., that usually occur on their least busy day of the month during the least busy times of the month (usually in the middle of the night). They must have some type of metrics for their business with regard to business. You could offer better uptime (4 nines) for them only during those times.
–
GregDSep 29 '11 at 20:23

Ask them to define 100% and how it will be measured Over what time period. They probably mean as close to 100% as they can afford. Give them the costings.

To elaborate. I've been in discussions with clients over the years with supposedly ludicrous requirements. In all cases the they were actually just using non precise enough language.

Quite often they frame things in ways that appear absolute - like 100% but in actual fact on deeper investigation they are reasonable enough to do the cost/benefit analyses that are required when presented with costings to risk mitigation data. Asking them how they will measure the availability is a crucial question. If they don't know this then you are in a position having to suggest to them that this needs to defined first.

I would ask the client to define what would happen in terms of business impact/costs if the site went down in the following circumstances:

At their busiest hours for x hours

At their least busy hours for x hours

And also how they will measure this.

In this way you can work with them to determine the right level of '100%'. I suspect by asking these kinds of of questions they will be able to better determine their other requirements' priorities. For example they may want to pay certain levels of SLA and compromise other functionality in order to achieve this.

Agreed. They may just mean "very high" uptime (upper 90s?) with a pretty solid failover strategy. If not, then an explanation of the cost scale involved would hopefully persuade them...
–
Martin DowSep 29 '11 at 10:55

28

+1 for not jumping to conclusions, and instead just asking the client to explain what they have in mind.
–
sleskeSep 29 '11 at 14:10

4

I echo the "not jumping to conclusions" statement...if the customer means 100% uptime (minus scheduled maintenance) then it may be more of a reasonable requirement.
–
Tim ReddySep 29 '11 at 18:26

1

Regarding business impact, we actually know and understand their business completely and the costs involved for the site going down are not financial. More along the lines of the natives showing up with pitchforks, potential hangings, etc. ;) Just imagine 40,000 people showing up at your front door screaming. That's what they want to avoid with a passion.
–
NotMeSep 29 '11 at 22:43

7

@ChrisLively All the more reason to have a mature understanding of risk then. The dominant paradigm for safety engineering is probabilistic risk assessment. There are systems that could kill (not just annoy) thousands of people and they still have a low, hopefully well understood, but non-zero probability of failure.
–
poolieSep 29 '11 at 23:41

Your clients are crazy. 100% uptime is impossible no matter how much money you spend on it. Plain and simple - impossible. Look at Google, Amazon, etc. They have nearly endless amounts of money to throw at their infrastructure and yet they still manage to have downtime. You need to deliver that message to them, and if they continue to insist that they offer reasonable demands. If they don't recognize that some amount of downtime is inevitable, then ditch 'em.

That said, you seem to have the mechanics of scaling/distributing the application itself. The networking portion will need to involve redundant uplinks to different ISPs, getting an ASN and IP allocation, and getting neck-deep in BGP and real routing gear so that IP address space can move between ISPs if need be.

This is, quite obviously, a very terse answer. You haven't had experience with applications requiring this degree of uptime, so you really need to get a professional involved if you want to get anywhere close to the mythical 100% uptime.

@ErikA A request for 100% uptime is indicative of ignorance of technical characteristics of systems. That's ok, because the customer's job is doing whatever they do. Your job is to engineer IT systems. Difficult customers like this can be nightmares, but they can also become your best customers.
–
duffbeer703Sep 30 '11 at 13:04

Well, that's definitely an interesting one. I'm not sure I would want to get myself contractually obligated to 100% uptime, but if I had to I think it would look something like this:

Start with the public IP on a load balancer completely out of the network and build at least two of them so that one can fail over to the other. A program like Heatbeart can help with the automatic failover of those.

Varnish is primarily known as a caching solution but it does some very decent load balancing as well. Perhaps that would be a good choice to handle the load balancing. It can be set up to have 1 to n backends optionally grouped in directors which will load balance either randomly or round-robin. Varnish can be made smart enough to check the health of every back end and drop unhealthy back ends out of the loop until it comes back online. The backends do not have to be on the same network.

I'm kind of in love with the Elastic IPs in Amazon EC2 these days so I would probably build my load balancers in EC2 in different regions or at least in different availability zones in the same region. That would give you the option of manually (god forbid) spinning up a new load balancer if you had to and moving the existing A record IP to the new box.

Varnish cannot terminate SSL, though, so if that is a concern you may want to look at something like Nginx instead.

You could have most of your backends in your clients network and one or more outside their network. I believe, but am not 100% sure, that you can prioritize the backends so that your clients machines would receive priority until such time as all of them became unhealthy.

That's where I would start if I had this task and undoubtedly refine it as I go along.

However, as @ErikA states, it's the Internet and there are always going to be parts of the network that are outside your control. You'll want to make sure your legal only ties you up with things that are under your control.

For a while I was thinking about Amazon and MS for a cloud deployment but both of them have had major outages over the past couple of months. SSL is critical.
–
NotMeSep 29 '11 at 1:57

3

If you were going to use Amazon, you would definitely want to spread your machines out around the 5 availability zones. It's pretty unlikely that all their zones would go out at the same time.
–
jdwSep 29 '11 at 12:10

you will always have a point of failure, jdw, as long as there's a non-distributed thing in the chain (in your case heartbeat, unless of course you have multiple instances of that running on remote machines all monitoring each other as well as your servers, which any of them may or may not see because of network trouble along the routing). Which brings us to "downtime". The servers may be up and running and still unavailable to the client without heartbeat ever detecting it if the failure is not in the routing path.
–
jwentingSep 30 '11 at 9:21

Agreed. As EVERYONE else has pointed out, there's no such thing as 100% uptime. All you can do is try and what I described is how I would start trying.
–
jdwSep 30 '11 at 10:13

+1 for noting, that 100% is not 100,0% or 100,000% etc. The decimal digits matters, they indicate precision ;)
–
РСТȢѸФХѾЦЧШЩЪЫЬѢѤЮѦѪѨѬѠѺѮѰѲѴSep 29 '11 at 13:13

4

By some conventions, "100%" has only one significant figure, such that all numbers between one-half and one would round to "100%"; 50% would round to 100%.
–
Thomas LevineSep 29 '11 at 23:42

1

Depending on the standard for counting some will say that 50% has two meeningfull numbers where 100% has three meeningful number. 50,5 and 100 are there fore just as precise. Others will count digits after the decimal point. Then 50,5 and 100,4 will be just as accurate. If nothing else stated I would assume that 100% is 99,5% and up. 100,0% is 99.95% and up etc.
–
TillebeckOct 18 '11 at 8:28

I don't understand what the issue is. The client wants you to plan for disaster, and they aren't math oriented, so asking for 100% probability sounds reasonable. The engineer, as engineers are prone to do, remembered his first day of prob&stat 101, without considering that the client might not.
When they say this, they aren't thinking about nuclear winter, they are thinking about Fred dumping his coffee on the office server, a disk crashing, or an ISP going down.
Furthermore, you can accomplish this. With geographically distinct, independent, self monitoring servers, you will basically have no downtime. With 3 servers operating at an independent(1) three 9 reliability, with good failover modes, your expected downtime is under a second per year(2). Even if this happens all at once, you are still within a reasonable SLA for web connections, and therefore the downtime practically does not exist.
The client still has to deal with doomsday scenarios, but Godzilla excluded, he will have a service that is "always" up.

(1) A server in LA is reasonably independent from the server in Boston, but yes, I understand that there is some intersection involving nuclear war, Chinese hackers crashing the power grid, etc. I don't think your client will be upset by this.

(2) DNS failover may add a few seconds. You are still in a scenario where the client has to retry a request once a year, which is, again, within a reasonable SLA, and not typically considered in the same vein as "downtime". With an application that automatically reroutes to an available node on failure, this can be unnoticeable.

The problem is that they're saying it in contract-ese. Meaning that if a disaster does occur and you need more than ten seconds to take the site back online via backups they'd have standing to sue.
–
ShadurOct 2 '11 at 14:16

@Shadur: If they really want it, then you must really charge them. Spread the servers geographically far and wide, hopefully there will not be disaster everywhere.
–
Jungle HunterOct 3 '11 at 2:49

2

I've seen a site that offered 100% uptime guarantees or your money back. The trick was they charged a boatload and partitioned into months. So some months go unpaid and you schedule everything around that, and cover the loss with the months that work out okay.
–
jlduggerOct 3 '11 at 16:38

he could be smarter than all their people combined, who knows :p
–
MattSep 29 '11 at 5:40

3

100% uptime doesn't have to be so literal people - it means: 100% available during the time that it's needed. For example, bank systems should always be available, and they do quite well. Just because they go down for maintenance for 1 second once a year doesn't mean they failed at their 100% uptime goal.
–
David d C e FreitasSep 29 '11 at 10:14

Review the other answers here, sit down with your client, and explain WHY it's impossible, and gauge their response.

If they still insist on 100% uptime, politely inform them that it cannot be done and decline the contract. You will never meet their demand, and if the contract doesn't totally suck you'll get skewered with penalties.

100% needs to be defined, i.e. 100% available except when doing maintainance or upgrades and that time will be limited to quiet hours for a few hours a month at most. It all depends on what the purpose and usage of the web app is in this case...
–
David d C e FreitasSep 29 '11 at 10:18

1

and define "downtime". Can't even in theory guarantee they'll be able to access a server in Omaha from their offices in Fairbanks unless you control the entire network in between (though you could give assurances about the server being up and running).
–
jwentingSep 30 '11 at 9:23

The definitions are, IMHO, irrelevant if they ask for "100% uptime": Even if you negotiate scheduled maintenance and build in N+N redundancy if one minor glitch causes an unscheduled reboot or service blink you've blown your SLA. DEFINITELY relevant if you're negotiating a 3, 4 or 5 nines SLA though.
–
voretaq7♦Sep 30 '11 at 14:38

Depends on the terms of the SLA though, doesn't it? If you get paid $100K per month and every minute of downtime carries a $1K penalty, that might be entirely doable (if you have other contracts to amortize the cost of 24/7 on-site sysadmins).
–
Michael BorgwardtSep 30 '11 at 23:49

@MichaelBorgwardt there are definitely ways to "make it work" from a pure numbers standpoint, but I'd still decline because of potential for bad PR ($_CLIENT goes on Twitter and tells the world 'we're down because $_PROVIDER is incompetent and can't meet their SLA!'). Personally I'd rather have 10 smaller, more reasonable clients pay me $10k a month :-)
–
voretaq7♦Oct 1 '11 at 4:06

Price accordingly, and then stipulate in the contract that any downtime past the SLA will be refunded at the rate they are paying.

The ISP at my last job did that. We had the choice of a "regular" DSL line at 99.9% uptime for $40/mo, or a bonded trio of T1s at 99.99% uptime for $1100/mo. There were frequent outages of 10+ hours per month, which brought their uptime well below the $40/mo DSL, yet we were only refunded around $15 or so, because that's what the rate per hour * hours ended up at. They made out like bandits from the deal.

If you bill $450,000 a month for 100% uptime, and you only hit 99.999%, you'll need to refund them $324. I'm willing to bet the infrastructure costs to hit 99.999% are in the neighborhood of $45,000 a month assuming fully distributed colos, multiple tier 1 uplinks, fancypants hardware, etc.

If you see anybody promising 100% uptime then this is exactly what they are doing. There's a difference between promising 100% uptime and delivering it. It would be a good idea to explain this to the client if they try to quote a competitor's SLA to you.
–
sjbothaSep 30 '11 at 13:30

You will not meet 100% availability goal for an extended period of time. You may get away with it for a week or a year, but then something will happen and you will be held responsible. The outfall can range from damaged reputation (you promised, you didn't deliver) to bankruptcy from contractual fines.

I would communicate with the client to establish with them what exactly 100% uptime means. It is possible they don't really see a distinction between 99% uptime and 100% uptime. To most people (ie. not server admins) those two numbers are the same.

People with absolutely no knowledge about computers, computer systems, or the Internet.*

Ones who are intentionally making an ass of themselves, either to test your ability to say No (Google "the Orange Juice Test"), or trying to gain some kind of contract SLA leverage in order to get out of paying you later.

My advice, having suffered both of these types of clients on many occasions, is to not take this client. Let them drive someone else insane.

Just define 'uptime' to be relative to the entire bundle of service you can actually keep operational 100% of the time, and you should have no problems.

Also, it's worth pointing out that the entire point of an SLA is to define what your obligations are and what happens if you can't meet them. It doesn't matter if the client asks for 3 nines or 5 nines or a million nines - the question is what they get when/if you can't deliver. The obvious answer is to provide a line item for 100% uptime at 5x the price you want to charge, and then they get a 4x refund if you miss that target. You might score!

DNS changes only take time if they are configured to take time. You can set the TTL on a record to one second - your only issue would be to ensure that you provide a timely response to DNS queries, and that the DNS servers can cope with that level of queries.

This is exactly how GTM works in F5 Big IP - the DNS TTL by default is set to 30 seconds and if one member of the cluster needs to take over, the DNS is updated and new IP is taken up almost immediately. Maximum of 30 seconds of outage, but that is the edge case, the average would be 15 seconds.

It's been my experience that some DNS servers will disregard a TTL that they consider to be obnoxiously low (in spite of the RFC). Anything less than 5 minutes becomes somewhat unreliable in the global scale.
–
jdwSep 29 '11 at 0:59

I'm with jdw on this. I've seen numerous DNS servers completely ignore TTL, even a 1 hr setting and default back to something like 24 hours or so.
–
NotMeSep 29 '11 at 1:56

6

@Paul - the OP doesn't have control over every ISP's DNS resolvers on the planet. Ergo, they don't get the choice to say "if you're going to use our website, do not use Comcast/Roadrunner/whomever as your ISP because they will ignore our TTL settings". It's something that is simply out of their control and is therefore too fragile to be considered a solution for this problem IMHO. The solution has to include some way to be able to internally force the IPs around without relying on other bits of the network that may not be cooperative.
–
jdwSep 29 '11 at 10:12

3

That's kind of like not having a UPS because the power 'should just work'. It's not a forward thinking way to architect a system. If you know that there is a fragile part of the system, for whatever reason, you should try to account for it.
–
jdwSep 29 '11 at 16:39

Yes, DNS is a good start - e.g. nslookup google.com returns 6 different IP's for redundancy in case some of them don't work. Also check out RobTex.com a great site to look at the configurations of certain domains e.g. robtex.com/dns/google.com.html#records
–
David d C e FreitasSep 29 '11 at 10:21

While I doubt 100% is possible you may want to consider Azure (or something with a similar SLA) as a possibility. What goes on:

Your servers are virtual machines. If there's ever a hardware issue on one server your virtual machine is moved to a new machine. The load balancer takes care of the redirection so the customer should not see any downtime (though I'm not sure how your sessions state would be affected).

That said, even with this fail-over, the difference between 99.999 and 100 borders on insanity.

You'll have to have full control over the following factors.
- Human factors, both internal and external, both malice and impotence. An example of this is somebody pushing something to production code that brings down a server. Even worse, what about sabotage?
- Business issues. What if your provider goes out of buisness or forgets to pay their electric bills, or simply decides to stop supporting your infrastructure without sufficient warning?
- Nature. What if unrelated tornadoes simultaneously hit enough data centers to overwhelm backup capacity?
- A completely bug free environment. Are you sure there isn't an edge case with some third party or core system control that hasn't manifested itself but still could do so in the future?
- Even if you have full control over the above factors, are you sure the software/person monitoring this won't present you with false negatives when checking if your system is up?

Azure and EC2 have both recently had near complete and total failures. I believe Azure was recently taken down simply due to a bad config entry on a DNS server. Either way, thanks for the info.
–
NotMeSep 29 '11 at 19:11

and if your load balancer (which does the switching) goes down unnoticed (its monitor could also be down unnoticed, ad infinitum) when the node goes down, you're still screwed.
–
jwentingSep 30 '11 at 9:26

1

I think you meant 'incompetence.' 'Impotence' shouldn't have a great deal of impact on the IT staff's ability to do their jobs.
–
mfinniSep 30 '11 at 12:58

Honestly 100% is completely insane without at least a waver in the terms of a hacking attack. Your best bet is to do what Google and Amazon do in that you have a geo-distributed hosting solution where you have your site and DB replicated across multiple servers in multiple geographic locations. This will guarantee it in anything but a major disaster such as the internet backbone being cut to a region (which does happen from time to time) or something nearly apocalyptic.

I would put in a clause for just such cases (DDOS, internet backbone cutting, apocalyptic terrorist attack or a big war, etc).

Other than that look into Amazon S3 or Rackspace cloud services. Essentially the cloud setup will not just offer the redundancy in each location but also the scalability and the geo-distribution of traffic along with the ability to redirect around failed geo-areas. Though my understanding is that the geo-distribution costs more money.

I just wanted to add another voice to the "it can (theoretically) be done" party.

I wouldn't take on a contract that had this specified no matter how much they paid me, but as a research problem, it has some rather interesting solutions. I'm not familiar enough with networking to outline the steps, but I imagine a combination of network-related configurations + electrical/hardware wiring failovers + software failovers would, possibly, in some configuration or the other work to actually pull it off.

There's almost always a single point of failure somewhere in any configuration, but if you work hard enough, you can push that point of failure to be something that can be repaired "live" (i.e. root dns goes down, but the values are still cached everywhere else so you have time to fix it).

Again, not saying it's feasible.. I just didn't like how not a single answer addressed the fact that it isn't "way out there" - it's just not something they actually want if they think it through.

Re-think your methodology of measuring availability then work with your customer to set meaningful targets.

If you are running a large website, uptime is not useful at all. If you drop queries for 10 minutes when your customers need them most (traffic peak), it could be more damaging to the business than an hour-long outage at 3 AM on a Sunday.

Sometimes large web companies measure availability, or reliability, using the following metrics:

percentage of queries that are answered successfully, without a server-side error (HTTP 500s).

percentage of queries that are answered below a certain target latency.

dropped queries should count against your stats (see below).

Availability should not be measured using sample probes, which is what an external entity such as pingdom and pingability are able to report. Don't rely solely on that. If you want to do it right, every single query should count. Measure your availability by looking at your actual, perceived success.

The most efficient way is to collect logs or stats from your load-balancer and calculate the availability based on the metrics above.

The percentage of dropped queries should also count against your stats. It can be accounted in the same bucket as server-side errors. If there are problems with the network or with another infrastructure such as DNS or the load balancers, you can use simple math to estimate how many queries you lost. If you expected X queries for that day of the week but you got X-1000, you probably dropped 1000 queries. Plot your traffic into queries per minute (or second) graphs. If gaps appear, you dropped queries. Use basic geometry to measure the area of those gaps, which gives you the total number of dropped queries.

Discuss this methodology with your customer and explain its benefits. Set a base-line by measuring their current availability. It will become clear to them that 100% is an impossible target.

Then you can sign a contract based on improvements on the baseline. Say, if they are currently experiencing 95% of availability, you could promise to improve the situation ten fold by getting to 98.5%.

Note: there are disadvantages to this way of measuring availability. First, collecting logs, processing and generating the reports yourself may not be trivial, unless you use existing tools to do it. Second, application bugs may hurt your availability. If the application is low quality, it will serve more errors. The solution to this is to only consider the 500s created by the load-balancer instead of those coming from the application.

Things may get a bit complicated this way, but it's one step beyond measuring just your server uptime.

Go grab a book on manufacturing quality control using statistical sampling. A general discussion in this book, the concepts of which any manager would have been exposed to in a general statistics course in college, dictate the the costs to go from 1 excption in a thousand, to 1 in ten thousand to 1 in a million to 1 in a billion rise exponentially. Essentially the ability to hit 100% uptime would cost an almost unlimited amount of funds, kind of like the amount of fuel required to push an object to the speed of light.

From a performance engineering perspective I would reject the requirement as both untestable and unreasonable, that this expression is more of a desire than a true requirement. With the application dependencies which exist outside of any application for networking, name resolution, routing, defects propogated from underlying architectural components or development tools, it becomes a practical impossibility to have anyone gurantee 100% uptime.

While some people noted here, that 100% is insane or impossible, they somehow missed the real point. They argued, that the reason for this is the fact that even the best companies/services cannot achieve it.

Well, it's lot simpler than that. It's mathematically impossible.

Everything has a probability. There could be a simultaneous earthquake at all locations of where you store your servers, destroying all of them. Agreeably it's a ridiculously small probability, but it's not 0. All you internet providers could face a simultaneous terrorist/cyber attack. Again, not very probable, but not zero either. Whatever you provide, you can get a non-zero probability scenario which brings the whole service down. Because this, your uptime cannot be 100% either.

I don't think the customer is actually asking for 100% uptime, or even 99.999% uptime. If you look at what they're describing, they're talking about picking up where they left off if a meteor takes out their on-site datacenter.

If the requirement is external people not even notice, how drastic does that have to be? Would making an Ajax request retry and show a spinner for 30 seconds to the end user be acceptable?

Those are the kinds of things the customer cares about. If the customer was actually thinking of precise SLAs, then they would know enough to express it as 99.99 or 99.999.

If the customer thinks they want "100% uptime" and that's when ends up in the contract verbiage, you might get held to it if it ends up in court. Best to talk it out and help the customer understand what they really want instead of assuming you know what they're thinking.
–
Chris S♦Sep 30 '11 at 19:35

Oh I agree this needs to be cleared up before it gets into a contract. I'm just saying this needs to be approached as the client isn't communicating what they actually want, as opposed to the client is asking for something ridiculous.
–
Kevin PetersonSep 30 '11 at 20:43

my 2 cents. I was responsible for a very popular web site for a fortune-5 company who would take out ads for the super bowl. I had to deal with huge spikes in traffic and the way I solved it was to use a service like Akamai. I do not work for Akamai but I found their service extremely good. They have their own, smarter DNS system that knows with a particular node/host is either under heavy load or is down and can route traffic accordingly.

The neat thing about their service was that I didn't really have to do anything very complicated in order to replicate content on servers in my own data center to their data center. Additionally, I know from working with them, they made heavy use of Apache HTTP servers.

While not 100% uptime, you may consider such options for dispersing content around the world. As I understood things, Akamai also had the ability to localize traffic meaning if I was in Michigan, I got content from a Michigan/Chicago server and if I was in California, I supposedly got the content from a server based in California.

-1 because this is a practical answer but not useful at all. All questions in this site could be answered by "hire someone else to do it", but that is not why we are here.
–
Yves JunqueiraOct 2 '11 at 17:33

I beg to differ. "Not useful at all?" It was most certainly useful for me and contrary to your "hire someone else to do it" comment, I suppose with your reasoning the guy should trench his own fiber optic cable and design his own switches rather than buy them too? Are you serious, Yves? You sound like someone who has not spent much time in the IT field.
–
KiloOct 3 '11 at 0:33

Instead of off-site failover, just run the application from two locations simultaneously, internal and external. And synchronise the two databases... Then if the internal goes down, the internal people will still be able to work and external people will still be able to use the application. When internal comes back online, synchronise the changes. You can have two DNS entries for one domain name or even a network router with round robin.

For externally hosted sites, the closest you'll get to 100% uptime is hosting your site on Google's App Engine and using its high replication datastore (HRD), which automatically replicates your data across at least three data centers in real time. Likewise, the App Engine front-end servers are auto scaled/replicated for you.

However, even with all of Google's resources and the most sophisticated platform in the world, the App Engine SLA uptime guarantee is only "99.95% of the time in any calendar month."