Thinking back to the Cassini search engine rewrite project, there are a lot of similarities as one could expect. (Without being specific, there are thousands of servers involved in eBay search.) Search at eBay should never go down – ever. There are always customers in some time zone wanting to search eBay. And if there is a problem, the system should recover as quickly as possible.

So the direction Cloud Foundry are going makes complete sense to me. You want resilience against all sorts of failures – losing a machine, losing a rack, ideally even losing a data center. This requires the platform to know enough about your application to know what to restart and when. I can see more and more of this functionality becoming commodity in the future. The more standardization there, the less applications need to worry about it. You don’t need thousands of servers for this infrastructure to be useful. Even a small site only needing a few servers can still benefit from HA – recovering from hardware failures without human involvement.

But for complete HA is it enough?

For Cassini, no. There is an additional level of HA not mentioned which to me is the next level of maturity. This is around application migration between versions of the software and/or database schema. The easiest solution for upgrades is to take the site down, perform the upgrade, then bring the site back online. For eBay search, this is not an option. Instead new software deployments roll out across the cluster incrementally, making sure that say no more than 10% of the cluster is ever down at the same time. A database schema change may require software to roll out where the new code supports both the current and next generation of schema, and then once rolled out the database schema can be changed, then once the database schema change is rolled out the final version of the code can be rolled out where only the new schema is now supported (you don’t want to support every previous version of the schema if you can avoid it – the code can get a bit messy.)

Bottom line: HA for resilience against hardware failures is a great thing to get standardized. Over time I expect more hosting providers with this level of sophistication. Customers of course need to move along with this trend, and standardization I think will be necessary to increase adoption. Beyond that, standardizing patterns such as progressive software deployments (and schema migration) is the next exciting step to me. That will require application logic changes to most applications – it cannot just be solved by the platform. How mainline will this be? That is not clear to me. It is not clear to me how many customers will need this extra application complexity, versus just taking the site down occasionally for short periods of time to perform such upgrades.