3
Microsoft Live Platform Services Motivation System-to-admin ratio indicator of admin costs Tracking total ops costs often gamed Outsourcing halves ops costs without addressing real issues Inefficient properties: 2:1 Average property: 150:1 (enterprises typically in the 70 to 140 range) Best services over 2,000:1 80% of ops issues from design and development Poorly written applications are difficult to automate Focus on reducing ops costs in early development

7
Microsoft Live Platform Services Design for auto-mgmt & provisioning Support for geo-distribution Auto-provisioning & auto-installation mandatory Manage "service role" rather than servers Multi-system failures are common Limit automation range of action Never rely on local, non-replicated persistent state Don't worry about clean shutdown Often won't get it & need this path tested Explicitly install everything and then verify Force fail all services and components regularly Application rollback supported & tested before deployment

8
Microsoft Live Platform Services Design for incremental release Incrementally release with schema changes? Old code must run against new schema, or Two-phase process (avoid if possible) Update code to support both, commit changes, and then upgrade schema Incrementally release with user experience (UX) changes? Separate UX from infrastructure Ensure old UX works with new infrastructure Deploy infrastructure incrementally On success, bring a small beta population onto new UX On continued success, announce new UX and set a date to roll out Client-side code? Ensure old and new clients can both run against new infrastructure

9
Microsoft Live Platform Services Graceful degradation & admission control No amount of "head room" is sufficient Even at 25% to 50% hardware utilization, spikes will exceed 100% Prevent overload through admission control Graceful degradation prior to admission control Find less resource-intensive modes to provide degraded services Related concept: Metered rate-of-service admission Service login typically more expensive than steady state Allow a single or small number of users in when restarting a service after failure