3Unpredictable Growth The TerraServer Story:Expected 5 M hits per dayGot 50 M hits on day 1Peak at 20 M hpd on a “hot” dayAverage 5 M hpd over last 2 yearsMost of us cannot predict demandMust be able to deal with NO demandMust be able to deal with HUGE demand

4Web Services RequirementsScalability: Need to be able to add capacityNew processingNew storageNew networkingAvailability: Need continuous serviceOnline change of all components (hardware and software)Multiple service sitesMultiple network providersAgility: Need great toolsManage the systemChange the application several times per year.Add new services several times per year.

5Premise: Each Site is a FarmBuy computing by the slice (brick):Rack of servers + disks.Functionally specialized serversGrow by adding slicesSpread data and computation to new slicesTwo styles:Clones: anonymous serversParts+Packs: Partitions fail over within a packIn both cases, GeoPlex remote farm for disaster recovery

6Scaleable Systems Scale UP Scale OUTScaleUP: grow by adding components to a single system.ScaleOut: grow by adding more systems.Scale OUT

7ScaleUP and Scale OUT Everyone does both. Choice’s Who’s software?Size of a brickClones or partitionsSize of a packWho’s software?scaleup and scaleout both have a large software component1M$/sliceIBM S390?Sun E 10,000?100 K$/sliceWintel 8X10 K$/sliceWintel 4x1 K$/sliceWintel 1x

10Clone Requirements Automatic replication (if they have any state)Applications (and system software)DataAutomatic request routingSpray or sieveManagement:Who is up?Update management & propagationApplication monitoring.Clones are very easy to manage:Rule of thumb: 100’s of clones per admin.

11Partitions for ScalabilityClones are not appropriate for some apps.State-full apps do not replicate wellhigh update rates do not replicate wellExamplesDatabasesRead/write file server…Cache managerschatPartition state among serversPartitioning:must be transparent to client.split & merge partitions online

16Directory Fail-Over Load BalancingRoutes request to right farmFarm can be clone or partitionAt farm, routes request to right serviceAt service routes request toAny cloneCorrect partition.Routes around failures.

25Architecture and Design workProduced an architectural Blueprint for large eSites published on MSDNCreating and testing instances of the architectureTeam led by Per Vonge NeilsenActually building and testing examples of the architecture with partners. (sometimes known as MICE)Built a scalability “Megalab” run by Robert Barnes1000 node cyber wall, U Compaq DL360s, ways, 7000 disks

27Clones and Packs aka ClusteringIntegrated the NLB and MSCS teamsBoth focused on scalability and availabilityNLB for ClonesMSCS for Partitions/PacksVision is a single communications and group membership infrastructure and a set of management tools for Clones, Partitions, and PacksUnify management for clones/partitions at BOTH: OS and app level (e.g. IIS, Biztalk, AppCenter, Yukon, Exchange…)

28Clustering in Whistler ServerMicrosoft Cluster ServerMuch improved setup and installation4 node support in Advanced serverKerberos support for Virtual ServersPassword change without restarting cluster service8 node support in DatacenterSAN enhancements (Device reset not bus reset for disk arbitration, Shared disk and boot disk on same bus)Quorum of nodes supported (no shared disk needed)Network Load BalancerNew NLB managerBi-Directional affinity for ISA as a Proxy/FirewallVirtual cluster support (Different port rules for each IP addr)Dual NIC support

31Appliances and Hardware TrendsThe appliances team under TomPh is focused on dramatically simplifying the user experience of installing the kind of devicesWorking with OEMs to adopt WindowsXPUltradense servers are on the horizon100s of servers per rackManage the rack as oneInfiniband and 10 GbpsEthernet change things.

32Operations and ManagementGreat research work done in MSR on this topicThe Mega services paper by Levi and HuntThe follow on BIG project developed the ideas ofScale Invariant Service Descriptions withautomated monitoring anddeployment of servers.Building on that work in Windows Server groupAppCenter doing similar things at app level

40What Software Do The Bricks Run?Each node has an OSEach node has local resources: A federation.Each node does not completely trust the others.Nodes use RPC to talk to each otherCOM+ SOAP, BizTalkHuge leverage in high-level interfaces.Same old distributed system story.ApplicationsApplicationsdatagramsstreamsRPC??RPCstreamsdatagramsCLRCLRInfiniband /Gbps Ehternet

43It’s Hard to Archive a Petabyte It takes a LONG time to restore it.At 1GBps it takes 12 days!Store it in two (or more) places online (on disk?). A geo-plexScrub it continuously (look for errors)On failure,use other copy until failure repaired,refresh lost copy from safe copy.Can organize the two copies differently (e.g.: one by time, one by space)

44Call To Action Lets work together to make storage bricksLow costHigh functionNAS (network attached storage) not SAN (storage area network)Ship NT8/CLR/IIS/SQL/Exchange/… with every disk drive