Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

• Many positives: –big increase
in throughput –big decrease in query latency • especially at the tail: expensive queries that previously needed GBs of disk I/O became much faster and cheaper e.g. [ “circle of life” ] In-Memory Indexing Systems • Some issues: –Variance: query touches 1000s of machines, not dozens • e.g. randomized cron jobs caused us trouble for a while –Availability: 1 or few replicas of each doc’s index data • Availability of index data when machine failed (esp for important docs): replicate important docs • Queries of death that kill all the backends at once: very bad

Canary Requests • Problem: requests
sometimes cause server process to crash – testing can help reduce probability, but can’t eliminate • If sending same or similar request to 1000s of machines: – they all might crash! – recovery time for 1000s of processes pretty slow • Solution: send canary request first to one machine – if RPC finishes successfully, go ahead and send to all the rest – if RPC fails unexpectedly, try another machine (might have just been coincidence) – if fails K times, reject request • Crash only a few servers, not 1000s

Features • Clean abstractions: –Repository
–Document –Attachments –Scoring functions • Easy experimentation –Attach new doc and index data without full reindexing • Higher performance: designed from ground up to assume data is in memory

New Index Format • Old
disk and in-memory index used two-level scheme: – Each hit was encoded as (docid, word position in doc) pair – Docid deltas encoded with Rice encoding – Very good compression (originally designed for disk-based indices), but slow/CPU-intensive to decode • New format: single flat position space – Data structures on side keep track of doc boundaries – Posting lists are just lists of delta-encoded positions – Need to be compact (can’t afford 32 bit value per occurrence) – … but need to be very fast to decode

Universal Search • Search all
corpora in parallel • Performance: most of the corpora weren’t designed to deal with high QPS level of web search • Mixing: Which corpora are relevant to query? –changes over time • UI: How to organize results from different corpora? –interleaved? –separate sections for different types of documents?

Current Work: Spanner • Storage
& computation system that runs across many datacenters – single global namespace • names are independent of location(s) of data • fine-grained replication configurations – support mix of strong and weak consistency across datacenters • Strong consistency implemented with Paxos across tablet replicas • Full support for distributed transactions across directories/machines – much more automated operation • automatically changes replication based on constraints and usage patterns • automated allocation of resources across entire fleet of machines

• Future scale: ~105 to
107 machines, ~1013 directories, ~1018 bytes of storage, spread at 100s to 1000s of locations around the world – zones of semi-autonomous control – consistency after disconnected operation – users specify high-level desires: “99%ile latency for accessing this data should be <50ms” “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia” Design Goals for Spanner

Many Internal Services • Break
large complex systems down into many services! • Simpler from a software engineering standpoint – few dependencies, clearly specified – easy to test and deploy new versions of individual services – ability to run lots of experiments – easy to reimplement service without affecting clients • Development cycles largely decoupled – lots of benefits: small teams can work independently – easier to have many engineering offices around the world • e.g. google.com search touches 200+ services –ads, web search, books, news, spelling correction, ...

Designing Efficient Systems Given a
basic problem definition, how do you choose "best" solution? • Best might be simplest, highest performance, easiest to extend, etc. Important skill: ability to estimate performance of a system design – without actually having to build it!

Designing & Building Infrastructure Identify
common problems, and build software systems to address them in a general way • Important to not try to be all things to all people – Clients might be demanding 8 different things – Doing 6 of them is easy – …handling 7 of them requires real thought – …dealing with all 8 usually results in a worse system • more complex, compromises other clients in trying to satisfy everyone

Designing & Building Infrastructure (cont)
Don't build infrastructure just for its own sake: • Identify common needs and address them • Don't imagine unlikely potential needs that aren't really there Best approach: use your own infrastructure (especially at first!) • (much more rapid feedback about what works, what doesn't)

Design for Growth Try to
anticipate how requirements will evolve keep likely features in mind as you design base system Don’t design to scale infinitely: ~5X - 50X growth good to consider >100X probably requires rethink and rewrite

Pattern: Single Master, 1000s of
Workers (cont) • Often: hot standby of master waiting to take over • Always: bulk of data transfer directly between clients and workers • Pro: – simpler to reason about state of system with centralized master • Caveats: – careful design required to keep master out of common case ops – scales to 1000s of workers, but not 100,000s of workers

Pattern: Multiple Smaller Units per
Machine • Problems: – want to minimize recovery time when machine crashes – want to do fine-grained load balancing • Having each machine manage 1 unit of work is inflexible – slow recovery: new replica must recover data that is O(machine state) in size – load balancing much harder single work chunk Machine