4
Small world phenomenon The social network is a graph: – “node’’ is a person – “edge’’ between people that know each other “6 degrees of separation’’ Are all pairs of people connected by path of distance at most 6?

5
Vast data Impossible to access all of it Accessible data is too enormous to be viewed by a single individual Once accessed, data can change

7
What can we hope to do without viewing most of the data? Can’t answer “for all” or “exactly” type statements: are all individuals connected by at most 6 degrees of separation? exactly how many individuals on earth are left-handed? Compromise? is there a large group of individuals connected by at most 6 degrees of separation? approximately how many individuals on earth are left-handed?

9
Property Testing Does the input object have crucial properties? Example Properties: Clusterability, Small diameter graph, Close to a codeword, Linear or low degree polynomial function, Increasing order Lots and lots more…

10
“In the ballpark” vs. “out of the ballpark” tests Property testing: Distinguish inputs that have specific property from those that are far from having that property Benefits: – Can often answer such questions much faster – May be the natural question to ask When some “noise” always present When data constantly changing Gives fast sanity check to rule out very “bad” inputs (i.e., restaurant bills) Model selection problem in machine learning

11
Examples Can test if a function is a homomorphism in CONSTANT TIME [Blum Luby R.] Can test if the social network has 6 degrees of separation in CONSTANT TIME [Parnas Ron]

12
Find characterization of property that is Efficiently (locally) testable Robust - objects that have the property satisfy characterization, and objects far from having the property are unlikely to PASS Constructing a property tester: Usually the bigger challenge

14
A “bad” testing characterization: For every node, all other nodes within distance 6. Another bad one: For most nodes, all other nodes within distance 6. Good characterization: For most nodes, there are many other nodes within distance 6. Example: 6 degrees of separation

20
Monotonicity of a sequence Given: list y 1 y 2... y n Question: can we quickly test if the list close to sorted? i.e., (1) pass sorted lists and (2) if passes test, can change at most  fraction of list to make it sorted Can test in O(1/  log n) time [Ergun, Kannan, Kumar, Rubinfeld, Viswanathan] best possible [EKKRV] + [Fischer]

21
An attempt: Proposed algorithm : Pick random i and test that y i ≤y i+1 Bad input type: 1,2,3,4,5,…j, 1,2,3,4,5,….j, 1,2,3,4,5,…j, 1,2,3,4,5,…,j Difficult for this algorithm to find “breakpoint” But other algorithms work well i yiyi

23
A test that works The test: (for distinct y i ) Test several times: Pick random i Look at value of y i Do binary search for y i Does the binary search find any inconsistencies? If yes, FAIL Do we end up at location i? If not FAIL Pass if never failed Running time: O(log n) time Why does this work? If list is in order, then test always passes If the test passes on choice i and j, then y i and y j are in correct order Since test usually passes, most y i ’s in the right order

32
Given graph G(V,E), a vertex cover (VC) C is a subset of V such that it “touches” every edge. What is minimum size of a vertex cover? NP-complete Poly time multiplicative 2-approximation based on relationship of VC and maximal matching Example: Vertex Cover

34
“Classical” approximation examples Can get CONSTANT TIME approximation for vertex cover on sparse graphs! Output y which is at most 2 ∙ OPT + ϵn How? Oracle reduction framework [Parnas Ron] Construct “oracle” that tells you if node u in 2-approx vertex cover Use oracle + standard sampling to estimate size of cover But how do you implement the oracle?

35
Implementing the oracle – two approaches: Sequentially simulate computations of a local distributed algorithm [Parnas Ron] Figure out what greedy maximal matching algorithm would do on u [Nguyen Onak]

37
Implementing the Oracle via Greedy To decide if edge e in matching: Must know if adjacent edges that come before e in the ordering are in the matching Do not need to know anything about edges coming after Arbitrary edge order can have long dependency chains! Odd or even steps from beginning? 1 2 4 8 25 36 47 88 89 110 111 112 113

38
Breaking long dependency chains [Nguyen Onak] Assign random ordering to edges Greedy works under any ordering Important fact: random order has short dependency chains

49
Key Question How many samples do you need in terms of domain size? Do you need to estimate the probabilities of each domain item? Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniques, learning distribution

60
Why so different? Collision statistics are all that matter Collisions on “heavy” elements can hide collision statistics of rest of the domain Construct pairs of distributions where heavy elements are identical, but “light” elements are either identical or very different