3
A Matching Peeling Argument Take a random graph with n vertices and m edges. Form a matching greedily as follows: – Find a vertex of degree 1. – Put the corresponding edge in the matching. – Repeat until out of vertices of degree 1.

4

5

6

7

8

9
History Studied by Karp and Sipser in Threshold behavior: showed that if m < en/2, then the number of leftover edges is o(n). Used an analysis based on differential equations.

11
Random Graph Interpretation Bipartite graph with vertices for literal pairs on one side, vertices for clauses on the other. When neighbors for one literal in a pair drops to 0, can remove both of them, and neighboring clauses.

13
A Peeling Paradigm Start with a random graph Peel away part of the graph greedily. – Generally, find a node of degree 1, remove the edge and the other adjacent node, but there are variations. k-core = maximal subgraph with degree at least k. Find vertex with degree less than k, remove it and all edges, continue. Analysis approach: – Left with a random graph (with a different degree distribution). – Keep reducing the graph all the way down. To an empty graph, or a stopping point.

18
Decoding Results Successful decoding corresponds to erasing the entire bipartite graph. Equivalently: graph has an empty 2-core. Using peeling analysis can determine when graph 2-core is empty with high probability. – Thresholds can be found by Differential equation analysis “And-Or” tree analysis

23
Peeling and Tabulation Hashing Given a set of strings being hashed, we can peel as follows: – Find a string that in the set that uniquely maps to a specific (character value,position). – Remove that string, and continue. The peeled set has completely independent hashed values. – Because each one is xor’ed with its very own random table value.

24
Tabulation Hashing Result due to Patrascu and Thorup. Lemma: Suppose we hash n 1 – m -a. – Bootstrap this result to get Chernoff bounds for tabulation hashing.

25
Proof Let t = (1+a)/eps, so d=t c. Step 1: Every set of d elements has a peelable subset of size t. – Pigeonhole principle says some character position is hit in d 1/c characters; otherwise fewer than (d 1/c ) c =d total elements. Choose 1 for each character hit in that position. Step 2: Use this to bound maximum load. – Maximum load is d implies some t-element peelable set landed in the same bit. At most (n choose t) such sets; m -(t-1) probability each set lands in the same bin. Union bound gives heavily loaded bin with prob m -a.

28
Bloom Filters, + Values + Listing Bloom filters useful for set membership. But they don’t allow (key,value) lookups. – Bloomier filters, extension to (key,value) pairs. Also based on peeling methods. They also don’t allow you to reconstruct the elements in the set. Can we find something that does key-value pair lookups, and allows reconstruction? Invertible Bloom Lookup Table (IBLT)

30
Listing, Details Consider data streams that insert/delete a lot of pairs. – Flows through a router, people entering/leaving a building. We want listing not at all times, but at “reasonable” or “off-peak” times, when the current working set size is bounded. – If we do all the N insertions, then all the N-M deletions, and want a list at the end, we want… Data structure size should be proportional to listing size, not maximum size. – Proportional to M, not to N! – Proportional to size you want to be able to list, not number of pairs your system has to handle.

31
Sample Applications Network flow tracking – Track flows on insertions/deletions – Possible to list flows at any time – as long as the network load is not too high If too high, wait till it gets lower again – Can also do flow lookups (with small failure probability) Oblivious table selection Database/Set Reconciliation – Alice sends Bob an IBLT of her data – Bob deletes his data – IBLT difference determines set difference

32
Possible Scenarios A nice system – Each key has (at most) 1 value – Delete only items that are inserted A less nice system – Keys can have multiple values – Deletions might happen for keys not inserted, or for the wrong value A further less nice system – Key-value pairs might be duplicated

34
Get Performance Bloom filter style analysis Let m = number of cells, n = number key-value pairs, j = number of hash functions Probability a Get for a key k in the system returns “not found” is Probability a Get for a key k not in the system returns “not found is”

35
The Nice System : Listing While some cell has a count of 1: – Set (k,v) = (KeySum,ValueSum) of that cell – Output (k,v) – Call Delete(k,v) on the IBLT

36
Listing Example

37
The Nice System : Listing While some cell has a count of 1: – Set (k,v) = (KeySum,ValueSum) of that cell – Output (k,v) – Call Delete(k,v) on the IBLT Peeling Process. This is the same process used to find the 2-core of a random hypergraph. Same process used to decode families of low- density parity-check codes.

38
Listing Performance Results on random peeling processes Thresholds for complete recovery depend on number of hash functions Interesting possibility : use “irregular” IBLTs – Different numbers of hash functions for different keys – Same idea used in LDPC codes J3456 m/n

39
Fault Tolerance Extraneous deletions – Now a count of 1 does not mean 1 key in the cell. Might have two inserted keys + one extraneous deletion. – Need an additional check: hash the keys, sum into HashKeySum. – If count is 1, and the hash of the KeySum = HashKeySum, then 1 key in the cell. What about a count of -1? – If count is -1, and the hash of -KeySum = -HashKeySum, then 1 key in the cell.

40
Fault Tolerance Keys with multiple values – Need another additional check; HashKeySum and HashValueSum. Multiply-valued keys “poison” a cell. – The cell is unusable; it will never have a count of 1. Small numbers of poisoned cells have minimal effect. – Usually, all other keys can still be listed. – If not, number of unrecovered keys is usually small, like 1.

45
Set Reconciliation Problem Alice and Bob each hold a set of keys, with a large overlap. – Example: Alice is your smartphone phone book, Bob is your desktop phone book, and new entries or changes need to be synched. Want one/both parties to learn the set difference. Goal: communication is proportional to the size of the difference. IBLTs yield an effective solution for set reconciliation. (Used in code construction…)

47
Reconciliation/Decoding For now, assume no errors in IBLT sent by Alice. Bob deletes his key-value pairs from the IBLT. What remains in the IBLT is the set difference, corresponding to symbol errors. Bob lists elements of the IBLT to find the errors.

48
Decoding Process Suppose a cell has one pair in it. Then the checksum should match with the key value. – And, if more than one pair, checksum should not match with the key value. Choose checksum length so no false matches with good probability. If checksum matches, recover the element, and delete it from the IBLT. Continue until all pairs recovered.

49
Peeling Process KeysHash Table

50
Analysis Analysis follows standard approaches for LDPC codes. – E.g., differential equation, fluid limit analysis. – Chernoff-like bounds on behavior. Overheads – Set difference size = 2x number of errors. Could we get rid of factor of 2? – Decoding structure overhead. Best not to use “regular graph” = same number of hash functions per item; but simpler to do so.

51
Fault Tolerance What about errors in IBLT cells? – The cell is “bad”, can’t be used for recovery. – If the checksum works, low probability of decoding error. – Enough bad IBLT cells will harm decoding. Most likely scenario: one key-value pair has all of its IBLT cells go bad; it cannot be recovered. So most likely error: 1 unrecovered value (or small number). Various remedies possible. – Small additional error-correction in original message. – Recursively protect IBLT cells with a code.

52
Simple Code Code is essentially a lot of hashing, XORing of values.

54
Experimental Results Parameters chosen so we expect rare but noticeable failures with 4 hash functions. – All IBLT cells for some symbol in error in approximately 1.6 x trials. But less so with 5 hash functions. – Failure once in approximately 3.2 x trials. 16 failures in 1000 trials for 4 hash functions, none for 5 hash functions. – Failures are all 1 unrecovered element! Experiments match analysis.

55
Experimental Results : Timing Less than 0.1 seconds per decoding. Most of the time is simply putting elements into the hash table. – 4 hash functions: seconds on average to load data into table, for subsequent decoding. – 5 hash functions: seconds on average to load data into table, for subsequent decoding. Optimizations, parallelizations possible.