4
Soumen Chakrabarti3 The Web 2 billion HTML pages, several terabytes Highly dynamic –1 million new pages per day –Over 600 GB of pages change per month –Average page changes in a few weeks Largest crawlers –Refresh less than 18% in a few weeks –Cover less than 50% ever Average page has 7–10 links –Links form content-based communities

6
Soumen Chakrabarti5 Differences from structured data Document  rows and columns –Extended complex objects –Links and relations to other objects Document  XML graph –Combine models and analyses for attributes, elements, and CDATA –Models different from structured scenario Very high dimensionality –Tens of thousands as against dozens –Sparse: most dimensions absent/irrelevant Complex taxonomies and ontologies

7
Soumen Chakrabarti6 The sublime and the ridiculous What is the exact circumference of a circle of radius one inch? Is the distance between Tokyo and Rome more than 6000 miles? What is the distance between Tokyo and Rome? java java +coffee -applet “uninterrupt* power suppl*” ups -parcel

17
Soumen Chakrabarti16 ‘Iceberg’ queries Given a query –For all pages in the database computer similarity between query and page –Report 10 most similar pages Ideally, computation and IO effort should be related to output size –Inverted index with AND may violate this Similar issues arise in clustering and classification

18
Similarity and clustering

19
Soumen Chakrabarti18 Clustering Given an unlabeled collection of documents, induce a taxonomy based on similarity (such as Yahoo) Need document similarity measure –Represent documents by TFIDF vectors –Distance between document vectors –Cosine of angle between document vectors Issues –Large number of noisy dimensions –Notion of noise is application dependent

20
Soumen Chakrabarti19 Vocabulary V, term w i, document  represented by is the number of times w i occurs in document  Most f’s are zeroes for a single document Monotone component-wise damping function g such as log or square-root Document model

23
Soumen Chakrabarti22 Bottom-up clustering Initially G is a collection of singleton groups, each with one document Repeat –Find ,  in G with max s(  ) –Merge group  with group  For each  keep track of best  O(n 2 logn) algorithm with n 2 space

24
Soumen Chakrabarti23 Updating group average profiles Un-normalized group profile: Can show:

25
Soumen Chakrabarti24 “Rectangular time” algorithm Quadratic time is too slow Randomly sample documents Run group average clustering algorithm to reduce to k groups or clusters Iterate assign-to-nearest O(1) times –Move each document to nearest cluster –Recompute cluster centroids Total time taken is O(kn) Non-deterministic behavior

31
Soumen Chakrabarti30 Collaborative recommendation People=record, movies=features People and features to be clustered –Mutual reinforcement of similarity Need advanced models From Clustering methods in collaborative filtering, by Ungar and Foster

32
Soumen Chakrabarti31 A model for collaboration People and movies belong to unknown classes P k = probability a random person is in class k P l = probability a random movie is in class l P kl = probability of a class-k person liking a class-l movie Gibbs sampling: iterate –Pick a person or movie at random and assign to a class with probability proportional to P k or P l –Estimate new parameters

38
Soumen Chakrabarti37 Document generation models Boolean vector (word counts ignored) –Toss one coin for each term in the universe Bag of words (multinomial) –Toss coin with a term on each face Limited dependence models –Bayesian network where each feature has at most k features as parents –Maximum entropy estimation Limited memory models –Markov models

39
Soumen Chakrabarti38 Binary (boolean vector) Let vocabulary size be |T | Each document is a vector of length |T| –One slot for each term Each slot t has an associated coin with head probability  t Slots are turned on and off independently by tossing the coins

41
Soumen Chakrabarti40 Limitations With the term distribution –100th occurrence is as surprising as first –No inter-term dependence With using the model –Most observed  (c,t) are zero and/or noisy –Have to pick a low-noise subset of the term universe –Have to “fix” low-support statistics Smoothing and discretization Coin turned up heads 100/100 times; what is Pr(tail) on the next toss?

44
Soumen Chakrabarti43 Effect of parameter smoothing Multinomial known to be more accurate than binary under Laplace smoothing Better marginal distribution model compensates for modeling term counts! Good parameter smoothing is critical

55
Soumen Chakrabarti54 Co-training Divide features into two class- conditionally independent sets Use labeled data to induce two separate classifiers Repeat: –Each classifier is “most confident” about some unlabeled instances –These are labeled and added to the training set of the other classifier Improvements for text + hyperlinks

58
Soumen Chakrabarti57 Topical locality on the Web Sample sequence of out-links from pages Classify out-links See if class is same as that at offset zero TFIDF similarity across endpoint of a link is very large compared to random page-pairs