3 IntroductionHierarchical approach to model-based clustering of grouped dataFind an unknown number of clusters to capture the structure of each group and allow for sharing among the groupsDocuments with an arbitrary number of topics which are shared globably across the set of corpora.A Dirichlet Process will be used as a prior mixture componentsThe DP will be extended to a HDP to allow for sharing clusters among related clustering problems3

4 MotivationInterested in problems with observations organized into groupsLet xji be the ith observation of group j = xj = {xj1, xj2...}xji is exchangeable with any other element of xjFor all j,k , xj is exchangeable with xk4

5 MotivationAssume each observation is drawn independently for a mixture modelFactor θji is the mixture component associated with xjiLet F(θji ) be the distribution of xji given θjiLet Gj be the prior distribution of θj1, θj2... which are conditionally independent given Gj5

7 The Dirichlet Process Let (Θ , β) be a measureable space,Let G0 be a probability measure on that spaceLet A = (A1,A2..,Ar) be a finite partition of that spaceLet α0 be a positive real numberG ~ DP( α0, G0) is defined s.t. for all A :7

8 Stick Breaking ConstructionThe general idea is that the distribution G will be a weighted average of the distributions of a set of infinite random variables2 infinite sets of i.i.d random variablesϕk ~ G0 – Samples from the initial probability measureπk' ~ Beta (1, α0) – Defines the weights of these samples8

11 Polya urn scheme/ CRPLet each θ1, θ2,.. be i.i.d. Random variables distributed according to GConsider the distribution of θi, given θ1,...θi-1, integrating out G:11

12 Polya urn schemeConsider a simple urn model representation. Each sample is a ball of a certain colorBalls are drawn equiprobably, and when a ball of color x is drawn, both that ball and a new ball of color x is returned to the urnWith Probability proportional to α0, a new atom is created from G0,A new ball of a new color is added to the urn12

13 Polya urn schemeLet ϕ1 ...ϕK be the distinct values taken on by θ1,...θi-1,If mk is the number of values of θ1,...θi-1, equal to ϕk:13

15 Dirichlet Process Mixture ModelDirichlet Process as nonparametric prior on the parameters of a mixture model:15

16 Dirichlet Process Mixture ModelFrom the stick breaking representation:θi will be the distribution represented by ϕk with probability πkLet zi be the indicator variable representing which ϕk θi is associated with:16

17 Infinite Limit of Finite Mixture ModelConsider a multinomial on L mixture components with parameters π = (π1, … πL)Let π have a symmetric Dirichlet prior with hyperparameters (α0/L,....α0/L)If xi is drawn from a mixture component, zi, according to the defined distribution:17

21 HDP Definition (Cont.)Formally, a hierarchical Dirichlet process definesA set of random probability measures , one for each group jA global random probability measureis a distributed as a Dirichlet processare conditional independent given , also follow DPis discrete!

22 Hierarchical Dirichlet Process Mixture ModelHierarchical Dirichlet process as prior distribution over the factors for grouped dataFor each group jEach observation corresponds to a factorThe factors are i.i.d random. variables distributed as

23 Some Notices HDP can be extended to more than two levelsThe base measure H can be drawn from a DP, and so on and so forthA tree can be formedEach node is a DPChildren nodes are conditionally independent given their parent, which is a base measureThe atoms at a given node are shared among all its descendant nodes

26 Analog II: the Chinese restaurant franchiseGeneral idea:Allow multiple restaurants to share a common menu, which includes a set of dishesA restaurant has infinite tables, each table has only one dish

32 Introduction to three MCMC schemesAssumption: H is conjugate to FA straightforward Gibbs sampler based on Chinese restaurant franchiseAn augmented representation involving both the Chinese restaurant franchise and the posterior for G0A variation to scheme 2 with streamline bookkeeping

34 Scheme I: Posterior sampling in the Chinese restaurant franchiseSampling t and kSampling tIf is a new t, sampling the k corresponding to it byAnd

35 Sampling kWhere is all the observations for table t in restaurant j

36 Scheme II: Posterior sampling with an augmented representationPosterior of G0 given :An explicit construction for G0 is given:

37 Given a sample of G0, posterior for each group is factorized and sampling in each group can be performed separatelySampling t and k:Almost the same as in Scheme IExcept using to replaceWhen a new component knew is instantiated, draw, and set and

39 Scheme III: Posterior sampling by direct assignmentDifference from Scheme I and II:In I and II, data items are first assigned to some table t, and the tables are then assigned to some component kIn III, directly assign data items to component via variable , which is equivalent toTables are collapsed to numbers

41 Comparison of Sampling SchemesIn terms of ease of implementationThe direct assignment is betterIn terms of convergence speedDirect assignment changes the component membership of data items one at a timeScheme I and II, component membership of one table will change the membership of multiple data items at the same time, leading to better performance