12 April 2007

I'm going to try something here to see if it works; my guess is not, but maybe it will at least spark some discussion. (I'm also testing LaTeX in Blogger.) In "text modeling" applications, we're usually dealing with lots of multinomials, especially multinomials over our vocabulary (i.e., unigram language models). These are parameterized by a vector of length equal to the size of the vocabulary. should be the probability of seeing word . The probability of observing an entire document of length , containing copies of word 1, copies of word 2, and so on, is:

What I'm interested in is extensions to this where we don't assume independence. In particular, suppose that we have an ontology. Let's say that our ontology is just some arbitrary graph. What I want is a distribution that essentially prefers to keep drawing values "locally" within the graph. That is, if I've already drawn "computer" then I'm not so suprised to also see "program."

One way of thinking about this is by having a sort of random walk model. Think about placing the individual s on the graph and then allowing them to disperse. So that maybe there is some for which (and the rest are zero). We certainly want our distribution to prefer draws for word , but draws of word where is "near" should also be acceptable.

My first cut attempt at this is as follows. Let be a graph over our vocabulary; let be a neighborhood of (maybe, all nodes within 10 steps); let be the inverse neighborhood (the set of all that have in ). Let be the distance between two nodes and let be a dampening parameter (how fast probability diffuses on the graph). Define our probabilistic model as:

Where we let .

The idea behind this model is as follows. When we observe word i, we have to assign a probability to it. This probability is given by any node j that contains i in its neighborhood. Each such node has some mass which it can distribute. It distributes this mass proportional to to all nodes , where serves to normalize this diffusion.

I'm pretty sure that the normalizing constant for this distribution is the same as the multinomial. After all, there is a one-to-one mapping between the multinomial parameters and the "graph multinomial" parameters, so we should still get to normalize.

Now, because I feel an incredible need to be Bayesian, I need a conjugate prior for this "graph multinomial" distribution. Since the graph multinomial is essentially just a reparameterization of the standard multinomial, it seems that a vanilla Dirichlet would be an appropriate prior. However, I cannot prove that it is conjugate -- in particular, I can't seem to come up with appropriate posterior updates.

So this is where y'all come in to save the day! I can't imagine that no one has come up with such a "graph multinomial" distribution -- I just don't know what keywords to search for. If someone has seen something at all like this, I'd love to know. Alternatively, if someone can tell me what the appropriate conjugate prior is for this distribution, plus the posterior update rules, that would also be fantastic.

I think your graph is a Markov chain. You have a word i, with neighbours j_1, ..., j_N, and probabilities p_1, ..., p_N of transitioning to each j from i. The principle eigenvector of the transition matrix is the stationary distribution of the chain. I.e. the probability, on average, of being in each state (word), which I think it what you are after. Yay for eigenvectors 'cause they made the Google dudes billionaires.

Some refs:

- The eigenvector / stationary distribution relationship is in any text on Markov chains

- More interesting stuff goes under the name of spectral graph theory.

You might be interested in first passage times (words between observing word i and word j) and distributions thereof, known as phase-type distributions.

You might be interested in spectral clustering, which allows you to find the neighbourhoods on a graph given its transition matrix.

I'm never good with Bayesian stuff, so ignore (and forgive) me if the following questions/suggestions are nonsense:

1. Is the difficulty with calculating the posterior with your graph multinomial and dirichlet prior due to the $$N^{-1}(i)$$ in the sum? It seems like if the neighborhood changes depending on how you define the graph, it is impossible to derive a closed form solution for a posterior. No?

2. Alternatively, is it possible to use your ontology to constrain the inference in some way, so you can continue using the traditional multinomial+dirichlet Bayesian model?

3. Though different, this idea of using graphs to model words/documents remind me of this paper: Toutanova et. al. Learning Random Walks for Inducing Word Dependency Distributions, ICML'04. They have a random walk that computes $$p(w_i|w_{i-1})$$ as the stationary distribution. I'm not sure how to add a prior in this--what does a prior even mean to a random walk? Prior on the initial distribution (which is irrelevant in the limit for most cases)? Or prior on the transition matrix? In your case you wanted a prior on $$\theta$$--but it seems like whatever value it is it'll disappear as we rearch stationary distribution... So I guess what you did makes more sense (defining the probability of a document statically, using a fix neighborhood)....