In the previous article we looked at the Beta-Bernoulli model. This time we’ll extend it to a model with
multiple possible outcomes. We’ll also take a look at the Dirichlet,
Categorical and Multinomial distributions.

After this, we’ll be quite close to implementing interesting models such as the
Latent Dirichlet Allocation (LDA). But for now, we have to understand the
basics first.

Multinomial coefficients

Before we can dive into the dirichlet-categorical model we have to briefly look
at the multinomial coefficient, which is the generalization of a binomial
coefficient. First, here’s a definition of the binomial coefficient

$$
\binom{n}{k} = \frac{n!}{k! (n - k)!}
$$

which represents the number of ways we can choose $k$ items out of $n$ total.

We can generalize this to more than two types of items using the multinomial
coefficient defined as

which represents the number of ways we can split $n$ items into $m$ groups,
with $k_1$ items in the first group, $k_2$ items in the second group, and so
on.

Categorical distribution

Now that we are comfortable with multinomial coefficients, let us continue with
the generalization of the Bernoulli distribution, that is the Categorical
distribution, denoted as $Cat(\boldsymbol{\pi})$, where $\boldsymbol\pi$ is a
vector of probabilities for each possible outcome. The probability mass
function (PMF) is simply

$$
p(x|\boldsymbol \pi) = \prod_{i=1}^k \pi_i^{I[x = i]}
$$

where $I[x = i]$ is an indicator variable which evaluates to $1$ if $x=i$, or
to $0$ otherwise. Note that we require $\sum_{i=1}^k \pi_i = 1$.

We can also re-formulate this for the case of one-of-K encoding, where only one
of the outcomes is $1$, and the remaining elements equal $0$. Then the
distribution becomes

$$
p(x|\boldsymbol\pi) = \prod_{i=1}^k \pi_i^{x_i}.
$$

An example of this would be a single roll of a dice, where only one of the
outcomes is possible, but each might have a different probability (unfair
dice).

Multinomial distribution

Having understood the Categorical distribution, we can now move to the
generalization of the Binomial distribution to multiple outcomes, that is the
Multinomial distribution. An easy way to think of it is $n$ rolls of a
$k$-sided dice.

When $n = 1$ and $k = 2$ we have a Bernoulli distribution.

When $n = 1$ and $k > 2$ we have a Categorical distribution.

When $n > 1$ and $k = 2$ we have a Binomial distribution.

And finally, when $n > 1$ and $k > 2$ we have a Multinomial distribution.

Of course we can simply always use the Multinomial distribution as it is the
most general. The PMF in the one-of-K case is then simply

In this case $\boldsymbol{x} = (x_1, \ldots, x_k)$ represent the number of
times each outcome was observed, while again $\boldsymbol{\pi} = (\pi_1,
\ldots, \pi_k)$ represent the probabilities of each outcome.

An example of the multinomial distribution is the Bag of Words model, which
describes the number of occurences of each word in a dataset. There are $k$
possible words in a dictionary and the document consists of $n$ words in total.

Dirichlet distribution

Lastly, let us consider the Dirichlet distribution, which is a generalization
of the Beta distribution to more than two outcomes. The Dirichlet distribution
is to the Categorical/Mutlinomial what the Beta is to the Bernoulli/Binomial.

Note that for shorthand we will also write $\boldsymbol\alpha = (\alpha_1,
\ldots, \alpha_k)$, giving us a shorter notation when writing
$Dir(\boldsymbol\pi | \boldsymbol\alpha)$.

Dirichlet-Categorical Model

Similarly in the previous article about the Beta-Bernoulli model we will now introduce the
Dirichlet-Categorical model. Since everything is analogous we won’t go into
that much detail. The Dirichlet distribution is a conjugate prior to the
Categorical and Multinomial distributions, which means if we set our prior to
Dirichlet and our likelihood to Categorical or Mutlinomial, the resulting
distribution will again be a Dirichlet distribution.

We can observe this easily by just multiplying out the probability mass
functions for $Cat(\boldsymbol x | \boldsymbol \pi)$ and
$Dir(\boldsymbol\pi|\boldsymbol\alpha)$, that is

Since only one of the $x_i$ in the Categorical distribution can be $1$ and the
rest are $0$, say $x_j =1 $, then this will get multiplied by the respective
$\pi_j$ in the Dirichlet distribution and we can immediately see that
$\alpha_j$ will be increased by one, giving us a new Dirichlet distribution
with a parameter $(\alpha_1, \ldots, \alpha_j + 1, \ldots, \alpha_k)$.