Sunday, March 17, 2013

Dirichlet Distribution (A Distribution on Top of Another Distribution)

The Dirichlet distribution is one of the distributions that you will often see in machine learning and natural language processing papers. For example, you will see this distribution a lot in probabilistic topic models. One particular application of Dirichlet distribution is modeling word distributions in text documents, or modeling topics in a collection of text documents. 

A Dirichlet distribution on $x=(x_1, \cdots, x_m)$ with parameters $\alpha=(\alpha_1, \cdots, \alpha_m)$ is defined as:

$p(x, \alpha) = \frac{\displaystyle\Gamma(A)}{\displaystyle\prod_{i=1}^{m}{\Gamma(\alpha_i)}}  \displaystyle\prod_{i=1}^{m}{x_i^{\alpha_i-1}}$

where $A = \displaystyle\sum_{i=1}^{m}{\alpha_i}$.

In this formula, $\Gamma$ stands for the Gamma function (you can think of the Gamma function as an extension of the factorial functions, so it can also work with the real numbers). By the way, if you are thinking that the above formula is somehow similar to the multinomial distribution formula, you're right! The Dirichlet distribution and the multinomial distribution can be roughly seen as reverse counterparts (multinomial distribution calculates the counts given probabilities, the Dirichlet distribution calculates the probabilities given the $\alpha$ pseudo-counts).

You have to be careful about $x$ here, as there is a limitation regarding $x$. It has to be a simplex. In a simplex, each $0<x_i<1$ and $\sum{x_i} = 1$.  The following figure shows a simplex. Generally speaking, an simplex with $n+1$ vertices is an n-dimensional polytope. So a 2-simplex is a triangle, a 3-simplex is a tetrahedron, and so on. 



So, with a Dirichlet distribution, unlike the other distributions (such as a Gaussian distribution), your domain cannot be anything you want, rather it has to be a simplex. It turns out that you can always think of a simplex as representing a probability distribution, because each $0<x_i<1$ and they sum up to 1. Therefore, the Dirichlet distribution becomes quite interesting: we are in fact defining a distribution on top of another "distribution"! 

Now, you might ask what is the role of $\alpha$? It controls the shape of density function on the simplex. In other words, it controls how $x$s are spread in our simplex. One special case is when all $\alpha_i$ are equal. Then the distribution over the simplex will be symmetrical. You can see a couple of Dirichlet distribution examples for different $\alpha$ values below (images are taken from this technical report, highly recommended if you want to dig deeper). 




In these figures, whenever $\alpha = [c,c,c]$, then the density is symmetric, as can be seen in the first three plots (clockwise, from top left). When $\alpha=[1,1,1]$, then the distribution turns into a uniform distribution (top left). When $0<c <1$, then the density is packed at the vertices of the simplex (top right), so this case is usually used to represent a sparse distribution. For example, you can use it to represent a sparse topic distribution over documents, if you know that each document is only about a few topics such as sports and celebrities. If $c > 1$, then the density becomes concentrated at the center of the simplex (bottom, left). Finally, if all $\alpha_i$ are not he same, then density will not be symmetrical, concentrated towards the vertex with the highest corresponding $\alpha_i$.

The mean of a Dirichlet distribution is quite simple:

$E(x_i) = \frac{\displaystyle\alpha_i}{\displaystyle A}$




No comments:

Post a Comment