Fun With Data: March 2013

Sunday, March 17, 2013

Dirichlet Distribution (A Distribution on Top of Another Distribution)

The Dirichlet distribution is one of the distributions that you will often see in machine learning and natural language processing papers. For example, you will see this distribution a lot in probabilistic topic models. One particular application of Dirichlet distribution is modeling word distributions in text documents, or modeling topics in a collection of text documents.

A Dirichlet distribution on $x=(x_1, \cdots, x_m)$ with parameters $\alpha=(\alpha_1, \cdots, \alpha_m)$ is defined as:

$p(x, \alpha) = \frac{\displaystyle\Gamma(A)}{\displaystyle\prod_{i=1}^{m}{\Gamma(\alpha_i)}} \displaystyle\prod_{i=1}^{m}{x_i^{\alpha_i-1}}$

where $A = \displaystyle\sum_{i=1}^{m}{\alpha_i}$.

In this formula, $\Gamma$ stands for the Gamma function (you can think of the Gamma function as an extension of the factorial functions, so it can also work with the real numbers). By the way, if you are thinking that the above formula is somehow similar to the multinomial distribution formula, you're right! The Dirichlet distribution and the multinomial distribution can be roughly seen as reverse counterparts (multinomial distribution calculates the counts given probabilities, the Dirichlet distribution calculates the probabilities given the $\alpha$ pseudo-counts).

You have to be careful about $x$ here, as there is a limitation regarding $x$. It has to be a simplex. In a simplex, each $0<x_i<1$ and $\sum{x_i} = 1$. The following figure shows a simplex. Generally speaking, an simplex with $n+1$ vertices is an n-dimensional polytope. So a 2-simplex is a triangle, a 3-simplex is a tetrahedron, and so on.

So, with a Dirichlet distribution, unlike the other distributions (such as a Gaussian distribution), your domain cannot be anything you want, rather it has to be a simplex. It turns out that you can always think of a simplex as representing a probability distribution, because each $0<x_i<1$ and they sum up to 1. Therefore, the Dirichlet distribution becomes quite interesting: we are in fact defining a distribution on top of another "distribution"!

Now, you might ask what is the role of $\alpha$? It controls the shape of density function on the simplex. In other words, it controls how $x$s are spread in our simplex. One special case is when all $\alpha_i$ are equal. Then the distribution over the simplex will be symmetrical. You can see a couple of Dirichlet distribution examples for different $\alpha$ values below (images are taken from this technical report, highly recommended if you want to dig deeper).

In these figures, whenever $\alpha = [c,c,c]$, then the density is symmetric, as can be seen in the first three plots (clockwise, from top left). When $\alpha=[1,1,1]$, then the distribution turns into a uniform distribution (top left). When $0<c <1$, then the density is packed at the vertices of the simplex (top right), so this case is usually used to represent a sparse distribution. For example, you can use it to represent a sparse topic distribution over documents, if you know that each document is only about a few topics such as sports and celebrities. If $c > 1$, then the density becomes concentrated at the center of the simplex (bottom, left). Finally, if all $\alpha_i$ are not he same, then density will not be symmetrical, concentrated towards the vertex with the highest corresponding $\alpha_i$.

The mean of a Dirichlet distribution is quite simple:

$E(x_i) = \frac{\displaystyle\alpha_i}{\displaystyle A}$

Monday, March 11, 2013

Difference between Maximum Likelihood (ML) and Maximum A Posterior (MAP)

Suppose you have collected some data and you have a number of models (hypotheses) which might be able to explain your data. The question is: which model/hypothesis do you choose?

There are two different ways to answer this question:

Using Maximum Likelihood approach (ML)
Using Maximum A Posteriori approach (MAP)

You might already have seen these terms in your statistics class, but if you don't remember what is the difference between ML and MAP, then continue reading.

Putting it very simply: ML is a simplified version of MAP. When you are using ML to choose your model, you are trying to maximize the following quantity by choosing the appropriate model:

P(data | model)

We assume that data is already collected, so the only way to maximize the above quantity is to by trying different models. Hopefully, you will choose the model that best explains your data. But, wait a minute ... What if the best model that explains my model, is in fact quite unlikely?! If other words, what if the models are not equally likely to be encountered in real world? (a leprechaun stealing your money or the universe being held in place by a giant elephant, still hypotheses, though unlikely!)

In that case, you are entering the realm of MAP, which takes into account the prior probability of the models too. That is, this time we are trying to maximize the following quantity:

P(data | model) * P(model)

So, not only we are trying to find the model that best describes our data, but we are also looking for a model that is "not unlikely" itself. By just comparing the above two quantities, you can easily see that ML is a special case of MAP, where you assign the same prior to every model (you do not express any preference to choose a certain model over another, no bias).

Both ML and MAP ultimately turn into optimization problems (remember that we are trying to maximize some quantity). The difference is that MAP problems end with an additional regularization term which reflects the prior information.

Now, you might ask: how do I choose the prior wisely? In other words, which models should be considered as more likely?

Usually, this is up to you as the designer to impose a certain type of "bias". In most cases, we consider the simpler models to be more likely. For example, if you are designing an algorithm to learn decision trees, you probably want to prefer simpler trees over more complex deeper ones (you want them to be more succinct). This can be elegantly described using the "minimum description length" principle (read more here), or "Occam's Razor" principle, or KISS (Keep It Simple & Stupid). The image below (from Duns Scotus' book Ordinatio) shows one of the earliest arguments for KISS (it is in Latin, it means "Plurality is not to be posited without necessity").

Friday, March 8, 2013

Hadoop Related Jargons

Hadoop implements a computational paradigm named MapReduce. If you have no idea what is MapReduce, read the Wikipedia article here.

Some Hadoop Related Terms:

Hadoop: developed by Y!, a map-reduce implementation
HDFS: Distributed file system written in Java for the Hadoop framework
Pig: High level scripting language to work with Hadoop
Hive: Data warehouse infrastructure to work with Hadoop, uses HiveQL (an SQL-like language)
HBase: A non-relational key/value datastore to work with Hadoop
Mahout: A set of machine learning algorithms to work with Hadoop on Big data

Some Other Systems Similar to Hadoop:

Dryad

Developed at Microsoft
Tasks modeled as directed acyclic graph
Sequential programs are connected using one-way channels

Developed by Y!
Stream processing
Using Java Platform

Spark

Developed at UC Berekley
In-memory queries, not just IO requests
Implemented in Scala
Needs a cluster manager (called Mesos)

Storm

Developed by Twitter
Stream processing
Guarantees message processing

BashReduce

works with Linux commands such as sort, grep

Disco

Developed at Nokia
Backend is written in Erlang
Works with Pyton
developed at Nokia

GraphLab

Developed at CMU
For machine learning tasks
Data should fir in main memory
Is not fault tolerant

HPCC

Uses Enterprise Control Language (ECL)
In C++