Sunday, March 17, 2013

Dirichlet Distribution (A Distribution on Top of Another Distribution)

The Dirichlet distribution is one of the distributions that you will often see in machine learning and natural language processing papers. For example, you will see this distribution a lot in probabilistic topic models. One particular application of Dirichlet distribution is modeling word distributions in text documents, or modeling topics in a collection of text documents. 

A Dirichlet distribution on $x=(x_1, \cdots, x_m)$ with parameters $\alpha=(\alpha_1, \cdots, \alpha_m)$ is defined as:

$p(x, \alpha) = \frac{\displaystyle\Gamma(A)}{\displaystyle\prod_{i=1}^{m}{\Gamma(\alpha_i)}}  \displaystyle\prod_{i=1}^{m}{x_i^{\alpha_i-1}}$

where $A = \displaystyle\sum_{i=1}^{m}{\alpha_i}$.

In this formula, $\Gamma$ stands for the Gamma function (you can think of the Gamma function as an extension of the factorial functions, so it can also work with the real numbers). By the way, if you are thinking that the above formula is somehow similar to the multinomial distribution formula, you're right! The Dirichlet distribution and the multinomial distribution can be roughly seen as reverse counterparts (multinomial distribution calculates the counts given probabilities, the Dirichlet distribution calculates the probabilities given the $\alpha$ pseudo-counts).

You have to be careful about $x$ here, as there is a limitation regarding $x$. It has to be a simplex. In a simplex, each $0<x_i<1$ and $\sum{x_i} = 1$.  The following figure shows a simplex. Generally speaking, an simplex with $n+1$ vertices is an n-dimensional polytope. So a 2-simplex is a triangle, a 3-simplex is a tetrahedron, and so on. 



So, with a Dirichlet distribution, unlike the other distributions (such as a Gaussian distribution), your domain cannot be anything you want, rather it has to be a simplex. It turns out that you can always think of a simplex as representing a probability distribution, because each $0<x_i<1$ and they sum up to 1. Therefore, the Dirichlet distribution becomes quite interesting: we are in fact defining a distribution on top of another "distribution"! 

Now, you might ask what is the role of $\alpha$? It controls the shape of density function on the simplex. In other words, it controls how $x$s are spread in our simplex. One special case is when all $\alpha_i$ are equal. Then the distribution over the simplex will be symmetrical. You can see a couple of Dirichlet distribution examples for different $\alpha$ values below (images are taken from this technical report, highly recommended if you want to dig deeper). 




In these figures, whenever $\alpha = [c,c,c]$, then the density is symmetric, as can be seen in the first three plots (clockwise, from top left). When $\alpha=[1,1,1]$, then the distribution turns into a uniform distribution (top left). When $0<c <1$, then the density is packed at the vertices of the simplex (top right), so this case is usually used to represent a sparse distribution. For example, you can use it to represent a sparse topic distribution over documents, if you know that each document is only about a few topics such as sports and celebrities. If $c > 1$, then the density becomes concentrated at the center of the simplex (bottom, left). Finally, if all $\alpha_i$ are not he same, then density will not be symmetrical, concentrated towards the vertex with the highest corresponding $\alpha_i$.

The mean of a Dirichlet distribution is quite simple:

$E(x_i) = \frac{\displaystyle\alpha_i}{\displaystyle A}$




Monday, March 11, 2013

Difference between Maximum Likelihood (ML) and Maximum A Posterior (MAP)

Suppose you have collected some data and you have a number of models (hypotheses) which might be able to explain your data. The question is: which model/hypothesis do you choose? 

There are two different ways to answer this question:
  1. Using Maximum Likelihood approach (ML)
  2. Using Maximum A Posteriori approach (MAP)
You might already have seen these terms in your statistics class, but if you don't remember what is the difference between ML and MAP, then continue reading.

Putting it very simply: ML is a simplified version of MAP. When you are using ML to choose your model, you are trying to maximize the following quantity by choosing the appropriate model:

P(data | model)

We assume that data is already collected, so the only way to maximize the above quantity is to by trying different models. Hopefully, you will choose the model that best explains your data. But, wait a minute ... What if the best model that explains my model, is in fact quite unlikely?! If other words, what if the models are not equally likely to be encountered in real world? (a leprechaun stealing your money or the universe being held in place by a giant elephant, still hypotheses, though unlikely!)

In that case, you are entering the realm of MAP, which takes into account the prior probability of the models too. That is, this time we are trying to maximize the following quantity:

P(data | model) * P(model)

So, not only we are trying to find the model that best describes our data, but we are also looking for a model that is "not unlikely" itself. By just comparing the above two quantities, you can easily see that ML is a special case of MAP, where you assign the same prior to every model (you do not express any preference to choose a certain model over another, no bias).

Both ML and MAP ultimately turn into optimization problems (remember that we are trying to maximize some quantity). The difference is that MAP problems end with an additional regularization term which reflects the prior information. 

Now, you might ask: how do I choose the prior wisely? In other words, which models should be considered as more likely?

Usually, this is up to you as the designer to impose a certain type of "bias". In most cases, we consider the simpler models to be more likely. For example, if you are designing an algorithm to learn decision trees, you probably want to prefer simpler trees over more complex deeper ones (you want them to be more succinct). This can be elegantly described using the "minimum description length" principle (read more here), or "Occam's Razor" principle, or KISS (Keep It Simple & Stupid). The image below  (from Duns Scotus' book Ordinatio) shows one of the earliest arguments for KISS (it is in Latin, it means "Plurality is not to be posited without necessity"). 


Friday, March 8, 2013

Hadoop Related Jargons

Hadoop implements a computational paradigm named MapReduce. If you have no idea what is MapReduce, read the Wikipedia article here.

Some Hadoop Related Terms:
  • Hadoop: developed by Y!, a map-reduce implementation
  • HDFS: Distributed file system written in Java for the Hadoop framework
  • Pig: High level scripting language to work with Hadoop
  • Hive: Data warehouse infrastructure to work with Hadoop, uses HiveQL (an SQL-like language) 
  • HBase: A non-relational key/value datastore to work with Hadoop
  • Mahout: A set of machine learning algorithms to work with Hadoop on Big data
Some Other Systems Similar to Hadoop:
  • Dryad
    • Developed at Microsoft
    • Tasks modeled as directed acyclic graph
    • Sequential programs are connected using one-way channels
  • S4
    • Developed by Y!
    • Stream processing
    • Using Java Platform
  • Spark
    • Developed at UC Berekley
    • In-memory queries, not just IO requests
    • Implemented in Scala
    • Needs a cluster manager (called Mesos)
  • Storm
    • Developed by Twitter
    • Stream processing
    • Guarantees message processing
  • BashReduce
    • works with Linux commands such as sort, grep
  • Disco
    • Developed at Nokia 
    • Backend is written in Erlang
    • Works with Pyton 
    • developed at Nokia
  • GraphLab
    • Developed at CMU
    • For machine learning tasks
    • Data should fir in main memory
    • Is not fault tolerant
  • HPCC
    • Uses Enterprise Control Language (ECL)
    • In C++