Monday, March 11, 2013

Difference between Maximum Likelihood (ML) and Maximum A Posterior (MAP)

Suppose you have collected some data and you have a number of models (hypotheses) which might be able to explain your data. The question is: which model/hypothesis do you choose? 

There are two different ways to answer this question:
  1. Using Maximum Likelihood approach (ML)
  2. Using Maximum A Posteriori approach (MAP)
You might already have seen these terms in your statistics class, but if you don't remember what is the difference between ML and MAP, then continue reading.

Putting it very simply: ML is a simplified version of MAP. When you are using ML to choose your model, you are trying to maximize the following quantity by choosing the appropriate model:

P(data | model)

We assume that data is already collected, so the only way to maximize the above quantity is to by trying different models. Hopefully, you will choose the model that best explains your data. But, wait a minute ... What if the best model that explains my model, is in fact quite unlikely?! If other words, what if the models are not equally likely to be encountered in real world? (a leprechaun stealing your money or the universe being held in place by a giant elephant, still hypotheses, though unlikely!)

In that case, you are entering the realm of MAP, which takes into account the prior probability of the models too. That is, this time we are trying to maximize the following quantity:

P(data | model) * P(model)

So, not only we are trying to find the model that best describes our data, but we are also looking for a model that is "not unlikely" itself. By just comparing the above two quantities, you can easily see that ML is a special case of MAP, where you assign the same prior to every model (you do not express any preference to choose a certain model over another, no bias).

Both ML and MAP ultimately turn into optimization problems (remember that we are trying to maximize some quantity). The difference is that MAP problems end with an additional regularization term which reflects the prior information. 

Now, you might ask: how do I choose the prior wisely? In other words, which models should be considered as more likely?

Usually, this is up to you as the designer to impose a certain type of "bias". In most cases, we consider the simpler models to be more likely. For example, if you are designing an algorithm to learn decision trees, you probably want to prefer simpler trees over more complex deeper ones (you want them to be more succinct). This can be elegantly described using the "minimum description length" principle (read more here), or "Occam's Razor" principle, or KISS (Keep It Simple & Stupid). The image below  (from Duns Scotus' book Ordinatio) shows one of the earliest arguments for KISS (it is in Latin, it means "Plurality is not to be posited without necessity"). 


No comments:

Post a Comment