Wednesday, July 18, 2012

Norms Explained


In everyday life, we have a good sense of the quantity of various items in terms of a "single" number. For example, we know what it means to buy 5 pounds of flour and it is more than 3 pounds of flour. But, we don't have a very intuitive sense of quantity and comparison in case of some mathematical objects such as vector and matrix.



For example, how big is a vector like the one below?

$[ 98, 1, 23, 6 ]$

Is it bigger than the following vector?

$[ 2, 97, 12, 13 ]$



What about the two matrices below?





It is not very intuitive to compare vectors and matrices, because a matrix is a collection of numbers, not just a single number. We need to define a way to interpret the size (or the quantity) of a collection of numbers, whether it is a vector or it is a matrix.


Norms

The size of a matrix or a vector is defined based on a "norm". You can say that the higher is the norm of matrix, the bigger is the size of the matrix (of course, it is not a precise definition, but you get the idea). Unlike the value of a single number, there is no unique way to interpret the size of a vector or a matrix. There are many different types of norms and each one has its own properties and uses.



Some of the famous norms include the Euclidean norm, l0 norm, l1 norm, Frobenius norm, etc. The following table shows the famous vector norms.



There are also a number of matrix norms. The matrix norms $l_0$ through $l_\infty$ can be defined in a number of different ways, for example by adding up all the entries, or as a result of multiplying by a vector.




We will explain a couple of vector norms and matrix norms later. But first let's use a more precise mathematical notation of norm. The norm of a matrix $X$ is denoted by $||X||_i$ where $i$ denotes the type of the norm, e.g. if it is norm 0, we will denote it as $||X||_0$, and if it is the Euclidean norm, we will show it as  $||X||_2$. Note that Euclidean norm is more commonly denoted as $||X||$, as it is the one of the most widely used norms, therefore the subscript is usually omitted.


Vector Norms

First, we define the vector norms and then we will explain the matrix norms.

Euclidean Norm 

The Euclidean norm (i.e. $l_2$ norm) of a vector X is defined as:

$||X|| = \sqrt{\sum_i{x_i^2}}$

$l_2$ norm is used in many applications, because it is continuous and differentiable.

$l_2$ norm has a very simple geometric interpretation, shown in the following figure. Here, $l_2$ norm shows the distance of a 2D point (i.e. a vector with only two elements) from the origin (it is the shortest path).

Another useful illustration of the Euclidean norm is its unit circle, i.e. all the set of all vectors whose Euclidean norm is equal to "1". You will realize that the set of all such vectors lie on a circle with a radius 1.


$l_n$ norms in General

In general, $l_n$ norm can be defined as:

$||X||_n = \sqrt[n]{\sum_i{x_i^n}}$

These norms might look similar, but they have very different properties, therefore those norms are not interchangeable in most cases.

$l_0$ Norm

If you look carefully at the above formula, then you will realize that there is a problem ... we mentioned $l_0$ norm. But how do we define $l_0$ norm with the presence of $\sqrt[0]{..}$ and $x_i^0$?! Well, as computing the above norm formula for $l_0$ is quite tricky, in practice we use the following definition of $l_0$:

$||X||_0 = \# (i | x_i \neq 0)$

The above definition tells us that $l_0$ norm of a vector $X$ is the total number of its non-zero elements. You might think that the above norm is just a mathematical definition with perhaps no real world application. In reality $l_0$ norm shows up in many optimization problems, however as its computation is difficult, we usually use other norms such as $l_1$.

$l_1$ Norm

$l_1$ norm is defined as: 

$||X||_1 = \sum_i{|x_i|}$

$l_0$ norm is used in many applications, and it has a couple of other names such as the "Manhattan distance". It is used in many regularization and optimization problems instead of $l_0$ which is more difficult to compute (see this post). $l_1$ also has a simple geometric interpretation, as shown in the following figure. It shows the distance in terms of "city blocks", hence sometimes it is called "Manhattan distance" or "city block distance".


Similar to Euclidean norm, we can also show its unit circle, i.e. all the set of all vectors whose $l_1$ norm is equal to "1". You will realize that the set of all such vectors lie on a square (and not a circle) with its sides equal to 1.


$l_\infty$ Norm
$l_\infty$ Norm is defined as:

$||X||_\infty = \max{(|x_1|, \cdots, |x_n|)}$

It is sometimes called as the Chebyshev norm. The geometric interpretation of its unit circle is a square with sides of 1. It has this particular shape, because in order to have a norm of 1, at least one of the entries has to be 1.

L1 Optimization

Suppose you need to reconstruct a signal $X$ from a set of measurements $y$. $y$ usually does not contain complete information about $X$ due to real world constraints, e.g. you cannot deploy too many sensors. How do we reconstruct $X$? One approach is to assume that $X$ is sparse, i.e. there are not many non-zero elements (i.e. less unknown). So, we want to find the matrix $X$ with the minimum number of non-zero entries. In other words, we are trying to minimize $||X||_0$, by finding a transformation $\Phi$:

$\min{||X||_0}$
subject to
$\Phi X = y$ 

It's perfect, but there is only one problem. Computing $l_0$ is NP hard, which in simple terms means, it needs exponential time (not practical). There have been some approximation solutions of $l_0$ optimization (see this paper, for example), but in most cases, we usually use  $l_1$ as an approximation of $l_0$:

$||X||_1 = \sum_i{|x_i|}$

This is the basis for what is called as "compressive sensing" and "$l_1$ regularization" which has many applications in machine learning and signal processing, especially in bio-medical applications. 

Co-variance and Gram Matrix

You can think of your data as a matrix, where the rows represent the instances and the columns represent the features. Let's call this matrix $A$. Now $A^TA$ represents feature $\times$ feature, or as it is commonly called: the co-variance matrix. On the other hand,  $AA^T$ represents instance $\times$ instance, i.e. the similarity! This matrix is called the gram matrix.  

This is useful in computing SVD. SVD is a method to decompose a matrix $A$ into 3 components:

$A=U\Sigma V^T$

U is obtained from the eigenvectors of the gram matrix ( $AA^T$), while V is obtained from the eigenvectors of the co-variance matrix ( $A^TA$).