Aug 16, 2016

Otsu threshold

Previously, in my post about Canny edge detector, we had seen that one of the required steps for obtaining perfectly thinned edges was thresholding, in the source code example for that post I've used an arbitrary good enough value for the thersholds. In this post I'll show you an algorithmic way of obtaining all thresholds for a given number of gray levels output, based on the Otsu's method.

The theory

In 1979, professor Nobuyuki Otsu, proposed a statistical method for dividing an image histogram into N groups of similar gray values, you can read the paper here. Otsu's method is also based in Fisher's linear discriminant.

Let's say we have a set of multidimensional elements, for example 2D points, where a point is defined as it's position in X and Y; or we have 3D point where the point is defined by it's (X, Y, Z) coordinates, or it can be a 1D element with just one component, the number of components doesn't matters whereas it is measurable. Now trace a division between the elements, wherever you want, calculate the median and deviation for the elements in each side, with this data calculate the deviation between the median for each group and the total median, and finally calculate the median of the of deviations of all groups.

Before continue, in discriminant analysis, the groups of data are called classes, the median distance between the classes and the absolute median is called between-class variance ($\sigma_B$), and the median distance between the elements in each class is called within-class variance ($\sigma_W$).

Fisher linear discriminant establishes a relation between the between-class variance ($\sigma_B$) and the within-class variance ($\sigma_W$) in which bigger the between-class variance ($\sigma_B$), and smaller the within-class variance ($\sigma_W$), means that we are more probably dealing with two (or more) different groups of data.

Summarizing, and explaining it with simple words, we want to find the division that make those elements in both sides, enough far away and compact to treat them as two completely different group of elements.

If you want a more detailed explanation on how Linear Discriminant Analysis works I recommend watching this video.

Creating the histogram

First at all we need to create a histogram for the luminance channel. A histogram gives us the number of repetitions for a given value in a data set. In this case, we are creating an histogram for the luminance channel.

The maths

Let's start explaining the simplest case, that is 2 classes (groups), 1 threshold. So, we will divide the histogram in two parts, and calculate the median an the quadratic deviation in each side for the color axis.

$$
\begin{matrix}
\mu_0=\sum_{i=0}^{k}\frac{p_i}{\omega_0}i & \sigma_0^2=\sum_{i=0}^{k}\frac{p_i}{\omega_0}(i-\mu_0)^2 & (left) \\
\mu_1=\sum_{i=k+1}^{L}\frac{p_i}{\omega_1}i & \sigma_1^2=\sum_{i=k+1}^{L}\frac{p_i}{\omega_1}(i-\mu_1)^2 & (right) \\
\end{matrix}
$$
Where $p_i$ are the normalized weights for each color, and $k$ is the selected threshold, and $L$ is the maximum number for the gray levels, 255 in this case. $\omega_0$ and $\omega_1$ are the summation of the weights in each side, that is:
$$
\begin{matrix}
\omega_0=\sum_{i=0}^{k}p_i & (left) \\
\omega_1=\sum_{i=k+1}^{L}p_i & (right) \\
\end{matrix}
$$
where:
$$\omega_0+\omega_1=1$$
and the total media is:
$$\omega_0\mu_0+\omega_1\mu_1=\mu_T$$
Confused? you can visualize the effects of changing the threshold in the following graph:

Linear Discriminant Analysis

In the graph above, you will notice that moving the $k$ index will change $\mu_0$ and $\mu_1$, if you move $k$ to the left will make $\mu_0$ close to $k$, but will move $\mu_1$ away of $k$, if you move $k$ to the right will make $\mu_1$ close to $k$, but will move $\mu_0$ away of $k$, we need to find a threshold in which $\mu_0$ and $\mu_1$ ar as far as possible from $k$, but the difference between the colors in the group is minimal. For that reason I will make use of Fisher lineal discriminant, that is:
$$
\lambda=\frac{\sigma_B^2}{\sigma_W^2}
$$
$\sigma_B$ is the between-class variance and $\sigma_W$ is the within-class variance, we need to find a value for $k$ that maximizes $\lambda$, because bigger the value of $\lambda$ means that more probably we are dealing with two clearly different groups (classes). So we have:
$$
\begin{matrix}
\sigma_B^2=\omega_0(\mu_0-\mu_T)^2+\omega_1(\mu_1-\mu_T)^2 \\
\sigma_W^2=\omega_0\sigma_0^2+\omega_1\sigma_1^2
\end{matrix}
$$
$\sigma_B^2$ is also:
$$
\begin{matrix}
\sigma_B^2= & \omega_0(\mu_0-\mu_T)^2+\omega_1(\mu_1-\mu_T)^2 \\
& \omega_0(\mu_0-\omega_0\mu_0-\omega_1\mu_1)^2+\omega_1(\mu_1-\omega_0\mu_0-\omega_1\mu_1)^2 \\
& \omega_0[\mu_0(1 -\omega_0)-\omega_1\mu_1]^2+\omega_1[\mu_1(1 -\omega_1)-\omega_0\mu_0]^2 \\
& \omega_0[\mu_0\omega_1-\omega_1\mu_1]^2+\omega_1[\mu_1\omega_0-\omega_0\mu_0]^2 \\
& \omega_0\omega_1^2(\mu_0-\mu_1)^2+\omega_0^2\omega_1(\mu_0-\mu_1)^2 \\
& \omega_0\omega_1(\mu_0-\mu_1)^2(\omega_0+\omega_1) \\
& \omega_0\omega_1(\mu_0-\mu_1)^2
\end{matrix}
$$
and $\sigma_W^2$ is:
$$
\begin{matrix}
\sigma_W^2= & \omega_0\sigma_0^2+\omega_1\sigma_1^2 \\
& \omega_0\left(\sum_{i=0}^{k}\frac{p_i}{\omega_0}(i-\mu_0)^2\right)+\omega_1\left(\sum_{i=k+1}^{L}\frac{p_i}{\omega_1}(i-\mu_1)^2\right) \\
& \left(\sum_{i=0}^{k}p_i(i-\mu_0)^2\right)+\left(\sum_{i=k+1}^{L}p_i(i-\mu_1)^2\right) \\
& \left(\sum_{i=0}^{k}p_i(i^2+\mu_0^2-2i\mu_0)\right)+\left(\sum_{i=k+1}^{L}p_i(i^2+\mu_1^2-2i\mu_1)\right) \\
& \left(\sum_{i=0}^{k}p_ii^2+\sum_{i=0}^{k}p_i\mu_0^2-2\sum_{i=0}^{k}p_ii\mu_0\right)+\left(\sum_{i=k+1}^{L}p_ii^2+\sum_{i=k+1}^{L}p_i\mu_1^2-2\sum_{i=k+1}^{L}p_ii\mu_1\right) \\
& \left(\sum_{i=0}^{k}p_ii^2+\omega_0\mu_0^2-2\omega_0\mu_0^2\right)+\left(\sum_{i=k+1}^{L}p_ii^2+\omega_1\mu_1^2-2\omega_1\mu_1^2\right) \\
& \left(\sum_{i=0}^{k}p_ii^2-\omega_0\mu_0^2\right)+\left(\sum_{i=k+1}^{L}p_ii^2-\omega_1\mu_1^2\right) \\
& \sum_{i=0}^{L}p_ii^2-\omega_0\mu_0^2-\omega_1\mu_1^2
\end{matrix}
$$
The total quadratic deviation is:
$$
\begin{matrix}
\sigma_T^2= & \sum_{i=0}^{L}p_i(i-\mu_T)^2 \\
& \sum_{i=0}^{L}p_i(i^2+\mu_T^2-2i\mu_T) \\
& \sum_{i=0}^{L}p_ii^2+\mu_T^2\sum_{i=0}^{L}p_i-2\mu_T\sum_{i=0}^{L}p_ii \\
& \sum_{i=0}^{L}p_ii^2+\mu_T^2-2\mu_T^2 \\
& \sum_{i=0}^{L}p_ii^2-\mu_T^2
\end{matrix}
$$
So $\sigma_W^2$ can be written as:
$$
\begin{matrix}
\sigma_W^2= & \sigma_T^2+\mu_T^2-\omega_0\mu_0^2-\omega_1\mu_1^2 \\
& \sigma_T^2+(\omega_0\mu_0+\omega_1\mu_1)^2-\omega_0\mu_0^2-\omega_1\mu_1^2 \\
& \sigma_T^2+\omega_0^2\mu_0^2+\omega_1^2\mu_1^2+2\omega_0\omega_1\mu_0\mu_1-\omega_0\mu_0^2-\omega_1\mu_1^2 \\
& \sigma_T^2+\omega_0\mu_0^2(\omega_0-1)+\omega_1\mu_1^2(\omega_1-1)+2\omega_0\omega_1\mu_0\mu_1 \\
& \sigma_T^2-\omega_0\omega_1\mu_0^2-\omega_0\omega_1\mu_1^2+2\omega_0\omega_1\mu_0\mu_1 \\
& \sigma_T^2-\omega_0\omega_1(\mu_0-\mu_1)^2 \\
& \sigma_T^2-\sigma_B^2
\end{matrix}
$$
At the end we found that:
$$
\sigma_B^2+\sigma_W^2=\sigma_T^2
$$
and because $\sigma_T^2$ is a constant, that means that $\sigma_B^2$ and $\sigma_W^2$ depends in each other, so we can re-write $\lambda$ as:
$$
\begin{matrix}
\lambda=\frac{\sigma_B^2}{\sigma_T^2-\sigma_B^2} & or & \lambda=\frac{\sigma_T^2-\sigma_W^2}{\sigma_W^2}
\end{matrix}
$$
In both cases, its possible to increase the value of $\lambda$ by increasing the value of $\sigma_B^2$, or by decreasing the value of $\sigma_W^2$, but $\sigma_B^2$ is preferred because $\sigma_W^2$ will require to calculate $\mu_T$ first.

In resume, we must iterate $k$ from 1 to $L-1$, and the $k$ that gives the maximum value for $\sigma_B^2$ will be the otsu threshold.

Generalization for multi level thresholding

In general, for multiple classes, we can calculate $\sigma_B^2$:
$$
\begin{matrix}
\sigma_B^2= & \sum_{i=0}^{C}\omega_i(\mu_i-\mu_T)^2 \\
& \sum_{i=0}^{C}\omega_i(\mu_i^2+\mu_T^2-2\mu_i\mu_T) \\
& \sum_{i=0}^{C}\omega_i\mu_i^2+\mu_T^2\sum_{i=0}^{C}\omega_i-2\mu_T\sum_{i=0}^{C}\omega_i\mu_i \\
& \sum_{i=0}^{C}\omega_i\mu_i^2+\mu_T^2-2\mu_T^2 \\
& \sum_{i=0}^{C}\omega_i\mu_i^2-\mu_T^2
\end{matrix}
$$
where $C$ is the number of classes minus one, and because $\mu_T^2$ is a constant, we just need to find the maximum value of:
$$
h=\sum_{i=0}^{C}\omega_i\mu_i^2
$$

Algorithm speed-up

In 2001, prof. Ping-Sung Liao, Tse-Sheng Chen and Pau-Choo Chung, proposed a method for speeding up the computation of Otsu's thresholds using look-up tables. You can read the paper here.

Supposed we split the histogram in N groups (classes), so we will have N - 1 thresholds, the intervals for each group will be $[k_i, k_{i+1}]$, where $k_0=0$ and $k_N=255$, so $[k_1, k_{N-1}]$ will be the thresholds we are finding. The scan ranges for each $k_i$ will be then:
$$
\begin{matrix}
k_1 \in [1, L-C+1] \\
k_2 \in [2, L-C+2] \\
k_3 \in [3, L-C+3] \\
\vdots \\
k_{C-1} \in [C-1, L-1]
\end{matrix}
$$
where $L$ is the maximum value for gray level. We can iterate over thresholds using the following code:

Let's return to the formula:
$$
h=\sum_{i=0}^{C}\omega_i\mu_i^2
$$
$\mu_i$ can be written as:
$$
\mu_i=\frac{1}{\omega_i}\sum_{j=u_i}^{v_i}p_jj
$$
In both cases, $\omega_i$ and $\mu_i$, can be calculated using cumulative sum tables, that is a trick we used before. So we can rewrite $h$ as:
$$
h=\sum_{i=0}^{C}\frac{(S_{v_i}-S_{u_i})^2}{P_{v_i}-P_{u_i}}
$$
where:
$$
\begin{matrix}
S=\sum_{i=0}^{L}p_ii \\
P=\sum_{i=0}^{L}i
\end{matrix}
$$
The $H$ table is then as follows:

At this point, I think I've explained almost all key parts of the algorithm, so from here, I'm pretty sure you will be able able to understand the rest by looking at code bellow.

Online Otsu threshold test

But before finishing this article, here I've written the following code so you can test the Otsu algorithm, you can test with the images in your computer and see how it reacts to the different histograms.