12 September 2017

A tutorial on Differential Evolution and Otsu's method for image thresholding

The first thing you need to know before your foray into computational intelligence (CI), is that the creators of these algorithms themselves don't know how it works. CI algorithms are inherently analytical techniques that depend on randomness to give solutions that are close to optimal, but not necessarily the best solution. It's like finding a lovely house in a city and choosing it as your home, but you have not actually found out if it is the best house that you could find in the city, because you just did not have the time to explore. The fact that people come up with naming these algorithms after animals (bat algorithm), insects (ant colony, bee colony, glowworm algorithms) and bacteria (bacterial foraging algorithm) is an amusing practice because the lack of theoretical backing makes the creation and optimization of the algorithm more of an art than a science.

Still, the way these algorithms improve over the generations is very impressive. This video shows you how:

This tutorial covers the topic of finding thresholds for a grayscale image using a CI algorithm named Differential Evolution (DE). We check if the thresholds are good enough by using the Otsu criterion. The best references for DE and Otsu criterion are the original papers presented here and here.

What is image thresholding?
Consider Lena's image. When you convert the red, blue and green pixels to gray colours, you end up with colours ranging from 0 to 255. Zero is black. 255 is white and the numbers in between are shades of gray (I'm not talking of Fifty shades of gray). Now if you decide that all pixels below the shade 118 are going to be converted to black and all pixels above or equal to 118 are going to be white, you end up with an image like this:

This is called "image thresholding" or "image segmentation". There are more complex ways of doing image segmentation though. The basic idea is to try and separate the image into distinct segments based on our interest. Sometimes we want to distinguish between foreground and background. Sometimes, to distinguish between parts of objects or multiple objects. If the threshold were closer to 50 or 10, the image would appear too dark. So we have to figure out which threshold will give us an optimum image. An image where the pixels are shaded in such a way that the objects in the image are well distinguished from other objects. An "optimal threshold".

So how did we find this threshold at 118?
We used the "Otsu" technique. A person named Otsu figured out that a simple and nice way to segment images was to first create a histogram of the pixels and then statistically analyze the histogram.
This is what Lena's histogram looks like:

The vertical columns are the count of the pixels of a particular shade of gray. Notice the bottom of the graph showing the shades of gray. At shade 100, the vertical column just crosses 2000. It means that the Lena image has a little more than 2000 pixels which have the gray shade of 100.  These vertical columns are called "bins".

In statistics, there is a concept of "class", which is a group of objects with a similar property. So in the above histogram, we want to find a threshold which would separate the pixels into two classes. Black pixels and white pixels.
Otsu figured out that he could statistically find the best combination of classes (and hence the location of the best threshold), by examining the mean and variances of the classes. He created a formula and called it "finding the between class variance". This between class variance will be used as a value that gives us the "fitness" of the image. Higher the value, better the threshold.
If there are two thresholds, there would be three classes. As shown in the image below, the image would be eventually shown with all pixels below grayscale 32, as white. Anything equal to or above 32 and below 106, as gray colour 32 and any pixel equal to or above 106 as white.

The calculation goes like this:

The gray histogram is first normalized. Each of the 0 to 255 bins of the histogram are iterated as i.

probability of i = number of pixels in bin i / total number of pixels in image
total mean = sum of probability of i of all bins

If there are t thresholds, then there will be t+1 number of classes c the image can be segmented into. So with a threshold of t=1, there would be classes c1 and c2. Class mean values would also be computed as mean1 and mean2. Or, as shown in Figure 2.1, if there are two thresholds t1 and t2, there would be three classes and therefore three mean values calculated.

pc = probability of class occurrence c = sum of probability of i of all bins in c
class mean = sum of (i * probability of i of all bins in c) / class occurrence c

Class variances of each class c are calculated as:

class variance = sum of ((i - class mean)2 * probability of i / probability of class occurrence c)

The variance between the classes gives the fitness of the threshold. 

between class variance = sum for all bins(pc * (class mean - total mean)2

Note that the between class variance (BCV) is one single value. You don't calculate between class variances for each threshold and sum them up. You use the one formula given above to calculate it.

Ok, so now we know how to calculate the fitness of a threshold, we can write a simple for loop that places the threshold at the zero'th position in the histogram and calculates the BCV. Then move the threshold to position 1 of the histogram and find the BCV and keep going on for values 2, 3, 4 ... till 255. The threshold with the highest BCV will be the optimal threshold. For a single threshold of the Lena image, it is at 118.

Now you'd say. Hey, that's it? We just have a search space of 0 to 255? Then why do we need a computational intelligence algorithm? Couldn't we just use normal for loops?

Well, for one threshold you can do an exhaustive search from 0 to 255. For two thresholds, you'd have to run two loops nested. So that's 2552. For four thresholds, the number of iterations are 4294967296. It would take 95 days to compute on Matlab.

Enter the saviour: Computational Intelligence algorithm: Differential Evolution:

The basic idea of DE is to place thresholds randomly within 0 and 255, and keep evaluating the BCV. With some luck, you will get good thresholds which won't be the best, but will be good enough.

The secret weapon of DE is "mutation" and "crossover". Similar to how we have a population of humans and our genes undergo mutations and crossover of traits, the DE algorithm initializes a 'population' of vectors. If we are solving a problem of finding two optimal thresholds, then each vector will contain two random thresholds. During the DE algorithm, these vectors will be combined randomly with other vectors in the population to provide some variety to the result, and somehow, we end up finding near-optimal thresholds. Nobody knows how, but it works. Magic.

If there are 4 thresholds and a population of 3 vectors, the vectors would look like this:
X1 = [t1; t2; t3; t4];
X2 = [t1; t2; t3; t4];
X3 = [t1; t2; t3; t4];
Where, the t1 of X1 does not have to be the same as the t1 in X2 and so on.

This is the pseudocode for my DE algorithm:

  • Load image and convert to 8 bit grayscale
  • Generate histogram from image. Bins ranging from 0 to 255
  • Initialize DE parameters: PopulationSize, thresh, crossover probability (cr), beta.
  • Set thresh number of thresholds each, randomly within bin space for each population entity X.
  • Evaluate initial fitness of X
  • forEach generation for a defined number of iterations
  • forEach population p for PopulationSize
  • Select population entity p from X.
  • Select 3 distinct populations from X. The thresholds should not be equal to p.
  • Generate mutant Up = X1 + beta * (X2 – X3)
  • If Up == Xp, regenerate threshold at random position for Up and continue;
  • Crossover at least one threshold from X to U and other thresholds if generated rnd <= cr
  • If mutant thresholds are outside histogram range, regenerate threshold at random position
  • end for population
  • Evaluate fitness of Up
  • forEach population p for PopulationSize
  • If fittest mutants in Up are fitter than fittest Xp, replace Xp with fittest Up
  • end for population
  • decrease beta by 40-1
  • end for generation

Beta value: Ranges from 0 to 2. A high beta value ensures that the mixing of vector values have a much larger variation, so the mutated vector will end up far from the existing vectors. This is called "exploration", and helps visiting many other threshold positions within the 0 to 255 values. Smaller values of beta keep the variations of threshold values closer to the existing thresholds. This is called "exploitation", as you have found a good threshold already and are trying to exploit that area to find better thresholds in the area. Beta is also called the constant of differentiation.
The crossover probability: Ranges from 0 to 1. Having a low crossover probability like 0.2 was found to help reach an optimal threshold faster, but it does not necessarily reach the most optimal threshold. A higher value like 0.8 helps reach a more optimal threshold, but it can take three times more number of generations to do so. Setting cr=0.3 appeared to help reach an optimal threshold with a slightly higher number of generations than cr=0.2. Allows exploitation of symmetry and separability of a function.

That's all there is to it. You can have multiple thresholds and the time of execution is just around a minute or two in Matlab even for 6 or eight thresholds. This is a major advantage of CI algorithms. You could achieve the same results with genetic algorithms, particle swarm optimization, bee colony algorithm or bacterial foraging algorithms.

The code for the DE algorithm and Otsu method, I've listed here: https://github.com/nav9/differentialEvolution