Adaptive Binarisation

The previous post covered the basics of binarisation, and introduced the Otsu algorithm, a good method for finding a global threshold number for a page. But there are inevitable limitations with using a global threshold for binarisation. Better would be to use a threshold that is adapted over different regions of the page, so that as the conditions of the page change so can the threshold. This technique is called adaptive binarisation.

For each pixel of an image, adaptive binarisation considers the pixels around it to determine a good threshold. This means that even in an area which is heavily shaded, for example near the spine of a book, the text will be correctly differentiated from the background, as even though they may both be darker than the text in the rest of the page, it is the darkness relative to its surroundings that matters.

A popular algorithm for this adaptive binarisation technique was described in a 2000 paper by J. Sauvola Adaptive document image binarization. The key part of the paper which is still used today is his modification of an earlier algorithm for adaptive binarisation by Niblack (1986), to add standard deviation to the calculation of a local threshold for an area. This improves the algorithm significantly, enabling it to take better account of variations in particularly dark or light areas of an image, such as in the gutter or a particularly stained area.

There are two parameters that need to be set for the Sauvola algorithm, a general threshold value called k, and the window size. The threshold value works much the same way as in other binarisation algorithms, essentially setting how aggressive the algorithm should be in removing dark areas, at the expense of potentially removing some parts of text. It can be between 0 and 1, with a higher number resulting in more aggressive binarisation. The window size is the number of pixels around each pixel which are taken into account by the algorithm to determine whether to set the pixel to black or white. A larger window size results in less local adjustment for different areas of a page, but too low and a good deal of noise can be incorrectly detected as text.

A major disadvantage of adaptive binarisation methods is that they tend to be much slower than global threshold binarisation, as a new threshold must be calculated for every pixel. In the case of the Sauvola algorithm, this requires calculating the mean and standard deviation of every pixel in the window around each pixel, for each pixel, which adds up to a lot of calculations.

Fortunately, however, we can use some clever mathematics to dramatically reduce the amount of comuptation that is necessary for adaptive binarisation. The key is using Integral Images, also known as Summed Area Tables. These work by calculating the sum of each pixel above and to the left of each pixel, which can then be used to calculate the mean and standard deviation of any area in an image by a very simple calculation of adding two numbers together and subtracting another two numbers from the Integral Image - a process that is many times faster than calculating the mean and standard deviation from scratch for each pixel. The Integral Image essentially does all of the calculation once for the whole image. This technique is well described in the paper Efficient implementation of local adaptive thresholding techniques using integral images by Shafait, Keysers and Breuel.

Our Tools

There are surprisingly few free software tools to do binarisation with the Sauvola algorithm, and none that I could find that can use Integral Images to speed the processing up. When you’re just binarising a few images, maybe that isn’t so important, but when you’re operating on a scale of many thousands of images, potentially processing each one multiple times with different threshold values, cutting the processing time from 60 seconds to 0.1 seconds makes a very big difference.

So, we wrote a Go package which does Sauvola binarisation, which contains both standalone command line tools and can be used as a package in your own Go projects. The same package contains some other image preprocessing functionality, so the package is called rescribe.xyz/preproc (docs). The command line binarisation tool in the package is binarize, and the relevant Go functions are Sauvola() and IntegralSauvola().

The Integral Image support is provided by another package we wrote, rescribe.xyz/integral (docs), which will also be useful for other image processing tools which require calculating the mean or standard deviation for different areas of an image. It includes tests and examples, so should be easy to pick up and use for your own projects.