Desktop Tool

While our pipeline works well for OCR of a corpus efficiently using cloud servers, it was hard to get the features of the pipeline on your own computer. So we spent a bit of time recently creating a new tool which is designed to run self- contained on a desktop computer. We’re calling the tool rescribe, because why not? At the moment it’s a command line only tool. We recently recorded a 5 minute lightning talk about the tool, if you’re interested to learn a little more and see it in action before trying it yourself.

Adaptive Binarisation

The previous post covered the basics of binarisation, and introduced the Otsu algorithm, a good method for finding a global threshold number for a page. But there are inevitable limitations with using a global threshold for binarisation. Better would be to use a threshold that is adapted over different regions of the page, so that as the conditions of the page change so can the threshold. This technique is called adaptive binarisation.

A tour of our tools

All of the code that powers our OCR work is released as open source, freely available for anyone to take, use, and build on as they see fit. This is a brief tour of some of the software we have written to help with our mission of creating machine transcriptions of early printed books. Almost all of our development these days is in the Go language, because we love it.

An Introduction to Binarisation

Binarisation is the process of turning a colour or grayscale image into a black and white image. It’s called binarisation as once you’re done, each pixel will either be white (0) or black (1), a binary option. Binarisation is necessary for various types of image analysis, as it makes various image manipulation tasks much more straightforward. OCR is one such process, and all major OCR engines today work on binarised images.

About us

Rescribe is a not-for-profit company focused on improving the state of OCR and related technologies for historical books and documents. Free and open source software is key to the work we do, and we release all the code and training data we create and use on our git server (also mirrored on github). We work with a variety of academic and archival projects to make historical works more accessible, searchable and discoverable, and to enable researchers to work with them and find new connections.