Kickstarter campaign

Today we have launched something entirely new for us: a Kickstarter campaign! We’re hoping to raise funds in order for us to further improve the Rescribe tool, to give it a proper graphical interface, so it really can be used by anyone. Take a look, and if you’re in a generous mood, please pledge. We have some nice little rewards there too, including discounted OCR services, if that tickles your fancy.

Turning OCR output into great PDFs

Recently we have been putting some effort into improving the PDF output from our tools, which have all made it into the latest release of rescribe (v0.5.1). While they may seem simple, PDFs are a surprisingly complex, sometimes tricky file format to produce correctly, so here we’ll run through some of the ways we get really good PDFs out of our pipeline, and exactly what “good” means in this context.

Major new release of Rescribe Desktop Tool

We’ve just released a major new version of our desktop OCR tool called Rescribe, which we first described here on the blog last year. The new version is much easier to use, as it includes the OCR engine software and our latest and greatest OCR models, so you can just download it and start using it without the need to install Tesseract and OCR models separately. You can get it from https://rescribe.

Desktop Tool

While our pipeline works well for OCR of a corpus efficiently using cloud servers, it was hard to get the features of the pipeline on your own computer. So we spent a bit of time recently creating a new tool which is designed to run self- contained on a desktop computer. We’re calling the tool rescribe, because why not? At the moment it’s a command line only tool. We recently recorded a 5 minute lightning talk about the tool, if you’re interested to learn a little more and see it in action before trying it yourself.

Adaptive Binarisation

The previous post covered the basics of binarisation, and introduced the Otsu algorithm, a good method for finding a global threshold number for a page. But there are inevitable limitations with using a global threshold for binarisation. Better would be to use a threshold that is adapted over different regions of the page, so that as the conditions of the page change so can the threshold. This technique is called adaptive binarisation.

A tour of our tools

All of the code that powers our OCR work is released as open source, freely available for anyone to take, use, and build on as they see fit. This is a brief tour of some of the software we have written to help with our mission of creating machine transcriptions of early printed books. Almost all of our development these days is in the Go language, because we love it. That means that all of the tools we’ll discuss should work fine on any platform, Linux, Mac, Windows or anything else supported by the Go tooling.

An Introduction to Binarisation

Binarisation is the process of turning a colour or grayscale image into a black and white image. It’s called binarisation as once you’re done, each pixel will either be white (0) or black (1), a binary option. Binarisation is necessary for various types of image analysis, as it makes various image manipulation tasks much more straightforward. OCR is one such process, and all major OCR engines today work on binarised images.

About us

Rescribe is a not-for-profit company focused on improving the state of OCR and related technologies for historical books and documents. Free and open source software is key to the work we do, and we release all the code and training data we create and use on our git server (also mirrored on github). We work with a variety of academic and archival projects to make historical works more accessible, searchable and discoverable, and to enable researchers to work with them and find new connections.