A tour of our tools

Mon, Jun 15, 2020

All of the code that powers our OCR work is released as open source, freely available for anyone to take, use, and build on as they see fit. This is a brief tour of some of the software we have written to help with our mission of creating machine transcriptions of early printed books.

Almost all of our development these days is in the Go language, because we love it. That means that all of the tools we’ll discuss should work fine on any platform, Linux, Mac, Windows or anything else supported by the Go tooling. We rely on the Tesseract OCR engine with specially crafted training sets for our OCR, but in reality a lot of things have to happen before and after the recognition step to ensure high quality output for difficult cases like historical printed works. These pre- and post-processing processes, as well as the automatic managing and combining of them into a fast and reliable pipeline, have been the focus of much of our development work, and are the focus of this post.

The tools are split across several different Go packages, which each contain some tools in the cmd/ directory, and some shared library functions. They are all thoroughly documented, and the documentation can be read online at pkg.go.dev.

bookpipeline (docs)

The central package behind our work is bookpipeline, which is also the name of the main command in our armoury. The bookpipeline command brings together most of the preprocessing, ocr and postprocessing tools we have and ties them to the cloud computing infrastructure we use, so that book processing can be easily scaled to run simultaneously on as many servers as we need, with strong fault tolerance, redundancy, and all that fun stuff. It does this by organising the different tasks into queues, which are then checked, and the tasks are done in an appropriate order.

If you want to run the full pipeline yourself you’ll have to set the appropriate account details in cloud.go in the package, or you can just try the ’local’ connection type to get a reduced functionality version which just runs locally.

As with all of our commands, you can run bookpipeline with the -h flag to get an overview of how to use it, like this bookpipeline -h.

There are several other comands in the package that interact with the pipeline, such as booktopipeline, which uploads a book to the pipeline, and lspipeline, which lists important status information about how the pipeline is getting along.

There are also several commands which are useful outside of the pipeline environment; confgraph, pagegraph and pdfbook. confgraph and pagegraph create a graphs showing the OCR confidence of different parts of a book or individual page, given hOCR input. pdfbook creates a searchable PDF from a directory of hOCR and image files - there are several tools online that could do this, but our pdfbook has several great features they lack; it can smartly reduce the size and quality of pages while maintaining the correct DPI and OCR coordinates, and it uses ‘strokeless’ text for the invisible text layer, which works reliably with all PDF readers.

preproc (docs)

preproc is a package of image preprocessing tools which we use to prepare page images for OCR. They are designed to be very fast, and to work well even in the common (for us) case of weird and dirty pages which have been scanned badly. Many of the operations take advantage of our integral (docs) package, which uses clever mathematics to make the image operations very fast.

There are two main commands (plus a number of exported functions to use in your own Go projects) in the preproc package, binarize and wipe, as well as a command that combines the two processes together, preprocess. The binarize tool binarises an image; that is, takes a colour or grey image and makes it black and white. This sounds simple, but as our binarisation posts here have described, doing it well takes a lot of work, and can make a massive difference to OCR quality. The wipe tool detects a content area in the page (where the text is), and removes everything outside of it. This is important to avoid noise in the margins from negatively affecting the final OCR result.

utils (docs)

The utils package contains a variety of small utilities and packages that we needed. Probably the most useful for others would be the hocr package (https://rescribe.xyz/utils/pkg/hocr), which parses a hOCR file and provides several handy functions such as calculating the total confidence for a page using word or character level OCR confidences. The hocrtotxt command is also a very handy simple command to output plain text from a hOCR file.

Other useful tools include eeboxmltohocr, which converts the XML available from the Early English Books Online project into hOCR, fonttobytes, which outputs a Go format byte list for a font file enabling it to be easily included in a Go binary (as used by bookpipeline), dehyphenate which follows simple rules to dehyphenate a text file, and pare-gt, which splits some files in a directory containing ground truth files (partial page images with a corresponding transcription) out into a separate directory, which we use extensively in preparing new OCR training sets, in particular to extract good OCR testing data.

Summary

We are proud of the tools we’ve written, but there will inevitably be plenty of shortcomings and missing features. We’re very keen for others to try them out and let us know what works well and what doesn’t, so that we can improve them for everyone. While we have written everything to be as robust and correct as possible, there are sure to be plenty of bugs lurking, so please do let us know if anything doesn’t work right or needs to be explained better, if there are features you would find useful, or anything else you want to share. And of course, if you want to share patches fixing or changing things, that would be even better! We hope that by releasing these tools, and explaining how they work and can be used, more of the world’s historical culture can be more accessible and searchable, and be used and understood and new ways.