We have a fresh new release of our desktop OCR tool Rescribe out now, v1.2.0. This fixes all known bugs, and adds a nice new feature. We also released v1.1.0 last year, very quietly. As with the previous releases, this works on MacOS, Windows and Linux, and is designed to be easy to install and use. It’s free and available to download now.
There were several annoying little bugs around which have been fixed, chief among them an issue where a directory containing pages with spaces in their names could fail to process correctly.
Today we’re very happy to announce the v1.0.0 release of the desktop OCR tool Rescribe. It’s free to download now from our Rescribe page, and it will work on Mac OS X, Windows and Linux.
This release brings several major new features, a graphical interface, support for reading PDFs directly, and an integrated Google Book downloader chief among them. There are heaps of smaller changes, too, like removing the limitation to name image files in a particular way, adding the option to disable automatic margin wiping, and adding the option to create the searchable PDF using original size page images, for those who value quality more than disk space in their PDFs.
Great news, our Kickstarter campaign successfully surpassed its funding goal! We’re already hard at work creating the graphical interface for the Rescribe tool, and will be posting updates on this blog and the Kickstarter page to let people know how it’s going.
Many thanks once again to all of the lovely people who have helped us get this far, we’re excited to share and improve our latest tools with you soon!
Today we have launched something entirely new for us: a Kickstarter campaign! We’re hoping to raise funds in order for us to further improve the Rescribe tool, to give it a proper graphical interface, so it really can be used by anyone.
Take a look, and if you’re in a generous mood, please pledge. We have some nice little rewards there too, including discounted OCR services, if that tickles your fancy.
Recently we have been putting some effort into improving the PDF output from our tools, which have all made it into the latest release of rescribe (v0.5.1). While they may seem simple, PDFs are a surprisingly complex, sometimes tricky file format to produce correctly, so here we’ll run through some of the ways we get really good PDFs out of our pipeline, and exactly what “good” means in this context.
We’ve just released a major new version of our desktop OCR tool called Rescribe, which we first described here on the blog last year.
The new version is much easier to use, as it includes the OCR engine software and our latest and greatest OCR models, so you can just download it and start using it without the need to install Tesseract and OCR models separately.
You can get it from https://rescribe.
While our pipeline works well for OCR of a corpus efficiently using cloud servers, it was hard to get the features of the pipeline on your own computer. So we spent a bit of time recently creating a new tool which is designed to run self- contained on a desktop computer. We’re calling the tool rescribe, because why not? At the moment it’s a command line only tool.
We recently recorded a 5 minute lightning talk about the tool, if you’re interested to learn a little more and see it in action before trying it yourself.
The previous post covered the basics of binarisation, and introduced the Otsu algorithm, a good method for finding a global threshold number for a page. But there are inevitable limitations with using a global threshold for binarisation. Better would be to use a threshold that is adapted over different regions of the page, so that as the conditions of the page change so can the threshold. This technique is called adaptive binarisation.
All of the code that powers our OCR work is released as open source, freely available for anyone to take, use, and build on as they see fit. This is a brief tour of some of the software we have written to help with our mission of creating machine transcriptions of early printed books.
Almost all of our development these days is in the Go language, because we love it. That means that all of the tools we’ll discuss should work fine on any platform, Linux, Mac, Windows or anything else supported by the Go tooling.
Binarisation is the process of turning a colour or grayscale image into a black and white image. It’s called binarisation as once you’re done, each pixel will either be white (0) or black (1), a binary option. Binarisation is necessary for various types of image analysis, as it makes various image manipulation tasks much more straightforward. OCR is one such process, and all major OCR engines today work on binarised images.
Rescribe is a not-for-profit company focused on improving the state of OCR and related technologies for historical books and documents. Free and open source software is key to the work we do, and we release all the code and training data we create and use on our git server (also mirrored on github).
We work with a variety of academic and archival projects to make historical works more accessible, searchable and discoverable, and to enable researchers to work with them and find new connections.