Mostly open-source OCR workflow with Tesseract

cpwrites · October 28, 2014, 7:06am

I thought folks might be interested in my new EagleFiler OCR workflow.

I decided to stop using PDFPen after it removed some functionality to coerce users into a paid “upgrade,” and broke an important file in the process. Getting the open-source Tesseract engine to work is pretty complicated, but fortunately someone wrote a Python script to make it much easier to run Tesseract on multipage PDFs using OSX. All the dependencies can be installed with Homebrew.

https://github.com/virantha/pypdfocr

So:

brew install tesseract
brew install ghostscript
brew install poppler
brew install imagemagick

And then:

pip install pypdfocr

Instead of using the PyPDFOCR folder watch switch, I use Hazel to monitor a folder (~/Downloads/PDF) for new PDFs, then trigger the Python script when it finds one, ie:

pypdfocr $1 -f -c ~/pypdfocr-config.yaml

$1 is the file to be imported. The config.yaml file includes a line like this, to move OCR’d PDFs to my EagleFiler “To Import” folder:

default_folder: "/PATH/TO/EAGLEFILER/LIBRARY/To\ Import\ (Library)/"

It took me a couple of years to arrive at this. Hope others find it useful.

mattm · August 30, 2015, 5:59pm

Great solution

This works really, really well! Thank you.