Paperless Office and PDF+Text

Hi all. I’ve been using EagleFiler for a long time and love it. I’ve got myself an HP sheet-fed scanner now and want to start scanning documents, bills, etc.

I’m using the HP scanner software, then importing the PDF into Acrobat Pro and running the OCR stuff on it, then saving it. If I do a spotlight search for a word in the receipt (say, “Staples”), spotlight finds it. So does DTPro Office (for which I just downloaded a demo). But EF doesn’t. I import the PDF, can view it, but when I do a search I see nothing; no results are found. I even quit EF and let it re-index on launch and still nothing.

Am I doing something wrong? I don’t get why the OCR is definitely apparent in Acrobat (I can highlight and copy-paste text), in DTPro Office it’s definitely there (both having it do the OCR and importing an Acrobat-saved PDF+Text), but in EF it doesn’t seem to be indexing.

I really don’t want to buy another piece of software just to have to do this (I suppose I could use DTPro (non-Office)) and would like to keep all these documents in EF, where the rest of my stuff is.

Anyone know why this might be the case?

Sorry I can’t help you. But I can reassure you that I have lots of documents scanned on my Canon LIDE35, which generates PDF+Text, which were fully searchable in DevonThink Pro, and are now equally searchable in EagleFiler. Words are not only searchable, but EF highlights the searched word in the PDF documents which it finds.

Yeah, it works with other PDFs I have. I don’t know if I’m trying to “rush it” and search too fast and it needs to do some other kind of indexing or something for it to show up or what, but if I do a search in other pdfs, they are definitely searched fine.

Do you scan with some specific canon software that saves to pdf? Or how are you getting the OCR part done?

Do you scan with some specific canon software that saves to pdf? Or how are you getting the OCR part done?

Yes, my Canon scanner came with “CanoScan Toolbox”, which has a “Create Searchable PDF” option. It seems to “just work”.

If the document has been OCR’d and has text, you should be able to select/copy the text within EagleFiler right away. EagleFiler uses the same PDF engine as Preview, so if Preview can read the text, EagleFiler can too.

EagleFiler waits about 20 seconds after importing before it tries to index a file. After that, you’ll either see it indexing in the Activity Viewer window (in which case it’s still indexing) or you won’t (in which case it’s already done).

In order to search the contents files, you must have EagleFiler’s search field set to do an Anywhere search. It’s possible that your search isn’t working due to a damaged index file. In that case, it would help to rebuild the index or import the file into a different library.

Rebuilding the index did the trick. Thanks, Michael. Good stuff… now I just need to automate the workflow. =)

Detecting text?
I have quite a lot of scanned PDFs, and not all of them have been OCRed. Actually not very many of them have. But this is something that I would like to get started on, since it would be quite useful.

However, I have no idea how to tell if any given PDF has actually been OCRed or not, apart from opening it, finding something that looks unique, and trying to search on it (which could still fail if the OCR procedure mangled one of the words).

Is there a way outside of EagleFiler to tell if a PDF has both an image and text associated with it? And, within EagleFiler, would it be possible to have some kind of indicator? (That is, something that indicates that the item has been internally indexed?) It would sure make the particular task I was planning on undertaking easier, though I can’t say how useful the wider audience would find it to be.

In either Preview or EagleFiler, you could simply try drag-selecting some text. If it highlights, there’s text there; otherwise there’s just an image.

Well, EagleFiler will index the document no matter what, because even if it’s just an image, there will still be a filename and PDF metadata to index.

Yep, that works, thanks. It still requires that I manually select each one, try it, and tag them to go back and OCR them later. But this is still a big step ahead of where I was before.

Right, I guess what I meant was something relatively PDF-specific that indicates whether there is text inside in addition to images (if that’s easy to detect when indexing the PDF). But, it’s kind of a specialized feature. In my situation, it would allow me to see which of the 5200 PDFs need to be OCRed without clicking each one manually (and, I believe that it’s actually a relatively small number that don’t have embedded text).

This might actually be a job for some kind of separate Applescript application having nothing to do with EagleFiler, really. I’ll ponder that possibility.

DEVONThink lists the “type” of PDF files that also have text included as “PDF+Text”. Just a suggestion.

It’s easy to detect (though too time-consuming to do on the fly), but EagleFiler doesn’t currently have a good way to store and display that type of information and keep it current if the file changes. I’m planning to add AppleScript support for accessing the main text of a record, so this would make it possible to, e.g., write a script that tagged the PDFs with no text.

That should work perfectly, and this is just a “do, sometime” project for me, so I’ll happily wait until this feature appears and then write just such a script.

Actually, I’d forgotten that the text of a record is already accessible via the contents property.