<p><strong>How to run multiple Acrobat OCR batches at once:</strong></p>
I was about to dive into a document review project today and ran into some "technical challenges". Here's how it worked out. Hopefully this will save others some time in the future.
Today's another one of those days ... my morning started out in a room stacked full of boxes of documents that were produced to us. Now, I need to make sense out of these and find those proverbial needles in the haystack. Fortunately, the tedious scanning process was already done for me, and I had PDF copies of the production on discs.
The immediate goal is to go through the documents and look for key words and phrases, with a more in-depth review guided by a roadmap that is developed today. The quickest way to do this is to make the documents searchable. Often, you can get a scanning service provider (or your own scanner/copier) to perform OCR text recognition during the scanning process. In this case, we received the documents as un-processed PDFs, so the recognition had to be done in-house.
The Problem -- Speed
Today's desktop computers often have plenty of horsepower. This particular job was run on a dual-core Mac Mini desktop with 4GB of RAM. However, Adobe Acrobat is single-threaded for OCR. What that means is, you can tell it to batch process multiple documents, but it goes through them sequentially, one at a time. When you have thousands, or hundreds of thousands of documents, this poses a real problem.
I started the batch as normal, by going to Adobe Acrobat Pro's "Document" menu and selecting "Recognize text in multiple files using OCR ...". I added the files from our Windows (SMB) file share and set it to work. After a few hundred pages, it was clear that this approach wasn't going to be finished in any reasonable amount of time. What I needed was concurrency. From my processor usage graph, I could tell that my computer had plenty of spare cycles, but that Acrobat just wasn't using them effectively.
One approach that I've used in the past is simply to split up the job across several different computers. This is fine if you have enough computers, but it's still inefficient and ties up several machines in the office with each one running at less than peak speed.
Step 1: More Acrobat!
This time, I took a different approach. I made a copy of the Acrobat application in the Finder.
Step 2: Queue up Jobs.
Now, you can launch each copy of Acrobat separately and add documents to each one's queue independently. However, to do this right, you'll need a license for each concurrent copy of Acrobat. I reviewed their retail license and it looks like they define "the Software" in such a way that it's licensed per copy, not per computer. The backup copy provision probably doesn't cover concurrent usage. In my case, I had two retail copies of Acrobat as well as another copy that was bundled in my Adobe Creative Suite package. If you have a site license or other agreement with Adobe, your licensing may be different.
Step 3: Doubled Productivity (or more).
Most desktop computers these days are dual-core systems or better. On the Mac Mini, running two copies of Acrobat is the most efficient, with each one loading up a different processor core. If you had a high-end machine with 8 cores, (and 8 copies of Acrobat) you could linearly scale up the workload. Another advantage to running multiple copies is that you can get work done in one copy while another copy runs the batch processing. If you're on a Mac, you can also use the bult-in Preview.app program to read and review PDF files while your Acrobat is running OCR jobs. Sadly, Preview.app doesn't do OCR on its own.
Running multiple Acrobat jobs on one computer beats tying up several different systems at once. However, licensing issues can be a pain if you want to run a large number of concurrent jobs. For full-time production, I'd stick with a copy provider or hardware scanner that provides OCR. You can also get standalone software like ABBYY FineReader that specializes in OCR. For small jobs, even Google Docs can now OCR documents. When it comes down to it, these kind of work-arounds shouldn't be necessary. Acrobat is a "professional" product (it says so on the box !) It's inexcusable that Acrobat Pro doesn't run batch jobs like this in parallel. At a minimum, it could parallelize page recognition, even if it attacked documents sequentially. But, until Adobe does a bit more to modernize its Acrobat product line, a bit of creativity and an additional tithe to Adobe can still get the job done quickly.
A final caveat: As part of this process, I discovered that Acrobat also has a propensity to crash when it tries to save batch-processed documents to a network SMB server. So, be sure to save to the local disk then copy back to your file server when done. Reading directly from the server doesn't seem to be an issue. Finally, I'm trying out Amazon's program, so some of the links in this post are affiliate links - http://cmp.ly/5