OCR on Demand

Someone gave me a 16 page printed specification document for the format of a data file. I’m too slow and lazy to type all that in manually. I tried a few free OCR (optical character recognition) programs (Gocr and SimpleOCR) but the results were absolutely terrible. High error rates and the formatting was all wrong. Commercial OCR programs are too expensive for occasional use. Microsoft removed their OCR program from Office. So I scanned the documents in as jpegs and uploaded them to Amazon’s S3. I created a job on Mechanical Turk to type in each page for 50 cents. Submitted the jobs at night, woke up the next morning and all jobs have been completed. The quality is quite good, though I’ve found a few missing rows. I’ll definitely use Turk for large tedious jobs in the future. However, this morning I tried an online OCR service called ocrNow! My one sample document was recognized perfectly, and it put the result in a perfectly formatted Excel document. It costs 2GBP ($3) for 20 documents, which is 15 cents per document. And the results are delivered in seconds. Unfortunately, the machines have won another round against humans. Stay in school, kids.

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s