Open source OCR sucks

Posted: Wed, 07 Jul 2010 03:58:15 +0300
Author: Делян Кръстев

This night I've tried text recognition with various open source tools. The input were images packages as PDF. The text in the images was bad looking, but readable.

To summarize my experience:

Too much reading
Too much hassle to convert between various input formats accepted by the tools
Totally unacceptable results
Even segmentation faults by some of the apps

None of the tools did the job even close to what I expected. Maybe it was my fault, but I could not spend a day each time I need to do a simple job which I do not do each month.

At the end I did the job by googling for "Online OCR" and using (guess what ?!) http://www.onlineocr.net/ for the first five pages. It had a limit for five pages per hour for non registered users (and 5 pages total for registered ones) so I registered and OCRed the last sixth page.

BTW, just to prove my point of not enough reading I later found this site http://www.free-ocr.com/, which also did the job and used one of the software I have tried - Tesseract.

Posted in dir: /blog/
Tags: linux ocr oss

Show comments Report article PermaLink