Friday, February 17, 2006

Online OCR

I've tried doing some OCR stuff and lately it all seems lower quality than what I would expect. Maybe I'm thinking too highly of the status quo, but it seems like it would be possible to add more intelligence to OCR tools. For example, it would be nice if they could automatically determine that the black area surrounding a page is not text. Basically, I'd like a top-down approach to scanning. If I can make sense of it, I'd like the computer to make sense of it. Start by straightening the page and defining "areas", maybe the border here, the crease down the middle of a book there, some text over here, and an image there. The user can then confirm that these are correct, or change/redraw areas. Once they are sectioned off, then begin the actual OCR. Not just any OCR, but perhaps OWR (object word recognition) or OPR (object prase recognition). If it looks like a certain letter could be a "c" or an "e", then see which makes sense as far as spelling goes, and if necessary, see which makes sense in terms of word context. This should help eliminate the mindless choices that some OCR software requires. It's one of those things that I *know* is possible because it comes standard in office software. Easy to implement...that's another question entirely. I'd also like it to have the capability of actually converting documents into rich text (or OASIS formats) instead of just plain text.

Granted, I've only been using cheap/free tools, and higher quality tools are available. However, I'm not willing to spend cash on them. What I would prefer is that when/if I scan documents in, it's free to me, except maybe some AdSense based on the scanned text. This inplies that the OCR tool would likely be an online service. That's fine, because a server somewhere else would probably have more resources (in terms of memory, processing power, character sets) to do the job right.

I actually applied to Google to be on OCR engineer there (one of the few positions that allowed for an electrical engineering degree) but was unfortunately turned down. I know they've got a vested interest in scanning with their library/book search project, but I wonder if they might also have something in the works for public-use OCR....


Post a Comment

<< Home