Print to OCR?
Why would you ever convert an already digital document back to image? I promise it's not because I'm so fond of OCR it actually has it's purpose.
Language Detection: By converting a document to image for OCR, I can check the language of each word in the document. While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume. The unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past 5 years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engines if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Additionally font while a clear indicator of language if not accompanied by the proper language encoding will not tell a digital process what a language is, in OCR there is no need for such an encoding.
Normalization of digital formats: While a PDF created in Acrobat and a PDF created in a third party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file you have to have a standard format that is used. If you do not have a standard format you are dealing with variations in the document visually and infrastructural. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices' times as many PDF generating applications exist. However, if you were to OCR the PDF to parse versus digital parsing than you are dealing with only the number of variants that exist in the invoices themselves.
However crazy it sounds like the above two are real scenarios and there are many more. I doubt that these problems will always exist, but it makes you think twice about crazy statements such as printing a digital document to image just so you can OCR it.
Labels: Data Capture, language detection, morphology, normalization, OCR, parsing, pdf, print to image

0 Comments:
Post a Comment
<< Home