OCR me once good for me, OCR me twice possibly great for me
The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.
An example of a scenario where this is being used and very successful is documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A a second read with the same OCR engine settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.
At minimum 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it's going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by environment, for example environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.
While multiple pass OCR slows down the entire process it's still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.
Labels: Accuracy, full-page ocr, icr, multi-pass ocr, voteing
