Tuesday, December 1, 2009

OCR me once good for me, OCR me twice possibly great for me

When accuracy is the primary concern in document recognition the best technique is multiple passes of the OCR or recognition process. Similar to how you would have a document manually entered two to three times why not have an OCR engine convert it 3, 4, 5 times all with different settings?

The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.

An example of a scenario where this is being used and very successful is documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A a second read with the same OCR engine settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.

At minimum 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it's going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by environment, for example environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.

While multiple pass OCR slows down the entire process it's still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Wednesday, November 4, 2009

You can read the fine-print

As fonts get smaller the challenge to read them with OCR software increases, however there are some key things that organizations should be aware of when reading the fine-print.

OCR technology today is capable of reading fonts as small as 8 pt even 6 pt very accurately. It used to be unless you have a 12 pt font you stood no chance. Because of increased quality of scans and more advanced OCR engines reading small fonts can be no problem if the right approaches are used.

Small fonts have a higher sensitivity to image quality and degradation to the document. For this reason original source images that are scanned at 300 DPI or higher are necessary. For normal fonts there is seldom reason to scan higher than 300 DPI but for small fonts the goal is to get them to appear more or less the same as the regular fonts, so scanning them at 400 to 600 DPI is useful. Additionally documents that are “clean” is very important. A smudge or spill on a document impacts smaller fonts many times more then a larger font because of the closeness of lines. Once you have a good image quality you can start the conversion.

The next best benefit for small fonts is for them to be zoned separately. Zoning is the process of rubber banding the region where the text exists. When small fonts are grouped in the same zone with normal sized fonts the OCR software assumes that they should be of the same size and the confidence and accuracy go down. If you zone the small fonts separately you increase the OCR engines ability to use experts just for small fonts and increase the accuracy on them.

Next time someone tells you to read the small print, tell them you wont read it you will scan and OCR it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, September 28, 2009

Set it and forget it OCR

My office is a paper monster, paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review it's contents and asses it's value, scan it, shred it. Usually within minuets of it's existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR'ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid as basically I'm relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as it's own input folder with it's own OCR settings with it's own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search. And after-all I'm an expert in OCR so to not use it everyday would be crazy in it's own right. I've yet to be proven wrong, my “File Cabinet” abyss has always giving the information I required at the time I required it and sometimes new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don't destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments