Friday, December 18, 2009

What you OCR is what you get

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it's important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it's job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it's important to find a solution that has good documenting analysis.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, November 11, 2009

Dropout, all or none

Color or Greyscale dropout is a great tool for increasing accuracy of extracting data from forms. But bad dropout is far worse than no dropout. Partially dropped out forms have the ability to confuse data capture technology. These forms are commonly called “Zebra” forms where portions of the form have dropout performed correctly and other portions have the fields now outlined in black. If you have control of the scanning and this is the situation you are better to turn off dropout, or improve it's use.

It used to be the only way to dropout a form was to use scanner driven dropout. This approach was limited in colors that could be removed. Essentially what would happen is the scanner would be equipped with lamps of red usually. During scan the lamp would be turned on thus canceling out the red in the form. Because of this it was important that printed forms used a certain type of red. If you have every had experience with color matching you know it's quite frustrating. Especially because the colors you see on the screen are not usually what is printed. Things have improved, now even scanners are using software dropout, where images initially arrive as color and algorithms then remove pixels of a certain color range from the document. This has created the added benefit of being able to with some scanners and software packages dropout any color, and multiple colors at a time. There are even some packages out there where you can drop out things like colored lines.

When dropout with any technology becomes difficult is when there are gradations on the form because of bad printing, color wear, sun or other damage. Because the software is looking for consistency with any dropout it will avoid colors that don't match the norm. This is often seen when the first half of a form is dropped out and not the second because of a color change mid document. There are tools that allow you to specify a threshold that can assist with this. This can be a very low threshold when dealing with documents where it's one color and black text, but more complex documents can with a low threshold loss important data.

The biggest key to proper dropout assuming good form printing is to scan the document as quickly as possible, removing time for damage to possibly take place. Dropout is a great tool, but if you find that forms are partially dropped out you are better for data capture accuracy to turn off dropout and deal with the black and white form than to include it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, September 14, 2009

“No text left behind” - Color's Impact on OCR

OCR technology has come a long way since it's creation. On the 300 DPI clean, letter type documents the technology has arrived and not much room for improvement. But what about the rest of the documents out there, how is OCR improving on them? When comparing that perfect letter document to that not so perfect article or newspaper say, the big difference is text placement and configuration. One of the keys to getting even better OCR is to improve your ability to identify what is graphics, what is text. Within the text you have to identify columns, paragraphs, sentences, words, and finally characters. Only then can the OCR take a whack at interpreting the text. This is called Document Analysis. Sometimes OCR accuracy is lower not because of the actual read of the text but because the OCR software tries to read things that are not text, or some of the text in the document is simply ignored because it was never found.

In the last few years and moving forward text identification, Document Analysis, has been one of the areas of greatest improvement. Many of the new products have been leveraging color as one more tool in not leaving any text behind. With color the ability to locate different parts of a document is even easier and more accurate, thus the overall OCR is more accurate. The most obvious benefit of color is ability to locate graphics. Sometimes index level OCR requires that even text within graphics be read to enhance the search-ability of a document. With color detection the modern engines are advancing to locate text in pictures and ignore the rest. Very stylized documents pose the greatest challenge to Document Analysis, and color is one of the best tools to attack them. Expect to see similar trends and focus on Document Analysis and the pursuit of no text left behind.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments