Friday, November 6, 2009

Invisible characters

Exceptions in OCR and data capture are usually thought of as miss-recognized characters only, but in reality there are several other types of exceptions that exist. One of those is called “high confidence blanks”. A “high confidence blank” in OCR or data capture is where the software looked in a particular region for a character but no text was identified or read. In data capture “high confidence blanks” usually occur for entire fields or just the first character, in full-page OCR they are less common but can occur sporadically throughout the text of the document or the entire text. This type of exception is elusive and hard to detect. Obviously if entire fields and text is missed where you expect there to be text it is easy to spot, but for the one-off missing characters it's tough. With full-page OCR detection is done with spell-check. Missing characters in a word will surely flag the word as being misspelled. In data capture it's much more tricky and the best thing to do is to take certain steps to avoid “high confidence blanks”.

1.)The first thing you can do to avoid “high confidence blanks” in data capture is to NOT over use image clean-up. If characters are regenerated or cleaned too much they look to the OCR engine to be just a graphic not a typographic character and thus avoided.

2.)Second if you have control of the form design make sure text is not printed close to lines, this is one of the biggest generators of “high confidence blanks”
3.)If text is close to lines then you should be able to establish a rule in the software indicating for example that if the first character in a field is more then x pixels away from the border then most likely a character(s) was missed.
4.)If at all possible use dictionaries and data types that state the structure of the information that should be present in a field. If a character is missing this data type will likely be broken.

This type of exception is one that leads to hidden downstream problems when organizations don't realize that it might happen. Being aware and taking the proper steps to avoid "high confidence blanks" is the solution.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, September 25, 2009

Don't over clean – the effects of image clean-up on accuracy

There is always some way to modify a scanned image to improve it's recognition results if it's not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen, the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it's possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.)Despeckle ( unless dot-matrix font )
2.)Line Straightening
3.)Basic Thresholding
4.)Background removal
5.)Correction of Linear Distortion
6.)Dropout
7.)Line Removal ( sometimes )

Bad for OCR is:

1.)Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”'s will be “e”'s. For hand-print you often remove portions of characters.
2.)Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
3.)Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes consider only those that improve the recognition rates, not destroy them.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Sunday, September 13, 2009

Don’t over clean – the effects of image clean-up on accuracy

There is always some way to modify a scanned image to improve it’s recognition results if it’s not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen, the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it’s possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.)Despeckle ( unless dot-matrix font )
2.)Line Straightening
3.)Basic Thresholding
4.)Background removal
5.)Correction of Linear Distortion
6.)Dropout
7.)Line Removal ( sometimes )

Bad for OCR is:

1.)Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”’s will be “e”’s. For hand-print you often remove portions of characters.
2.)Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
3.)Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes consider only those that improve the recognition rates, not destroy them.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments