Invisible characters
1.)The first thing you can do to avoid “high confidence blanks” in data capture is to NOT over use image clean-up. If characters are regenerated or cleaned too much they look to the OCR engine to be just a graphic not a typographic character and thus avoided.
2.)Second if you have control of the form design make sure text is not printed close to lines, this is one of the biggest generators of “high confidence blanks”
3.)If text is close to lines then you should be able to establish a rule in the software indicating for example that if the first character in a field is more then x pixels away from the border then most likely a character(s) was missed.
4.)If at all possible use dictionaries and data types that state the structure of the information that should be present in a field. If a character is missing this data type will likely be broken.
This type of exception is one that leads to hidden downstream problems when organizations don't realize that it might happen. Being aware and taking the proper steps to avoid "high confidence blanks" is the solution.
Labels: Accuracy, blanks, book OCR, Data Capture, high confidence, Image Clean-up, VRS

0 Comments:
Post a Comment
<< Home