Thursday, December 31, 2009

Measuring Document Automation Efficiency

The two most common question when organizations ask when they are seeking document automation technology is “how fast is it?” and “how accurate is it?”. Many don't realize that the two are at opposition to each other most of the time. The more accurate a system the slower it is, and the faster it is the less accurate. But there is one fatal mistake in all these calculations, and that mistake is how efficiency is calculated.

Most companies who trial data capture calculate performance on the slowest step which is optical character recognition OCR. Literally companies will hit the “read” button and immediately start timing until the read is complete. This is what is considered the speed of the document automation system. This is incorrect.

There is no question that OCR can be a tremendous bottleneck in the entire entry process, but poor OCR could create an even greater bottleneck. Imagine an OCR engine that reads a document with 100 characters in 1 second as compared to an engine that reads the same 100 characters in 3 seconds. Your initial thought is that the first engine would be better, but consider that the first engine may be 60% accurate leaving 40 characters to be manually entered, and the other engine 98% accurate leaving 2 characters to be manually entered or correct. If you consider an average entry speed of 1.6 characters per second then it will take the 40 characters an additional 25 seconds to enter for a total entry time of 26 seconds for the faster engine. For the slower engine it will take an additional 1.25 seconds to enter or edit 2 wrong characters thus a total entry time of 4.25 seconds. This means that end-to-end the slower engine is 6 times faster document automation process then the slower engine.

This simple calculation illustrates the folly in assuming that the slower OCR time makes for a slower overall process. Usually focusing on accuracy has the greatest benefit for an organization unless you are improving the speed of a slower engine with hardware, or two engines are to close to see a benefit.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Wednesday, December 30, 2009

The trick of the inverted text

The search for greater accuracy when it comes to document automation, never stops. It's true that with every new release, OCR technology has become so advanced that the jumps in accuracy are not what they were 10 years ago. Now, new versions of OCR engines contain enhancements for low quality documents and vertical document types but general OCR can't get much better. Because of this, modern integrations need to find new tricks. This blog is full of them, but I'm about to explain just one more. OCRing inverted text.

OCRing inverted text is nothing new. Many document types have regions where white text is printed on a black background. The modern engines have an ability to read this text. Typically it's not as accurate as black text on white background OCR, but it has its unique benefits. Especially with complex document types such as EOBs and drivers licenses.

There is a trick in using inverted text OCR to increase overall OCR accuracy. The method is to first OCR a document normally, then using imaging technology to invert the image. When you invert the image, the black text on white background switches to white text on a black background. Once the inversion is done, run OCR again. By comparing the two OCR results, you have essentially voted the same engine with little effort.

Large volume processing environments can deploy this trick without re-loading a new OCR engine, and applying different settings. It's important to note that when using this technique, how you compare the two results is as important as the process itself. Typically you will assign more weight to the original version of the document then the inverted one. There you have it, one more tool in increasing the OCR accuracy of the engine you already use.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, December 23, 2009

Expectations bite the dust

Just this morning, I was reminded of why market education is so important. I received an email in the morning from a customer who has been exposed to data capture technology for many years. This customer owns a semi-structured data capture solution that is capable of locating fields on forms that changes from variation to variation. In an attempt to help my understanding, we started a conversation about their expectations. Very wisely, the customer broke down their expectations into three categories: OCR accuracy ( field level ), field location accuracy, and amount of time to process per document. This is a step more advanced than a typical user who will clump all of this into one category. In addition to this, there should be a minimum template matching accuracy. In any case, they expect an OCR accuracy of 90%, which is reasonable considering the document they are working with are pixel perfect. They expect a 20 page document to be processed in 4 minuets which is also reasonable and right on the line. Finally, they expect field location to be 100%, RED FLAG!

This is not the first time that there is an assumption that you can locate fields on a semi-structured form with 100% accuracy, 100% of the time. To my dismay, as people seem to be learning more about the technology, this is the next class of common fallacy. And because the organization did not specify template matching accuracy, it means they must also assume templates match 100% of the time to get 100% field location accuracy. Trouble.

It's clear as to why 100% field accuracy is important for them. That is because, basic QA processes are capable of only checking recognition results ( OCR Accuracy ), and not locations of fields. Instead of modifying QA processes, an organization's first thought was how to eliminate the problems that QA might face. 100% accuracy is not possible no matter what is done, including straight text parsing. In this case, the reason it's not possible is that even in a pixel perfect document, there are situations where a field might be located partially, located in excess, or not located at all. The scenario that most often occurs in pixel perfect documents is that text may sometimes be seen as a graphic because it's so clean, and text that is too close to lines are ignored. So typically in these types of documents, any field error is usually a field located partial error. Most QA systems can be setup such that rules are applied to check data structure of fields, and if the data contained in them is faulty, an operator can check the field and expand it if necessary. But this is only possible if the QA system is tied with data capture.

After further conversation, it became clear that the data capture solution is being forced to fit in a QA model. There are various reason as to why this may happen: license cost, pre-existing QA, or miss-understanding of QA possibilities. This is very common for organizations and very often problematic. Quality assurance is a far more trivial processes to implement than data capture. When it comes to data capture it would be more important to focus on the functionality of the data capture system and develop a QA that makes it's output most efficient.

Again, a case of expectations and assumptions.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, December 17, 2009

Check-mark accuracy, all or none

Check-mark processing ( OMR ) is one of the most accurate recognition technologies. Companies who properly utilize OMR are able to process documents quickly and accurately. But for the same reason OMR is accurate, it can also be very inaccurate, when not used properly.

For the most part, OMR is an all or nothing technology. Unlike the varying degrees of accuracy and uncertainty in OCR, with OMR, a field is checked or not. Where accuracy and uncertainty come into play is when you deal with collections of check-marks where the technology will compare the results of all to see whichever ones are most likely checked. The three areas where organizations make the mistake when using OMR is: improper OMR type, poor thresholds, and bad rules.

Many think of OMR fields as the traditional bubble on school tests. But there are several types of OMR fields. Rectangle, Round, Automatic, and White Field. Unlike text recognition, the wrong field type selection in OMR results in 100% incorrect results, most of the time.

Rectangle and round are the traditional fields that comes to mind when thinking of check-marks. The technology used to processes these, also includes a way to tell if a field has been corrected ( slashed out, and answer changed ). For these fields, the borders of the field are detected and when a high enough amount of black pixels is found within the border, the field is considered checked. The only time this will not be the case is when a field has been detected as having a correction.

Automatic field types are for those forms that have non-traditional border types for their fields, OR have some sort of text already existing in the field. For example, if you scan a Scantron form as a black and white image without dropout, you will get for each field a round circle with some letter or number printed in the middle. In this case you would have to use the automatic field type. What happens is that the software compares an EMPTY form to the form being processed. If for example, a field has the letter “A” printed in the middle, the software will count how many pixels in the field the A consist of and use that as a baseline. For a field to be checked, it will have to contain some number of black pixels OVER the baseline. If in this case, you used a rectangle or round check-mark type field, all fields would be considered checked because no baseline was established. Now finally are white fields.

White fields are check-mark fields that have no border. The are most often forms that have dropout scanning or sometimes fields used for unique and cool cases such as detecting signatures. These are a useful type of checkmark that simply expects there to be no border and no printed text in the field area. If there is a small amount of black pixels in the field area it's considered checked. If you use a white field on a rectangle OMR field it will always be considered checked because of the borders. The biggest challenge for white fields is that the size of the field directly impacts it's accuracy so proper sizes must be chosen. All check-marks have degrees of thresholds assigned to them.

A threshold is the setting that determines the amount of pixels (as a percent ) that is required before a field is considered checked. Organizations usually never need to toggle the default thresholds, and this is one of the biggest mistakes that is made. Most OMR processing packages have default thresholds for all field types. These vendors have done the research to know what the optimum field threshold is for both accuracy and avoiding false positives. Companies, when they pick the wrong threshold, get fields considered checked when they are not and the other way around. The problem is most of these are never reviewed, because they never get flagged due to custom thresholds which creates a false positive, the worse possible outcome of any exception.

As with all data capture and forms processing tools, there is usually a step of validation and rules. For whatever reason, organizations tend to over-think the rules associated with check-marks. The most common rule is that for any given collection of check-marks associated with a single question, only one or combination of ones can be checked. So for example, for a multiple choice question that asks for one answer, if the software sees two checked it will flag both fields. These rules are very useful but when improperly implemented result in either too much verification of fields, which is OK just a time waster, or like the threshold false positives. Sometimes the rules are applied during recognition and thus effect recognition results. For example, a question that has no answer but one is expected, is forced an answer. It's easy to blame the software, but most of the time it's just a bad rule.

OMR is a great tool when used right because it's extremely fast and accurate, but when it's used wrong, it's still fast but just extremely inaccurate.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, December 1, 2009

OCR me once good for me, OCR me twice possibly great for me

When accuracy is the primary concern in document recognition the best technique is multiple passes of the OCR or recognition process. Similar to how you would have a document manually entered two to three times why not have an OCR engine convert it 3, 4, 5 times all with different settings?

The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.

An example of a scenario where this is being used and very successful is documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A a second read with the same OCR engine settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.

At minimum 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it's going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by environment, for example environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.

While multiple pass OCR slows down the entire process it's still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Tuesday, November 17, 2009

Data Capture – Problem Fields

The difference often between easy data capture projects and more complex ones has to do with the type of data being collected. For both hand-print and machine print forms certain fields are easy to capture others pose challenges. This post is to discuss those “problem fields” and how to address them.

In general fields that are not easily constrained and don't have a limited character set are problem fields. Fields that are usually very accurate and easy to configure are number fields, dates, phone numbers, etc. Then there are the middle ground fields such as dollar amounts and invoice numbers for example. The problem fields are addresses, proper names, items.

Address fields are for most people surprisingly complex. Many would like to believe that address fields are easy. The only way to very easily capture address fields would be to have for example in the US the entire USPS data base of addresses that they themselves use in their data capture. It is possible to buy this data base. If you don't have this data base the key to addresses is less constraint. Many think that you should specify a data type for address fields that starts with numbers and ends with text. While this might be great for 60% of the addresses out there, by doing so you made all exception address 0%. It's best to let it read what it's going to read and only support it with an existing data base of addresses if you have it.

Proper names is next in complexity to address. Proper names can be a persons name or company names It is possible to constrain amount of characters and eliminate for the most part numbers, but the structure of many names makes the recognition of them complex. If you have an existing data base of names that would be in the form you will excel at this field. Like address it would not be prudent to create a data type constraining the structure of a name.

Items consist of inventory items, item descriptions, and item codes. Items can either be a breeze or very difficult, and it comes down to the organizations understanding of their structure and if they have supporting data. For example if a company knows exactly how item codes are formed then it's very easy to accurately process them with an associated data type. The best trick for items is again a data base with supporting data.

As you can see the common trend is finding a data base with existing supporting data. Knowing the problem fields focuses companies and helps them with a plan of attack to creating very accurate data capture.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, November 11, 2009

Dropout, all or none

Color or Greyscale dropout is a great tool for increasing accuracy of extracting data from forms. But bad dropout is far worse than no dropout. Partially dropped out forms have the ability to confuse data capture technology. These forms are commonly called “Zebra” forms where portions of the form have dropout performed correctly and other portions have the fields now outlined in black. If you have control of the scanning and this is the situation you are better to turn off dropout, or improve it's use.

It used to be the only way to dropout a form was to use scanner driven dropout. This approach was limited in colors that could be removed. Essentially what would happen is the scanner would be equipped with lamps of red usually. During scan the lamp would be turned on thus canceling out the red in the form. Because of this it was important that printed forms used a certain type of red. If you have every had experience with color matching you know it's quite frustrating. Especially because the colors you see on the screen are not usually what is printed. Things have improved, now even scanners are using software dropout, where images initially arrive as color and algorithms then remove pixels of a certain color range from the document. This has created the added benefit of being able to with some scanners and software packages dropout any color, and multiple colors at a time. There are even some packages out there where you can drop out things like colored lines.

When dropout with any technology becomes difficult is when there are gradations on the form because of bad printing, color wear, sun or other damage. Because the software is looking for consistency with any dropout it will avoid colors that don't match the norm. This is often seen when the first half of a form is dropped out and not the second because of a color change mid document. There are tools that allow you to specify a threshold that can assist with this. This can be a very low threshold when dealing with documents where it's one color and black text, but more complex documents can with a low threshold loss important data.

The biggest key to proper dropout assuming good form printing is to scan the document as quickly as possible, removing time for damage to possibly take place. Dropout is a great tool, but if you find that forms are partially dropped out you are better for data capture accuracy to turn off dropout and deal with the black and white form than to include it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, November 6, 2009

Invisible characters

Exceptions in OCR and data capture are usually thought of as miss-recognized characters only, but in reality there are several other types of exceptions that exist. One of those is called “high confidence blanks”. A “high confidence blank” in OCR or data capture is where the software looked in a particular region for a character but no text was identified or read. In data capture “high confidence blanks” usually occur for entire fields or just the first character, in full-page OCR they are less common but can occur sporadically throughout the text of the document or the entire text. This type of exception is elusive and hard to detect. Obviously if entire fields and text is missed where you expect there to be text it is easy to spot, but for the one-off missing characters it's tough. With full-page OCR detection is done with spell-check. Missing characters in a word will surely flag the word as being misspelled. In data capture it's much more tricky and the best thing to do is to take certain steps to avoid “high confidence blanks”.

1.)The first thing you can do to avoid “high confidence blanks” in data capture is to NOT over use image clean-up. If characters are regenerated or cleaned too much they look to the OCR engine to be just a graphic not a typographic character and thus avoided.

2.)Second if you have control of the form design make sure text is not printed close to lines, this is one of the biggest generators of “high confidence blanks”
3.)If text is close to lines then you should be able to establish a rule in the software indicating for example that if the first character in a field is more then x pixels away from the border then most likely a character(s) was missed.
4.)If at all possible use dictionaries and data types that state the structure of the information that should be present in a field. If a character is missing this data type will likely be broken.

This type of exception is one that leads to hidden downstream problems when organizations don't realize that it might happen. Being aware and taking the proper steps to avoid "high confidence blanks" is the solution.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, October 30, 2009

How many keystrokes does it take to get to the center of accuracy?

Often times we are blinded by technology and forget the pain we originally adopted technology to solve. When I first learned accounting more tenured accountants would explain to me how they made journal entries on paper not Quick Books. Then I learned math I was freely solving complex equations on my graphic calculator as my professor explained how long these equations would take without it. OCR is no different. OCR is replacing manual data-entry that is not very accurate. If an OCR system is 85% or more accurate on a particular document type then it most likely is more accurate than a single entry by human on that same document type, and faster!

So we know there is a clear benefit to the technology; increased speed, increased accuracy, it's when companies want to be 100% accurate they start to groan. Before OCR and even today to reach 100% accuracy with data entry they did double or triple blind data entry. Double or triple the labor cost. What that means is that two separate people will data enter the same document and the results will be compared, make this three people and you will almost always be 100% accurate. You can do the same with OCR! Most large service bureaus in fact prefer that OCR technology make the first pass then they do one pass with manual entry making it double-blind. I'm going to suggest one step further.

Why not have OCR with settings geared towards numbers, and OCR with settings geared towards words ( our two separate data entry people ) both enter the same document and compare the results. Why not three sets of settings, maybe four? If you were to take the same OCR engine with different settings and compare their extraction results from each instance you are creating automated double blind data entry! You can replicate the trusted process for producing high accuracy with greater efficiency and lower cost.

I am a constant advocate of human intervention on low confident fields or characters, but in the above approach you are using more technology to replicate existing very accurate processes. Never forget the original problem and you will see very quickly that OCR is a benefit.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, October 9, 2009

It learns right? - The misconception about recognition learning

Because of the way the market has come to understand OCR ( typographic recognition ) and ICR ( hand-print recognition ) there is no surprise when some of the most common questions and expectations about the technology appear to be fact from a tarot card. Before I talked about one of these questions “How accurate is it” and how the basis of this question is completely off and can come to no good, here is a similar “It learns right?” which is quite a loaded questions, so lets explore.

Learning is the process of retaining knowledge for a subsequent use. Learning is based in the realm of fact, following the same exact steps creates the same exact results. OCR and ICR arguable learn ever-time it's used, for example engines will do one read and go back and re-read characters with low confidence values using patterns and similarities they identified on a single page. This is on a page level, and after that page is processed this knowledge is gone. This is where the common question comes in. What people expect happens is that the OCR engine will make an error on a degraded character that is later corrected, now that it's been corrected once that character will never have an error again, assuming this is true then you would believe that at some point the solution will be 100% accurate when all the possible errors are seen.

WRONG! Because the technology does not remember sessions is the reason it works so well. Can you imagine if for example a forms processing system was processing all surveys generated by a single individual ( this is true for OCR as well ), the processing happened enough that in learned all possible errors and was 100%. Then you start processing a from generated by a new individual, your results on the first form type and the new will likely be horrendous, not because of the recognition capability, all because of supposed “learning”. In this case learning killed your accuracy as soon as any variation was introduced.

What most people don't realize is that characters change, they change based on paper, printer, humidity, handling conditions, etc. In the area of ICR it's exaggerated as characters for a single individual change by the minuet, based on mood and fatigue. So learning is a misnomer as what you are learning is only one page, one printer, one time, one paper who will likely never repeat again. A successful production environment allows as much variation that is possible at the highest accuracy and this is not done with this type of learning.

Things that can be learned: Like I said before a single pass of a page, can have a second pass of low confident characters with learned patters on that page. In the world of Data Capture field locations can be learned, field types also can be learned. In the world of classification documents based on content are learned, this in fact is what classification is.

While the idea of errors never repeating again is attractive people need to understand this technology is so powerful because of the huge range of document types and text that can be processed, and this is only possible by allowing variance.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, October 5, 2009

Exceptional exceptions – Key to winning with Data Capture

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It's how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it's important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for keep return on investment ROI stable.

There are two phases of exceptions, those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it's how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it's probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It's not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it's doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions weather whole document types or just portions of one particular document type is to set a standard that indicates an exact problem has to repeat X times ( based on volume ) before it's given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Saturday, September 26, 2009

Document's say “Cheese” - Digital Photo OCR

Contrary popular belief it will be many years before a digital photograph of a document will be close to the accuracy of a document scan. Yes there are document scanners today that are based on a mounted digital camera, this is very accurate, but not what I'm referring to. I'm talking about photography of documents with your cell phone, or digital camera. One would assume that taking a photograph of a document at the highest possible resolution would be able to eventually replace document scanning, but that is not the whole story. Even your 12 mega-pixel digital camera will not beat a 300 DPI document scan when it comes to document imaging. While it is possible to get better and better digital photographs of documents there is one major problem in converting them using OCR and that is that OCR engines have to account for many more variable elements, the most complicated being layers.

When you take a photograph of a document there is the potential of several different focal points, a table, a finger, the floor. Some of these focal points can be easily be mistaken for the flat surface of a document. The OCR engine has to determine which layer or focal point is the actual document and what it's borders are. The way the do this is color detection primarily. Because in a document scan there is only one focal point, as the document is the entirety of the image, the OCR engine does not need to guess and make any modification to the image to find it. This increases the accuracy of both document analysis and character reading. The next challenge is perspective.

A digital photograph of a document should be taken head on. Think about the LCD screen on your camera as being on the same plane as the piece of paper. Any variation to this causes problems with distortion where for example the top portion of the document from left-edge to right-edge has a shorter distance than the bottom portion. There are some capture applications out there for the iPhone and other mobile devices that force you to line the document up in brackets. This forces the capture to focus only on the document and know by virtue of the guide where the borders are, but lining it up is very time consuming. That gets to the final point, time.

It actually would take you much more time to capture 10 page document with a digital photograph than with a ADF or sheet-fed document scanner. Because the quality of the photo is so important in running OCR on a digital photograph It requires a lot of conscious effort on no shaking, lining up the document, and placing the document on a surface that does not contain many layers or focal points. Because of this additional effort it's actually not saving any time.

I am a fan of blooming technology as well, but for acquiring paper images and converting them there is not better way then a portable or traditional document scanner. In time digital photographs of documents will become a popular way to capture single page documents for one-off processing, but as long as paper exists so will the reality of document scanners.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, September 25, 2009

Don't over clean – the effects of image clean-up on accuracy

There is always some way to modify a scanned image to improve it's recognition results if it's not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen, the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it's possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.)Despeckle ( unless dot-matrix font )
2.)Line Straightening
3.)Basic Thresholding
4.)Background removal
5.)Correction of Linear Distortion
6.)Dropout
7.)Line Removal ( sometimes )

Bad for OCR is:

1.)Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”'s will be “e”'s. For hand-print you often remove portions of characters.
2.)Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
3.)Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes consider only those that improve the recognition rates, not destroy them.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, September 24, 2009

“You vote engines! Of course it's better” - Reality of voting

The trend of companies promoting OCR voting has become less common, but you will still occasionally find products that promote their accuracy by saying they don't just use one engine they use many and vote them together. The presumption of this approach is that of course they are more accurate then single engine solutions. This would seem to be the case, but it's not that easy.

All the OCR engines have a system of voting internally already. This is how OCR technologies have made their advances throughout the years. They take algorithms that are expert in one particular way to interpret text, such as trigrams, words, fonts, etc. and vote their character guesses against each other for the final guess. This works great. This is very different from the voting that is often promoted of taking several engines and voting their result together. When you take two separate OCR engines and vote them together it would seem you are getting the best of what's available, but there is one major problem. Voting requires that each engine guess the same way, and this is not the case. For example Engine A might report a confidence on the letter “c” at 98% that it's actually an “e” while Engine B might report with a 78% confidence that I is a “c”. When you vote these two, Engine A will win even though it's wrong. This is typically how it goes, one engine in a voting scenario will win most of the time right or wrong, just because of how it reports it's confidence levels.

This blog is not in combat with voting. Voting is a great tool, it's used internally in the engines, and it can be used externally as well. How? Vote Engine A settings A against Engine A settings B. The same engine voted against itself just will different settings. This is a tremendous tool especially when dealing with varied documents, or highly degraded documents. By doing so you are comparing apples-to-apples confidence levels not apples-to-elephants.

So next time you are turned on by voting, take a second look and see if it's a marketed or real value.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, September 22, 2009

The wrong question - “How accurate are you?”

Organizations seeking full-page or Data Capture technology have a serious need to estimate accuracy before they even deploy a technology, as this is a primary variable in determining the range of return on investment they can expect to achieve. When organizations try to understand accuracy by asking the vendor “How accurate are you?” they have gone down a path that may be hard to undue.

Accuracy is tied very closely to your document types and business process. While even asking for an accuracy on a document similar to yours is fair, it should not have much weight. An organization's business process dramatically impacts OCR accuracy as well. Instead of asking “How accurate are you?” you should be asking “Can I test your software on my documents?”.

A properly established test bed of documents is the ideal way to evaluate the accuracy of a product. You want to know worse case. Build a set of documents that are samples of your production documents, make sure your collection is proportional to the volume you intend to process and the number of variations. Of that 25% of them should be the “pretty” documents, 50% should be your typical documents, and 25% your worse documents. Use this sample set on all products you test. If you are able to compile truth data ( 100% accurate manual results from these documents ) then you are even better off in your analysis.

While I would hope no vendor answers this question directly, the question itself means that you don't understand yet the problem you are trying to solve. Today the ability to test is essential and the vendor should grant you that right. Taking the time to test will save you much pain and time later.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, September 17, 2009

Even OCR needs a helping hand – Quality Assurance

Let's face it OCR is not 100% accurate 100% of the time. Accuracy is highly dependent on document type, quality of scan, and document makeup. The reason OCR is so powerful is because it's not. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where a human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full page OCR often spell checking is used. In Data Capture usually a review character-by-character of a field is done and you don't see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has it's own variation of “verification mode”. It's important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture there is an additional step in quality assurance, business rules. In this process the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don't enter anyone in the system who's birth year is earlier than 1984”. If such a document is found it is flagged for an operator to check. These rules can be endless and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations the use of a database or body of knowledge is deployed as first round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, September 15, 2009

Invoice-in-a-box – 4 steps to success

Invoices are one of the highest demanded documents to automate. Lets talk a little about what it takes to be successful in invoice processing. Data Capture is the technology used for invoices, this is where you extract field-by-field the information you want from the invoice in field order. In order to automate invoices with the high accuracy and utilize a boxed invoice solution you need to do some preparation. Here are 4 MUST have steps:

1.)Separate your commercial invoices from any specialized invoice types such as legal, manufacturing, telecommunication, etc. The reason you do this is because the low hanging fruit when automating invoices is commercial invoices. Software packages have put the most amount of effort in these documents. By working with them first you are ensuring your success on a large population of your invoices and then can tackle the remainder.

2.)Know how many vendors you have. Understanding the makeup of your invoices is very important. Your focus should be determined by those invoices that are easiest to automate and make up the greatest portion of your entire volume. So make a list of all your vendors and what paper volume percentage each makes up of the whole.

3.)Know if you want to collect line-item data or not. At first glance majority of companies say they want line-items, only later to change their mind. Find that business process that mandates you collect line items. In your current process are you having line items entered? What database of existing information will you use to support your line-item extraction? Most companies in the end choose against line-items or choose to extract them for limited critical vendors.

4.)Know how you are going to check the quality of extraction. Quality assurance happens with human review, and business rules. Know before hand how you want those to work. For example a business rule simply could be all line-items must add up to total amount, if they don't you have someone look at the entire invoice.

These four steps are not the end-all in proving you invoice processing accuracy, but they are necessary and all steps to consider before you look and purchasing a boxed invoice processing solution.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, September 14, 2009

“No text left behind” - Color's Impact on OCR

OCR technology has come a long way since it's creation. On the 300 DPI clean, letter type documents the technology has arrived and not much room for improvement. But what about the rest of the documents out there, how is OCR improving on them? When comparing that perfect letter document to that not so perfect article or newspaper say, the big difference is text placement and configuration. One of the keys to getting even better OCR is to improve your ability to identify what is graphics, what is text. Within the text you have to identify columns, paragraphs, sentences, words, and finally characters. Only then can the OCR take a whack at interpreting the text. This is called Document Analysis. Sometimes OCR accuracy is lower not because of the actual read of the text but because the OCR software tries to read things that are not text, or some of the text in the document is simply ignored because it was never found.

In the last few years and moving forward text identification, Document Analysis, has been one of the areas of greatest improvement. Many of the new products have been leveraging color as one more tool in not leaving any text behind. With color the ability to locate different parts of a document is even easier and more accurate, thus the overall OCR is more accurate. The most obvious benefit of color is ability to locate graphics. Sometimes index level OCR requires that even text within graphics be read to enhance the search-ability of a document. With color detection the modern engines are advancing to locate text in pictures and ignore the rest. Very stylized documents pose the greatest challenge to Document Analysis, and color is one of the best tools to attack them. Expect to see similar trends and focus on Document Analysis and the pursuit of no text left behind.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Sunday, September 13, 2009

Don’t over clean – the effects of image clean-up on accuracy

There is always some way to modify a scanned image to improve it’s recognition results if it’s not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen, the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it’s possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.)Despeckle ( unless dot-matrix font )
2.)Line Straightening
3.)Basic Thresholding
4.)Background removal
5.)Correction of Linear Distortion
6.)Dropout
7.)Line Removal ( sometimes )

Bad for OCR is:

1.)Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”’s will be “e”’s. For hand-print you often remove portions of characters.
2.)Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
3.)Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes consider only those that improve the recognition rates, not destroy them.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments