Tuesday, December 1, 2009

OCR me once good for me, OCR me twice possibly great for me

When accuracy is the primary concern in document recognition the best technique is multiple passes of the OCR or recognition process. Similar to how you would have a document manually entered two to three times why not have an OCR engine convert it 3, 4, 5 times all with different settings?

The important thing to note in multiple pass recognition is that you NEVER use a different engine for the same process. Reconciling results from two separate engines is self-defeating. This is often called voting and does not work because of the fact that each engine represents the confidence of characters differently, so you might end up always picking one engine that is less accurate just because it told you it was more confident than the more accurate engine. But using the same engine multiple times with different settings is consistent and a good idea.

An example of a scenario where this is being used and very successful is documents that have both machine and hand-printed text. A first read can be done with an OCR engine with settings A a second read with the same OCR engine settings B. In the areas where both produce just garbage text might indicate that in that area is hand print. Now you can use ICR ( hand-print engine ) in that region to pickup additional information. That is 3 total passes of recognition. The results are combined to make the final document.

At minimum 3 runs of the same engine would be ideal as the statistical chance of two different settings producing the same error reduce drastically and the final output is nearly as good as it's going to be. Some document types lend themselves to multiple pass recognition over others. Sometimes its determined by environment, for example environments that have a lot of traditional documents mixed with invoice looking documents would benefit from having a full-page read with standard settings on every page and a full-page read with special document analysis designed for documents with lines and tables.

While multiple pass OCR slows down the entire process it's still faster and more accurate than manual entry most of the time. I recommend this approach for any organization where accuracy is the primary concern.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Monday, November 16, 2009

Digital Ink – it's not OCR or ICR

Digital ink is the approach of having a touch screen device that monitors a users movements with a stylus on the screen to determine character was written. This is not OCR or more specifically ICR. Very often companies have asked for OCR technology when they meant digital ink and vice versa. OCR and digital ink overlap but not always. There are cases where you simply cannot do away with paper, and not to mention digital ink does not process typed text.

The first time the technology was seen was back when Apple released the Newton. The newton was the first PDA that had a touchscreen and stylus. Later Apple sold Newton to become Palm Computer. At that time you had to re-learn how to write characters according to a guide. The characters were specifically structure to provide the best recognition and then had to be completed in a single hand-stroke. When mastered the recognition was very good. Now any tablet PC has a basic version of digital ink software. Digital ink competes with ICR intelligent character recognition or hand-print. Whereas ICR technology is looking at an image of characters written, digital ink is monitoring hand strokes as the character is being written.

The accuracy difference between the two is an argument that can very easily be lost for both sides. There are times when digital ink is way more accurate and times when ICR of paper forms is more accurate. The key really is the business process that the technology is fitting into. Both have their place. Digital ink is usually combined with an elaborate data entry and content management process. Most often digital ink is not about getting a substantial amount of text from the operator but more about the operator answers quickly simple questions usually requiring no writing at all. The amount of characters entered in a digital ink scenario vs. a ICR of a form scenario is many times less. You will not see tablet PCs sent out in the mail to survey a customer base.

The biggest place digital ink is used today is in health-care where the drive is to increase it's adoption even more. The purpose of the technology in this space is to rapidly populate medical records at the point of examination. However health-care still remains to be one of the top paper generating industries requiring OCR and ICR. This shows the technologies both satisfy very different needs and should not be confused with each other.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, November 3, 2009

Fixed, Semi-structured, UNSTRUCTURED!?

I find my self educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.

First lets start with defining a document, a document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity, this it seems, is the meat of the confusion.

Often someone will say a document is unstructured, what they are thinking is the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know is if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.

The ability to processes agreements and contracts is very limited in very concrete scenarios, where the contract variants are non which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.

It's actually very easy, 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, does not make it fixed. For example a tax form always has the same general location to print company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes is semi-structured, additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.

There, again the confusion is exposed. It's very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly it will help you and them to correct it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, October 22, 2009

Hand-print or Handwriting, makes a big difference

When it comes to forms processing and data capture working with documents that have hand-print vs. handwriting is a huge difference in accuracy and validity. Sometimes the difference between these two is not so clear. So how do you tell if your form is hand-print, or handwriting, or better yet both!

ICR ( Intelligent Character Recognition ) is the algorithm used in the place of OCR for characters generated by a human hand. The algorithm is more dynamic as a persons hand-print changes slightly by the minute. It's possible to be very accurate when processing hand-print forms when the form is designed correctly. When doing this type of forms processing you will always have quality assurance steps, but you can get close to the accuracy of any OCR process. Very often forms that were not created with data capture or automatic extraction in mind will contain handwriting. The reason for this is that hand-print is usually guided by the form itself. Forms without hand-print cannot expect to be processed at a high accuracy. So what makes hand-print hand-print?

Mono-spaced text: What this means that each character as it's filled out is the same distance apart as all the other characters. In handwriting very often you will have characters that connect, in the extreme form this is cursive. When characters touch or are not spread out equally you get improper segmentation and get characters clumped together as one or split in half during recognition. Mono-spaced text is usually achieved using boxes on the form guiding the user to fill within the boxes.

Uniform Height and Width: Similar to mono-spaced text the text as it is filled in should have a more or less uniform height or width. This forces the completer to not introduce as many variable elements as they would in straight handwriting and increases accuracy. This is also accomplished using boxes on the form keeping user's in a bounds.

Stable Base-Line: This aspect of hand-print is the lessor thought about but very important. Text must always be on the same horizontal base-line. What happens typically in handwriting is a user varies up and down on an invisible baseline. You may have noticed sometimes when you write that the end of any line is lower then the beginning. Baselines are important for OCR and ICR to get proper character segmentation and recognition of a few key characters such as “q” and “p” the “tail” characters.

Sans-serif: The last element is keeping characters sans-serif. The reason for this is the extra tails to characters can cause confusion between certain characters like “o” vs. “q” and “c” vs. “e”. The way to achieve this is less obvious, it is by putting a guide on the top of the form that shows a good character and a bad character.

ICR is a technology for Hand-print recognition and can be very accurate when having the proper guides. Today handwriting and cursive automation is not complete and usually only successful when augmented with other technologies such as data base look-up and CAR and LAR. Sometimes the difference between the two is unclear, but the above 4 elements provide a clear definition of hand-print. The best hand-print that can be found is by the highly training creators of engineering drawings who's print is so perfect it resembles very closely typographic text.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, October 15, 2009

Ok Chris, you talk the talk, but what is it?

The constituents of this blog are varied. Some know what OCR and Data Capture is, some do not. Some know they need it but now necessarily how to use it. Others know how powerful it is and have a good understanding of what is out there, but not the best practices. So taking a step back let me tell you what it's all about. It's about saving money, and reducing the cost associated with paper based operations.

OCR is commonly used to encompass all of the recognition technologies out there. It specifically stands for Optical Character Recognition. This is simply the process of taking an image scanned or digital received and converting from an image to text. OCR while it can be used to mean ICR, OCR, Data Capture, OMR, and barcode processing is really the process of extracting ALL of the typographic text from an image document and converting it to a digital format. ICR is hand-print extraction, OMR is filled in bubble extraction, and barcode is, well barcode extraction. These later recognition technologies make up Data Capture.

Data capture is the process of extracting field data pairs to be exported in a structured format. It does not have to necessarily get all the information on a document, and is very highly dictated by business processes. Data Capture incorporates ICR, OMR, Barcode, and OCR to extract the data from fields. Fixed From Data Capture are forms that don't change page to page, and are usually hand-print. Semi-structure forms are 80% of the documents someone sees. Data Capture is usually a more complex technology as compared to just full page OCR.

So there you have it, this is why you are reading this blog to learn about the specifics, nuances, and best practices of these technologies.

Labels: , , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, October 9, 2009

It learns right? - The misconception about recognition learning

Because of the way the market has come to understand OCR ( typographic recognition ) and ICR ( hand-print recognition ) there is no surprise when some of the most common questions and expectations about the technology appear to be fact from a tarot card. Before I talked about one of these questions “How accurate is it” and how the basis of this question is completely off and can come to no good, here is a similar “It learns right?” which is quite a loaded questions, so lets explore.

Learning is the process of retaining knowledge for a subsequent use. Learning is based in the realm of fact, following the same exact steps creates the same exact results. OCR and ICR arguable learn ever-time it's used, for example engines will do one read and go back and re-read characters with low confidence values using patterns and similarities they identified on a single page. This is on a page level, and after that page is processed this knowledge is gone. This is where the common question comes in. What people expect happens is that the OCR engine will make an error on a degraded character that is later corrected, now that it's been corrected once that character will never have an error again, assuming this is true then you would believe that at some point the solution will be 100% accurate when all the possible errors are seen.

WRONG! Because the technology does not remember sessions is the reason it works so well. Can you imagine if for example a forms processing system was processing all surveys generated by a single individual ( this is true for OCR as well ), the processing happened enough that in learned all possible errors and was 100%. Then you start processing a from generated by a new individual, your results on the first form type and the new will likely be horrendous, not because of the recognition capability, all because of supposed “learning”. In this case learning killed your accuracy as soon as any variation was introduced.

What most people don't realize is that characters change, they change based on paper, printer, humidity, handling conditions, etc. In the area of ICR it's exaggerated as characters for a single individual change by the minuet, based on mood and fatigue. So learning is a misnomer as what you are learning is only one page, one printer, one time, one paper who will likely never repeat again. A successful production environment allows as much variation that is possible at the highest accuracy and this is not done with this type of learning.

Things that can be learned: Like I said before a single pass of a page, can have a second pass of low confident characters with learned patters on that page. In the world of Data Capture field locations can be learned, field types also can be learned. In the world of classification documents based on content are learned, this in fact is what classification is.

While the idea of errors never repeating again is attractive people need to understand this technology is so powerful because of the huge range of document types and text that can be processed, and this is only possible by allowing variance.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, September 30, 2009

It's CAPTCHA for a reason – Why you can't OCR CAPTCHA

I've been surprised recently about the number of project requests and Twitter conversation's insisting that OCR can be used to read CAPTCHA. A CAPTCHA is that crazy set of letters and numbers most websites ask you to enter when completing a web form. The purpose of a CAPTCHA is to prevent web bots to create accounts on websites for use in spamming or other malicious activities. It's surprising the number of organizations both private and public that want people to solve this problem of reading CAPTCHA for them. Most all of these companies ask for the use of OCR technology to do so.

I'm sorry, but the answer is it's not possible with OCR. The reason it's not possible is because CAPTCAH is not an OCR problem. It would be more logical to call it ICR ( Hand Print ), but this is still a stretch. OCR is Optical Character Recognition which is reading of typographic text. CAPTCHA fonts are clearly not typographic. To be typographic they would have to have the same baseline (bottom border), same font height for each character in the same class, etc. CAPTCHA fonts resemble more closely hand-print which is ICR processing. However even ICR technology is expecting some consistency, for the most part in a given day and time you will write the word “Analytics” pretty much the same across a form. This allows ICR to understand subject hand strokes etc. in creating the character. This level of consistency is simply not present in CAPTCHA's. CAPTCHA's deploy backgrounds and ever moving lines to prevent the consistency of even their already bizarre fonts. For the most part each CAPTCHA system at any given moment in time will produce a different character variation for each character possible.

While the idea of processing CAPTCHA's is technically enticing, actually wanting to do it has obvious malicious intent. Conversion of CAPTCHA's would require a combination of varying recognition technologies, adaptive pattern training, and imaging techniques. I'm not convinced that the effort in creating such an approach is fiscally feasible, especially when the average project is offering fifteen dollars to complete it. My job today is to set the record straight and let the world know that CAPTCHA processing is not a job for OCR and ICR technology period.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, September 23, 2009

The Magic of 300DPI

Many users of OCR don't realize what the impact of resolution and bit-depth is or even what they are. Usually in the case of OCR more is better. More resolution, more bit-depth. It's more information the OCR engine can use to interpret text. But as with many things there is a point of diminishing returns, as it relates to image resolution diminishing returns are very interesting.

You will hear a lot that 300 DPI is the best resolution to scan an image for OCR. But why? 300 DPI is that magic number where you gain the most accuracy with out sacrificing speed and file size. If you were to put the resolutions on a progressive line starting with 96 DPI and run test of both OCR accuracy, scanning speed, OCR speed, and file size. You will notice something very interesting, the improvement gap between 200 DPI scan and 300 DPI scan will be at least 2 times the improvement gap of any other resolutions. Now if you look at the same line between 300 DPI and 400 DPI the improvement gap is nearly absent, but still there. This simple study is the reason 300 DPI is the ideal resolution for OCR scanning. Now lets look at why.

There is one major reason that 300 DPI is optimal besides it has a reasonable scan speed and reasonable file size, but the biggest reason is the Engine cores were all initial trained on this resolution. Some engine's no matter what resolution you give it will actual sample up or down to get to 300 DPI. The image pre-processing/cleanup engines are similarly setup.

There are always exceptions, and the area of exceptions are usually in hand-printed forms ( ICR ), or documents with small print.

The beauty of the 300 DPI best practice is that it's one of the few things in the area of OCR and Data Capture that is consistent through document type. You have been told to use 300 DPI and now you know reason behind it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, September 18, 2009

When you got it design it – Form Design

Not too often to companies using Data Capture technology have the chance to change their forms design or even create new ones. If you have this ability, USE IT! A properly designed form is the fist step to success in automating that form. There are many things you can do to make sure your form is as machine readable as possible. Typically the forms we are talking about are hand-written but occasional also machine filled. I will highlight the major points.

1. Corner stones. Make sure your form has corner stones in each corner of the page. The corner stones should be at 90 degree angles to each neighbor one and the ideal type is black 5 mm squares.

2. Form title. A clear title in 24 pt or higher print and no stylized font.

3.Completion Guide. This is optional but sometimes is useful at the top of the form to print a guide on how best to fill in the fields of the type you use.

4.Mono-Spaced fields. For the fields to be completed it's best to use field types that are character by character separation. Each character block should be 4 mm x 5 mm and should be separated by 2 mm or more distance. The best types of fields to use in order are letters separated by dotted frame, letters separated by drop-out color frame, letters separated by complete square frames.

5. Segmented fields by data type. For certain fields it will be important to segment the field in portions to enhance ICR accuracy. The best example is date instead of having one field for the complete data split it into 3 separate parts first being a month field, next a day field, and finally a year field. Same is often done for numbers, codes, and phone numbers.

6. Separate fields. Separate each field by 3 mm or more.

7. Consistent fields. Make sure the form uses consistent field types stated in 4.

8. Form breaks. It's OK to break the form up into sections and separate those sections with solid lines. This often helps template matching.

9. Placement of field text. For the text that indicates what a field is “first name”, “last name”. It is best to put these left justified to the left of the field at a distance of 5mm or more. DO NOT put the field descriptor in drop-out in the field itself.

10. Barcode. Barcode form identifiers are useful in form identification. Use a unique id per form page and place the barcode at the bottom of the page at lease 10 mm from any field.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, September 16, 2009

If it's not semi-structured why fix it – know your form's class?

There are two major classes of Data Capture technology fixed or semi-structured. When processing a form it's critical that the right class is chosen. To complicate things there is a population of forms out there that can be automated with either, but there is always a definite benefit of one over the other. In my experience organizations are having a very hard time figuring out if their form is fixed or not. The most common miss-diagnosis is from forms where fields are in the same location and each possess an allotted white space for data to be entered. Too most this seems fixed, but in actuality it's not. Text in these boxes can move around substantially, additionally the boxes themselves while in the same location relative to each other can move because of copying, variations in printing, etc. There are two very easy steps to determine if your form is fixed or not.

1.)Does your form have corner stones? Corner stones, sometimes refereed to as registration marks ( registration marks have been known to replace corner stones when they are very clearly defined ) are printed objects usually squares in each corner of the form. They must be all at 90 degree angle's from their neighbors. What corner stones do is allow the software to match the scanned or input document to the original template, theoretically making all fields and all elements that are static on the form lined up. Removing any shifts, skews, etc.

2.)Does your form have pre-defined fields? A pre-defined field is more than location on the form a pre-defined field has a set width, height, location, and finally and most importantly set number of characters. You know these fields most commonly by when you have filled out a form and you have a box for each letter. There are variations in how the characters are separated, but they all share these attributes. This is called mono-spaced text.

If your form does not have the above two items it is not a fixed form. This would indicate that a semi-structured forms processing technology would be the best fit. On those forms that are commonly confused for fixed, there are ways to make it process well with a fixed form solution by isolating the input type ( fax, email, scan ), and using the proper arrangement of registration marks.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments