Monday, December 7, 2009

Re-OCR, Lessons learned

o my surprise I still receive requests from companies needing to start over on their OCR processes. Companies that have used the technology, did not plan, and are now finding themselves in a situation where they have to repeat OCR efforts. The companies fall into two categories.

First category is where companies find they have processed large volumes of paper and the accuracy was not what they expected. This can be discovered in a relatively short time-frame or long after initial integration of the technology. It can be as easy as fixing bad settings for a particular document type to as bad as purchasing correcting a bad choice in software solutions.

For companies in category one it's truly a lesson learned scenario. I will work with these companies to evaluate proper OCR settings and to test future prospect engines. The hope of mine is that the company at least scanned their documents at a high enough quality that already converted or scanned images can be used for backlog conversion versus a re-scan if that is even possible.

The second category is companies who discovered they were collecting too little of data from their documents. This usually happens in data capture environments where companies configure to capture 3 key fields only to find later that there were an additional 2 fields required for downstream processes. Depending on the severity it's often better to do day forward processing with proper settings on new documents and to key in missing fields for incorrect documents. The reason for this is sometimes the work of getting the additional fields and reconciliation on old documents takes away from day forward production and may not be worth the additional cost there it imposes. Or a common practice is to have the backlog documents run from scratch through the new process.

The trend in both categories is improper planning by the organization before evaluating technology. It's important for companies to take the time and plan for capture technology. A part of this planning is forward looking need for the data. One of the best tricks to exposing the requirements is to involve ALL constituents that create, use, and benefit from extracted data. Plan, Plan, Plan.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, December 2, 2009

Playing tricks on your images

Often organizations have no control over the images they have received. Images can come via fax which has a varied range of resolutions, or they can come as poor scans. All of which are no good for data capture and OCR processes. Fortunately there are a lot of imaging tools and tricks out there to help. None of these tools replace a good scan but some get close. One of the tools not often thought about is up-sampling.

Up-Sampling is the process of taking an image at a lower resolution and increasing it to a higher resolution. The technology basically increases the resolution of an image then fills in new empty pixels with predicted values from the original image. For data capture and OCR up-sampling is usually done from 150 DPI and 200 DPI to 300 DPI. Up-sampling technologies have become very impressive and useful. I will recommend up-sampling often over working with the source lower resolution. But lets talk about the facts and how and when you should consider up-sampling.

Up-sampling should be considered on documents that have a low amount of noise such as watermarks, spills, stains, stamps, speckling. Essentially documents that are a good quality and scan but low resolution. You should also avoid doing up-sampling on documents with close spacing of elements and text crowding. In these two above scenarios it's better to work with the source image as-is and work around the problems

The bigger the gap between the source resolution and the desired resolution, the more risk of fragments exist after up-sampling. For example 150 DPI to 300 DPI will not yield the quality that 150 DPI to 200 DPI will. This is why going crazy and up-sampling to the highest possible resolution is not a good idea. It's like taking a very small image and trying to zoom in as far as you can to get detail, you probably wont. Trying to trick the system will only hurt you. Up-sampling from 150 DPI to 200 DPI then again to 300 DPI would not be better then just converting to 150 DPI to 300 DPI. In fact this would be a pretty big mistake. Essentially what you do when you do this is magnify the mistakes created during up-sampling as they get propagated now twice over. These will likely decrease you quality and can result in such things as bloated characters, fuzzy characters, or an abundance of speckling. The goal is to do as few conversions on the document as possible.

I will always defer to a proper scan over any image techniques, but when you do not have control of the image scan one of the image tools to consider is up-sampling. Uneducated use of the technology is unsafe as is true with all advanced technologies, but if you stick with the facts, and pick a great technology you will be successful.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, November 25, 2009

Convert now Export later

It's not surprising that organizations focus of any sort of document automation is the export format and data coming out of the system. But sometimes this focus has organizations choosing poor data capture and OCR products just for and ideal export format. The places this occurs the most is in healthcare and accounting where these industry specific repositories expect a format and the vendors of these repositories are unwilling to change. This post is to assure you that the accuracy and features of your data capture and OCR product are more important than the file format it creates.

By focusing on file export format organizations are limiting their possibilities of solutions and perhaps locking them into a more expensive proposition then they should. Industry specific applications are able to charge a premium for connectors and their products because they understand where the focus is. However the most accurate data capture and OCR systems out there are general. Some data capture applications have connectors to say a specific accounting system, but even without specific connectors all data capture systems can export data in such a way that it can be converted to ANY desired format.

Data capture application support CSV, XML, ODBC, or text exports that can be molded in to any required format. Often because they support ODBC there is an opportunity to export directly to any application also supporting it. Because a conversion utility or a custom connector takes weeks to create vs. data capture and OCR's man years to create, the focus should be given to the accuracy and capability of the OCR and data capture system before it's export functionality.

While it would be ideal to find a data capture application that had the accuracy, the features, and the export you desire, I urge organizations not to limit themselves too it. Picking a poor data capture and OCR system will be far more costly than creating even a custom export from scratch.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, November 20, 2009

Line-Items : Picking the correct field type

Documents containing tables have the majority of information one the document printed in those tables, thus the demand to collect this data is high. In data capture organizations will choose three scenarios to collect data from these documents, ignore the table, get the header and footer and just a portion of table, or get it all. Ideally organizations prefer the later, but there are some strategic decisions that have to be made prior to any integration using tables. One of those decisions is weather to capture the data in the table as a large body of individual fields or as a single table block. Lets explore the benefit and downside to both.

Why would you ever perform data capture of a table with a large collection of individual fields when you can collect it as a single table field? Accuracy. Theoretically it will always be more accurate to collect every cell of a table as it's own individual field. The reason for this is because you will more accurately located fields, remove risk of partially collected cells or cells where the base line is cut, and remove white space or lines from fields. In some data capture solutions this is your only choice. Because of this many have made it very easy to duplicate fields and make small changes so the time it takes to create many fields is faster. This is a great tool, as the downside to tables as a collection of individual fields is the time it takes to create to create all fields maybe to great to justify the increase in accuracy.

If you have the ability in your data capture application to collect data as an individual table block, you are able to very quickly do the setup for any one document type. Table blocks require document analysis that can identify table structures in a document. The table block relies heavily on identified tables and then applies column names per the logic in your definition. This is what creates it's simplicity but also it's problems. Sometimes document analysis finds tables incorrectly, more often partially. This can cause missing columns, missing rows, and the worse case scenario rows where the text is split vertically between two cells or horizontally cutting columns in half.

There is a varying complexity in the tables out there, and this most often is the deciding factor of which approach to take. Also very often the accuracy required, and the amount of integration time to obtain that accuracy determines the approach. For organizations that want line-items, but they are not required, table blocks are ideal. For organizations needing high accuracy and processing high volume individual fields are ideal. In any case it's something that needs to be decided prior to any integration work.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, November 17, 2009

Data Capture – Problem Fields

The difference often between easy data capture projects and more complex ones has to do with the type of data being collected. For both hand-print and machine print forms certain fields are easy to capture others pose challenges. This post is to discuss those “problem fields” and how to address them.

In general fields that are not easily constrained and don't have a limited character set are problem fields. Fields that are usually very accurate and easy to configure are number fields, dates, phone numbers, etc. Then there are the middle ground fields such as dollar amounts and invoice numbers for example. The problem fields are addresses, proper names, items.

Address fields are for most people surprisingly complex. Many would like to believe that address fields are easy. The only way to very easily capture address fields would be to have for example in the US the entire USPS data base of addresses that they themselves use in their data capture. It is possible to buy this data base. If you don't have this data base the key to addresses is less constraint. Many think that you should specify a data type for address fields that starts with numbers and ends with text. While this might be great for 60% of the addresses out there, by doing so you made all exception address 0%. It's best to let it read what it's going to read and only support it with an existing data base of addresses if you have it.

Proper names is next in complexity to address. Proper names can be a persons name or company names It is possible to constrain amount of characters and eliminate for the most part numbers, but the structure of many names makes the recognition of them complex. If you have an existing data base of names that would be in the form you will excel at this field. Like address it would not be prudent to create a data type constraining the structure of a name.

Items consist of inventory items, item descriptions, and item codes. Items can either be a breeze or very difficult, and it comes down to the organizations understanding of their structure and if they have supporting data. For example if a company knows exactly how item codes are formed then it's very easy to accurately process them with an associated data type. The best trick for items is again a data base with supporting data.

As you can see the common trend is finding a data base with existing supporting data. Knowing the problem fields focuses companies and helps them with a plan of attack to creating very accurate data capture.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Sunday, November 8, 2009

Tax Return OCR

If you are thinking about using data capture to read text from tax returns it's time now to start thinking about the steps to accomplish this. Reading typographic tax returns from current and previous years has proven to be very accurate and a great use of data capture and OCR technology. Tax Returns fall into the medium complexity to automate category. There are a few things that make tax returns unique.

Checkmarks: Tax returns have two types of checkmarks, ones that are standard and printed in the body of the document. These can be handled similar to all other common checkmark types. The other type of checkmark is unique only to tax forms, they are typically on the right side of the document. They are boxes that within can be filled with a character or a checkmark symbol. With these checkmark's the best approach is to create a field the entire size of where the checkmark can be printed and set the checkmark type to be of type “white field”. In this case the software will expect there to be only white space and a presence of enough black pixels will consider it checked.

Tabular Data: Much of the data in a tax form is presented as a table. When considering capturing data from a table organizations have to decide if they want to capture each cell of the table as it's own field OR if they would like to capture the data in the table as a table field that later must be parsed. This can dramatically effect the exported results so knowing before hand is very important.

Delivery Type: Tax forms usually come as eFile which is a pixel perfect document that is never printed and never scanned, or as a scanned document received first as paper then scanned. For the most part the eFile version of the tax form will be more accurate, however the eFile version of the form has non-traditional checkmark's that could cause a problem. Organizations need to decide if they are going to process all delivery types together as a single type or separate them. There are advantages to both. By combining them integration time is less, by separating them accuracy is higher.

I much rather OCR a tax return than file one. Because of this the skills I've developed in processing tax returns are better than creating them, and I hope today I imparted some of that knowledge.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, November 4, 2009

You can read the fine-print

As fonts get smaller the challenge to read them with OCR software increases, however there are some key things that organizations should be aware of when reading the fine-print.

OCR technology today is capable of reading fonts as small as 8 pt even 6 pt very accurately. It used to be unless you have a 12 pt font you stood no chance. Because of increased quality of scans and more advanced OCR engines reading small fonts can be no problem if the right approaches are used.

Small fonts have a higher sensitivity to image quality and degradation to the document. For this reason original source images that are scanned at 300 DPI or higher are necessary. For normal fonts there is seldom reason to scan higher than 300 DPI but for small fonts the goal is to get them to appear more or less the same as the regular fonts, so scanning them at 400 to 600 DPI is useful. Additionally documents that are “clean” is very important. A smudge or spill on a document impacts smaller fonts many times more then a larger font because of the closeness of lines. Once you have a good image quality you can start the conversion.

The next best benefit for small fonts is for them to be zoned separately. Zoning is the process of rubber banding the region where the text exists. When small fonts are grouped in the same zone with normal sized fonts the OCR software assumes that they should be of the same size and the confidence and accuracy go down. If you zone the small fonts separately you increase the OCR engines ability to use experts just for small fonts and increase the accuracy on them.

Next time someone tells you to read the small print, tell them you wont read it you will scan and OCR it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, November 2, 2009

Users of OCR doing it right

Good article on acquiring OCR technology form a service bureau and end-user perspective. I especially like the point of soft costs which are inline with my recent market education on planning.


8 things to consider when deciding to buy or rent OCR capabilities

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, October 26, 2009

Black belt in data capture processes an EOB

Explanation of Benefit's (EOB) next to student transcripts are unquestionable the most difficult documents to automate. The value to automate these documents however is tremendously high as they are very expensive to data enter. 3 years ago the fad to automating these documents was to use semi-structure data capture to locate information no matter the variation. Companies buying into this fad quickly found themselves in an expensive and deep data capture implementation. This is where I get to tout the power of simplicity and beat down the over complicators.

Just as a Sensei would practice meditation before a bout to calm the nerves so should an implementer of data capture when facing the bloody battle with EOB documents. Simplicity is key when processing EOBs. Organizations should:

1.) Consider processing first those EOBs that are clear. Clarity is a vague term and includes document structure and scanning quality. But because of the variation across EOB types its best for an organization to focus on automating the best quality, the ones they know will provide the highest accuracy then move onto the rest when they have succeeded.

2.) Consider classification as a primary step. If you can very accurately classify EOBs by type then you don't need to use semi-structured technology on the EOB, you simply need to isolate each type and use a combination of coordinate and semi-structured based field location. Because you are working with a single type you will be way more actuate in locating the fields and reading them.

3.) Ignore document structure. Very often EOBs don't follow their own document structure especially when it comes to tables. Often EOBs have tables within tables, or data in tables that does not align to table headings. Additionally EOBs have patients that span pages, and totals for items on previous pages. EOBs should be thought about as a collection of lines that start with a header ( easy to collect the data ) and a footer ( also easy to collect data ). Your job then is to classify lines, and extract data per-line.

4.) Extract the data then convert it. In EOB processing there are many items contained within the EOB that have to be converted to another format prior to reconciliation. When trying to extract data if you focus on the conversions they often muddy up the extraction process. First very accurately get the data from the paper then convert it to the desired format.

For those who are currently processing EOBs and receiving the great value that automation can provide, you truly are black-belts of data capture and have mastered the nuances of document automation. For those of you wanting to process EOBs, it's very possible, just keep it simple.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, October 16, 2009

Not that you want to pay that invoice any faster

But you can, and you can with a lower cost, and perhaps take advantage of net discounts. With Data Capture and OCR technology you can automate the entry and routing of commercial invoices. The reality for organizations that receive many invoices a day is that the accounting department is paying high salaries and taking time a way from other activities to data enter paper invoices. Using recognition technology to replace this process has been a tremendous benefit to many organizations. There are a few keys to success.

Start out simple: don't try to tackle the entire paper world with your solution, start out simple. First identify the process and where the opportunities for saving are. Usually the biggest opportunity is going to be in the entry of data into some accounting system. To automate this you will need data capture and scanning capabilities. Starting out simple does not mean to overlook all the possibilities but to find the technology that will fit all your wildest dreams of automation but start out slow with it. More specifically with invoices, first start by scanning, then by getting vendor, invoice number, and total due using recognition technology, etc.

Wait for an ROI before you make a major change: These technologies if implemented correctly can provide a great return on investment. Sometimes organizations make the mistake of not waiting until they get an ROI before making another major change. The change likely will have positive results, but requires another round of additional effort and could be problematic. This does not allow you to see when the value of the technology starts kicking in and could have you repeating effort. Wait until you succeed at a basic implementation before you seek even more cost savings. Saving money is addicting, but let each phase actualize itself.

Never forget your business process is boss: Organizations have processes that are set in stone. Staff understands how to execute them, technology is setup to facilitate them, and other processes are feeding or fed by them. Sometimes new technology is so excited it forces you to change what you are doing right when you acquire it. Often organizations don't realize the upstream and downstream impact of dramatically changing business processes. A technology should give you the option to keep doing what you are doing only faster, or to change things if you choose. At first try to keep it as consistent with the already in place AP business processes, then look for process improvement later.

No maybe you don't want to pay that invoice faster, but you do want to reduce the cost of working with it. With Data Capture and OCR you can save a ton as long as you prepare yourself and do your homework.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, September 23, 2009

The Magic of 300DPI

Many users of OCR don't realize what the impact of resolution and bit-depth is or even what they are. Usually in the case of OCR more is better. More resolution, more bit-depth. It's more information the OCR engine can use to interpret text. But as with many things there is a point of diminishing returns, as it relates to image resolution diminishing returns are very interesting.

You will hear a lot that 300 DPI is the best resolution to scan an image for OCR. But why? 300 DPI is that magic number where you gain the most accuracy with out sacrificing speed and file size. If you were to put the resolutions on a progressive line starting with 96 DPI and run test of both OCR accuracy, scanning speed, OCR speed, and file size. You will notice something very interesting, the improvement gap between 200 DPI scan and 300 DPI scan will be at least 2 times the improvement gap of any other resolutions. Now if you look at the same line between 300 DPI and 400 DPI the improvement gap is nearly absent, but still there. This simple study is the reason 300 DPI is the ideal resolution for OCR scanning. Now lets look at why.

There is one major reason that 300 DPI is optimal besides it has a reasonable scan speed and reasonable file size, but the biggest reason is the Engine cores were all initial trained on this resolution. Some engine's no matter what resolution you give it will actual sample up or down to get to 300 DPI. The image pre-processing/cleanup engines are similarly setup.

There are always exceptions, and the area of exceptions are usually in hand-printed forms ( ICR ), or documents with small print.

The beauty of the 300 DPI best practice is that it's one of the few things in the area of OCR and Data Capture that is consistent through document type. You have been told to use 300 DPI and now you know reason behind it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, September 18, 2009

When you got it design it – Form Design

Not too often to companies using Data Capture technology have the chance to change their forms design or even create new ones. If you have this ability, USE IT! A properly designed form is the fist step to success in automating that form. There are many things you can do to make sure your form is as machine readable as possible. Typically the forms we are talking about are hand-written but occasional also machine filled. I will highlight the major points.

1. Corner stones. Make sure your form has corner stones in each corner of the page. The corner stones should be at 90 degree angles to each neighbor one and the ideal type is black 5 mm squares.

2. Form title. A clear title in 24 pt or higher print and no stylized font.

3.Completion Guide. This is optional but sometimes is useful at the top of the form to print a guide on how best to fill in the fields of the type you use.

4.Mono-Spaced fields. For the fields to be completed it's best to use field types that are character by character separation. Each character block should be 4 mm x 5 mm and should be separated by 2 mm or more distance. The best types of fields to use in order are letters separated by dotted frame, letters separated by drop-out color frame, letters separated by complete square frames.

5. Segmented fields by data type. For certain fields it will be important to segment the field in portions to enhance ICR accuracy. The best example is date instead of having one field for the complete data split it into 3 separate parts first being a month field, next a day field, and finally a year field. Same is often done for numbers, codes, and phone numbers.

6. Separate fields. Separate each field by 3 mm or more.

7. Consistent fields. Make sure the form uses consistent field types stated in 4.

8. Form breaks. It's OK to break the form up into sections and separate those sections with solid lines. This often helps template matching.

9. Placement of field text. For the text that indicates what a field is “first name”, “last name”. It is best to put these left justified to the left of the field at a distance of 5mm or more. DO NOT put the field descriptor in drop-out in the field itself.

10. Barcode. Barcode form identifiers are useful in form identification. Use a unique id per form page and place the barcode at the bottom of the page at lease 10 mm from any field.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments