Friday, March 5, 2010

Workflow, super-charge with OCR

Document workflow can be as easy as saving a file to a single location to as complex as decision tree document routing rules. Throw some paper into the mix and the problem intensifies slightly. Getting your paper documents to fit your already accepted digital document workflow can be challenging. Some organizations choose to keep the paper and digital workflows separate. Others unite them but create separate rules for each. For most however, it would be ideal to have a single workflow engine or product supporting both the digital, image, and paper documents.

To do so with the greatest value, you need not only document conversion using Optical Character Recognition ( OCR ), but some other advanced imaging and recognition tools. In the digital document world, you don't have only the data contained in the document, you have various other meta data items such as file name, file location ( taxonomy ), tags, etc. In order to marry paper with digital the same has to be duplicated on the paper document and has to occur at time of document processing. This could be a manual process or automated, and depending on your paper volume doing it in manual may be OK. To compete with the efficiency of digital documents however, automatic is the way to go.

Using OCR, image-based and contextual-based classification, paper or image documents that enter the workflow can obtain the same value as digital documents. The OCR is responsible for getting all the content from the document. The purpose of this content is for search, indexing, auto-filing, as well as generation of keywords ( tags ) associated with a taxonomy. In order to determine where the document fits into a taxonomy, you must first classify it.

For classification to be most effective, it happens on two levels. Image-based classification, which is what the document looks like, classifies documents based on their physical structure which is a good indicator of its type and very fast. Contextual classification, which is what words are contained in the document, is one level deeper in classification and looks for the keywords that would make a document one type over another. For some environments, image-based classification can do the job entirely. Once classification is known, a classification engine can place the document in the correct spot in an existing taxonomy. Once an ID or classification is determined, it is no challenge to apply tags, file-naming, and file location to a document.

Workflow can stand alone, but injected with the power of OCR and document classification, it becomes a power house that does not know the difference between paper and digital.

Labels: , , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, February 9, 2010

Mysterious tables

In the world of data capture. the one document element that easily doubles the complexity, increases software cost, and is all-around mysterious are tables. In Invoices, table data is all line-item details, in Bills of Lading, they are all shipping details. Many commonly used business documents contain tables. Extracting data from tables starts first with a clear understanding of table structure.

Most tables out there follow the typical structure, a header with column names, 1 to many rows of data below the that align to column names, and a footer which may contain summation data. This structure is ideal. The first added element of complexity that can occur is when column names do not align with data. This can happen intentionally or due to shifts in scanning. If this is an always or common enough occurrence then it's necessary in data capture setup to ignore table headers completely. Next level of complexity is multi-level headers. Multi-level header structured tables amount to basically tables within tables. There are two levels of headers the first being the parent, and the subsequent levels provide additional details usually a lessor number of items. The levels are usually indicated by using more indents per level. This is most commonly found in EOBs, and what makes EOBs so complex. In this case, you have to capture multiple copies of the same table over and over, and not attempt to collect the whole details as a table. In the most complex documents with this structure, the table data capture element is not used at all but instead a basic field-by-field approach.

One of the biggest mistake's integrators made is assuming a certain data capture table approach will work for all their tables on all documents. The only way to know for sure is testing. The ability for data capture software to find table structures is based on the process Document Analysis. Document Analysis will tell the data capture software where ALL tables on the document are located allowing it to choose the best one. In the case of tables within tables this very often results in a single table that is cutting data cells in half. Document Analysis is built on probability, so if borders of cells for one column have a high location average than that border is selected right or wrong. The more data in a table, the greater the chance of this probability being wrong.

It's best to use tables on concrete document types i.e. a single variation of vendor invoice, or class of vendor invoices all with the same table type. If you prepare, you will not be let down by bad expectations and instead, you will be impressed with your table extraction.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, January 12, 2010

Duel Stream Scanning – Have your cake and eat it too

The benefit of drop-out forms is that they are very accurate in data capture. The downside to drop-out forms is that after they are scanned they aren't much to look at. Companies want the best of both drop-out and black and white forms. They do this in various ways, the most common being to just deal with the images they have. Some will scan a document twice, that is very time consuming. Others will use an overlay utility that stamps the original form fields and labels back on an already processed drop-out image. These utilities are accurate but not as accurate as the original and often result in lines stamped on text. The best solution for getting a form scanned efficiently that is both optimum for data capture and viewing is to use duel stream scanning.

Duel stream scanning is usually a feature in the higher end scanners. The technology is slowly moving down to the work group and desktop scanners. What the feature allows for is a single scan that produces both a drop-out and black and white image. The scan speed is the same scan speed as if you were scanning in color. When configured the drop-out image goes one path and the black and white image another. By doing so a company can use the drop-out image only for data capture, and the black and white image will marry with the data capture results in the database or file system.

The difference in data capture accuracy between a drop-out form and a black and white scanned form is on average 15% more accurate often much higher. The reason for this is the OCR in data capture does not get interfered with form lines being printed on or too close to text. Additionally the logic to locate fields can be simplified as field labels are often small font and hard to detect.

It's simple and has the greatest accuracy of any solution, duel stream is a great tool.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, January 7, 2010

Print to OCR?

When I talk to people about the unique technique of printing text documents to image just for the purpose to run optical character recognition ( OCR ) or data capture on them, they are rightful confused and think I'm a little nutz.

Why would you ever convert an already digital document back to image? I promise it's not because I'm so fond of OCR it actually has it's purpose.

Language Detection: By converting a document to image for OCR, I can check the language of each word in the document. While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume. The unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past 5 years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engines if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Additionally font while a clear indicator of language if not accompanied by the proper language encoding will not tell a digital process what a language is, in OCR there is no need for such an encoding.

Normalization of digital formats: While a PDF created in Acrobat and a PDF created in a third party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file you have to have a standard format that is used. If you do not have a standard format you are dealing with variations in the document visually and infrastructural. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices' times as many PDF generating applications exist. However, if you were to OCR the PDF to parse versus digital parsing than you are dealing with only the number of variants that exist in the invoices themselves.

However crazy it sounds like the above two are real scenarios and there are many more. I doubt that these problems will always exist, but it makes you think twice about crazy statements such as printing a digital document to image just so you can OCR it.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, December 28, 2009

Capture Products, Data Capture Products, confused?

All technology markets are guilty of coming up with at least one or two confusing terms. In the document imaging world, it's terms with very similar sounding names. They are technically similar, but strictly different.

One of the most confusing things in the imaging world is the difference between Image Capture software often just called Capture, and Data Capture software. Not only are the names confusing, but technically there is a lot of overlap. All data capture products have imaging capabilities, all capture products have basic data capture. The risk of the confusion is replacing one product for the other. For example, organizations that attempt to take the data capture functionality built into a capture application for a full blown project, end with little success and a lot of frustration. Let me explain where they fit.

Capture products have the primary function of delivering quality images in a proper document structure. They often feature image clean-up, review, and page splitting tools that are more advanced then the scanning found in data capture applications. Most demonstrate what is called rubber-band OCR, the reading of a specific coordinate on a page. Some go as far as creating templates where coordinates zones are saved. This is where the solutions get confused with data capture. Until there is a registration of documents and proper forms processing approaches, it is not data capture. The risk of such basic templates is low accuracy and zones that do not always collect data.

Data capture products need images to function, so it was an obvious choice to add scanning to the solutions. These solutions however are better fed by a full capture application that has the performance and additional features such as batch naming, annotations, page splitting, etc. that the organization may require in the resulting image files. For data capture, the purpose of image capture is for getting data only and sometimes neglect the features that are important for image storage and archival.

In the end, both solutions are improving in the other's territory. Eventually the lines will blur to the point where feature-wise they will be identical, and the benefit of one over the other will be rooted in the vendors expertise, either capture or data capture. If your primary requirement is quality images, the capture vendors solution is best chosen, but if it's data extraction, then data capture rooted solutions are better.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, December 23, 2009

Expectations bite the dust

Just this morning, I was reminded of why market education is so important. I received an email in the morning from a customer who has been exposed to data capture technology for many years. This customer owns a semi-structured data capture solution that is capable of locating fields on forms that changes from variation to variation. In an attempt to help my understanding, we started a conversation about their expectations. Very wisely, the customer broke down their expectations into three categories: OCR accuracy ( field level ), field location accuracy, and amount of time to process per document. This is a step more advanced than a typical user who will clump all of this into one category. In addition to this, there should be a minimum template matching accuracy. In any case, they expect an OCR accuracy of 90%, which is reasonable considering the document they are working with are pixel perfect. They expect a 20 page document to be processed in 4 minuets which is also reasonable and right on the line. Finally, they expect field location to be 100%, RED FLAG!

This is not the first time that there is an assumption that you can locate fields on a semi-structured form with 100% accuracy, 100% of the time. To my dismay, as people seem to be learning more about the technology, this is the next class of common fallacy. And because the organization did not specify template matching accuracy, it means they must also assume templates match 100% of the time to get 100% field location accuracy. Trouble.

It's clear as to why 100% field accuracy is important for them. That is because, basic QA processes are capable of only checking recognition results ( OCR Accuracy ), and not locations of fields. Instead of modifying QA processes, an organization's first thought was how to eliminate the problems that QA might face. 100% accuracy is not possible no matter what is done, including straight text parsing. In this case, the reason it's not possible is that even in a pixel perfect document, there are situations where a field might be located partially, located in excess, or not located at all. The scenario that most often occurs in pixel perfect documents is that text may sometimes be seen as a graphic because it's so clean, and text that is too close to lines are ignored. So typically in these types of documents, any field error is usually a field located partial error. Most QA systems can be setup such that rules are applied to check data structure of fields, and if the data contained in them is faulty, an operator can check the field and expand it if necessary. But this is only possible if the QA system is tied with data capture.

After further conversation, it became clear that the data capture solution is being forced to fit in a QA model. There are various reason as to why this may happen: license cost, pre-existing QA, or miss-understanding of QA possibilities. This is very common for organizations and very often problematic. Quality assurance is a far more trivial processes to implement than data capture. When it comes to data capture it would be more important to focus on the functionality of the data capture system and develop a QA that makes it's output most efficient.

Again, a case of expectations and assumptions.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, December 11, 2009

Outsourcing document recognition

It's common for organizations to outsource their scanning, and document conversion. Organizations find sometimes that the skill required, the convince factor, and liability is worth the additional cost. Other organizations that have one time backlog conversions save money by using an outsourcing company vs. bringing the software in-house. In recent years service bureaus and business process outsourcing companies have dramatically improved their use of recognition technology, if they are utilizing it, and prices have dropped substantially. Though as an organization who chooses to outsource you are removing the responsible of picking document conversion technology, do you know what technology your service bureau is using?

YOU SHOULD! Absolutely you should be concerned about the OCR and Data Capture technology that your outsourcing company is using. It's no less important than if you were bringing the technology in-house. It's your job to make sure your vendor is using the not just the best technology but in the best way. The education level between outsourcing companies is different and they each often specialize in one document type or one type of processing. Proper evaluation of a service bureau will include review of sample results. You should have your prospect service bureau or BPO run a good number of your production documents and provide you a result. Make sure the technology they used to produce the results is the same that is used when in production. Don't be afraid to ask the vendor what engine or engine's are being used, even what version. Make sure you understand how your vendor handles exceptions.

While it's easy to overlook these items when you are looking at a service instead of a technology, it's important that you are educated. Service bureaus make money based on how much they save. This occasional can create motives to use poor technology to gain greater margin. Some outsourcing companies put customers into categories by volume, those with the greater volume get the best technology. Most the outsourcing companies out there are very good at ensuring their document quality, and many will even go as far to give you a guarantee on quality. But the nature of production environments is such that you cannot check everything always. It's about relationship. Some times paying a higher price per page for a better solution is worth it!

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, December 8, 2009

On the fly OCR, Click-Entry and Rubber-band OCR

If you were to put the degrees of automation on a scale, you would first have no automation, semi-automation, and the varying degrees of full automation which is dependent on system accuracy. No automation is of course manual entry of documents that enter an organization, full-automation is an attempt to collect all data automatically from the document and only using manual labor when required for exceptions and quality assurance steps. The degree of automation here is dependent on accuracy, the lower the accuracy the more there will be exceptions and more documents in quality assurance.

Semi-automated data capture and OCR is not much thought about. The primary reason for this is because when document automation technology was introduced people wanted to go full force. It was a combination of poor market education and grand dreams. Semi-automated is an intermediary step where the operator will see every image, but their time spent per image is far less than manual entry. It allows organizations to start using the technology with less risk, more control, and lower cost. The challenge with the adoption of semi-automated data capture is that it's hard to change from or upgrade. Some packages out there allow you a seamless integration into full-automation, but you are stuck with a solution. Now that you know what it is, how does it work?

Semi-automated data capture is very basic. When an image is scanned it is displayed for the operator to see in as much real-estate as possible. If it's a click-entry solution then a full-page OCR read has already happened, if it's a rubber-band solution then it's just the image. In both scenarios an operator on some other portion of the screen has a field list, they go field by field locating information on the page. With click-entry since the OCR is already done they highlight the word or words on the document they want to populate in the field and they click. When they click the text is transferred to the next unpopulated field, In rubber-band OCR all the fields are rubber-banded in advance, a “read” button is clicked after the rubber-banding is done and then all text is populated into each field.

Semi-automated data capture is becoming more popular for organizations that are budget prohibited or scared from adopting full automation, and surprisingly companies that have adopted full automation but did not do it well. I very much believe in full document automation, but semi-automated data capture has a necessary place in the spectrum of document automation.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, December 7, 2009

Re-OCR, Lessons learned

o my surprise I still receive requests from companies needing to start over on their OCR processes. Companies that have used the technology, did not plan, and are now finding themselves in a situation where they have to repeat OCR efforts. The companies fall into two categories.

First category is where companies find they have processed large volumes of paper and the accuracy was not what they expected. This can be discovered in a relatively short time-frame or long after initial integration of the technology. It can be as easy as fixing bad settings for a particular document type to as bad as purchasing correcting a bad choice in software solutions.

For companies in category one it's truly a lesson learned scenario. I will work with these companies to evaluate proper OCR settings and to test future prospect engines. The hope of mine is that the company at least scanned their documents at a high enough quality that already converted or scanned images can be used for backlog conversion versus a re-scan if that is even possible.

The second category is companies who discovered they were collecting too little of data from their documents. This usually happens in data capture environments where companies configure to capture 3 key fields only to find later that there were an additional 2 fields required for downstream processes. Depending on the severity it's often better to do day forward processing with proper settings on new documents and to key in missing fields for incorrect documents. The reason for this is sometimes the work of getting the additional fields and reconciliation on old documents takes away from day forward production and may not be worth the additional cost there it imposes. Or a common practice is to have the backlog documents run from scratch through the new process.

The trend in both categories is improper planning by the organization before evaluating technology. It's important for companies to take the time and plan for capture technology. A part of this planning is forward looking need for the data. One of the best tricks to exposing the requirements is to involve ALL constituents that create, use, and benefit from extracted data. Plan, Plan, Plan.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, December 2, 2009

Playing tricks on your images

Often organizations have no control over the images they have received. Images can come via fax which has a varied range of resolutions, or they can come as poor scans. All of which are no good for data capture and OCR processes. Fortunately there are a lot of imaging tools and tricks out there to help. None of these tools replace a good scan but some get close. One of the tools not often thought about is up-sampling.

Up-Sampling is the process of taking an image at a lower resolution and increasing it to a higher resolution. The technology basically increases the resolution of an image then fills in new empty pixels with predicted values from the original image. For data capture and OCR up-sampling is usually done from 150 DPI and 200 DPI to 300 DPI. Up-sampling technologies have become very impressive and useful. I will recommend up-sampling often over working with the source lower resolution. But lets talk about the facts and how and when you should consider up-sampling.

Up-sampling should be considered on documents that have a low amount of noise such as watermarks, spills, stains, stamps, speckling. Essentially documents that are a good quality and scan but low resolution. You should also avoid doing up-sampling on documents with close spacing of elements and text crowding. In these two above scenarios it's better to work with the source image as-is and work around the problems

The bigger the gap between the source resolution and the desired resolution, the more risk of fragments exist after up-sampling. For example 150 DPI to 300 DPI will not yield the quality that 150 DPI to 200 DPI will. This is why going crazy and up-sampling to the highest possible resolution is not a good idea. It's like taking a very small image and trying to zoom in as far as you can to get detail, you probably wont. Trying to trick the system will only hurt you. Up-sampling from 150 DPI to 200 DPI then again to 300 DPI would not be better then just converting to 150 DPI to 300 DPI. In fact this would be a pretty big mistake. Essentially what you do when you do this is magnify the mistakes created during up-sampling as they get propagated now twice over. These will likely decrease you quality and can result in such things as bloated characters, fuzzy characters, or an abundance of speckling. The goal is to do as few conversions on the document as possible.

I will always defer to a proper scan over any image techniques, but when you do not have control of the image scan one of the image tools to consider is up-sampling. Uneducated use of the technology is unsafe as is true with all advanced technologies, but if you stick with the facts, and pick a great technology you will be successful.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, November 30, 2009

The Clock is ticking

When considering the ROI on a data capture integration, setup time is one of the most important and often miscalculated factors. Not just the setup time for initial integration, but the setup time used for any fine-tuning and optimization some times post production.

The difference in setup time between a fixed data capture environment where coordinate based fields are used and rules based semi-structured environments is substantial. It's not usually the fixed data capture environments that pose the biggest challenge in calculating ROI or predicting it. It takes an administrator on average between 15 to 45 seconds to create and fine-tune a fixed form field. In semi-structured processing the field setup time can be between 60 seconds and hours, depending on the complexity of the document ant the logic being deployed. It's this large gap that throws a wrench in some ROI calculations.

For experienced integrators ability to put a document and it's associated fields into complexity classes is usually pretty easy. After doing so gauging the average amount of time to setup each field, and thus all fields should be accurate. There is always a field or two that requires extra fine-tuning. The key is a complete understanding of the document. Sometimes document variations are obvious, other times they sneak up on you and you have no idea the variation exists until you start working with it. Knowing all variations is the easiest way to understand the additional time any field will take to setup. Variants are the biggest contributor of time in semi-structured data capture setup. Second is odd field types, such as fields that take up one to many lines, or are continuous across two separate lines, and finally tables. The third and final largest contributor to setup time is poor document quality, this means the administrator has to be more general when creating fields and likely has to deploy multiple logic per each field to locate information in several possible ways.

When calculating the ROI on your data capture project make sure to be aware of these sometimes sneaky factors that can eat at integration time. Bottom-line, know your documents, and know the technology before any work is done. If you are unsure seek professional assistance.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, November 25, 2009

Convert now Export later

It's not surprising that organizations focus of any sort of document automation is the export format and data coming out of the system. But sometimes this focus has organizations choosing poor data capture and OCR products just for and ideal export format. The places this occurs the most is in healthcare and accounting where these industry specific repositories expect a format and the vendors of these repositories are unwilling to change. This post is to assure you that the accuracy and features of your data capture and OCR product are more important than the file format it creates.

By focusing on file export format organizations are limiting their possibilities of solutions and perhaps locking them into a more expensive proposition then they should. Industry specific applications are able to charge a premium for connectors and their products because they understand where the focus is. However the most accurate data capture and OCR systems out there are general. Some data capture applications have connectors to say a specific accounting system, but even without specific connectors all data capture systems can export data in such a way that it can be converted to ANY desired format.

Data capture application support CSV, XML, ODBC, or text exports that can be molded in to any required format. Often because they support ODBC there is an opportunity to export directly to any application also supporting it. Because a conversion utility or a custom connector takes weeks to create vs. data capture and OCR's man years to create, the focus should be given to the accuracy and capability of the OCR and data capture system before it's export functionality.

While it would be ideal to find a data capture application that had the accuracy, the features, and the export you desire, I urge organizations not to limit themselves too it. Picking a poor data capture and OCR system will be far more costly than creating even a custom export from scratch.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, November 23, 2009

Barcodes, time savers, and wasters

Barcodes are a great technology. You can fit a lot of information in a barcode, they can be read at any angle, and are very accurate. You have to degrade 30% of a barcode before it's unreadable. In data capture barcodes are commonly used for batch cover sheets, document separation, or printed on the document themselves. This has been proven to be a time saver both in quality and because they can be read very quickly using both software based and hardware based solutions. What organizations often don't think about is the additional time and cost that barcodes add to the capture process.

Organizations usually don't connect document creation and prep time with data capture time. The total time and cost associated with the capture of documents is not just from the point of scan to export it's all the additional steps leading up to scan to get the document in the state it needs to be fore scanning. If an organization uses barcode pages to separate documents it's the time it takes for an operator to generate the pages and put them manually between documents. If organizations use barcode pages as batch separation, it's the time it takes to create the unique barcode for each batch and place it on top of the batch prior to scan. These are just the two most common examples but there are many more. Often the disconnect comes because it's not the same person doing the barcode creation and separation as the person scanning, or the barcodes are created in advanced and the time it took is forgotten.

Because organizations are not counting this into the total capture process they are missing out in the real data capture time and cost. It's no surprise then when they are maintaining high paper cost and not reaching the ROI they expected. Barcodes are a great tool, but should be used when their benefit is greater then their time cost. Benefits can be accuracy, and process molding. Very seldom are barcodes alone responsible for substantial cost savings. Very often organizations don't realize that they could in fact do away with barcodes by using advanced data capture. Accuracy may surfer slightly but the time savings is substantially more.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, November 20, 2009

Line-Items : Picking the correct field type

Documents containing tables have the majority of information one the document printed in those tables, thus the demand to collect this data is high. In data capture organizations will choose three scenarios to collect data from these documents, ignore the table, get the header and footer and just a portion of table, or get it all. Ideally organizations prefer the later, but there are some strategic decisions that have to be made prior to any integration using tables. One of those decisions is weather to capture the data in the table as a large body of individual fields or as a single table block. Lets explore the benefit and downside to both.

Why would you ever perform data capture of a table with a large collection of individual fields when you can collect it as a single table field? Accuracy. Theoretically it will always be more accurate to collect every cell of a table as it's own individual field. The reason for this is because you will more accurately located fields, remove risk of partially collected cells or cells where the base line is cut, and remove white space or lines from fields. In some data capture solutions this is your only choice. Because of this many have made it very easy to duplicate fields and make small changes so the time it takes to create many fields is faster. This is a great tool, as the downside to tables as a collection of individual fields is the time it takes to create to create all fields maybe to great to justify the increase in accuracy.

If you have the ability in your data capture application to collect data as an individual table block, you are able to very quickly do the setup for any one document type. Table blocks require document analysis that can identify table structures in a document. The table block relies heavily on identified tables and then applies column names per the logic in your definition. This is what creates it's simplicity but also it's problems. Sometimes document analysis finds tables incorrectly, more often partially. This can cause missing columns, missing rows, and the worse case scenario rows where the text is split vertically between two cells or horizontally cutting columns in half.

There is a varying complexity in the tables out there, and this most often is the deciding factor of which approach to take. Also very often the accuracy required, and the amount of integration time to obtain that accuracy determines the approach. For organizations that want line-items, but they are not required, table blocks are ideal. For organizations needing high accuracy and processing high volume individual fields are ideal. In any case it's something that needs to be decided prior to any integration work.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, November 17, 2009

Data Capture – Problem Fields

The difference often between easy data capture projects and more complex ones has to do with the type of data being collected. For both hand-print and machine print forms certain fields are easy to capture others pose challenges. This post is to discuss those “problem fields” and how to address them.

In general fields that are not easily constrained and don't have a limited character set are problem fields. Fields that are usually very accurate and easy to configure are number fields, dates, phone numbers, etc. Then there are the middle ground fields such as dollar amounts and invoice numbers for example. The problem fields are addresses, proper names, items.

Address fields are for most people surprisingly complex. Many would like to believe that address fields are easy. The only way to very easily capture address fields would be to have for example in the US the entire USPS data base of addresses that they themselves use in their data capture. It is possible to buy this data base. If you don't have this data base the key to addresses is less constraint. Many think that you should specify a data type for address fields that starts with numbers and ends with text. While this might be great for 60% of the addresses out there, by doing so you made all exception address 0%. It's best to let it read what it's going to read and only support it with an existing data base of addresses if you have it.

Proper names is next in complexity to address. Proper names can be a persons name or company names It is possible to constrain amount of characters and eliminate for the most part numbers, but the structure of many names makes the recognition of them complex. If you have an existing data base of names that would be in the form you will excel at this field. Like address it would not be prudent to create a data type constraining the structure of a name.

Items consist of inventory items, item descriptions, and item codes. Items can either be a breeze or very difficult, and it comes down to the organizations understanding of their structure and if they have supporting data. For example if a company knows exactly how item codes are formed then it's very easy to accurately process them with an associated data type. The best trick for items is again a data base with supporting data.

As you can see the common trend is finding a data base with existing supporting data. Knowing the problem fields focuses companies and helps them with a plan of attack to creating very accurate data capture.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, November 12, 2009

Not quite as fun as the DMV

Understanding the different licensing that is available for data capture and OCR products can sometimes be difficult, but I assure you that the complexities involved will not be as painful as a trip to your local motor vehicle. There are a few aspects of licenses that trip up some users namely license type dongle or serial number, activation process, and finally page-counts.

License type can be very important but is not often clearly explained. The most common license type out there is “software license”. This is a license structure that is a license file tied to a specific machine. The benefits of such a license are, it's more efficient and easier to install on servers and hardware that are not local. The downside is that because it is tied to a machine, if the license dies you may have downtime while waiting for replacement and proving destruction or may have to purchase a new licenses. Another very common license type is a hardware dongle. Dongles now are most often USB devices very similar to a USB thumb drive we are all used too. The benefit to this type of license is that the software can be installed on every machine in the organization but only the machine with the dongle in can run it. This means that if something happen to one machine it would be very easy to switch to another. The downside to this type of license is that the licenses can be lost, and it's not the most efficient. After you have whatever license type it is, you will need to go through the activation processes.

Activation can be troublesome for some products and others very simple. The difference is usually the installers effort in understanding the activation processes BEFORE any installation. For many of these products activation has as many as 3 steps and it's usually always in the form of sending an activation request, receiving an activation file, installing the activation file. The trend is for products to allow web activation and it's becoming more popular, but because of the premium on some advance data capture products these steps are required. Now with an activated license the most important thing, what does a license give you?

Licenses are usually set with general operation right, purchased add-on's if they exist, and very commonly page-count. Page-count is the biggest contention of most any purchaser. Because of this most all vendors have the option to have unlimited page-count license for a premium. In the end most all companies end-up with a page-count licenses and are quite happy. What argument I would like to pose is that a piece of hardware has inherently a page-count, as each piece of hardware will only be able to physically process a certain number of pages a day, month, year. For this reason page-count is actually quite reasonable but a slowly dieing trend. In the future I expect to see far fewer page-count licenses. For most businesses pages are counted on a monthly basis but some seasonal companies may elect for an annual or pure page count.

License structure is important to ALL organizations and I encourage companies to spend the time during the discovery phases of technology acquisition to investigate the structures that are available from each vendor and how that may work in your environment.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Sunday, November 8, 2009

Tax Return OCR

If you are thinking about using data capture to read text from tax returns it's time now to start thinking about the steps to accomplish this. Reading typographic tax returns from current and previous years has proven to be very accurate and a great use of data capture and OCR technology. Tax Returns fall into the medium complexity to automate category. There are a few things that make tax returns unique.

Checkmarks: Tax returns have two types of checkmarks, ones that are standard and printed in the body of the document. These can be handled similar to all other common checkmark types. The other type of checkmark is unique only to tax forms, they are typically on the right side of the document. They are boxes that within can be filled with a character or a checkmark symbol. With these checkmark's the best approach is to create a field the entire size of where the checkmark can be printed and set the checkmark type to be of type “white field”. In this case the software will expect there to be only white space and a presence of enough black pixels will consider it checked.

Tabular Data: Much of the data in a tax form is presented as a table. When considering capturing data from a table organizations have to decide if they want to capture each cell of the table as it's own field OR if they would like to capture the data in the table as a table field that later must be parsed. This can dramatically effect the exported results so knowing before hand is very important.

Delivery Type: Tax forms usually come as eFile which is a pixel perfect document that is never printed and never scanned, or as a scanned document received first as paper then scanned. For the most part the eFile version of the tax form will be more accurate, however the eFile version of the form has non-traditional checkmark's that could cause a problem. Organizations need to decide if they are going to process all delivery types together as a single type or separate them. There are advantages to both. By combining them integration time is less, by separating them accuracy is higher.

I much rather OCR a tax return than file one. Because of this the skills I've developed in processing tax returns are better than creating them, and I hope today I imparted some of that knowledge.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, November 6, 2009

Invisible characters

Exceptions in OCR and data capture are usually thought of as miss-recognized characters only, but in reality there are several other types of exceptions that exist. One of those is called “high confidence blanks”. A “high confidence blank” in OCR or data capture is where the software looked in a particular region for a character but no text was identified or read. In data capture “high confidence blanks” usually occur for entire fields or just the first character, in full-page OCR they are less common but can occur sporadically throughout the text of the document or the entire text. This type of exception is elusive and hard to detect. Obviously if entire fields and text is missed where you expect there to be text it is easy to spot, but for the one-off missing characters it's tough. With full-page OCR detection is done with spell-check. Missing characters in a word will surely flag the word as being misspelled. In data capture it's much more tricky and the best thing to do is to take certain steps to avoid “high confidence blanks”.

1.)The first thing you can do to avoid “high confidence blanks” in data capture is to NOT over use image clean-up. If characters are regenerated or cleaned too much they look to the OCR engine to be just a graphic not a typographic character and thus avoided.

2.)Second if you have control of the form design make sure text is not printed close to lines, this is one of the biggest generators of “high confidence blanks”
3.)If text is close to lines then you should be able to establish a rule in the software indicating for example that if the first character in a field is more then x pixels away from the border then most likely a character(s) was missed.
4.)If at all possible use dictionaries and data types that state the structure of the information that should be present in a field. If a character is missing this data type will likely be broken.

This type of exception is one that leads to hidden downstream problems when organizations don't realize that it might happen. Being aware and taking the proper steps to avoid "high confidence blanks" is the solution.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, November 5, 2009

Guarantees, Guarantees, Guarantees

One of the most popular questions to ask when organizations purchase data capture or OCR software, “what accuracy can you guarantee?”. If you have ever asked this question of a vendor you got one of two responses: the first was a percentage of accuracy, the second is a long explanation on why they can't guarantee anything. If the vendor gave you a percentage you should probably run, because it's the start of a bad relationship.

Why? It' not really possible for a vendor to tell you how accurate your recognition will be on your documents. Vendors can estimate accuracy based on samples, they can give you an idea of range, but because of the nature of the technology there is no way to guarantee anything. The first fact of OCR is that you can ALWAYS find a document that breaks the norms of recognition and accuracy. Because of this possibility it's hard to know how exception documents will effect the accuracy of the entire system. So lets talk about what is reasonable.

It is reasonable to provide a sample set of documents and expect an average accuracy level as a percentage on the samples. Because they are a discrete subset of documents, this is something that can actually be measured. It is the job of the organization to pick samples that most closely represent production. It would be wise to include bad, average, and good documents in the sample set so as to cover the entire range of possibilities.

What organizations often forget is that even if 50% of the documents are automated there is a cost savings as compared to manual entry. The industry standard for accuracy is 85% however this changes heavily based on document type and the organizations perception of accuracy. The ideal way to measure accuracy is to compare recognition results to truth data. If truth data is not available the next best thing is to count not accuracy but level of uncertainty on the document. If a document is 5% uncertain according to the OCR engine, then it is 95% certain and this should be your measure.

Next time a vendor is faced with the question of “how accurate are you?” or “what accuracy do you guarantee” I hope they issue the proper response of “how accurate will your process allow us to be?”. It's a fair question when you are not familiar with the technology, but hopefully the above gives you the proper approach to measuring a solution.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, November 3, 2009

Fixed, Semi-structured, UNSTRUCTURED!?

I find my self educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.

First lets start with defining a document, a document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity, this it seems, is the meat of the confusion.

Often someone will say a document is unstructured, what they are thinking is the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know is if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.

The ability to processes agreements and contracts is very limited in very concrete scenarios, where the contract variants are non which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.

It's actually very easy, 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, does not make it fixed. For example a tax form always has the same general location to print company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes is semi-structured, additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.

There, again the confusion is exposed. It's very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly it will help you and them to correct it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, October 28, 2009

Data Type, Dictionary, Database Lookup = First Verification

After viewing the power of data capture technology I've yet to see an organization un-impressed, until the conversation explores quality assurance steps. Though the technology is extremely powerful there will always be some level of quality checking to get a 100% accurate results. Think of it this way if you were to spill coffee on a perfectly printed document, scan it soon after ( rollers making a nice smudge ) you likely would be unable to read the text yourself, so how can the software? In this scenario QA would be required for the smudged fields. This seems obvious but illustrates the fact. I have good news however, if you provide the right tools you can use a computer to do the first pass of verification.

It's just like a human verifying a document but much faster and less expensive. Organizations that deploy these methods can eliminate a large percentage of verification, but the caveat is they must first know their documents. After data capture has happened if you combine first, data types with a dictionary or database look-up you have created especially an electronic verifier.

A data type tells the software what structure a field should be in. A data type can be used to confirm a fields results OR can modify uncertain results based on the knowledge contained within. For example take a date field. After data capture the field is recognized as 1O/13/8I. We see there are two errors an “O” instead of a “0” and a “I” instead of a “1”. If you were to deploy a date data type that says simply you will always have numbers 1-12 followed by a “/” followed by numbers 1-31 followed by a “/” followed by two numbers. Then the date would automatically be converted to 10/13/81 which is correct. Some data types are universal such as date and time, others are specific to a document type and the organization if they know ALL of them stands to benefit greatly.

Dictionaries and database look-up function essentially the same with a slight variation. The purpose of these two approaches is to validate what was extracted via data capture against pre-existing acceptable results. The simplest example to consider is existing customer names. If you are processing a form distributed to existing customers that contains first name and last name because you already know they exist you should be able to look in a database for the customer and confirm the results, if no match is found then likely there is a problem with the form. Dictionaries can provide the same value but are more static and often used for fields such as product type, or rate type that have one set of possibilities that rarely change. The point is that organizations should look at the database or dictionary assets they already have to augment the data capture process and make it more accurate.

There will always be quality assurance steps with any technology that involves interpretation of data. Organizations wanting to deny these steps either do not understand the technology, do not understand their own processes, or were mislead by a vendor. Quality assurance is the place where much effort should be spent to streamline, and one of the ways to do that is by leveraging data types, dictionaries, and databases that already exist.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, October 27, 2009

Read signatures maybe not, make sure a doc is signed, easy!

A lot of the documents we encounter require their to be a signature. In data capture these documents add an additional complexity as an operator either before data capture or after has to make sure each document is signed. When a document is not signed it very often has to go a different path of approvals. Often organizations will ask OCR vendors to read the signature in a form. Ability to recognize signatures is very expensive and requires a database of pre-existing signatures so often not feasible. But ability to find a signature and confirm it's presence is not that difficult at all.

Because documents with a signature line almost always have to be checked to assure a signature is there it is an additional step of processing. However companies often don't realize that the data capture software they are using can get all the fields off of the document and check accurately if a signature is present. By doing so they remove any additional step and can flag documents only that are not reporting a signature.

Using OMR optical mark recognition technology you can determine if a signature is present. In it's simplest form OMR check's to see if there is a substantial amount of black pixels in a white space. At a certain threshold of black that field will be considered checked. If in a data capture setup you put an OMR field in the location where a signature should be then you will know that if it reports checked, there is signature present, and unchecked there likely is no signature.

While you are not reading the signature OMR is a fast and accurate way to see if signatures are present and avoid the additional manual step of checking for signed documents.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, October 26, 2009

Black belt in data capture processes an EOB

Explanation of Benefit's (EOB) next to student transcripts are unquestionable the most difficult documents to automate. The value to automate these documents however is tremendously high as they are very expensive to data enter. 3 years ago the fad to automating these documents was to use semi-structure data capture to locate information no matter the variation. Companies buying into this fad quickly found themselves in an expensive and deep data capture implementation. This is where I get to tout the power of simplicity and beat down the over complicators.

Just as a Sensei would practice meditation before a bout to calm the nerves so should an implementer of data capture when facing the bloody battle with EOB documents. Simplicity is key when processing EOBs. Organizations should:

1.) Consider processing first those EOBs that are clear. Clarity is a vague term and includes document structure and scanning quality. But because of the variation across EOB types its best for an organization to focus on automating the best quality, the ones they know will provide the highest accuracy then move onto the rest when they have succeeded.

2.) Consider classification as a primary step. If you can very accurately classify EOBs by type then you don't need to use semi-structured technology on the EOB, you simply need to isolate each type and use a combination of coordinate and semi-structured based field location. Because you are working with a single type you will be way more actuate in locating the fields and reading them.

3.) Ignore document structure. Very often EOBs don't follow their own document structure especially when it comes to tables. Often EOBs have tables within tables, or data in tables that does not align to table headings. Additionally EOBs have patients that span pages, and totals for items on previous pages. EOBs should be thought about as a collection of lines that start with a header ( easy to collect the data ) and a footer ( also easy to collect data ). Your job then is to classify lines, and extract data per-line.

4.) Extract the data then convert it. In EOB processing there are many items contained within the EOB that have to be converted to another format prior to reconciliation. When trying to extract data if you focus on the conversions they often muddy up the extraction process. First very accurately get the data from the paper then convert it to the desired format.

For those who are currently processing EOBs and receiving the great value that automation can provide, you truly are black-belts of data capture and have mastered the nuances of document automation. For those of you wanting to process EOBs, it's very possible, just keep it simple.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, October 21, 2009

Is it a document or just pages

There are several aspects of how we talk about documents, scanning, OCR and Data Capture that are the culprits of confusion, misunderstanding, and unfortunately deferred adoption. One of these common language misconceptions is when discussing documents and their structure. All data capture and OCR implementations involve the concept of document even if a document is a single page document that is repeatedly scanned. But the definition of what a document is often gets blurred between vends, end-users, resellers, and even internally in all of these.

Some times people think a document is just one page in a collection of pages, others believe a document is a record in a database that consist of several page types but are combined together in a single record. In this last thought it does not include when the scanning happens so one page can come in at a different time than another, but not until they are all there do you have a document. And others think a document is multiple pages scanned together with a page type that determines the beginning and the end.

Where the confusion comes in is that they are all correct, but are influenced by different things. Documents to an organization can be defined by a business process, or a scanning process. To add to the confusion the scanning department has a concept of a document related to scanning, but the back office has a different concept as it relates to the data base. To reconcile this let me tell you in complete what a document is.

A document is all the paper it takes to create a single record in a system or data base. This definition actually combines all of the above and generalizes it. The reason it's important to reconcile everyone's opinion on what a document is, is because document structure and business rules around a document directly impact how you implement OCR and Data Capture and keep it accurate.

The biggest challenge of all these language misconceptions is purely understanding that they exist. If you know it's going to happen then you can mitigate their impact. Not knowing their presence can make them a silent killer of success.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, October 16, 2009

Not that you want to pay that invoice any faster

But you can, and you can with a lower cost, and perhaps take advantage of net discounts. With Data Capture and OCR technology you can automate the entry and routing of commercial invoices. The reality for organizations that receive many invoices a day is that the accounting department is paying high salaries and taking time a way from other activities to data enter paper invoices. Using recognition technology to replace this process has been a tremendous benefit to many organizations. There are a few keys to success.

Start out simple: don't try to tackle the entire paper world with your solution, start out simple. First identify the process and where the opportunities for saving are. Usually the biggest opportunity is going to be in the entry of data into some accounting system. To automate this you will need data capture and scanning capabilities. Starting out simple does not mean to overlook all the possibilities but to find the technology that will fit all your wildest dreams of automation but start out slow with it. More specifically with invoices, first start by scanning, then by getting vendor, invoice number, and total due using recognition technology, etc.

Wait for an ROI before you make a major change: These technologies if implemented correctly can provide a great return on investment. Sometimes organizations make the mistake of not waiting until they get an ROI before making another major change. The change likely will have positive results, but requires another round of additional effort and could be problematic. This does not allow you to see when the value of the technology starts kicking in and could have you repeating effort. Wait until you succeed at a basic implementation before you seek even more cost savings. Saving money is addicting, but let each phase actualize itself.

Never forget your business process is boss: Organizations have processes that are set in stone. Staff understands how to execute them, technology is setup to facilitate them, and other processes are feeding or fed by them. Sometimes new technology is so excited it forces you to change what you are doing right when you acquire it. Often organizations don't realize the upstream and downstream impact of dramatically changing business processes. A technology should give you the option to keep doing what you are doing only faster, or to change things if you choose. At first try to keep it as consistent with the already in place AP business processes, then look for process improvement later.

No maybe you don't want to pay that invoice faster, but you do want to reduce the cost of working with it. With Data Capture and OCR you can save a ton as long as you prepare yourself and do your homework.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, October 15, 2009

Ok Chris, you talk the talk, but what is it?

The constituents of this blog are varied. Some know what OCR and Data Capture is, some do not. Some know they need it but now necessarily how to use it. Others know how powerful it is and have a good understanding of what is out there, but not the best practices. So taking a step back let me tell you what it's all about. It's about saving money, and reducing the cost associated with paper based operations.

OCR is commonly used to encompass all of the recognition technologies out there. It specifically stands for Optical Character Recognition. This is simply the process of taking an image scanned or digital received and converting from an image to text. OCR while it can be used to mean ICR, OCR, Data Capture, OMR, and barcode processing is really the process of extracting ALL of the typographic text from an image document and converting it to a digital format. ICR is hand-print extraction, OMR is filled in bubble extraction, and barcode is, well barcode extraction. These later recognition technologies make up Data Capture.

Data capture is the process of extracting field data pairs to be exported in a structured format. It does not have to necessarily get all the information on a document, and is very highly dictated by business processes. Data Capture incorporates ICR, OMR, Barcode, and OCR to extract the data from fields. Fixed From Data Capture are forms that don't change page to page, and are usually hand-print. Semi-structure forms are 80% of the documents someone sees. Data Capture is usually a more complex technology as compared to just full page OCR.

So there you have it, this is why you are reading this blog to learn about the specifics, nuances, and best practices of these technologies.

Labels: , , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, October 14, 2009

Why hot folder's are so HOT

We are all guilty of over complicating things. In technology products over complication results in more features then you will ever use and less money you could use, other times over complication creates new problems in business processes. End-users, vendors, and technologist are all commonly trying to add too many elements to automation projects. One of the areas where over complication occurs the most in data capture and OCR integrations is when it comes to passing images and results from one step to another.

Most organizations when it comes to passing images from a capture application to a data capture application ask for a connector specifically written to incorporate the chosen imagines applications API to pass images to the chosen Data Capture applications API. Most organizations similarly when considering export form OCR and Data capture processes want a special connector to their repository or ECM product. I'm not sure what to blame, the warm and fuzzies that come from the realization that a OCR vendor has spent specific effort to develop these connectors, or the faith that somehow connectors are more efficient. What I do know is that in most all cases connectors are overkill and simply not necessary, why? Because there are hot folders, and they are amazingly powerful and simple.

A hot folder ( sometimes called a watch folder ) is a directory virtual or real that is setup to be a staging or queue for applications to put data in and take data from in real-time. The best thing about hot folders is they are free! Most all imaging, data capture, and content management applications support hot folders. If they don't you have every right to ask why. When an image capture application scans documents they can scan those documents to a directory. The data capture application can automatically read images as soon as they appear in this directory and process them. Data capture and OCR results can be automatically exported to another directory that a content management application can automatically pick up from. That is two folders vs. two pricey connectors.

You may think that you are losing functionality such as tracking and security, but there are numerous ways in window to monitor folder activity and protect folder security. You might be surprised that many “connectors” out there are actually just a hot folder with a settings dialog. It's a hot folder in disguise.

So when it comes to deciding how to get files from one application process to another, first consider hot folders and try your best to disprove their validity. If you can't, you just saved a bundle of money and probably picked the most efficient method for your OCR solution.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 5 Comments

Tuesday, October 13, 2009

Check your check scanning

Check scanners are fast, and have very accurate MICR reading. The check scanners get the job done, when the only job is to get MICR from a check. As OCR of checks and reconciliation of check data with remittances, or check images for future verification an reference, gains greater importance and demand, check scanning has some complications.

The typical check scanner has two very key features:

1.) Auto endorsement
2.) MICR reading

Often people think that the way check scanners read MICR is with OCR. This is incorrect MICR is printed with magnetic print that is read via a very specific magnetic reading and conversion process. When companies intend to augment their check scanning with OCR and Data Capture processes there is something major they need to consider and not overlook. Check scanners are great at what they do, but they are not great at producing high quality images. Most check scanners cannot scan past a 200 DPI which as you will see in my previous articles is less then optimum for OCR. Additionally the lamps used to produce the image are fast but not the greatest quality.

So. Here are the options:

1.)Scan checks with a document scanner and a check scanner. The hard part here is the additional time it takes to perform two scans and merging the two data streams. Om this scenario you get the best of both worlds. Great image for storing,OCR and data capture from the document scanner, and great MICR and endorsement speed in the check scanner.

2.)Replace the check scanner with a document scanner. You can actually read the MICR using OCR, but it's not quite as accurate as magnetic reading. This might be OK as the quality of the rest of the information on the check's extraction will be higher with the better image. Some times it's better also because an ADF feeder allows you to scan many checks at one time which is a new time savings. The biggest killer of this approach is the fact that auto endorsement is such a tremendous time saver, it's impossible to part with it.

3.)And finally option three, the most common, just use a check scanner. This option may be most common but not necessarily the best. In this option the company must make sure they get good image preparation and clean-up software that will enhance the OCR and Data Capture process as well as likely up-sample the images to 300 or 400 DPI. Up-sampling does not produce the same quality as scanning at these resolutions but products that excel in up-sampling can get close.

Check scanning is being more and more augmented with OCR and Data Capture processes, companies should not assume that a check scanner will have the quality of image that a document scanner will have so these above considerations are important.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 1 Comments

Monday, October 5, 2009

Exceptional exceptions – Key to winning with Data Capture

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It's how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it's important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for keep return on investment ROI stable.

There are two phases of exceptions, those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it's how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it's probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It's not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it's doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions weather whole document types or just portions of one particular document type is to set a standard that indicates an exact problem has to repeat X times ( based on volume ) before it's given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, September 21, 2009

File size, get over it – When to consider file size, when not

There are times when an organization's focus actually stabs them in the back. When it comes to Data Capture, file size is one of these common focuses. The most common mistake companies are making is not the concern of file size, but when they are concerned about it. Many companies will investigate heavily the size of a file at input. Tweaking and tuning to get a smaller input file to their data capture solution, and in their mind final storage. But at what cost? Companies often overlook that file size can be changed at any point, and the best point is not input, but after Data Capture has been run.

When you assign anyone a task, or teach anyone anything, you expect to give them the proper tools to get the job done as best they can. If they are missing some tools, you can expect quality to go down. Data Capture is the same way. Scanning at 150 Dpi vs. 300 Dpi, Scanning at Black and White vs. Grey-scale or Color, are limiting the tools of Data Capture. Yes they all dramatically reduce the file size, but also your quality. Give your Data Capture the best chance at success, then worry about file size.

The proper way to address file size is at the point just before it's stored into a file system or content management system. At this point you can down sample, reduce bit depth, or even better to keep the re-purposing integrity, use reliable compression technology to get the job done. I say compression is best as it's the most true to the input image and anytime you consider printing or re-purposing or even another pass in data capture, this will be very important.

So, while file size is important, delay the concern until after OCR or Data Capture is done.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, September 17, 2009

Even OCR needs a helping hand – Quality Assurance

Let's face it OCR is not 100% accurate 100% of the time. Accuracy is highly dependent on document type, quality of scan, and document makeup. The reason OCR is so powerful is because it's not. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where a human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full page OCR often spell checking is used. In Data Capture usually a review character-by-character of a field is done and you don't see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has it's own variation of “verification mode”. It's important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture there is an additional step in quality assurance, business rules. In this process the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don't enter anyone in the system who's birth year is earlier than 1984”. If such a document is found it is flagged for an operator to check. These rules can be endless and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations the use of a database or body of knowledge is deployed as first round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, September 16, 2009

If it's not semi-structured why fix it – know your form's class?

There are two major classes of Data Capture technology fixed or semi-structured. When processing a form it's critical that the right class is chosen. To complicate things there is a population of forms out there that can be automated with either, but there is always a definite benefit of one over the other. In my experience organizations are having a very hard time figuring out if their form is fixed or not. The most common miss-diagnosis is from forms where fields are in the same location and each possess an allotted white space for data to be entered. Too most this seems fixed, but in actuality it's not. Text in these boxes can move around substantially, additionally the boxes themselves while in the same location relative to each other can move because of copying, variations in printing, etc. There are two very easy steps to determine if your form is fixed or not.

1.)Does your form have corner stones? Corner stones, sometimes refereed to as registration marks ( registration marks have been known to replace corner stones when they are very clearly defined ) are printed objects usually squares in each corner of the form. They must be all at 90 degree angle's from their neighbors. What corner stones do is allow the software to match the scanned or input document to the original template, theoretically making all fields and all elements that are static on the form lined up. Removing any shifts, skews, etc.

2.)Does your form have pre-defined fields? A pre-defined field is more than location on the form a pre-defined field has a set width, height, location, and finally and most importantly set number of characters. You know these fields most commonly by when you have filled out a form and you have a box for each letter. There are variations in how the characters are separated, but they all share these attributes. This is called mono-spaced text.

If your form does not have the above two items it is not a fixed form. This would indicate that a semi-structured forms processing technology would be the best fit. On those forms that are commonly confused for fixed, there are ways to make it process well with a fixed form solution by isolating the input type ( fax, email, scan ), and using the proper arrangement of registration marks.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, September 15, 2009

Invoice-in-a-box – 4 steps to success

Invoices are one of the highest demanded documents to automate. Lets talk a little about what it takes to be successful in invoice processing. Data Capture is the technology used for invoices, this is where you extract field-by-field the information you want from the invoice in field order. In order to automate invoices with the high accuracy and utilize a boxed invoice solution you need to do some preparation. Here are 4 MUST have steps:

1.)Separate your commercial invoices from any specialized invoice types such as legal, manufacturing, telecommunication, etc. The reason you do this is because the low hanging fruit when automating invoices is commercial invoices. Software packages have put the most amount of effort in these documents. By working with them first you are ensuring your success on a large population of your invoices and then can tackle the remainder.

2.)Know how many vendors you have. Understanding the makeup of your invoices is very important. Your focus should be determined by those invoices that are easiest to automate and make up the greatest portion of your entire volume. So make a list of all your vendors and what paper volume percentage each makes up of the whole.

3.)Know if you want to collect line-item data or not. At first glance majority of companies say they want line-items, only later to change their mind. Find that business process that mandates you collect line items. In your current process are you having line items entered? What database of existing information will you use to support your line-item extraction? Most companies in the end choose against line-items or choose to extract them for limited critical vendors.

4.)Know how you are going to check the quality of extraction. Quality assurance happens with human review, and business rules. Know before hand how you want those to work. For example a business rule simply could be all line-items must add up to total amount, if they don't you have someone look at the entire invoice.

These four steps are not the end-all in proving you invoice processing accuracy, but they are necessary and all steps to consider before you look and purchasing a boxed invoice processing solution.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments