Friday, March 5, 2010

Workflow, super-charge with OCR

Document workflow can be as easy as saving a file to a single location to as complex as decision tree document routing rules. Throw some paper into the mix and the problem intensifies slightly. Getting your paper documents to fit your already accepted digital document workflow can be challenging. Some organizations choose to keep the paper and digital workflows separate. Others unite them but create separate rules for each. For most however, it would be ideal to have a single workflow engine or product supporting both the digital, image, and paper documents.

To do so with the greatest value, you need not only document conversion using Optical Character Recognition ( OCR ), but some other advanced imaging and recognition tools. In the digital document world, you don't have only the data contained in the document, you have various other meta data items such as file name, file location ( taxonomy ), tags, etc. In order to marry paper with digital the same has to be duplicated on the paper document and has to occur at time of document processing. This could be a manual process or automated, and depending on your paper volume doing it in manual may be OK. To compete with the efficiency of digital documents however, automatic is the way to go.

Using OCR, image-based and contextual-based classification, paper or image documents that enter the workflow can obtain the same value as digital documents. The OCR is responsible for getting all the content from the document. The purpose of this content is for search, indexing, auto-filing, as well as generation of keywords ( tags ) associated with a taxonomy. In order to determine where the document fits into a taxonomy, you must first classify it.

For classification to be most effective, it happens on two levels. Image-based classification, which is what the document looks like, classifies documents based on their physical structure which is a good indicator of its type and very fast. Contextual classification, which is what words are contained in the document, is one level deeper in classification and looks for the keywords that would make a document one type over another. For some environments, image-based classification can do the job entirely. Once classification is known, a classification engine can place the document in the correct spot in an existing taxonomy. Once an ID or classification is determined, it is no challenge to apply tags, file-naming, and file location to a document.

Workflow can stand alone, but injected with the power of OCR and document classification, it becomes a power house that does not know the difference between paper and digital.

Labels: , , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, March 2, 2010

Document Conversion and Law

Both CVISION Technologies and I had the pleasure of attending LegalTech 2010 this year in New York. I was quite impressed with the show and especially how engaged the attendees were. Where does document conversion and compression technologies fit in the legal space? Here is a brief review of the usage of the technologies in this vertical market.

File security:

Starting with the most popular buzz word PDFs. PDFs are the most popular file format in legal for their ability to be secure, and with the right compression tools very small file format. Security is fairly obvious, but compression not so much. Because many of the legal case management platforms, eDiscovery engines, and simply content management are billed by the megabyte of space, keeping files small but usable is critical. The trend of these applications is to be fewer installed products and most hosted. The hosted products usually have a monthly service fee and charge per amount of storage. Keeping the content value but small then becomes a real concern especially when dealing with the hundreds of thousands of pages a case might contain.

Search-ability:

Lawyers work with a lot of paper, getting at the right information is tough. That is why before a document can be loaded to any case management or eDiscovery system, it must be OCRed and made searchable. Good OCR is essential, as is the ability to quickly get the documents converted. Without OCR, eDiscovery simply cannot work on paper. Surprisingly this was a common afterthought, but a large complaint for products with poor OCR. Some organizations simply put the paper or image files aside, risking loss of valuable information. Others did not concern themselves with OCR accuracy and just assumed it was good enough. Both are a mistake and I hope a dying trend in this particular market as they are only hurting themselves. Garbage in garbage out.

Translation:

The number of translation companies at the show was large. Why? Because very often lawsuits are comprised of a large collection of documents that contain a subset of languages. In order for eDiscovery to work well, the data must be normalized i.e. translated. The first challenge is to find the languages. It is a tremendous effort to go through a large collection of documents and identify each page a particular language occurs. Second is in paper documents getting the data into a digital format so manual or software based translation can occur. OCR can facilitate both. First is the conversion of paper to digital, and second is during OCR language detection happens internally in the OCR engine. Again just like the above, the quality of the OCR is imperative, so law firms have every right to be concerned about what OCR engine they or their translation company uses.

If you did not attend, I recommend you keep it on your radar for next year, or the west coast version. While document conversion is not the favorite topic in legal, it finds its way in each step of case management, e-discovery, and billing.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, February 23, 2010

Translating images

Text translation services come in a variety of forms, from individuals who make a good living translating documents from one language to another, to large firms using many individuals or purely software. No matter the form, they are all faced with a challenge when the text they need to translate is contained in physical paper or an image file.

Today, translation is facilitated with the use of word processing systems. Word processors give the translator the ability to be more efficient and manage the translation process over many sessions. But in order to use the capabilities of a word processing system, it's necessary to get the text into a digital format. That is where Optical Character Recognition comes in. OCR is one of the greatest tools in a translator's bag of tricks. It allows the individual to convert the image files and physical paper to digital text which can be consumed and translated.

The great thing about modern OCR is the sheer number of languages that are supported. Not only is OCR capable of converting a document to digital in one language but even if it contains multiple languages, it's smart enough to know where one language begins and the other ends. If you can imagine the risk of a translator who receives OCR errors, you will see why making sure documents are scanned at the optimum quality is a great consideration. Modern OCR engines will tell the operator exactly where any confusion might have occurred and give them the opportunity to correct it. Documents scanned at 300 DPI TIFF Group 4 black and white will excel.

Without OCR, a translator's job becomes more of a data entry task than what they are truly skilled at which is translation.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, February 4, 2010

Document Conversion and Law

I had the pleasure of attending LegalTech 2010 this year in New York. I was quite impressed with the show and especially how engaged the attendees were. Where does document conversion and the conversion and compression technologies fit in the legal space? Here is a brief review of the usage of the technologies in this vertical market.

File security:

Starting with the most popular buzz word PDFs. PDFs are the most popular file format in legal for their ability to be secure, and with the right compression tools very small file format. Security is fairly obvious, but compression not so much. Because many of the legal case management platforms, eDiscovery engines, and simply content management are billed by the megabyte of space, keeping files small but usable is critical. The trend of these applications is to be fewer installed products and most hosted. The hosted products usually have a monthly service fee and charge per amount of storage. Keeping the content value but small then becomes a real concern especially when dealing with the hundreds of thousands of pages a case might contain.

Search-ability:

Lawyers work with a lot of paper, getting at the right information is tough. That is why before a document can be loaded to any case management or eDiscovery system it must be OCRed and made searchable. Good OCR is essential, as is the ability to quickly get the documents converted. Without OCR eDiscovery simply cannot work on paper. Surprisingly this was a common afterthought, but a large complaint for products with poor OCR. Some organizations simply put the paper or image files aside, risking loss of valuable information. Others did not concern themselves with OCR accuracy and just assumed it was good enough. Both are a mistake and I hope a dieing trend in this particular market as they are only hurting themselves. Garbage in garbage out.

Translation:

The number of translation companies at the show was large. Why? Because very often lawsuits are comprised of a large collection of documents that contain a subset of languages. In order for eDiscovery to work well the data must be normalized i.e. translated. The first challenge is to find the languages. It is a tremendous effort to go through a large collection of documents and identify each page a particular language occurs. Second is in paper documents getting the data into a digital format so manual or software based translation can occur. OCR can facilitate both. First is the conversion of paper to digital, and second is during OCR language detection happens internally in the OCR engine. Again just like the above, the quality of the OCR is imperative, so law firms have every right to be concerned about what OCR engine they or their translation company uses.

If you did not attend, I recommend you keep it on your radar for next year, or the west coast version. While document conversion is not the favorite topic in legal, it finds its way in each step of case management, e-discovery and billing.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, January 26, 2010

Attachment Emailing Master

Very often in business, email correspondences are accompanied by a file attachment. While it's possible to attach to an email any file format ( some not preferred by email clients ) the most common type is a document and the most common format is either Word or PDF. This post contains some advice on the best way to deliver documents via email.

When emailing documents, you have to be concerned about size, readability, and security. If the attachment is too large, you may not be able to email it at all. If the document is not readable, there is no point in sending it anyway. Finally, if it's not secure, it might be re-purposed or stolen. When your document starts out in paper form, the challenges increase.

There is an ideal format and conversion settings to use when sending documents via email. Ideally you would scan your document in color for readability visually. This is not the only type of readability, you also want to make sure the documents are accessible for long periods of time. You would use optical character recognition ( OCR ) for the document's ability to be indexed by a search utility. You would use a compression tool to convert that initially large color image into one that is manageable but the quality is not degraded, and finally you will use the PDF format to get all levels of security you choose.

The combination of a searchable, compressed, color PDF is the ideal method for emailing documents as attachments and ensuring their effectiveness and long-term usage.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, January 14, 2010

Replacement for fax right under our noses

How does a technology first invented in 1843 and executed in 1924 still exist as a primary function in our working lives? I'm talking about fax. The fax technology is old and outdated. I personally avoid fax based on simply principle. But my principle alone will not make big changes in adoption. What people don't understand is that we have a fax replacement right under our noses, one that is both green and as easy to use.

The combination of a document scanner, imaging software, and email software is a complete fax replacement solution. Instead of typing in phone numbers users can type in email addresses. In fax you double the amount of paper that exists. Paper in, paper out. With the document scanning approach you are reducing the paper consumption, paper in, email out. Most document scanners today even ship with a pre-configured “Scan to Email” option. On a production level, systems can be setup in offices, your local Kinkos, wherever, to allow multiple users to access the same document scanner and scan to any email with a basic step-by-step wizard.

Not only is fax to email saving trees it is also increasing efficiency and when combined with workflow, document imaging, OCR, and data capture, adds much greater value for that single piece of paper.

These systems do in fact exist in small corners of the world, and I have participated in the development and setup of them. The adoption is still very low. What it comes down to is fear of change. People understand paper to paper. Many users of fax don't even know what email is. There is two ways this is solved, time, and forced adoption. While I would hope for the second which would be a campaign of replacing all fax machines with scanners, it's very unlikely and requires unity of multiple competing entities.

No I do not like fax, but I understand it. And I hope that sooner rather than later people see there has been a solution to replace fax that is both saving trees, increasing efficiency and has existed for many years.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Tuesday, January 12, 2010

Duel Stream Scanning – Have your cake and eat it too

The benefit of drop-out forms is that they are very accurate in data capture. The downside to drop-out forms is that after they are scanned they aren't much to look at. Companies want the best of both drop-out and black and white forms. They do this in various ways, the most common being to just deal with the images they have. Some will scan a document twice, that is very time consuming. Others will use an overlay utility that stamps the original form fields and labels back on an already processed drop-out image. These utilities are accurate but not as accurate as the original and often result in lines stamped on text. The best solution for getting a form scanned efficiently that is both optimum for data capture and viewing is to use duel stream scanning.

Duel stream scanning is usually a feature in the higher end scanners. The technology is slowly moving down to the work group and desktop scanners. What the feature allows for is a single scan that produces both a drop-out and black and white image. The scan speed is the same scan speed as if you were scanning in color. When configured the drop-out image goes one path and the black and white image another. By doing so a company can use the drop-out image only for data capture, and the black and white image will marry with the data capture results in the database or file system.

The difference in data capture accuracy between a drop-out form and a black and white scanned form is on average 15% more accurate often much higher. The reason for this is the OCR in data capture does not get interfered with form lines being printed on or too close to text. Additionally the logic to locate fields can be simplified as field labels are often small font and hard to detect.

It's simple and has the greatest accuracy of any solution, duel stream is a great tool.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, January 7, 2010

Print to OCR?

When I talk to people about the unique technique of printing text documents to image just for the purpose to run optical character recognition ( OCR ) or data capture on them, they are rightful confused and think I'm a little nutz.

Why would you ever convert an already digital document back to image? I promise it's not because I'm so fond of OCR it actually has it's purpose.

Language Detection: By converting a document to image for OCR, I can check the language of each word in the document. While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume. The unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past 5 years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engines if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Additionally font while a clear indicator of language if not accompanied by the proper language encoding will not tell a digital process what a language is, in OCR there is no need for such an encoding.

Normalization of digital formats: While a PDF created in Acrobat and a PDF created in a third party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file you have to have a standard format that is used. If you do not have a standard format you are dealing with variations in the document visually and infrastructural. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices' times as many PDF generating applications exist. However, if you were to OCR the PDF to parse versus digital parsing than you are dealing with only the number of variants that exist in the invoices themselves.

However crazy it sounds like the above two are real scenarios and there are many more. I doubt that these problems will always exist, but it makes you think twice about crazy statements such as printing a digital document to image just so you can OCR it.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, January 5, 2010

Document Preparation

In some organization's document preparation prior to scanning is the largest time cost in their document entry process. In all organizations, it's an important consideration. Document preparation is the processes of sorting, organizing, and preparing documents for the most successful document scan and chance at accuracy in downstream software processes. Sometimes document preparation is as simple as dividing pages into a small enough stack that a document scanner can handle, to as complex as staple removing, envelop opening, and document separation using page separators.

As recognition technology advances the need for document preparation diminishes. New technologies are allowing for automatic document separation based on templates or keywords, automatic document rotation, annotation, sorting, etc. The challenge for organizations becomes picking what document preparation step to use technology on versus manual labor. This has been a challenging question and as new technologies' surface becomes even more challenging.

If an organization keeps its focus on return on investment the path should become clear. Complete evaluation of the technologies will show accuracy and % of automation that can be accomplished with technology, and the amount of time and cost it will save. The tricky part of the evaluation is really in the understanding of the environment. Doing a study of what document preparation is currently done, and all document preparations required for document entry should be fairly straight-forward. Listing the features of document preparation that can be handled by software and those products that have them is a little more complex and requires an organization to spend dedicated time on it. The process of separating documents and barcodeing documents tends to be the biggest cost and the low hanging fruit to seek automation for. Using OCR software can determine document start and end with keyword's versus a person manually placing separator pages or barcodes on the document.

For most organizations the result is a combination of manual and automatic. The ultimate goal would be to automate every step in document preparation that can be automated and leave those that have to be manual such as placing documents in a scanner.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, December 31, 2009

Measuring Document Automation Efficiency

The two most common question when organizations ask when they are seeking document automation technology is “how fast is it?” and “how accurate is it?”. Many don't realize that the two are at opposition to each other most of the time. The more accurate a system the slower it is, and the faster it is the less accurate. But there is one fatal mistake in all these calculations, and that mistake is how efficiency is calculated.

Most companies who trial data capture calculate performance on the slowest step which is optical character recognition OCR. Literally companies will hit the “read” button and immediately start timing until the read is complete. This is what is considered the speed of the document automation system. This is incorrect.

There is no question that OCR can be a tremendous bottleneck in the entire entry process, but poor OCR could create an even greater bottleneck. Imagine an OCR engine that reads a document with 100 characters in 1 second as compared to an engine that reads the same 100 characters in 3 seconds. Your initial thought is that the first engine would be better, but consider that the first engine may be 60% accurate leaving 40 characters to be manually entered, and the other engine 98% accurate leaving 2 characters to be manually entered or correct. If you consider an average entry speed of 1.6 characters per second then it will take the 40 characters an additional 25 seconds to enter for a total entry time of 26 seconds for the faster engine. For the slower engine it will take an additional 1.25 seconds to enter or edit 2 wrong characters thus a total entry time of 4.25 seconds. This means that end-to-end the slower engine is 6 times faster document automation process then the slower engine.

This simple calculation illustrates the folly in assuming that the slower OCR time makes for a slower overall process. Usually focusing on accuracy has the greatest benefit for an organization unless you are improving the speed of a slower engine with hardware, or two engines are to close to see a benefit.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Wednesday, December 30, 2009

The trick of the inverted text

The search for greater accuracy when it comes to document automation, never stops. It's true that with every new release, OCR technology has become so advanced that the jumps in accuracy are not what they were 10 years ago. Now, new versions of OCR engines contain enhancements for low quality documents and vertical document types but general OCR can't get much better. Because of this, modern integrations need to find new tricks. This blog is full of them, but I'm about to explain just one more. OCRing inverted text.

OCRing inverted text is nothing new. Many document types have regions where white text is printed on a black background. The modern engines have an ability to read this text. Typically it's not as accurate as black text on white background OCR, but it has its unique benefits. Especially with complex document types such as EOBs and drivers licenses.

There is a trick in using inverted text OCR to increase overall OCR accuracy. The method is to first OCR a document normally, then using imaging technology to invert the image. When you invert the image, the black text on white background switches to white text on a black background. Once the inversion is done, run OCR again. By comparing the two OCR results, you have essentially voted the same engine with little effort.

Large volume processing environments can deploy this trick without re-loading a new OCR engine, and applying different settings. It's important to note that when using this technique, how you compare the two results is as important as the process itself. Typically you will assign more weight to the original version of the document then the inverted one. There you have it, one more tool in increasing the OCR accuracy of the engine you already use.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, December 29, 2009

Rich Media OCR

I often speak of unique uses of OCR, and here is yet another. OCRing video files! But why? Part of the management of rich media assets is indexing these files. Technologies such as speech recognition and optical character recognition give a greater index and search value to rich media.

By using OCR technology to find and extract text from video frames, the data can be stored as meta-data. In the simplest scenario, this is a text file that accompanies the video file. More complex environments will even tell you the minuet and second the text occurs. Because this is not a traditional use of the technology, some special consideration must take place.

First is converting and separating frames to individual images files. For the OCR to be effective it needs to work on a series of images. Although a video is only a sequence of images that repeat at a high rate of speed, it's still somewhat of a challenge to convert video files such as MPEG to a series of images. Not only that, dealing with motion blurs that might occur in some frames will also be a problem.

The second challenge is dealing with frames that are repeats. Essentially, because there are so many similar images that are only slightly different from each other, the text on a series of frames might not change. Better OCR results will account for this and not repeat text as the frames would.

And finally dealing with the variations of fonts, and often small sizes. This requires an OCR engine with specific settings for specialized OCR, and one that is very accurate on complex low quality documents.

I expect that in the future, this technique in conjunction with speech recognition will be used in eDiscovery, content management, and robust search of rich media files.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, December 25, 2009

It's not that you don't want to, it's that you can't

Many of us tech heads are quick to give you an answer to your technical needs and propose a solution even if you did not ask. I'm no different, if you tell me you want your documents digital I will explain OCR to you and then explain the best solution for your document types. To my dismay, if you work for a large company your response will likely be, "but I'm not allowed to install anything."

It's very common for large organizations to lock down their employees' computers to the point it becomes more of an appliance than a computer. This lock down makes perfect sense especially considering the amount of personal and private information these organizations encounter. The lock down however makes it very difficult for a technical operator to increase their efficiency with new technology. While the offer stands to approach an IT department with requests for new technology, the reality as we know is very small, especially with the current situation of shrinking IT departments.

Most recently I was in a conversation with someone working for a bank. She had stacks of business cards that needed to be digitized and of course being the tech head that I am, I got excited and explained about business card reading ( BCR ), and that perhaps it would be easier to get a document scanner that could scan the business cards and everything else. But to no avail, she could not install the software.

The real hurdle with the computer lock downs is not so much hardware installations. This can be overcome with a simple request. It's the approval of new software that requires many months of review and approvals. Because OCR is a software driven process, this complicates things. Eventually, I hope that document automation becomes a part of the standard build for end-users machines. Until then, the solution is a scanner and an OCR service either web based or on an intranet.

If an organization can deploy centrally an OCR server that users send documents to and receive results from, they will eliminate the risk of installed software. Alternatively, an end-user with an attached scanner can leverage the OCR web based services that exist, either via FTP or E-Mail upload documents and receive results.

I hope soon we all have OCR as a standard so we can start removing the reliance on troublesome paper, but until then, the OCR services exist to get the job done, and may sometimes be the preference.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, December 22, 2009

Already digital but still OCRed

I've faced unique projects in the last four years and in a few, the best approach even seemed to contradict my better logic. The projects I'm talking about are ones where the data we were working with was already in a digital format, namely a PDF file that was created digitally. What this meant was that all the text in the PDF was available and 100% accurate. So why then, to accomplish the project's goals, did we use OCR to read the already digital files as images?

I had intended for all these projects to do a logical parsing of the already digital content so I can get what I want. The problem is that even though the internal structure of the PDF has a logical standard, it's not used logically 90% of the time by most PDF generating applications. PDF has in it a tolerance for mistakes that allows organizations to deviate quite drastically from the standard. What this means is that not only is the content in each PDF unique per company that generates it, it's unique per number of applications able to create them. Variations on-top of variations makes logical parsing very difficult. This becomes most obvious when the documents contain tables. Because of this the only way to text parse the PDF properly would be to flatten the internal logic so that they consist of nothing but text, but by doing so you lose some of the information pointing to where tables are and their structure.

You may have guessed by now that all my projects were to parse tables from PDF. Not just any table but specific tables in PDFs where each was a unique format. As I said before, my preference would have been to use the 100% accurate data already in the PDF. In the end what I ended up doing was OCRing the PDFs because they were what is called "pixel perfect" so the accuracy was very high. Now that I was using OCR, I was able to first recognize an entire document and remove everything that was not a table which was determined by my OCR document analysis. Then I was able to use keywords to find the specific table that I wanted. The end result took me about 3 weeks of work for each project, and the result was higher accuracy in table finding, and only slightly less accurate in the text values than a table parsing.

While it seemed most logical to do the parsing, in the end I saved over 5 man-months of work by using OCR.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, December 21, 2009

OCRing Magazines

Often times when I receive printed periodicals, my preference is to OCR them to a digital search-able format and read the articles I'm interested in on my computer, just like my online periodicals. One of these printed documents might be a magazine. Magazines are either very easy to OCR or very difficult, and usually both cases exist in a single magazine. It all has to do with the graphical elements that are often incorporated in magazines.

Text printed on graphics. Very often articles will have text printed over related graphics. If entire paragraphs are printed over a single graphic, it's less challenging; but when text overlaps graphic and white-space, it's problematic because a single word will change from color to black normal text in order to contrast the images.

Annotated images. Many magazines including my favorite scientific one, includes text as part of diagrams in the articles. To many this text may be irrelevant, but to me, it has become important search words at the very least. These annotations tend to be small font and often hard for the OCR engine to identify because of close proximity to images.

The good news is that for the most part the purpose of OCRing any magazine is to make its text, searchable. Anything more would probably be illegal. The other good news is that there are tricks to deal with each of these problems. First, a magazine that is being OCRed must be scanned in color. The additional information provided by the color scan will help the OCR engine to distinguish graphics from text on graphics. Second, is to enable full recognition of any engine and any settings geared to small fonts. Third, is to turn off document analysis or enable limited document analysis. This is the less obvious setting. By disabling document analysis, you don't allow the OCR engine to get confused by strange structure, text printed on graphics, and annotated images. You are forcing it to read all possible text.

Being that text-searchable is the greatest benefit to OCRing my periodicals, I have opted for the OCR settings that produce the most text and the least structure. If you are converting similar documents, I recommend doing the same.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, December 18, 2009

What you OCR is what you get

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it's important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it's job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it's important to find a solution that has good documenting analysis.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, December 15, 2009

Why buy what I already own!?

Many people inherit full-page Optical Character Recognition (OCR) technology by simply purchasing a scanner or a multi-function (MFP) device. All these pieces of hardware include various software packages and OCR is one of the most common. Often the software is never used or the use isn't always clear. Other times, the bundle is a tight integration with the hardware and the OCR is a part of configuration of the scanner and is used during scanning unknown to the user.

Bundled OCR technology is the easiest way to learn through use, and get the technology for a low price. Bundled software has contributed a great deal to market education and understand around the advance technologies. All the top OCR engines have a consumer product bundled with a document scanner or multi-function device. But because it's already there, it leaves many wondering why you would ever purchase the software directly.

For many, the bundled OCR is sufficient for use. The quality of documents is clean, and the demand for advanced options is not required. But for others they just need more. This is why more advanced versions exist. Bundled OCR, even from the best vendors, is limited or an older version of the product. Some of the vendors make a special "bundle only version", while others choose to incorporate non-current versions. Not only is buying the software directly getting the latest technology with the best features, the biggest drive to purchase is a greater more specific need to focus on OCR functionality. This could be because you are scanning old documents, degraded documents, or you need special settings such as compression and PDF/A functionality that is simply not found in bundled versions.

Vendors don't make any money on bundled OCR other than to cover costs. Because vendors use for the most part bundled versions as marketing, they don't incorporate the latest, greatest, and most advanced features. For those who the document version process is very important, there is a clear benefit in quality OCR packages.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, December 11, 2009

Outsourcing document recognition

It's common for organizations to outsource their scanning, and document conversion. Organizations find sometimes that the skill required, the convince factor, and liability is worth the additional cost. Other organizations that have one time backlog conversions save money by using an outsourcing company vs. bringing the software in-house. In recent years service bureaus and business process outsourcing companies have dramatically improved their use of recognition technology, if they are utilizing it, and prices have dropped substantially. Though as an organization who chooses to outsource you are removing the responsible of picking document conversion technology, do you know what technology your service bureau is using?

YOU SHOULD! Absolutely you should be concerned about the OCR and Data Capture technology that your outsourcing company is using. It's no less important than if you were bringing the technology in-house. It's your job to make sure your vendor is using the not just the best technology but in the best way. The education level between outsourcing companies is different and they each often specialize in one document type or one type of processing. Proper evaluation of a service bureau will include review of sample results. You should have your prospect service bureau or BPO run a good number of your production documents and provide you a result. Make sure the technology they used to produce the results is the same that is used when in production. Don't be afraid to ask the vendor what engine or engine's are being used, even what version. Make sure you understand how your vendor handles exceptions.

While it's easy to overlook these items when you are looking at a service instead of a technology, it's important that you are educated. Service bureaus make money based on how much they save. This occasional can create motives to use poor technology to gain greater margin. Some outsourcing companies put customers into categories by volume, those with the greater volume get the best technology. Most the outsourcing companies out there are very good at ensuring their document quality, and many will even go as far to give you a guarantee on quality. But the nature of production environments is such that you cannot check everything always. It's about relationship. Some times paying a higher price per page for a better solution is worth it!

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, December 10, 2009

Space age Optical Character Recognition

There are a lot of technologist out there that believe that optical character recognition has it's days numbered and is an aged technology. The belief is that soon paper will go away. This post is for those who believe OCR technology is going away.

The reality is that paper consumption has not really decreased, in some areas paper has been replaced with electronic data interchange EDI, but in other areas it's actually increased. Studies have also shown that because documents are being scanned more there is an increase in printing when the documents need to be shared or re-purposed. But I'm not here to argue that paper is not going away and document conversion technologies required to convert them. I'm here to point out a few futuristic uses of the technology that technologist like to already talk about and involve OCR.

Data Security

The first futuristic use of the technology I would like to discuss is the use of OCR in data security. Text strings sent over the Internet are far easier to sniff and unlock the data they contain than is sending a compressed JPEG image. What if you were to during transmission convert text to a JPEG compressed image and on the receiving end OCR it to get the data. By doing so the data has been masked your in a more efficient and secretive way. For added security proprietary image formats could be devised.

File Compression

Storing ASCII text takes up far less space than an image or video file. As apart of the future of compression technologies expect that OCR be incorporate to extract the text from an image and save it as ASCII. Viewers will convert the text back to and image during viewing. This then removes the image portion of the text and significantly reduces file size.

Robots

How else to you expect future robots to read text? OCR of course. The eyes of the robot are essentially a camera that takes pictures of images rapidly. When the robot is faced with the comprehension of text the image will be converted using OCR and feed through an engine to gain meaning from the text and act on it.

So there you have it, three really cool and cutting edge ways OCR is and will be used in the future. Paper is not going away, but even if it were just look at the other cool uses of OCR technology.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, December 9, 2009

OCR makes old systems new

One of the biggest challenges in the IT space is migration from legacy systems, often mainframe's, to modern day operating systems and applications. Legacy systems today still exist in the form of classic green screen UNIX systems. Their life has been extended do to the critical nature of the data they contain. Modern day standards have been put into place hoping to avoid this problem in the future. However those applications that seem most critical to conform to standards such as hospital medical records systems, airline systems, and government systems still do not conform to any The vendors who make this systems have every intention of making it very hard to migrate from. But there is a way, and it works very well. OCR.

You may have seen in a previous post where I eluded to the possibilities of using OCR to scrape screen-shots. This is one of the best real examples of why the technology is so useful. When you don't have XML and ODBC or any of the other great standards that allow the exchange of data from one system to another, you always have what you can see, and if you can see it you can OCR it. If you can view the data on the screen you can move it to a new system.

Using OCR to either problematically or manual read portions of a screen where the legacy system window is displaying data, copy it to memory, and paste it into the new system is one of the most ingenious ways to ensure the neutrality of your data. Vendor lock down attempts, or old technology should not prevent you from getting to what you own, the information.

Weather it's a manual process or a programmatic one the ability to OCR screen-shots to migrate data is the hidden secret to crack any proprietary software safe.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, December 7, 2009

Re-OCR, Lessons learned

o my surprise I still receive requests from companies needing to start over on their OCR processes. Companies that have used the technology, did not plan, and are now finding themselves in a situation where they have to repeat OCR efforts. The companies fall into two categories.

First category is where companies find they have processed large volumes of paper and the accuracy was not what they expected. This can be discovered in a relatively short time-frame or long after initial integration of the technology. It can be as easy as fixing bad settings for a particular document type to as bad as purchasing correcting a bad choice in software solutions.

For companies in category one it's truly a lesson learned scenario. I will work with these companies to evaluate proper OCR settings and to test future prospect engines. The hope of mine is that the company at least scanned their documents at a high enough quality that already converted or scanned images can be used for backlog conversion versus a re-scan if that is even possible.

The second category is companies who discovered they were collecting too little of data from their documents. This usually happens in data capture environments where companies configure to capture 3 key fields only to find later that there were an additional 2 fields required for downstream processes. Depending on the severity it's often better to do day forward processing with proper settings on new documents and to key in missing fields for incorrect documents. The reason for this is sometimes the work of getting the additional fields and reconciliation on old documents takes away from day forward production and may not be worth the additional cost there it imposes. Or a common practice is to have the backlog documents run from scratch through the new process.

The trend in both categories is improper planning by the organization before evaluating technology. It's important for companies to take the time and plan for capture technology. A part of this planning is forward looking need for the data. One of the best tricks to exposing the requirements is to involve ALL constituents that create, use, and benefit from extracted data. Plan, Plan, Plan.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, December 4, 2009

OCR-and-Paste

You probably use the copy and paste functionality on your computer daily. I too use copy and paste on a regular basis, but I also use OCR and paste nearly as much. OCR and paste is what I'm referring to as the process of selecting a region on the your computer screen and using OCR to read that region as a screen-shot and converting it to text. Even to my surprise it's become quite the habit and one of my favorite ways to collect data from on location on my computer to another. Many wonder why this might be the case, as most information on the screen is available as text anyways. The reasons are: it's more efficient then copy and paste into a program, it maintains structure of information using document analysis, and there are times when the information I want is not in text form but in an image only.

I have actually taken it one step further and used the technology to automate the extraction of data from web pages that are scroll heavy. Instead of scrolling forever for information on a web page I can use the tool to take a screenshot of the entire web page and convert it to text for me. You can imaging how the technology could be used maliciously but in this case it's just to get information.

The ability of OCR to read screen-shots is quite impressive. Though screen-shots usually come out in low 72 or 96 DPI resolution which is traditionally not optimal for OCR, the text and text in image is what is called pixel perfect so it provides an excellent candidate for conversion. Also leveraging document analysis technologies built into OCR I can grab a table and have it export a table versus having to copy and paste text and manipulate back to original form later.

When you become and expert in OCR you find yourself using the technology in the oddest places, but this is one case where my productivity has increased because of the tool, and I think it's worth sharing. I suspect that OCR of screen-shots is only going to increase in the future because of this, malicious reasons, and counter mal-ware technologies. As well as a very easy way to convert data from one locked down legacy system to a new one.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, December 3, 2009

Turning the latest and greatest off

Our culture is built on the fact that the newer and the more the better. In the advanced technologies that exist this for the most part this is true, but people are always surprised when I tell them that actually disabling some of the newer technology will produce a better result. I am going to give you three examples of where the use-case of technology demands time travel to older approaches for higher accuracy.

In data capture and OCR there is a component of the technology called document analysis. Document analysis prior to any collection of data tells the structure of a page including columns, rows, tables, pictures, paragraphs, lines, etc. It's the biggest contributor to modern day OCR accuracy. Document analysis is really designed for documents that are traditional such as an article, a book page, a letter. Document analysis ( although there have been special ones ) does not excel at form type documents. One of the most difficult documents in the world is an Explanation of Benefits EOB. This document has is own structure per variant typically. Surprisingly the best way to process such a document is to turn off document analysis and default to a basic full-page read of the text. The reason for this is that document analysis provides an overwhelming bias for tables that no EOB will match.

Similarly reading text from photographs. When reading text from license-plates and product-plates ( serial number plates welded or stuck to many products ) during assembly is best done with engines that do not have document analysis. In this case the document analysis is trying too hard to find information. Because of the nature of these images what ends up happening is characters in the photo are split into multiple lines and characters. Without document analysis the engine sees the whole image as one text block and just reads it, thus creating better results. Looking at the license-plate readers that snap pictures of your license plate at toll booths they are all using older antiquated OCR technology. By turning off document analysis they can use the newer engines.

Finally, mobility. This one makes a lot of people uncomfortable. Our society wants to believe their cell phone can do anything. Just today I was wondering why my cell phone did not brush my teeth for me. You can have your cell phone do OCR sure, but it requires older smaller and limited OCR engines to do so. I prefer to send an image to a server and use more advance OCR, but many demand OCR on the phone though in practice it's usually slower. The reason for this is OCR requires specific processing power, and specific types of processing. Chips in phones today, and likely for a very long time to come will not compete with the power of a computer nor will they, and most importantly, include the proper math operators it takes for efficient and math heavy modern OCR. Cell phones cannot adopt proper chips because we demand long lasting batteries, small size, and low cost. Intense math is simply not important for 99.9% of mobile applications.

There you have it. Modern OCR taken down a few notches to solve current day problems. The best engines that exist today allow you to turn on and off all the various functionality you need thus making it possible to purchase the latest OCR technology and limiting it however you need. Most organizations don't understand why anyone would want to turn off the new but today I've proven new is not always better!

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, December 2, 2009

Playing tricks on your images

Often organizations have no control over the images they have received. Images can come via fax which has a varied range of resolutions, or they can come as poor scans. All of which are no good for data capture and OCR processes. Fortunately there are a lot of imaging tools and tricks out there to help. None of these tools replace a good scan but some get close. One of the tools not often thought about is up-sampling.

Up-Sampling is the process of taking an image at a lower resolution and increasing it to a higher resolution. The technology basically increases the resolution of an image then fills in new empty pixels with predicted values from the original image. For data capture and OCR up-sampling is usually done from 150 DPI and 200 DPI to 300 DPI. Up-sampling technologies have become very impressive and useful. I will recommend up-sampling often over working with the source lower resolution. But lets talk about the facts and how and when you should consider up-sampling.

Up-sampling should be considered on documents that have a low amount of noise such as watermarks, spills, stains, stamps, speckling. Essentially documents that are a good quality and scan but low resolution. You should also avoid doing up-sampling on documents with close spacing of elements and text crowding. In these two above scenarios it's better to work with the source image as-is and work around the problems

The bigger the gap between the source resolution and the desired resolution, the more risk of fragments exist after up-sampling. For example 150 DPI to 300 DPI will not yield the quality that 150 DPI to 200 DPI will. This is why going crazy and up-sampling to the highest possible resolution is not a good idea. It's like taking a very small image and trying to zoom in as far as you can to get detail, you probably wont. Trying to trick the system will only hurt you. Up-sampling from 150 DPI to 200 DPI then again to 300 DPI would not be better then just converting to 150 DPI to 300 DPI. In fact this would be a pretty big mistake. Essentially what you do when you do this is magnify the mistakes created during up-sampling as they get propagated now twice over. These will likely decrease you quality and can result in such things as bloated characters, fuzzy characters, or an abundance of speckling. The goal is to do as few conversions on the document as possible.

I will always defer to a proper scan over any image techniques, but when you do not have control of the image scan one of the image tools to consider is up-sampling. Uneducated use of the technology is unsafe as is true with all advanced technologies, but if you stick with the facts, and pick a great technology you will be successful.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, November 26, 2009

eDiscovery and OCR

I have touched on this topic a little on one of my previous posts but because of eDiscovery's popularity I thought it fitting to look at OCRs interaction with eDiscovery preparedness. Organizations who are not ready for audits and court orders to deliver documents are spending tremendous amounts of money to undo bad document processes. Because of this, preparing yourself to be ready for possible legal future events is critical and a long term cost saver.

The purpose of OCR technology in conjunction with eDiscovery readiness is based in the principle of having as much data at your finger tips as possible. The proper policies of being ready is heavy in records management policies, and a good taxonomy that is strictly followed. Because of this sometimes OCR is overlooked as a tool. With the proper above practices it should be possible to pull up any document at any time. However OCR should be viewed as an insurance policy, by OCRing every document you have even more information than you would otherwise, and information is the key to success in these situations.

eDiscovery also includes other types of data, email is one of the most popular. But what about the data contained in email attachments that are PDF, TIFF, JPEG? OCR is the only tool to extract the data from the images in these formats. Surprisingly products that provide eDiscovery tools just for email still do not yet heavily deploy OCR technology, but the information contained in these attachments is often as valuable as the emails themselves.

In addition to all the traditional proper records management practices, and eDiscovery tools, OCR should be considered as a must have for organizations preparing themselves for audits or court orders, and sometimes even more importantly knowing what to omit.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, November 19, 2009

Let the OCR do the talking for you

I've covered various interesting and non-conventional uses of OCR. I would like to talk about a new one OCR to Speech. The blind community is familiar with technology and it assists them in their everyday lives. The key to OCR to speech is simplicity. When the concept was first developed it required some very elaborate combination of software and hardware, now it's possible to take the latest and greatest OCR technology and make it talk for you with a simple configuration.

It requires a document scanner with a easy physical button interface and programmed to scan an image at 300 DPI to a folder on a machine. Traditional documents work very well for OCR to speech, documents that have a lot of graphics and un-traditional formats may be more challenging. It's important that the technology is able to omit garbage. To do this the OCR process should be driven by a dictionary. The words recognized must be in this dictionary or they will not show up in the final results. The reason for this is a lot of time can be wasted if bad recognition results are spoken.

Once the OCR engine has done it's job of accurately and automatically converting and image to text, the ASCII text results from OCR will be saved into a directory. Now it's time to automatically put the text to speech. There are many text to speech applications out there, some free, some for pay. The goal is to find one that also reads results from a directory and automatically speaks the text over computer speakers.

It can be that easy! Some users of such technologies spend more time trying to find an acceptable digital voice then really configuring the solution. I assure you the packages exist and when configured correctly is very accurate. One scanner, One OCR application hot folder driven, and one text to speech application also hot folder driven will give a robust OCR to speech solution that can be setup in minuets.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, November 16, 2009

Digital Ink – it's not OCR or ICR

Digital ink is the approach of having a touch screen device that monitors a users movements with a stylus on the screen to determine character was written. This is not OCR or more specifically ICR. Very often companies have asked for OCR technology when they meant digital ink and vice versa. OCR and digital ink overlap but not always. There are cases where you simply cannot do away with paper, and not to mention digital ink does not process typed text.

The first time the technology was seen was back when Apple released the Newton. The newton was the first PDA that had a touchscreen and stylus. Later Apple sold Newton to become Palm Computer. At that time you had to re-learn how to write characters according to a guide. The characters were specifically structure to provide the best recognition and then had to be completed in a single hand-stroke. When mastered the recognition was very good. Now any tablet PC has a basic version of digital ink software. Digital ink competes with ICR intelligent character recognition or hand-print. Whereas ICR technology is looking at an image of characters written, digital ink is monitoring hand strokes as the character is being written.

The accuracy difference between the two is an argument that can very easily be lost for both sides. There are times when digital ink is way more accurate and times when ICR of paper forms is more accurate. The key really is the business process that the technology is fitting into. Both have their place. Digital ink is usually combined with an elaborate data entry and content management process. Most often digital ink is not about getting a substantial amount of text from the operator but more about the operator answers quickly simple questions usually requiring no writing at all. The amount of characters entered in a digital ink scenario vs. a ICR of a form scenario is many times less. You will not see tablet PCs sent out in the mail to survey a customer base.

The biggest place digital ink is used today is in health-care where the drive is to increase it's adoption even more. The purpose of the technology in this space is to rapidly populate medical records at the point of examination. However health-care still remains to be one of the top paper generating industries requiring OCR and ICR. This shows the technologies both satisfy very different needs and should not be confused with each other.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, November 12, 2009

Not quite as fun as the DMV

Understanding the different licensing that is available for data capture and OCR products can sometimes be difficult, but I assure you that the complexities involved will not be as painful as a trip to your local motor vehicle. There are a few aspects of licenses that trip up some users namely license type dongle or serial number, activation process, and finally page-counts.

License type can be very important but is not often clearly explained. The most common license type out there is “software license”. This is a license structure that is a license file tied to a specific machine. The benefits of such a license are, it's more efficient and easier to install on servers and hardware that are not local. The downside is that because it is tied to a machine, if the license dies you may have downtime while waiting for replacement and proving destruction or may have to purchase a new licenses. Another very common license type is a hardware dongle. Dongles now are most often USB devices very similar to a USB thumb drive we are all used too. The benefit to this type of license is that the software can be installed on every machine in the organization but only the machine with the dongle in can run it. This means that if something happen to one machine it would be very easy to switch to another. The downside to this type of license is that the licenses can be lost, and it's not the most efficient. After you have whatever license type it is, you will need to go through the activation processes.

Activation can be troublesome for some products and others very simple. The difference is usually the installers effort in understanding the activation processes BEFORE any installation. For many of these products activation has as many as 3 steps and it's usually always in the form of sending an activation request, receiving an activation file, installing the activation file. The trend is for products to allow web activation and it's becoming more popular, but because of the premium on some advance data capture products these steps are required. Now with an activated license the most important thing, what does a license give you?

Licenses are usually set with general operation right, purchased add-on's if they exist, and very commonly page-count. Page-count is the biggest contention of most any purchaser. Because of this most all vendors have the option to have unlimited page-count license for a premium. In the end most all companies end-up with a page-count licenses and are quite happy. What argument I would like to pose is that a piece of hardware has inherently a page-count, as each piece of hardware will only be able to physically process a certain number of pages a day, month, year. For this reason page-count is actually quite reasonable but a slowly dieing trend. In the future I expect to see far fewer page-count licenses. For most businesses pages are counted on a monthly basis but some seasonal companies may elect for an annual or pure page count.

License structure is important to ALL organizations and I encourage companies to spend the time during the discovery phases of technology acquisition to investigate the structures that are available from each vendor and how that may work in your environment.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Sunday, November 8, 2009

Tax Return OCR

If you are thinking about using data capture to read text from tax returns it's time now to start thinking about the steps to accomplish this. Reading typographic tax returns from current and previous years has proven to be very accurate and a great use of data capture and OCR technology. Tax Returns fall into the medium complexity to automate category. There are a few things that make tax returns unique.

Checkmarks: Tax returns have two types of checkmarks, ones that are standard and printed in the body of the document. These can be handled similar to all other common checkmark types. The other type of checkmark is unique only to tax forms, they are typically on the right side of the document. They are boxes that within can be filled with a character or a checkmark symbol. With these checkmark's the best approach is to create a field the entire size of where the checkmark can be printed and set the checkmark type to be of type “white field”. In this case the software will expect there to be only white space and a presence of enough black pixels will consider it checked.

Tabular Data: Much of the data in a tax form is presented as a table. When considering capturing data from a table organizations have to decide if they want to capture each cell of the table as it's own field OR if they would like to capture the data in the table as a table field that later must be parsed. This can dramatically effect the exported results so knowing before hand is very important.

Delivery Type: Tax forms usually come as eFile which is a pixel perfect document that is never printed and never scanned, or as a scanned document received first as paper then scanned. For the most part the eFile version of the tax form will be more accurate, however the eFile version of the form has non-traditional checkmark's that could cause a problem. Organizations need to decide if they are going to process all delivery types together as a single type or separate them. There are advantages to both. By combining them integration time is less, by separating them accuracy is higher.

I much rather OCR a tax return than file one. Because of this the skills I've developed in processing tax returns are better than creating them, and I hope today I imparted some of that knowledge.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, November 3, 2009

Fixed, Semi-structured, UNSTRUCTURED!?

I find my self educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.

First lets start with defining a document, a document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity, this it seems, is the meat of the confusion.

Often someone will say a document is unstructured, what they are thinking is the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know is if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.

The ability to processes agreements and contracts is very limited in very concrete scenarios, where the contract variants are non which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.

It's actually very easy, 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, does not make it fixed. For example a tax form always has the same general location to print company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes is semi-structured, additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.

There, again the confusion is exposed. It's very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly it will help you and them to correct it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, November 2, 2009

Users of OCR doing it right

Good article on acquiring OCR technology form a service bureau and end-user perspective. I especially like the point of soft costs which are inline with my recent market education on planning.


8 things to consider when deciding to buy or rent OCR capabilities

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

The little secretes to OCRing large maps and drawings

Occasionally the need to convert large documents such as maps and engineering documents comes along. Many times the OCR requirement is limited to a small subset of fields and clearly defined, but when it comes to converting the entire document to get as much text as possible there are many things you need to consider.

First is if you already have the ability to scan or are receiving images of large format drawings congratulations, as this can be one of the biggest challenges. Scanning large format documents requires either a large format scanner, or stitching of partial scans ( less preferred ). Because these documents have small fonts it's important to scan at 300 to 400 DPI. For maps because of the amount of graphics drop-out of all colors would be ideal or a thresholded black and white scan where you are left with mostly only text in the image.

The purpose of OCR for most of these documents is for index and search-ability, so the goal is to get as much possible text as you can. For maps with a good scan you should be able to get the majority of the text except for names printed on a curve. Running line straightening on these might work but more likely hurt the recognition of the rest of the map so I would recommend avoiding it. Prior to OCR set your OCR engine to disable auto-rotate, there is a lot of things on these documents that can cause a miss-rotation namely text printed in every direction.

Now to the secrete, it has to do with rotation. Depending on the setup of the drawing or map if you OCR the document at every 90 degrees, once completing a full 360 degrees will have the majority of the text. That is right I'm suggesting that you OCR the document 4 times, hopefully in an automated fashion. Now this might leave you thinking that you will end up with a lot of garbage, and your right. But what you can simply do with the final OCR result is use a dictionary to remove all garbage text.

The end result is a map or drawing with the most amount of index level text possible. I admit that I made it sound a little easier then it is, and most likely you will require an API to get the full job done, but the possibility exists and it's been proven successful.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, October 30, 2009

How many keystrokes does it take to get to the center of accuracy?

Often times we are blinded by technology and forget the pain we originally adopted technology to solve. When I first learned accounting more tenured accountants would explain to me how they made journal entries on paper not Quick Books. Then I learned math I was freely solving complex equations on my graphic calculator as my professor explained how long these equations would take without it. OCR is no different. OCR is replacing manual data-entry that is not very accurate. If an OCR system is 85% or more accurate on a particular document type then it most likely is more accurate than a single entry by human on that same document type, and faster!

So we know there is a clear benefit to the technology; increased speed, increased accuracy, it's when companies want to be 100% accurate they start to groan. Before OCR and even today to reach 100% accuracy with data entry they did double or triple blind data entry. Double or triple the labor cost. What that means is that two separate people will data enter the same document and the results will be compared, make this three people and you will almost always be 100% accurate. You can do the same with OCR! Most large service bureaus in fact prefer that OCR technology make the first pass then they do one pass with manual entry making it double-blind. I'm going to suggest one step further.

Why not have OCR with settings geared towards numbers, and OCR with settings geared towards words ( our two separate data entry people ) both enter the same document and compare the results. Why not three sets of settings, maybe four? If you were to take the same OCR engine with different settings and compare their extraction results from each instance you are creating automated double blind data entry! You can replicate the trusted process for producing high accuracy with greater efficiency and lower cost.

I am a constant advocate of human intervention on low confident fields or characters, but in the above approach you are using more technology to replicate existing very accurate processes. Never forget the original problem and you will see very quickly that OCR is a benefit.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, October 27, 2009

Read signatures maybe not, make sure a doc is signed, easy!

A lot of the documents we encounter require their to be a signature. In data capture these documents add an additional complexity as an operator either before data capture or after has to make sure each document is signed. When a document is not signed it very often has to go a different path of approvals. Often organizations will ask OCR vendors to read the signature in a form. Ability to recognize signatures is very expensive and requires a database of pre-existing signatures so often not feasible. But ability to find a signature and confirm it's presence is not that difficult at all.

Because documents with a signature line almost always have to be checked to assure a signature is there it is an additional step of processing. However companies often don't realize that the data capture software they are using can get all the fields off of the document and check accurately if a signature is present. By doing so they remove any additional step and can flag documents only that are not reporting a signature.

Using OMR optical mark recognition technology you can determine if a signature is present. In it's simplest form OMR check's to see if there is a substantial amount of black pixels in a white space. At a certain threshold of black that field will be considered checked. If in a data capture setup you put an OMR field in the location where a signature should be then you will know that if it reports checked, there is signature present, and unchecked there likely is no signature.

While you are not reading the signature OMR is a fast and accurate way to see if signatures are present and avoid the additional manual step of checking for signed documents.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, October 23, 2009

Whatever happened to OCR-A and OCR-B

In the early days of OCR soon after Kurzweil invented it the desired approach to increase accuracy was to institute a printing standard. That standard included two fonts OCR-A and OCR-B fonts that the first OCR engines were specially trained for. Today use of these fonts sometimes actually reduces OCR accuracy with modern engines. It's a fact that if you just run a modern engine on a document with OCR-A text that it will initially be less accurate unless you tell the software that it is OCR-A at which point it will be extremely accurate.

Some of the education around OCR processing still discusses these fonts as a living standard. In the area of OCR of numbers only the fonts are beneficial as it demonstrates a significant difference between numbers that look like characters “1”, “0”, etc. This font, if you extract the numbers only portion of OCR-A is called “Index”. But for the most part the fonts provide no additional benefit in everyday OCR processing. So what happened?

Three major things happened that prevented this standard from taking off:

1. The adoption of OCR technology was very low at the time and used in special cases so there was not a large enough user base to embrace it.

2. It's really hard to tell users how to create their documents, especially because the people doing the OCR often are not the creators of the original document and do not have the power to determine printing font. Documents all printed in these fonts are very boring and document generator's like style.

3. The OCR engines in-spite of the standard improved to work very well on the vast majority of all fonts minus cursive and stylized special fonts. Because of this it quickly became clear that any typographic text could be converted.

As a little bit of OCR history these fonts are interesting to explore the rapid growth in the technologies accuracy. There are a few specialized engines out their that utilize only the OCR-A and OCR-B fonts especially when dealing with very fast camera OCR of part numbers on product assembly lines, but for the most part the standard is not required and not widely used.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 1 Comments

Wednesday, October 21, 2009

Is it a document or just pages

There are several aspects of how we talk about documents, scanning, OCR and Data Capture that are the culprits of confusion, misunderstanding, and unfortunately deferred adoption. One of these common language misconceptions is when discussing documents and their structure. All data capture and OCR implementations involve the concept of document even if a document is a single page document that is repeatedly scanned. But the definition of what a document is often gets blurred between vends, end-users, resellers, and even internally in all of these.

Some times people think a document is just one page in a collection of pages, others believe a document is a record in a database that consist of several page types but are combined together in a single record. In this last thought it does not include when the scanning happens so one page can come in at a different time than another, but not until they are all there do you have a document. And others think a document is multiple pages scanned together with a page type that determines the beginning and the end.

Where the confusion comes in is that they are all correct, but are influenced by different things. Documents to an organization can be defined by a business process, or a scanning process. To add to the confusion the scanning department has a concept of a document related to scanning, but the back office has a different concept as it relates to the data base. To reconcile this let me tell you in complete what a document is.

A document is all the paper it takes to create a single record in a system or data base. This definition actually combines all of the above and generalizes it. The reason it's important to reconcile everyone's opinion on what a document is, is because document structure and business rules around a document directly impact how you implement OCR and Data Capture and keep it accurate.

The biggest challenge of all these language misconceptions is purely understanding that they exist. If you know it's going to happen then you can mitigate their impact. Not knowing their presence can make them a silent killer of success.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, October 20, 2009

Why OCR is for everyone

You may come to this site looking for OCR software, PDF Compression tools, or maybe it was a StumbleUpon. Maybe a friend said they used OCR and loved it, and you just had to Google it to find out what IT was. Unfortunately tech industries have the habit of making great technology visible to only those who know the acronyms and have a good idea of the benefits it can provide. Everyone can benefit from Optical Character Recognition. So lets break the barrier.

What is most important about the technology is not how it works, but the result it produces. Some times when people who are unfamiliar with scanners see the slue of document scanners I have they ask “why do you have so many printers”. Barrier one scanning. To OCR documents they need to come via email or some digital transfer as images, or more likely they are paper that needs to be scanned. We all get mail, some mail is junk some is useful. We all also have paper documents sitting around and in cabinets we need to keep for a rainy day. At the same time we annually increase the use of our computers and are creating many files on them. So at the vary least wouldn't it be nice to take the useful mail, and other useful documents you have around: mortgage documents, nice letters, business cards, etc., and get them with all your other digital files? To do so you scan them, hopefully using a document scanner as it's more efficient than a flatbed. Consumers are very used to the idea of scanning photos, scanning documents is no different except for the fact that you have more. A document scanner, not a printer but looks like one, allows you to batch documents and scan them to a folder on your computer without doing it one-by-one one side at a time like a flatbed scanner. . Now that you are scanning you have an image representation on your computer of your files right by all the other digital files you have. Now what? Now it's time to get the data out and make them just as useful as all your other files.

Barrier number two OCR. It's an acronym that stands for Optical Character Recognition, this does not tell you much so forget about it and use it only to reference the process. Simply it's just a helpful technology that gets text from images and converts them into a format you can use. OCR converts the image into usable text, so you can search for that nice letter, or you can edit that party invite and print it again. The result can be PDF, DOC, TEXT pretty much any format you can image.

Now coming full circle that good mail, and useful documents you have are not sitting somewhere cluttering up desks and drawers, they are with all your other files on your computer ready to use. OCR is useful to everyone, you just have to clear your mind of the techie talk and understand it's value.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, October 16, 2009

Not that you want to pay that invoice any faster

But you can, and you can with a lower cost, and perhaps take advantage of net discounts. With Data Capture and OCR technology you can automate the entry and routing of commercial invoices. The reality for organizations that receive many invoices a day is that the accounting department is paying high salaries and taking time a way from other activities to data enter paper invoices. Using recognition technology to replace this process has been a tremendous benefit to many organizations. There are a few keys to success.

Start out simple: don't try to tackle the entire paper world with your solution, start out simple. First identify the process and where the opportunities for saving are. Usually the biggest opportunity is going to be in the entry of data into some accounting system. To automate this you will need data capture and scanning capabilities. Starting out simple does not mean to overlook all the possibilities but to find the technology that will fit all your wildest dreams of automation but start out slow with it. More specifically with invoices, first start by scanning, then by getting vendor, invoice number, and total due using recognition technology, etc.

Wait for an ROI before you make a major change: These technologies if implemented correctly can provide a great return on investment. Sometimes organizations make the mistake of not waiting until they get an ROI before making another major change. The change likely will have positive results, but requires another round of additional effort and could be problematic. This does not allow you to see when the value of the technology starts kicking in and could have you repeating effort. Wait until you succeed at a basic implementation before you seek even more cost savings. Saving money is addicting, but let each phase actualize itself.

Never forget your business process is boss: Organizations have processes that are set in stone. Staff understands how to execute them, technology is setup to facilitate them, and other processes are feeding or fed by them. Sometimes new technology is so excited it forces you to change what you are doing right when you acquire it. Often organizations don't realize the upstream and downstream impact of dramatically changing business processes. A technology should give you the option to keep doing what you are doing only faster, or to change things if you choose. At first try to keep it as consistent with the already in place AP business processes, then look for process improvement later.

No maybe you don't want to pay that invoice faster, but you do want to reduce the cost of working with it. With Data Capture and OCR you can save a ton as long as you prepare yourself and do your homework.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, October 15, 2009

Ok Chris, you talk the talk, but what is it?

The constituents of this blog are varied. Some know what OCR and Data Capture is, some do not. Some know they need it but now necessarily how to use it. Others know how powerful it is and have a good understanding of what is out there, but not the best practices. So taking a step back let me tell you what it's all about. It's about saving money, and reducing the cost associated with paper based operations.

OCR is commonly used to encompass all of the recognition technologies out there. It specifically stands for Optical Character Recognition. This is simply the process of taking an image scanned or digital received and converting from an image to text. OCR while it can be used to mean ICR, OCR, Data Capture, OMR, and barcode processing is really the process of extracting ALL of the typographic text from an image document and converting it to a digital format. ICR is hand-print extraction, OMR is filled in bubble extraction, and barcode is, well barcode extraction. These later recognition technologies make up Data Capture.

Data capture is the process of extracting field data pairs to be exported in a structured format. It does not have to necessarily get all the information on a document, and is very highly dictated by business processes. Data Capture incorporates ICR, OMR, Barcode, and OCR to extract the data from fields. Fixed From Data Capture are forms that don't change page to page, and are usually hand-print. Semi-structure forms are 80% of the documents someone sees. Data Capture is usually a more complex technology as compared to just full page OCR.

So there you have it, this is why you are reading this blog to learn about the specifics, nuances, and best practices of these technologies.

Labels: , , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, October 14, 2009

Why hot folder's are so HOT

We are all guilty of over complicating things. In technology products over complication results in more features then you will ever use and less money you could use, other times over complication creates new problems in business processes. End-users, vendors, and technologist are all commonly trying to add too many elements to automation projects. One of the areas where over complication occurs the most in data capture and OCR integrations is when it comes to passing images and results from one step to another.

Most organizations when it comes to passing images from a capture application to a data capture application ask for a connector specifically written to incorporate the chosen imagines applications API to pass images to the chosen Data Capture applications API. Most organizations similarly when considering export form OCR and Data capture processes want a special connector to their repository or ECM product. I'm not sure what to blame, the warm and fuzzies that come from the realization that a OCR vendor has spent specific effort to develop these connectors, or the faith that somehow connectors are more efficient. What I do know is that in most all cases connectors are overkill and simply not necessary, why? Because there are hot folders, and they are amazingly powerful and simple.

A hot folder ( sometimes called a watch folder ) is a directory virtual or real that is setup to be a staging or queue for applications to put data in and take data from in real-time. The best thing about hot folders is they are free! Most all imaging, data capture, and content management applications support hot folders. If they don't you have every right to ask why. When an image capture application scans documents they can scan those documents to a directory. The data capture application can automatically read images as soon as they appear in this directory and process them. Data capture and OCR results can be automatically exported to another directory that a content management application can automatically pick up from. That is two folders vs. two pricey connectors.

You may think that you are losing functionality such as tracking and security, but there are numerous ways in window to monitor folder activity and protect folder security. You might be surprised that many “connectors” out there are actually just a hot folder with a settings dialog. It's a hot folder in disguise.

So when it comes to deciding how to get files from one application process to another, first consider hot folders and try your best to disprove their validity. If you can't, you just saved a bundle of money and probably picked the most efficient method for your OCR solution.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 5 Comments

Tuesday, October 13, 2009

Check your check scanning

Check scanners are fast, and have very accurate MICR reading. The check scanners get the job done, when the only job is to get MICR from a check. As OCR of checks and reconciliation of check data with remittances, or check images for future verification an reference, gains greater importance and demand, check scanning has some complications.

The typical check scanner has two very key features:

1.) Auto endorsement
2.) MICR reading

Often people think that the way check scanners read MICR is with OCR. This is incorrect MICR is printed with magnetic print that is read via a very specific magnetic reading and conversion process. When companies intend to augment their check scanning with OCR and Data Capture processes there is something major they need to consider and not overlook. Check scanners are great at what they do, but they are not great at producing high quality images. Most check scanners cannot scan past a 200 DPI which as you will see in my previous articles is less then optimum for OCR. Additionally the lamps used to produce the image are fast but not the greatest quality.

So. Here are the options:

1.)Scan checks with a document scanner and a check scanner. The hard part here is the additional time it takes to perform two scans and merging the two data streams. Om this scenario you get the best of both worlds. Great image for storing,OCR and data capture from the document scanner, and great MICR and endorsement speed in the check scanner.

2.)Replace the check scanner with a document scanner. You can actually read the MICR using OCR, but it's not quite as accurate as magnetic reading. This might be OK as the quality of the rest of the information on the check's extraction will be higher with the better image. Some times it's better also because an ADF feeder allows you to scan many checks at one time which is a new time savings. The biggest killer of this approach is the fact that auto endorsement is such a tremendous time saver, it's impossible to part with it.

3.)And finally option three, the most common, just use a check scanner. This option may be most common but not necessarily the best. In this option the company must make sure they get good image preparation and clean-up software that will enhance the OCR and Data Capture process as well as likely up-sample the images to 300 or 400 DPI. Up-sampling does not produce the same quality as scanning at these resolutions but products that excel in up-sampling can get close.

Check scanning is being more and more augmented with OCR and Data Capture processes, companies should not assume that a check scanner will have the quality of image that a document scanner will have so these above considerations are important.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 1 Comments

Monday, October 12, 2009

Down and dirty paperless office

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

1. An unused computer attached to your network
2. Google Desktop Search with network browsing enabled
3. A document scanner
4. A server based automatic OCR product
5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don't even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become apart of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it's setup it's simply a matter of putting paper in the scanner and pressing the scan button, and your done. It's that easy, and extremely useful!

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Thursday, October 8, 2009

Not even your monitor is safe from OCR

I've talked about various uses of OCR that are non-conventional: anti-virus, CAPTCHA ( thought this does not work ), and now it's time for a new one. Screen scraping. OCR technology is not widely used to extract text's from user's active screens, and the predominate use has been of the sneaky kind. However I suspect that screen scraping will become more popular for data validation, user identification similar to CAPTCHA, user automation, and even extreme content management. I myself have used screen scraping to convert an on-line address book from one email account to an importable format for another email account where the initial account did not have the option for export!

Essentially what screen scraping does is take a screenshot of the active window, or entire current session and reads the text in it with OCR. Although screenshot resolution is very-low, 96 dpi, the text contained in it is what is called “pixel perfect”, and does not accompany the distortions, dithering, and splotches that can appear in scans. This makes reading the text itself relatively easy, the hard part is getting to the text.

Look at your screen now. It's probably filled with various graphics, and text everywhere. For screen scraping you cannot consider any traditional document analysis to discern where text is and what text is valuable, this has to happen after the fact. The most successful screen scraping is that which is focused on one particular portion of the screen. The next biggest challenge in screen scraping that is continuous, is the rate a screen changes. For example if you are typing a document, as I am now, you may scroll up and down very rapidly at times. Deciding when and where to capture data in an active screen can be tricky.

It may be hard for you to image why screen scraping is useful. Especially you techies who realize that the text on the screen is in digital format already somewhere. Where screen scraping is extremely valuable is when your application has to obtain data from another application. Developing connectors between applications can be very time consuming, and often a major waste of time. You have to learn the other products API, if they come out with a new version you now have to support it. But with screen scraping you can write one way to get data off the screen of ANY active application window, search for the relevant content, and presto, you never have to do it again. In the areas of enterprise content management, and conversion from a legacy system to a new, screen scraping using OCR can be the most amazing tool.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, October 6, 2009

Beefy servers don't always make faster OCR

IT departments like the new latest and greatest computer technology, why shouldn't they. Usually when shopping for a machine it's always true that MORE = BETTER. But in the case of OCR organizations are surprised when a desktop testing machine outperforms their new Beefy server. In the case of OCR there are very specific things that increase the performance of processing. Many desktop grade machines will do an amazing job at OCR if you just hit the right points.

1.)Bus speed. If you consider that OCR is moving images in memory and on the hard-drive very rapidly and doing it a lot than you will quickly realize that the time it takes to move from point A to point B could be one of your biggest bottle necks. Lets try an analogy. San Francisco, and New York are two very large cities. They have quite an amazing capacity for people, and things. Let's say San Francisco is computer memory, and New York is a hard-drive. If I and 200 of my friends want to move from San Francisco to New York with all our stuff, driving 100 or so VW Beatles cross country would take a LONG time. But if we were to all load on a jumbo jet we would be there in a matter of hours. This is how the BUS works and the slower the BUS speed on memory, hard-drive, and CPU the more of a delay for these image files to write. Servers often have fast BUS speeds but have a tremendous amount of overhead that gets in the way.

2.)OCR is a CPU HOG. It will take 99% of any single thread when it is running, so putting energy into a more powerful CPU with more threads is not a bad idea. However assuming that a server grade CPU such as the Xeon is better then a Desktop CPU such as the Duo might be a mistake. The reason for this is simple and two fold. Again servers have more overhead which can get in the way of processes that have a lot of moving from one place to another. Most importantly is that the chip-set of the older established CPUs is just that, older. They may be the same speed, but they don't deploy some of the faster math processing that is very good for OCR and found in the new chip sets.

3.)Hard-Drive speed is the same story as BUS speed. You want your hard drives to write quickly. Images are being serialized very often with OCR. Not only do you want it to be fast but you want it's connection to the motherboard to be fast. Serial ATA so far is the proven fastest way. Server's tend to implement SCSI which is great for redundancy, but not a promoter of speed because of the overhead.

4.)Memory is important but amount of memory is less important then the memory speed. 4 GB should be sufficient for most activity any machine can handle. The difference between 266 MHz speed and 666 MHz is a huge difference.

If you keep it simple and focus on those tools that REALLY increase OCR performance you may be surprised that you have to pay less to get more in this case.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments