Thursday, March 4, 2010

Document longevity

One of the biggest risks in document scanning is doing it wrong. A document that is scanned improperly, stored improperly, and with the original paper destroyed, it could be a very serious situation for an individual or organization. Sometimes it's just too hard to anticipate or know what settings to use. For example, while your scanning today may be for the purpose of regular consumption via search and retrieval, tomorrow it could be required and printed for a law suite.

Fortunately, technologies are advancing such that scanning the “Golden Document” is practical and possible. The “Golden Document” is a document scanned with all the best settings for quality; not taking into consideration file storage or performance, the two biggest drivers to reduction in scan quality. The settings for the “Golden Document” are a resolution of 300 DPI, a color bit-depth, and a fill format of uncompressed TIFF. If the “Golden Document” is the optimum, one must make the rationalization of why to ever deviate from it.

With advances in document scanners, compression, and file formats, the need for rationalization becomes less and less. Document scanners can now scan a color image at nearly the speed of a black and white. For this reason, there is little reason to use black-and-white or gray-scale scans. A color document gives you the ability to convert, re-purpose, and print. Scanning at 300 DPI is a setting that should never be compromised. Now that you have the golden scan, you have created a rather large file. Ideally you could compress this file to a more regularly consumed format and not lose quality. Compression technology advances substantially every year. The ideal file format for storage, quality, etc. is arguably PDF searchable. This format has the functionality of a regularly consumed document and the configuration for sustainability. Alternatively, some may choose to create both a PDF plus a word document for the additional ability to re-purpose.

While you may not be scanning the “Golden Document” today, now is a time to revisit why and ways to get there.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, February 25, 2010

Imprint vs. Annotate

Large volume scanning environments often have the need to imprint, herein “Stamp”, usually date of scan on each and every page that is processed. This requirement is created for tracking purposes and sometimes compliance. Many service bureaus require more than just a date, they require batch IDs and other important tracking information. The question becomes how to do this in the best way. There are several options.

Pre-Scan Imprint

Pre-Scan imprint being the most common option allows an organization to have the stamp on both the physical paper copy and the scan. Scanners capable of pre-scan imprint will print in the proper location for the data prior to the image reaching the scanners lamps. By doing so, the stamp will also be part of the scan. The reason this is the most common is because there are times when a scanned image needs to be compared with a physical document and this is what would be required to do so. Scanners with the imprint feature come at a premium and requires more maintenance.

Post-Scan Imprint

If the organization only needs the data or tracking mechanism on the physical paper then they can imprint after scan. Some scanners support post-scan imprinting or organizations feed the paper through an additional printing process. Usually the purpose of this operation is to imprint pages indicating simply if a page has been processed or not. Scanners with the post-scan imprinting feature run nearly the same price as the pre-scan imprint and gradually being faded out in favor of it.

Software Annotation

If the organization only needs the data or tacking mechanism on the scanned image they may elect to do software annotation. Software annotation gives the greatest amount of flexibility of all three options as any combination or sequence of data can be printed on the image anywhere. Software annotation would require an additional piece of software. Very often organizations will choose software annotation instead of the premium for imprinting scanners but sacrifice the physical imprint. The application that provides the annotation needs to be automated and batch driven.

The alternative to the above three methods is manual stamping. Manual stamping is tedious, time consuming and often inaccurate. It's up to the organization to review the three options and pick the best fit for their production and budgets.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, January 26, 2010

Attachment Emailing Master

Very often in business, email correspondences are accompanied by a file attachment. While it's possible to attach to an email any file format ( some not preferred by email clients ) the most common type is a document and the most common format is either Word or PDF. This post contains some advice on the best way to deliver documents via email.

When emailing documents, you have to be concerned about size, readability, and security. If the attachment is too large, you may not be able to email it at all. If the document is not readable, there is no point in sending it anyway. Finally, if it's not secure, it might be re-purposed or stolen. When your document starts out in paper form, the challenges increase.

There is an ideal format and conversion settings to use when sending documents via email. Ideally you would scan your document in color for readability visually. This is not the only type of readability, you also want to make sure the documents are accessible for long periods of time. You would use optical character recognition ( OCR ) for the document's ability to be indexed by a search utility. You would use a compression tool to convert that initially large color image into one that is manageable but the quality is not degraded, and finally you will use the PDF format to get all levels of security you choose.

The combination of a searchable, compressed, color PDF is the ideal method for emailing documents as attachments and ensuring their effectiveness and long-term usage.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, January 12, 2010

Duel Stream Scanning – Have your cake and eat it too

The benefit of drop-out forms is that they are very accurate in data capture. The downside to drop-out forms is that after they are scanned they aren't much to look at. Companies want the best of both drop-out and black and white forms. They do this in various ways, the most common being to just deal with the images they have. Some will scan a document twice, that is very time consuming. Others will use an overlay utility that stamps the original form fields and labels back on an already processed drop-out image. These utilities are accurate but not as accurate as the original and often result in lines stamped on text. The best solution for getting a form scanned efficiently that is both optimum for data capture and viewing is to use duel stream scanning.

Duel stream scanning is usually a feature in the higher end scanners. The technology is slowly moving down to the work group and desktop scanners. What the feature allows for is a single scan that produces both a drop-out and black and white image. The scan speed is the same scan speed as if you were scanning in color. When configured the drop-out image goes one path and the black and white image another. By doing so a company can use the drop-out image only for data capture, and the black and white image will marry with the data capture results in the database or file system.

The difference in data capture accuracy between a drop-out form and a black and white scanned form is on average 15% more accurate often much higher. The reason for this is the OCR in data capture does not get interfered with form lines being printed on or too close to text. Additionally the logic to locate fields can be simplified as field labels are often small font and hard to detect.

It's simple and has the greatest accuracy of any solution, duel stream is a great tool.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, December 25, 2009

It's not that you don't want to, it's that you can't

Many of us tech heads are quick to give you an answer to your technical needs and propose a solution even if you did not ask. I'm no different, if you tell me you want your documents digital I will explain OCR to you and then explain the best solution for your document types. To my dismay, if you work for a large company your response will likely be, "but I'm not allowed to install anything."

It's very common for large organizations to lock down their employees' computers to the point it becomes more of an appliance than a computer. This lock down makes perfect sense especially considering the amount of personal and private information these organizations encounter. The lock down however makes it very difficult for a technical operator to increase their efficiency with new technology. While the offer stands to approach an IT department with requests for new technology, the reality as we know is very small, especially with the current situation of shrinking IT departments.

Most recently I was in a conversation with someone working for a bank. She had stacks of business cards that needed to be digitized and of course being the tech head that I am, I got excited and explained about business card reading ( BCR ), and that perhaps it would be easier to get a document scanner that could scan the business cards and everything else. But to no avail, she could not install the software.

The real hurdle with the computer lock downs is not so much hardware installations. This can be overcome with a simple request. It's the approval of new software that requires many months of review and approvals. Because OCR is a software driven process, this complicates things. Eventually, I hope that document automation becomes a part of the standard build for end-users machines. Until then, the solution is a scanner and an OCR service either web based or on an intranet.

If an organization can deploy centrally an OCR server that users send documents to and receive results from, they will eliminate the risk of installed software. Alternatively, an end-user with an attached scanner can leverage the OCR web based services that exist, either via FTP or E-Mail upload documents and receive results.

I hope soon we all have OCR as a standard so we can start removing the reliance on troublesome paper, but until then, the OCR services exist to get the job done, and may sometimes be the preference.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, November 11, 2009

Dropout, all or none

Color or Greyscale dropout is a great tool for increasing accuracy of extracting data from forms. But bad dropout is far worse than no dropout. Partially dropped out forms have the ability to confuse data capture technology. These forms are commonly called “Zebra” forms where portions of the form have dropout performed correctly and other portions have the fields now outlined in black. If you have control of the scanning and this is the situation you are better to turn off dropout, or improve it's use.

It used to be the only way to dropout a form was to use scanner driven dropout. This approach was limited in colors that could be removed. Essentially what would happen is the scanner would be equipped with lamps of red usually. During scan the lamp would be turned on thus canceling out the red in the form. Because of this it was important that printed forms used a certain type of red. If you have every had experience with color matching you know it's quite frustrating. Especially because the colors you see on the screen are not usually what is printed. Things have improved, now even scanners are using software dropout, where images initially arrive as color and algorithms then remove pixels of a certain color range from the document. This has created the added benefit of being able to with some scanners and software packages dropout any color, and multiple colors at a time. There are even some packages out there where you can drop out things like colored lines.

When dropout with any technology becomes difficult is when there are gradations on the form because of bad printing, color wear, sun or other damage. Because the software is looking for consistency with any dropout it will avoid colors that don't match the norm. This is often seen when the first half of a form is dropped out and not the second because of a color change mid document. There are tools that allow you to specify a threshold that can assist with this. This can be a very low threshold when dealing with documents where it's one color and black text, but more complex documents can with a low threshold loss important data.

The biggest key to proper dropout assuming good form printing is to scan the document as quickly as possible, removing time for damage to possibly take place. Dropout is a great tool, but if you find that forms are partially dropped out you are better for data capture accuracy to turn off dropout and deal with the black and white form than to include it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Path to simple yet robust document routing

When it comes to the input path that documents follow, for many it's as simple as scan, convert, save, but others require more complex work-flows. The good news is there are tools out there to perform even the most advanced work-flows you could imagine. The bad news, they are expensive. I'm here to tell you about a way of combining your scanner and data capture, OCR, and document conversion software to make more complex work-flows without the premium.

By using settings that come with most document scanners and the ability of most data capture, OCR, and document conversion products to utilize hot-folders ( watch folders ) you can create robust multi-step work-flows out of the box. What you need is a scanner that supports multiple destinations usually 9 or more. This is indicated by an LED on your document scanner which at the point of a batch scan allows you to pick a destination number. Second you will need all the software required to perform the conversions needed for final result. In our example we will want to be able to OCR, data capture, compress and archive.

Basically the task is to create a funnel for your documents and the end result is saved where you want final destination to be. If your scanner supports what is called duel-stream then you can be working with two funnels simultaneously making your work-flow all the more robust. The first part of the funnel is identifying the document type. Each of the 9 destinations on your scanner should be configured for one document type ( you may want it to be one destination per business process instead ). The configuration would include the scan settings, 300 DPI of course, and what folder the document will go in. This is just the staging folder for the next step. Lets assume that we setup destination 1 for invoices and our scanner supports duel-stream. We want the invoices when it's all said and done to have one copy to saved in a search-able directory, where the file is both compressed and in PDF/A format. Then we want another copy of the same invoice to be data captured and put in a working directory for someone to review. Lets put it all together.

Destination one on the scanner is configured for invoices. The first copy of any invoice will be saved to a hot-folder that the PDF conversion utility is watching, the second copy will be scanned into a hot-folder that the data capture product is watching. Because these are hot folders both copies are picked up instantly and processed by each application. Our requirement for the second copy was only to be data captured and exported to a working directory, so we have now completed it's task. For the first copy we have more conversions to do. The PDF conversion utility saves the OCRed search-able PDF to a hot-folder for the compression utility, the compression utility compresses the PDF and saves it to a hot-folder for the archive utility, and FINALLY the archive utility saves the result in our final destination for all invoices. Below is a basic diagram of the work-flow we created for invoices ( destination 1 )

Scan >PDF Creation >Compression >Archive >Final Result
>Data Capture >Final Result

Although it may have been slightly difficult to read, hopefully it's clear that above is just one work-flow getting the most out of the tools offered by both the document scanner and conversion software packages. Now you can proceed to program each other destination with different document types and their associated work-flows. Programmers and tech savvy individuals will be able to easily envision ways to add scripts to make the process even more robust with email notifications etc. This approach is not a replacement for advanced work-flows but a middle ground between no work-flow and very pricey work-flows.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, November 9, 2009

Get the “Blank” out of here

One of the challenges of document scanning is pesky blank pages. They are usually an annoyance and a space taker more then a real problem. Blank pages in PDF files cause needless scrolling, and in text documents make you believe something has been missed. In the area of duplex scanning you can be assured that unless you take the steps to remove blank pages they will be there. There are several ways to zap blank pages from any batch scan. You can remove them prior to scanning, during scanning, or after recognition / post scan. Obviously the task of removing blank pages prior to scanning is only possible if they are two sided blanks or if you selectively scan simplex and duplex depending on the document. This is cumbersome and takes up a lot of time.

Most document scanners today include as apart of their driver a blank page removal tool. These tools vary slightly they may have specific algorithms that detects blank pages not only by the amount of white on the page but also possibly by how a page relates to other pages in the batch. Some times this is problematic when you have backsides of documents with very little text. The other approach is to measure the resulting image file size, under a certain number of kilobytes you can likely spot a blank page, this has the same problem of removing pages that have very little text which often occur on the back side of documents. The final and most accurate way is to measure the amount of black or color pixels on the page and set a threshold at a small percent like 1% or 2% that could consider the page blank, this approach is the most accurate but requires you to know your documents beforehand and may be problematic with greyscale scans or contrast settings that make blank pages slightly gray. The other approach would be to have imaging or OCR software remove the pages for you.

Some, not most OCR applications have the ability to also detect blank pages, they use a combination of pixel detection and the presence of text. This might slow down your OCR process but is a useful tool if it is available. More likely you can purchase a full-on imaging application that has very robust blank page removal tools akin to what you would find in a scan driver but usually with more options.

Organizations such as service bureaus often combine methods to ensure that no blanks make it through. Blank page detection tools are very accurate and very useful that you can start using today.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, October 21, 2009

Is it a document or just pages

There are several aspects of how we talk about documents, scanning, OCR and Data Capture that are the culprits of confusion, misunderstanding, and unfortunately deferred adoption. One of these common language misconceptions is when discussing documents and their structure. All data capture and OCR implementations involve the concept of document even if a document is a single page document that is repeatedly scanned. But the definition of what a document is often gets blurred between vends, end-users, resellers, and even internally in all of these.

Some times people think a document is just one page in a collection of pages, others believe a document is a record in a database that consist of several page types but are combined together in a single record. In this last thought it does not include when the scanning happens so one page can come in at a different time than another, but not until they are all there do you have a document. And others think a document is multiple pages scanned together with a page type that determines the beginning and the end.

Where the confusion comes in is that they are all correct, but are influenced by different things. Documents to an organization can be defined by a business process, or a scanning process. To add to the confusion the scanning department has a concept of a document related to scanning, but the back office has a different concept as it relates to the data base. To reconcile this let me tell you in complete what a document is.

A document is all the paper it takes to create a single record in a system or data base. This definition actually combines all of the above and generalizes it. The reason it's important to reconcile everyone's opinion on what a document is, is because document structure and business rules around a document directly impact how you implement OCR and Data Capture and keep it accurate.

The biggest challenge of all these language misconceptions is purely understanding that they exist. If you know it's going to happen then you can mitigate their impact. Not knowing their presence can make them a silent killer of success.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, October 20, 2009

Why OCR is for everyone

You may come to this site looking for OCR software, PDF Compression tools, or maybe it was a StumbleUpon. Maybe a friend said they used OCR and loved it, and you just had to Google it to find out what IT was. Unfortunately tech industries have the habit of making great technology visible to only those who know the acronyms and have a good idea of the benefits it can provide. Everyone can benefit from Optical Character Recognition. So lets break the barrier.

What is most important about the technology is not how it works, but the result it produces. Some times when people who are unfamiliar with scanners see the slue of document scanners I have they ask “why do you have so many printers”. Barrier one scanning. To OCR documents they need to come via email or some digital transfer as images, or more likely they are paper that needs to be scanned. We all get mail, some mail is junk some is useful. We all also have paper documents sitting around and in cabinets we need to keep for a rainy day. At the same time we annually increase the use of our computers and are creating many files on them. So at the vary least wouldn't it be nice to take the useful mail, and other useful documents you have around: mortgage documents, nice letters, business cards, etc., and get them with all your other digital files? To do so you scan them, hopefully using a document scanner as it's more efficient than a flatbed. Consumers are very used to the idea of scanning photos, scanning documents is no different except for the fact that you have more. A document scanner, not a printer but looks like one, allows you to batch documents and scan them to a folder on your computer without doing it one-by-one one side at a time like a flatbed scanner. . Now that you are scanning you have an image representation on your computer of your files right by all the other digital files you have. Now what? Now it's time to get the data out and make them just as useful as all your other files.

Barrier number two OCR. It's an acronym that stands for Optical Character Recognition, this does not tell you much so forget about it and use it only to reference the process. Simply it's just a helpful technology that gets text from images and converts them into a format you can use. OCR converts the image into usable text, so you can search for that nice letter, or you can edit that party invite and print it again. The result can be PDF, DOC, TEXT pretty much any format you can image.

Now coming full circle that good mail, and useful documents you have are not sitting somewhere cluttering up desks and drawers, they are with all your other files on your computer ready to use. OCR is useful to everyone, you just have to clear your mind of the techie talk and understand it's value.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, October 12, 2009

Down and dirty paperless office

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

1. An unused computer attached to your network
2. Google Desktop Search with network browsing enabled
3. A document scanner
4. A server based automatic OCR product
5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don't even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become apart of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it's setup it's simply a matter of putting paper in the scanner and pressing the scan button, and your done. It's that easy, and extremely useful!

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Monday, September 28, 2009

Set it and forget it OCR

My office is a paper monster, paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review it's contents and asses it's value, scan it, shred it. Usually within minuets of it's existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR'ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid as basically I'm relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as it's own input folder with it's own OCR settings with it's own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search. And after-all I'm an expert in OCR so to not use it everyday would be crazy in it's own right. I've yet to be proven wrong, my “File Cabinet” abyss has always giving the information I required at the time I required it and sometimes new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don't destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Saturday, September 26, 2009

Document's say “Cheese” - Digital Photo OCR

Contrary popular belief it will be many years before a digital photograph of a document will be close to the accuracy of a document scan. Yes there are document scanners today that are based on a mounted digital camera, this is very accurate, but not what I'm referring to. I'm talking about photography of documents with your cell phone, or digital camera. One would assume that taking a photograph of a document at the highest possible resolution would be able to eventually replace document scanning, but that is not the whole story. Even your 12 mega-pixel digital camera will not beat a 300 DPI document scan when it comes to document imaging. While it is possible to get better and better digital photographs of documents there is one major problem in converting them using OCR and that is that OCR engines have to account for many more variable elements, the most complicated being layers.

When you take a photograph of a document there is the potential of several different focal points, a table, a finger, the floor. Some of these focal points can be easily be mistaken for the flat surface of a document. The OCR engine has to determine which layer or focal point is the actual document and what it's borders are. The way the do this is color detection primarily. Because in a document scan there is only one focal point, as the document is the entirety of the image, the OCR engine does not need to guess and make any modification to the image to find it. This increases the accuracy of both document analysis and character reading. The next challenge is perspective.

A digital photograph of a document should be taken head on. Think about the LCD screen on your camera as being on the same plane as the piece of paper. Any variation to this causes problems with distortion where for example the top portion of the document from left-edge to right-edge has a shorter distance than the bottom portion. There are some capture applications out there for the iPhone and other mobile devices that force you to line the document up in brackets. This forces the capture to focus only on the document and know by virtue of the guide where the borders are, but lining it up is very time consuming. That gets to the final point, time.

It actually would take you much more time to capture 10 page document with a digital photograph than with a ADF or sheet-fed document scanner. Because the quality of the photo is so important in running OCR on a digital photograph It requires a lot of conscious effort on no shaking, lining up the document, and placing the document on a surface that does not contain many layers or focal points. Because of this additional effort it's actually not saving any time.

I am a fan of blooming technology as well, but for acquiring paper images and converting them there is not better way then a portable or traditional document scanner. In time digital photographs of documents will become a popular way to capture single page documents for one-off processing, but as long as paper exists so will the reality of document scanners.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, September 25, 2009

Don't over clean – the effects of image clean-up on accuracy

There is always some way to modify a scanned image to improve it's recognition results if it's not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen, the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it's possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.)Despeckle ( unless dot-matrix font )
2.)Line Straightening
3.)Basic Thresholding
4.)Background removal
5.)Correction of Linear Distortion
6.)Dropout
7.)Line Removal ( sometimes )

Bad for OCR is:

1.)Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”'s will be “e”'s. For hand-print you often remove portions of characters.
2.)Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
3.)Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes consider only those that improve the recognition rates, not destroy them.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, September 23, 2009

The Magic of 300DPI

Many users of OCR don't realize what the impact of resolution and bit-depth is or even what they are. Usually in the case of OCR more is better. More resolution, more bit-depth. It's more information the OCR engine can use to interpret text. But as with many things there is a point of diminishing returns, as it relates to image resolution diminishing returns are very interesting.

You will hear a lot that 300 DPI is the best resolution to scan an image for OCR. But why? 300 DPI is that magic number where you gain the most accuracy with out sacrificing speed and file size. If you were to put the resolutions on a progressive line starting with 96 DPI and run test of both OCR accuracy, scanning speed, OCR speed, and file size. You will notice something very interesting, the improvement gap between 200 DPI scan and 300 DPI scan will be at least 2 times the improvement gap of any other resolutions. Now if you look at the same line between 300 DPI and 400 DPI the improvement gap is nearly absent, but still there. This simple study is the reason 300 DPI is the ideal resolution for OCR scanning. Now lets look at why.

There is one major reason that 300 DPI is optimal besides it has a reasonable scan speed and reasonable file size, but the biggest reason is the Engine cores were all initial trained on this resolution. Some engine's no matter what resolution you give it will actual sample up or down to get to 300 DPI. The image pre-processing/cleanup engines are similarly setup.

There are always exceptions, and the area of exceptions are usually in hand-printed forms ( ICR ), or documents with small print.

The beauty of the 300 DPI best practice is that it's one of the few things in the area of OCR and Data Capture that is consistent through document type. You have been told to use 300 DPI and now you know reason behind it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, September 14, 2009

“No text left behind” - Color's Impact on OCR

OCR technology has come a long way since it's creation. On the 300 DPI clean, letter type documents the technology has arrived and not much room for improvement. But what about the rest of the documents out there, how is OCR improving on them? When comparing that perfect letter document to that not so perfect article or newspaper say, the big difference is text placement and configuration. One of the keys to getting even better OCR is to improve your ability to identify what is graphics, what is text. Within the text you have to identify columns, paragraphs, sentences, words, and finally characters. Only then can the OCR take a whack at interpreting the text. This is called Document Analysis. Sometimes OCR accuracy is lower not because of the actual read of the text but because the OCR software tries to read things that are not text, or some of the text in the document is simply ignored because it was never found.

In the last few years and moving forward text identification, Document Analysis, has been one of the areas of greatest improvement. Many of the new products have been leveraging color as one more tool in not leaving any text behind. With color the ability to locate different parts of a document is even easier and more accurate, thus the overall OCR is more accurate. The most obvious benefit of color is ability to locate graphics. Sometimes index level OCR requires that even text within graphics be read to enhance the search-ability of a document. With color detection the modern engines are advancing to locate text in pictures and ignore the rest. Very stylized documents pose the greatest challenge to Document Analysis, and color is one of the best tools to attack them. Expect to see similar trends and focus on Document Analysis and the pursuit of no text left behind.

Labels: , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Sunday, September 13, 2009

Don’t over clean – the effects of image clean-up on accuracy

There is always some way to modify a scanned image to improve it’s recognition results if it’s not already perfect. But there are also ways to modify an image to destroy recognition results. Not all image cleanup is good for OCR for several reasons.

There are two types of image clean-up. First is image clean-up for view-ability. These are the image clean-up tricks that make images look even prettier on the screen, the goal is what is called pixel perfect where the image looks like it was electronically generated. The second is image clean-up for OCR or Data Capture. These are the tricks to making the image gain better recognition results. All image clean-up for OCR and Data Capture is good for view-ability, but not all image clean-up for view-ability is good for recognition. The reason for this is that engines were built and trained during a time where many image clean-up technologies were not available, and because recognition technologies interpret pixels, it’s possible to remove useful ones.

Here are some tips. Stick to certain types of clean-up for recognition when this is the primary purpose. Some products and scanners will even allow what is called “dual stream” where one scanned image produces two results that can go separate paths. If you have this function use settings for one of the images that are best for OCR and settings for the other that are best for view-ability. Good for OCR is:

1.)Despeckle ( unless dot-matrix font )
2.)Line Straightening
3.)Basic Thresholding
4.)Background removal
5.)Correction of Linear Distortion
6.)Dropout
7.)Line Removal ( sometimes )

Bad for OCR is:

1.)Adaptive Thresholding: Often causes a condition called “Fuzzy Characters”. “c”’s will be “e”’s. For hand-print you often remove portions of characters.
2.)Character Regeneration: Removes critical information important to OCR and ICR processes. If you use it in OCR ( Machine-Print ) you will notice more “high confidence blanks”, the characters are so perfect they look like images to the OCR engine and are ignored. In ICR ( Hand-Print ) you will damage the hand stroke of the characters thus confusing the ICR algorithms and reducing trainings ability to understand the subject and this ultimately reduces accuracy.
3.)Line Removal: Bad line removal makes bad OCR. Line fragments really interfere with OCR and ICR processes.

When using imaging for OCR and Data Capture processes consider only those that improve the recognition rates, not destroy them.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments