Wednesday, November 25, 2009

Convert now Export later

It's not surprising that organizations focus of any sort of document automation is the export format and data coming out of the system. But sometimes this focus has organizations choosing poor data capture and OCR products just for and ideal export format. The places this occurs the most is in healthcare and accounting where these industry specific repositories expect a format and the vendors of these repositories are unwilling to change. This post is to assure you that the accuracy and features of your data capture and OCR product are more important than the file format it creates.

By focusing on file export format organizations are limiting their possibilities of solutions and perhaps locking them into a more expensive proposition then they should. Industry specific applications are able to charge a premium for connectors and their products because they understand where the focus is. However the most accurate data capture and OCR systems out there are general. Some data capture applications have connectors to say a specific accounting system, but even without specific connectors all data capture systems can export data in such a way that it can be converted to ANY desired format.

Data capture application support CSV, XML, ODBC, or text exports that can be molded in to any required format. Often because they support ODBC there is an opportunity to export directly to any application also supporting it. Because a conversion utility or a custom connector takes weeks to create vs. data capture and OCR's man years to create, the focus should be given to the accuracy and capability of the OCR and data capture system before it's export functionality.

While it would be ideal to find a data capture application that had the accuracy, the features, and the export you desire, I urge organizations not to limit themselves too it. Picking a poor data capture and OCR system will be far more costly than creating even a custom export from scratch.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, October 29, 2009

Not all Documents are Equal – OCRing Newspapers

There are several document types out there both for full-page OCR and for data capture that require special attention and configuration. For Full-Page OCR ( extraction of all the text on a document ) newspapers is one of these and poses some interesting challenges. When considering the OCR of these types of documents you need to change your opinion on the document itself.

When you open up a page of any newspapers you likely are considering the document as a whole, while your brain is picking apart the pieces. This is the key to OCRing news papers. The biggest challenge facing companies wanting to convert newspapers to text using OCR is their layout. Often times though the font on newspapers is usually pretty small it can be scanned at a quality that the raw OCR read is very high. Newspapers have their own structure they have page headings, section headings, article titles, article sub-titles, by lines, articles, and then footers. Not only that articles span pages.

When converting a newspaper the most effort should be spent on a process of proper zoning. Because document analysis tools built into OCR engines are tuned to the average document ( newspapers are not ) they will accurately find columns and paragraphs, but the key is to find the titles by lines and be able to separate articles. Most large service bureaus processing newspapers at volume have a manual zoning process and then a single read of OCR which produces very accurate results all because the zoning was done properly. Others have devised a two pass OCR system that essentially zones documents twice narrowing the focus on each step and increasing zoning accuracy thus OCR accuracy. This solves the read accuracy but not page continuations.

Page continuations are handled most often post OCR with a business rule applied to the OCR result. Meta-data from the OCR results should indicate on which page the text came from, thus by finding the words “continues on” at the bottom of any given article you can concatenate to it their continuation for final presentation. As apart of this rule is an article count and an article portion count, by the end you should have 0 portions and only articles. If you have a low confidence on the merging of articles you can simple merge the result review the remaining portions and your accuracy will then increase.

OCRing newspapers has it's challenges, not to mention the difficulty in scanning them, but it's possible and can be very accurate if in the right state of mind, and using the right approaches.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Wednesday, September 16, 2009

It's CAPTCHA for a reason – Why you can't OCR CAPTCHA

I've been surprised recently about the number of project requests and Twitter conversation's insisting that OCR can be used to read CAPTCHA. A CAPTCHA is that crazy set of letters and numbers most websites ask you to enter when completing a web form. The purpose of a CAPTCHA is to prevent web bots to create accounts on websites for use in spamming or other malicious activities. It's surprising the number of organizations both private and public that want people to solve this problem of reading CAPTCHA for them. Most all of these companies ask for the use of OCR technology to do so.

I'm sorry, but the answer is it's not possible with OCR. The reason it's not possible is because CAPTCAH is not an OCR problem. It would be more logical to call it ICR ( Hand Print ), but this is still a stretch. OCR is Optical Character Recognition which is reading of typographic text. CAPTCHA fonts are clearly not typographic. To be typographic they would have to have the same baseline (bottom border), same font height for each character in the same class, etc. CAPTCHA fonts resemble more closely hand-print which is ICR processing. However even ICR technology is expecting some consistency, for the most part in a given day and time you will write the word “CVision” pretty much the same across a form. This allows ICR to understand subject hand strokes etc. in creating the character. This level of consistency is simply not present in CAPTCHA's. CAPTCHA's deploy backgrounds and ever moving lines to prevent the consistency of even their already bizarre fonts. For the most part each CAPTCHA system at any given moment in time will produce a different character variation for each character possible.

While the idea of processing CAPTCHA's is technically enticing, actually wanting to do it has obvious malicious intent. Conversion of CAPTCHA's would require a combination of varying recognition technologies, adaptive pattern training, and imaging techniques. I'm not convinced that the effort in creating such an approach is fiscally feasible, especially when the average project is offering fifteen dollars to complete it. My job today is to set the record straight and let the world know that CAPTCHA processing is not a job for OCR and ICR technology period.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments