Tuesday, February 16, 2010

MacWorld 2010

At first I did not think I would find the time to make it, but I did, and on a Saturday mind you. Last week was MacWorld 2010 in San Francisco. I went into the show this year not expecting much. I've become accustom to the rapid decline in attendance and quality of trade shows recently. That combined with the fact that Apple was not present, and Adobe not really welcome. I thought the show would be a little light. The show was actually pretty well attended, but the quality of the exhibitors not great. A lot of useless, almost Microsft'ish type clutter: mice, cases, keyboards with silly color coding systems that I'm still trying to figure out the point of and why that would make me more efficient, and an app pavilion that was so scrunched it was impossible to get an idea of what was even going on. There were a few things I would note from the show however.

Eye-Fi. I just like these guys, I like what they are doing, and I am impressed with the technology. Eye-Fi has a product that combines software and hardware embedded on an SD card to give you a way to automatically transfer photos and video from camera to your favorite location as soon as a Wi-Fi connection is available. There are several obvious limitations, but the future potential is outstanding. I don't like that they are tied to the DCIM file-system, but I'm sure that can be overcome.

Fujitsu ScanSnap + Evernote. Really played up their new scanning functionality. I like Everynote's App, but I don't like how proprietary it is. It is not easy ( takes a hack ) to get your documents out of the application. What if you need to use the data somewhere else, or heaven forbid need to migrate? They combined with Fujitsu's talk about there scanner profile on the ScanSnap to scan directly to Evernote now have created a complete personal content management system.

Neat Receipts. Always been a fan of the product. The particular scanner they use I'm very familiar with and have worked with directly on an OEM level. The great part about the scanner is not so much the quality of scan but the form-factor and convenience. As did all the other scanner guys, Neat talked mostly, not about the scanner, but about the software bundled with it. They also have a personal content management system or file cabinet application and they play up their recognition quality. Unfortunately optical character recognition ( OCR ) on the Mac is still not ported to the best version available on PC, so the quality is not great. I found it interesting in the time I was waiting to say hello how many complaints I heard about the business card ( BCR ) and receipt reading accuracy. I felt bad for the guys, but not too much. Vendors like Neat, ReadIRIS, etc. have painted a picture for end-users that is completely incorrect and they of course expect the reading of business cards and receipts to be very accurate, when actually on the desktop level it usually is not. They only hurt themselves with bad market education.

And finally, Microsoft. I found it humors there sarcastic marketing against Apple. Re-iterates the love-hate relationship that exists. It's like two brothers, one makes all the money off what the other creates. They know it, I know it, but we still pretend.

Not too much was said about the iPad at the show. Some companies quickly incorporated a photo or two into their marketing. And there was an amazing lack of Chotchkies. I Covered the show in 1.5 hours, and did not feel bad about the lack of effort. I doubt MacWorld will ever be what it was, but I'm hoping that next year brings more technology and less cases and accessories.

Labels: , , , , , , , , , , , , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, November 6, 2009

Invisible characters

Exceptions in OCR and data capture are usually thought of as miss-recognized characters only, but in reality there are several other types of exceptions that exist. One of those is called “high confidence blanks”. A “high confidence blank” in OCR or data capture is where the software looked in a particular region for a character but no text was identified or read. In data capture “high confidence blanks” usually occur for entire fields or just the first character, in full-page OCR they are less common but can occur sporadically throughout the text of the document or the entire text. This type of exception is elusive and hard to detect. Obviously if entire fields and text is missed where you expect there to be text it is easy to spot, but for the one-off missing characters it's tough. With full-page OCR detection is done with spell-check. Missing characters in a word will surely flag the word as being misspelled. In data capture it's much more tricky and the best thing to do is to take certain steps to avoid “high confidence blanks”.

1.)The first thing you can do to avoid “high confidence blanks” in data capture is to NOT over use image clean-up. If characters are regenerated or cleaned too much they look to the OCR engine to be just a graphic not a typographic character and thus avoided.

2.)Second if you have control of the form design make sure text is not printed close to lines, this is one of the biggest generators of “high confidence blanks”
3.)If text is close to lines then you should be able to establish a rule in the software indicating for example that if the first character in a field is more then x pixels away from the border then most likely a character(s) was missed.
4.)If at all possible use dictionaries and data types that state the structure of the information that should be present in a field. If a character is missing this data type will likely be broken.

This type of exception is one that leads to hidden downstream problems when organizations don't realize that it might happen. Being aware and taking the proper steps to avoid "high confidence blanks" is the solution.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, November 4, 2009

You can read the fine-print

As fonts get smaller the challenge to read them with OCR software increases, however there are some key things that organizations should be aware of when reading the fine-print.

OCR technology today is capable of reading fonts as small as 8 pt even 6 pt very accurately. It used to be unless you have a 12 pt font you stood no chance. Because of increased quality of scans and more advanced OCR engines reading small fonts can be no problem if the right approaches are used.

Small fonts have a higher sensitivity to image quality and degradation to the document. For this reason original source images that are scanned at 300 DPI or higher are necessary. For normal fonts there is seldom reason to scan higher than 300 DPI but for small fonts the goal is to get them to appear more or less the same as the regular fonts, so scanning them at 400 to 600 DPI is useful. Additionally documents that are “clean” is very important. A smudge or spill on a document impacts smaller fonts many times more then a larger font because of the closeness of lines. Once you have a good image quality you can start the conversion.

The next best benefit for small fonts is for them to be zoned separately. Zoning is the process of rubber banding the region where the text exists. When small fonts are grouped in the same zone with normal sized fonts the OCR software assumes that they should be of the same size and the confidence and accuracy go down. If you zone the small fonts separately you increase the OCR engines ability to use experts just for small fonts and increase the accuracy on them.

Next time someone tells you to read the small print, tell them you wont read it you will scan and OCR it.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, October 29, 2009

Not all Documents are Equal – OCRing Newspapers

There are several document types out there both for full-page OCR and for data capture that require special attention and configuration. For Full-Page OCR ( extraction of all the text on a document ) newspapers is one of these and poses some interesting challenges. When considering the OCR of these types of documents you need to change your opinion on the document itself.

When you open up a page of any newspapers you likely are considering the document as a whole, while your brain is picking apart the pieces. This is the key to OCRing news papers. The biggest challenge facing companies wanting to convert newspapers to text using OCR is their layout. Often times though the font on newspapers is usually pretty small it can be scanned at a quality that the raw OCR read is very high. Newspapers have their own structure they have page headings, section headings, article titles, article sub-titles, by lines, articles, and then footers. Not only that articles span pages.

When converting a newspaper the most effort should be spent on a process of proper zoning. Because document analysis tools built into OCR engines are tuned to the average document ( newspapers are not ) they will accurately find columns and paragraphs, but the key is to find the titles by lines and be able to separate articles. Most large service bureaus processing newspapers at volume have a manual zoning process and then a single read of OCR which produces very accurate results all because the zoning was done properly. Others have devised a two pass OCR system that essentially zones documents twice narrowing the focus on each step and increasing zoning accuracy thus OCR accuracy. This solves the read accuracy but not page continuations.

Page continuations are handled most often post OCR with a business rule applied to the OCR result. Meta-data from the OCR results should indicate on which page the text came from, thus by finding the words “continues on” at the bottom of any given article you can concatenate to it their continuation for final presentation. As apart of this rule is an article count and an article portion count, by the end you should have 0 portions and only articles. If you have a low confidence on the merging of articles you can simple merge the result review the remaining portions and your accuracy will then increase.

OCRing newspapers has it's challenges, not to mention the difficulty in scanning them, but it's possible and can be very accurate if in the right state of mind, and using the right approaches.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments

Friday, October 9, 2009

It learns right? - The misconception about recognition learning

Because of the way the market has come to understand OCR ( typographic recognition ) and ICR ( hand-print recognition ) there is no surprise when some of the most common questions and expectations about the technology appear to be fact from a tarot card. Before I talked about one of these questions “How accurate is it” and how the basis of this question is completely off and can come to no good, here is a similar “It learns right?” which is quite a loaded questions, so lets explore.

Learning is the process of retaining knowledge for a subsequent use. Learning is based in the realm of fact, following the same exact steps creates the same exact results. OCR and ICR arguable learn ever-time it's used, for example engines will do one read and go back and re-read characters with low confidence values using patterns and similarities they identified on a single page. This is on a page level, and after that page is processed this knowledge is gone. This is where the common question comes in. What people expect happens is that the OCR engine will make an error on a degraded character that is later corrected, now that it's been corrected once that character will never have an error again, assuming this is true then you would believe that at some point the solution will be 100% accurate when all the possible errors are seen.

WRONG! Because the technology does not remember sessions is the reason it works so well. Can you imagine if for example a forms processing system was processing all surveys generated by a single individual ( this is true for OCR as well ), the processing happened enough that in learned all possible errors and was 100%. Then you start processing a from generated by a new individual, your results on the first form type and the new will likely be horrendous, not because of the recognition capability, all because of supposed “learning”. In this case learning killed your accuracy as soon as any variation was introduced.

What most people don't realize is that characters change, they change based on paper, printer, humidity, handling conditions, etc. In the area of ICR it's exaggerated as characters for a single individual change by the minuet, based on mood and fatigue. So learning is a misnomer as what you are learning is only one page, one printer, one time, one paper who will likely never repeat again. A successful production environment allows as much variation that is possible at the highest accuracy and this is not done with this type of learning.

Things that can be learned: Like I said before a single pass of a page, can have a second pass of low confident characters with learned patters on that page. In the world of Data Capture field locations can be learned, field types also can be learned. In the world of classification documents based on content are learned, this in fact is what classification is.

While the idea of errors never repeating again is attractive people need to understand this technology is so powerful because of the huge range of document types and text that can be processed, and this is only possible by allowing variance.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, October 1, 2009

“eBooks for Reading” By: Oc R.

As the popularity of reading eBook's increases so does the demand and need to convert books to an eBook . Legality aside the promise of using OCR technology to create eBook's is very high, and not too difficult. There are few things to remember when wanting to use OCR to create an eBook. Getting a digital file in the the eBook format is relatively easy, but creating the content for that format is the challenge. Enter Optical Character Recognition OCR. There are several steps to successfully creating an eBook with OCR.

1.)How you scan
2.)How you optimize the image
3.)How you OCR the image

There are two common ways to scan a book. If you are lucky enough to have a book scanner this is the desired approach as this does not require the destruction of the book. These scanners are very pricey, but do a great job. The resulting image with a book scanner is one image for every two pages. We will get to this in a moment. The other way to scan is with a typical document scanner where you remove the binding of the book and use a document scanner to produce image files for each page. In this approach the quality is high, sometimes higher even than a book scanner, but less convenient. It's important in this approach to keep the book page order correct as often times you have to scan in batches and it's easy to get pages mixed up. Scanning should be done at 300 DPI Tiff Group 4 Grey-scale. This will produce the ideal image. Unless the book has significantly small fonts these settings will do the trick. Scanning in color would only be required if your book has color photographs.

Once scanning is done and you have image files it's time to apply imaging. For the most part any scanning done with the binding removed imaging will not be required, perhaps only line straightening, and deskew in case of crooked scans. For the books scanned with a book scanner there are two critical imaging tools that will always be applied, first is page separation. This is the imaging that separates the left side and the right side of the image as two separate pages in the book. The result is two separate image files. Next on each of these image files line straightening is required. Because the binding of a book causes pages to curve inward this curve appears as curved lines in a book scan. Line straightening finds the base-line for each line in the page and makes every portion of every line follow it.

Now the magic of OCR can take place. Following these steps for 90% of the books out there will create an accurate eBook. There are many utilities that will then take Text, Doc, XML, etc. and convert it into the desired eBook format. Some tagging may be required for chapters etc. to gain all of the functionality in eBook readers.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments