Tuesday, December 29, 2009

Rich Media OCR

I often speak of unique uses of OCR, and here is yet another. OCRing video files! But why? Part of the management of rich media assets is indexing these files. Technologies such as speech recognition and optical character recognition give a greater index and search value to rich media.

By using OCR technology to find and extract text from video frames, the data can be stored as meta-data. In the simplest scenario, this is a text file that accompanies the video file. More complex environments will even tell you the minuet and second the text occurs. Because this is not a traditional use of the technology, some special consideration must take place.

First is converting and separating frames to individual images files. For the OCR to be effective it needs to work on a series of images. Although a video is only a sequence of images that repeat at a high rate of speed, it's still somewhat of a challenge to convert video files such as MPEG to a series of images. Not only that, dealing with motion blurs that might occur in some frames will also be a problem.

The second challenge is dealing with frames that are repeats. Essentially, because there are so many similar images that are only slightly different from each other, the text on a series of frames might not change. Better OCR results will account for this and not repeat text as the frames would.

And finally dealing with the variations of fonts, and often small sizes. This requires an OCR engine with specific settings for specialized OCR, and one that is very accurate on complex low quality documents.

I expect that in the future, this technique in conjunction with speech recognition will be used in eDiscovery, content management, and robust search of rich media files.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, December 21, 2009

OCRing Magazines

Often times when I receive printed periodicals, my preference is to OCR them to a digital search-able format and read the articles I'm interested in on my computer, just like my online periodicals. One of these printed documents might be a magazine. Magazines are either very easy to OCR or very difficult, and usually both cases exist in a single magazine. It all has to do with the graphical elements that are often incorporated in magazines.

Text printed on graphics. Very often articles will have text printed over related graphics. If entire paragraphs are printed over a single graphic, it's less challenging; but when text overlaps graphic and white-space, it's problematic because a single word will change from color to black normal text in order to contrast the images.

Annotated images. Many magazines including my favorite scientific one, includes text as part of diagrams in the articles. To many this text may be irrelevant, but to me, it has become important search words at the very least. These annotations tend to be small font and often hard for the OCR engine to identify because of close proximity to images.

The good news is that for the most part the purpose of OCRing any magazine is to make its text, searchable. Anything more would probably be illegal. The other good news is that there are tricks to deal with each of these problems. First, a magazine that is being OCRed must be scanned in color. The additional information provided by the color scan will help the OCR engine to distinguish graphics from text on graphics. Second, is to enable full recognition of any engine and any settings geared to small fonts. Third, is to turn off document analysis or enable limited document analysis. This is the less obvious setting. By disabling document analysis, you don't allow the OCR engine to get confused by strange structure, text printed on graphics, and annotated images. You are forcing it to read all possible text.

Being that text-searchable is the greatest benefit to OCRing my periodicals, I have opted for the OCR settings that produce the most text and the least structure. If you are converting similar documents, I recommend doing the same.

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments