Monday, March 8, 2010

This blog has moved


This blog is now located at http://blog.livinganalytics.com/.
You will be automatically redirected in 30 seconds, or you may click here.

For feed subscribers, please update your feed subscriptions to
www.livinganalytics.com.
Bookmark and Share
posted by Chris Riley at 0 Comments

Friday, March 5, 2010

Workflow, super-charge with OCR

Document workflow can be as easy as saving a file to a single location to as complex as decision tree document routing rules. Throw some paper into the mix and the problem intensifies slightly. Getting your paper documents to fit your already accepted digital document workflow can be challenging. Some organizations choose to keep the paper and digital workflows separate. Others unite them but create separate rules for each. For most however, it would be ideal to have a single workflow engine or product supporting both the digital, image, and paper documents.

To do so with the greatest value, you need not only document conversion using Optical Character Recognition ( OCR ), but some other advanced imaging and recognition tools. In the digital document world, you don't have only the data contained in the document, you have various other meta data items such as file name, file location ( taxonomy ), tags, etc. In order to marry paper with digital the same has to be duplicated on the paper document and has to occur at time of document processing. This could be a manual process or automated, and depending on your paper volume doing it in manual may be OK. To compete with the efficiency of digital documents however, automatic is the way to go.

Using OCR, image-based and contextual-based classification, paper or image documents that enter the workflow can obtain the same value as digital documents. The OCR is responsible for getting all the content from the document. The purpose of this content is for search, indexing, auto-filing, as well as generation of keywords ( tags ) associated with a taxonomy. In order to determine where the document fits into a taxonomy, you must first classify it.

For classification to be most effective, it happens on two levels. Image-based classification, which is what the document looks like, classifies documents based on their physical structure which is a good indicator of its type and very fast. Contextual classification, which is what words are contained in the document, is one level deeper in classification and looks for the keywords that would make a document one type over another. For some environments, image-based classification can do the job entirely. Once classification is known, a classification engine can place the document in the correct spot in an existing taxonomy. Once an ID or classification is determined, it is no challenge to apply tags, file-naming, and file location to a document.

Workflow can stand alone, but injected with the power of OCR and document classification, it becomes a power house that does not know the difference between paper and digital.

Labels: , , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, March 4, 2010

Document longevity

One of the biggest risks in document scanning is doing it wrong. A document that is scanned improperly, stored improperly, and with the original paper destroyed, it could be a very serious situation for an individual or organization. Sometimes it's just too hard to anticipate or know what settings to use. For example, while your scanning today may be for the purpose of regular consumption via search and retrieval, tomorrow it could be required and printed for a law suite.

Fortunately, technologies are advancing such that scanning the “Golden Document” is practical and possible. The “Golden Document” is a document scanned with all the best settings for quality; not taking into consideration file storage or performance, the two biggest drivers to reduction in scan quality. The settings for the “Golden Document” are a resolution of 300 DPI, a color bit-depth, and a fill format of uncompressed TIFF. If the “Golden Document” is the optimum, one must make the rationalization of why to ever deviate from it.

With advances in document scanners, compression, and file formats, the need for rationalization becomes less and less. Document scanners can now scan a color image at nearly the speed of a black and white. For this reason, there is little reason to use black-and-white or gray-scale scans. A color document gives you the ability to convert, re-purpose, and print. Scanning at 300 DPI is a setting that should never be compromised. Now that you have the golden scan, you have created a rather large file. Ideally you could compress this file to a more regularly consumed format and not lose quality. Compression technology advances substantially every year. The ideal file format for storage, quality, etc. is arguably PDF searchable. This format has the functionality of a regularly consumed document and the configuration for sustainability. Alternatively, some may choose to create both a PDF plus a word document for the additional ability to re-purpose.

While you may not be scanning the “Golden Document” today, now is a time to revisit why and ways to get there.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, March 3, 2010

Compression: Save space, AND MONEY

Yes compression saves valuable hard-drive space, but as the technology world becomes more and more hosted, it's also just as important for saving money. Previously I have explored various types of compression, general, and file type specific. I have also explored various drivers for compression, archive, and space saving on regularly consumed files. But what I have not talked about in detail is how compression is becoming more and more popular for saving money from hosted storage services.

Hosted software products are being created at a faster rate than installed. Many of these hosted solutions are content driven such as content management, eDiscovery, accounts payable, off-site storage etc. and they are all rooted in storing data. It is the preferred business model for the companies producing these solutions to charge per mega-byte of usage or combination of mega-byte usage and a monthly service charge. For this reason, it's important to consider how much storage is being used up. Not only because of cost control, but also to make sure the system is being utilized on useful data and not garbage.

Often organizations purchase an allotment of storage that they pay for monthly; their goal is to not exceed their storage limit and have to upgrade to the next level. Often with the content management services and in particular documents, they can be uploaded but are never utilized within the system and are purely space wasters.

For these reasons, compression is a great tool to reduce the size of the files on your hosted service. The type of compression used for hosted services would need to be file specific. Hosted applications understand specific file formats and how to consume them; compression formats such as zip would not be useful for that reason. Instead, compression for particular formats such as PDF compression must be used. In this way, you are still working with a compatible and consumable PDF, but at a much smaller size. The driver for the compression must be compression for regular consumption. There are hosted archival systems, but in this case I'm discussing hosted products where the data contained in them are used on a frequent to semi-frequent basis.

By compressing documents a company can store more data for less storage fee. As hosted software products become more common, you will see people seeking better and better ways to make their files smaller but maintain quality.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, March 2, 2010

Document Conversion and Law

Both CVISION Technologies and I had the pleasure of attending LegalTech 2010 this year in New York. I was quite impressed with the show and especially how engaged the attendees were. Where does document conversion and compression technologies fit in the legal space? Here is a brief review of the usage of the technologies in this vertical market.

File security:

Starting with the most popular buzz word PDFs. PDFs are the most popular file format in legal for their ability to be secure, and with the right compression tools very small file format. Security is fairly obvious, but compression not so much. Because many of the legal case management platforms, eDiscovery engines, and simply content management are billed by the megabyte of space, keeping files small but usable is critical. The trend of these applications is to be fewer installed products and most hosted. The hosted products usually have a monthly service fee and charge per amount of storage. Keeping the content value but small then becomes a real concern especially when dealing with the hundreds of thousands of pages a case might contain.

Search-ability:

Lawyers work with a lot of paper, getting at the right information is tough. That is why before a document can be loaded to any case management or eDiscovery system, it must be OCRed and made searchable. Good OCR is essential, as is the ability to quickly get the documents converted. Without OCR, eDiscovery simply cannot work on paper. Surprisingly this was a common afterthought, but a large complaint for products with poor OCR. Some organizations simply put the paper or image files aside, risking loss of valuable information. Others did not concern themselves with OCR accuracy and just assumed it was good enough. Both are a mistake and I hope a dying trend in this particular market as they are only hurting themselves. Garbage in garbage out.

Translation:

The number of translation companies at the show was large. Why? Because very often lawsuits are comprised of a large collection of documents that contain a subset of languages. In order for eDiscovery to work well, the data must be normalized i.e. translated. The first challenge is to find the languages. It is a tremendous effort to go through a large collection of documents and identify each page a particular language occurs. Second is in paper documents getting the data into a digital format so manual or software based translation can occur. OCR can facilitate both. First is the conversion of paper to digital, and second is during OCR language detection happens internally in the OCR engine. Again just like the above, the quality of the OCR is imperative, so law firms have every right to be concerned about what OCR engine they or their translation company uses.

If you did not attend, I recommend you keep it on your radar for next year, or the west coast version. While document conversion is not the favorite topic in legal, it finds its way in each step of case management, e-discovery, and billing.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Thursday, February 25, 2010

Imprint vs. Annotate

Large volume scanning environments often have the need to imprint, herein “Stamp”, usually date of scan on each and every page that is processed. This requirement is created for tracking purposes and sometimes compliance. Many service bureaus require more than just a date, they require batch IDs and other important tracking information. The question becomes how to do this in the best way. There are several options.

Pre-Scan Imprint

Pre-Scan imprint being the most common option allows an organization to have the stamp on both the physical paper copy and the scan. Scanners capable of pre-scan imprint will print in the proper location for the data prior to the image reaching the scanners lamps. By doing so, the stamp will also be part of the scan. The reason this is the most common is because there are times when a scanned image needs to be compared with a physical document and this is what would be required to do so. Scanners with the imprint feature come at a premium and requires more maintenance.

Post-Scan Imprint

If the organization only needs the data or tracking mechanism on the physical paper then they can imprint after scan. Some scanners support post-scan imprinting or organizations feed the paper through an additional printing process. Usually the purpose of this operation is to imprint pages indicating simply if a page has been processed or not. Scanners with the post-scan imprinting feature run nearly the same price as the pre-scan imprint and gradually being faded out in favor of it.

Software Annotation

If the organization only needs the data or tacking mechanism on the scanned image they may elect to do software annotation. Software annotation gives the greatest amount of flexibility of all three options as any combination or sequence of data can be printed on the image anywhere. Software annotation would require an additional piece of software. Very often organizations will choose software annotation instead of the premium for imprinting scanners but sacrifice the physical imprint. The application that provides the annotation needs to be automated and batch driven.

The alternative to the above three methods is manual stamping. Manual stamping is tedious, time consuming and often inaccurate. It's up to the organization to review the three options and pick the best fit for their production and budgets.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Tuesday, February 23, 2010

Translating images

Text translation services come in a variety of forms, from individuals who make a good living translating documents from one language to another, to large firms using many individuals or purely software. No matter the form, they are all faced with a challenge when the text they need to translate is contained in physical paper or an image file.

Today, translation is facilitated with the use of word processing systems. Word processors give the translator the ability to be more efficient and manage the translation process over many sessions. But in order to use the capabilities of a word processing system, it's necessary to get the text into a digital format. That is where Optical Character Recognition comes in. OCR is one of the greatest tools in a translator's bag of tricks. It allows the individual to convert the image files and physical paper to digital text which can be consumed and translated.

The great thing about modern OCR is the sheer number of languages that are supported. Not only is OCR capable of converting a document to digital in one language but even if it contains multiple languages, it's smart enough to know where one language begins and the other ends. If you can imagine the risk of a translator who receives OCR errors, you will see why making sure documents are scanned at the optimum quality is a great consideration. Modern OCR engines will tell the operator exactly where any confusion might have occurred and give them the opportunity to correct it. Documents scanned at 300 DPI TIFF Group 4 black and white will excel.

Without OCR, a translator's job becomes more of a data entry task than what they are truly skilled at which is translation.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments