Thursday, March 4, 2010

Document longevity

One of the biggest risks in document scanning is doing it wrong. A document that is scanned improperly, stored improperly, and with the original paper destroyed, it could be a very serious situation for an individual or organization. Sometimes it's just too hard to anticipate or know what settings to use. For example, while your scanning today may be for the purpose of regular consumption via search and retrieval, tomorrow it could be required and printed for a law suite.

Fortunately, technologies are advancing such that scanning the “Golden Document” is practical and possible. The “Golden Document” is a document scanned with all the best settings for quality; not taking into consideration file storage or performance, the two biggest drivers to reduction in scan quality. The settings for the “Golden Document” are a resolution of 300 DPI, a color bit-depth, and a fill format of uncompressed TIFF. If the “Golden Document” is the optimum, one must make the rationalization of why to ever deviate from it.

With advances in document scanners, compression, and file formats, the need for rationalization becomes less and less. Document scanners can now scan a color image at nearly the speed of a black and white. For this reason, there is little reason to use black-and-white or gray-scale scans. A color document gives you the ability to convert, re-purpose, and print. Scanning at 300 DPI is a setting that should never be compromised. Now that you have the golden scan, you have created a rather large file. Ideally you could compress this file to a more regularly consumed format and not lose quality. Compression technology advances substantially every year. The ideal file format for storage, quality, etc. is arguably PDF searchable. This format has the functionality of a regularly consumed document and the configuration for sustainability. Alternatively, some may choose to create both a PDF plus a word document for the additional ability to re-purpose.

While you may not be scanning the “Golden Document” today, now is a time to revisit why and ways to get there.

Labels: , , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, March 3, 2010

Compression: Save space, AND MONEY

Yes compression saves valuable hard-drive space, but as the technology world becomes more and more hosted, it's also just as important for saving money. Previously I have explored various types of compression, general, and file type specific. I have also explored various drivers for compression, archive, and space saving on regularly consumed files. But what I have not talked about in detail is how compression is becoming more and more popular for saving money from hosted storage services.

Hosted software products are being created at a faster rate than installed. Many of these hosted solutions are content driven such as content management, eDiscovery, accounts payable, off-site storage etc. and they are all rooted in storing data. It is the preferred business model for the companies producing these solutions to charge per mega-byte of usage or combination of mega-byte usage and a monthly service charge. For this reason, it's important to consider how much storage is being used up. Not only because of cost control, but also to make sure the system is being utilized on useful data and not garbage.

Often organizations purchase an allotment of storage that they pay for monthly; their goal is to not exceed their storage limit and have to upgrade to the next level. Often with the content management services and in particular documents, they can be uploaded but are never utilized within the system and are purely space wasters.

For these reasons, compression is a great tool to reduce the size of the files on your hosted service. The type of compression used for hosted services would need to be file specific. Hosted applications understand specific file formats and how to consume them; compression formats such as zip would not be useful for that reason. Instead, compression for particular formats such as PDF compression must be used. In this way, you are still working with a compatible and consumable PDF, but at a much smaller size. The driver for the compression must be compression for regular consumption. There are hosted archival systems, but in this case I'm discussing hosted products where the data contained in them are used on a frequent to semi-frequent basis.

By compressing documents a company can store more data for less storage fee. As hosted software products become more common, you will see people seeking better and better ways to make their files smaller but maintain quality.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, December 14, 2009

Squeeze those files

Compression is a great tool for saving hard drive space. You may not currently be thinking about file compression, but you should. It's very likely that on your machines data is being created at an increasing rate, and your hard-drive space is decreasing at the same fast pace. Organizations and individuals often only consider file compression when there is far to little space left on their hard-drives or the warning messages about too little space start appearing. This is a big risk.

As we create files on our computer, access them, move them, modify them, we are fragmenting the drive. Overly fragmented drives slow down machines and increase risk for damage and corruption. The more files you have, the more this multiplies. Real-time file compression helps with this because as soon as a file is generated, it's compressed. There is less space being used, and the need to compress in the future is gone. Back-log compression ( compressing in bulk of all your files ) requires a lot of activity on the hard drive and increases the fragmentation. The other risk of bulk conversion is the fact that you only have one chance to get it right.

Bad compression is not just an irritation, it's a risk. Usually when you compress a file, you are removing the original. The whole purpose is to save space, not use up more by keeping both copies. But because of the need to make sure you are compressing the file correctly, keeping both files waste a lot of space. When doing day-forward compression or real-time compression it's easy to check as the files come across to make sure at initial setup everything is good, but if you do bulk compression and make a mistake you could have ruined a large library of files.

I firmly believe in file compression, but I know first hand the risk of doing it incorrectly. I now compress files as they are created and no longer have to think about data piling up faster then I can find ways to save space.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 1 Comments

Thursday, December 10, 2009

Space age Optical Character Recognition

There are a lot of technologist out there that believe that optical character recognition has it's days numbered and is an aged technology. The belief is that soon paper will go away. This post is for those who believe OCR technology is going away.

The reality is that paper consumption has not really decreased, in some areas paper has been replaced with electronic data interchange EDI, but in other areas it's actually increased. Studies have also shown that because documents are being scanned more there is an increase in printing when the documents need to be shared or re-purposed. But I'm not here to argue that paper is not going away and document conversion technologies required to convert them. I'm here to point out a few futuristic uses of the technology that technologist like to already talk about and involve OCR.

Data Security

The first futuristic use of the technology I would like to discuss is the use of OCR in data security. Text strings sent over the Internet are far easier to sniff and unlock the data they contain than is sending a compressed JPEG image. What if you were to during transmission convert text to a JPEG compressed image and on the receiving end OCR it to get the data. By doing so the data has been masked your in a more efficient and secretive way. For added security proprietary image formats could be devised.

File Compression

Storing ASCII text takes up far less space than an image or video file. As apart of the future of compression technologies expect that OCR be incorporate to extract the text from an image and save it as ASCII. Viewers will convert the text back to and image during viewing. This then removes the image portion of the text and significantly reduces file size.

Robots

How else to you expect future robots to read text? OCR of course. The eyes of the robot are essentially a camera that takes pictures of images rapidly. When the robot is faced with the comprehension of text the image will be converted using OCR and feed through an engine to gain meaning from the text and act on it.

So there you have it, three really cool and cutting edge ways OCR is and will be used in the future. Paper is not going away, but even if it were just look at the other cool uses of OCR technology.

Labels: , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Wednesday, November 18, 2009

PDF weight loss program

The cost of hard-drive space has dramatically decreased throughout the years, but the amount of data being created is keeping up. It's important to find ways to manage the space you have, one way to do that is to consider file compression. PDF files are a great opportunity most times to save space. PDFs consist of layers. If you have a PDF converted using OCR it will have most likely have an image and text layer. There are several ways to consider compressing PDFs either during scan, post scan, or a compressed file format.

Compressing PDFs during scan is the fastest way to ensure files are in the size you expect. The downside is that you never start with an uncompressed file so quality is out of your control. Most advanced compression tools are not lossless so the file can be compromised, if you don't ever have a chance to view the file uncompressed there is no chance for undoing any issues. The during scan compression essentially compresses the TIFF or RAW image file prior to PDF creation so the other downside is that it's not the highest compression that can be achieved.

Compressing PDFs after scan allows you to leverage the latest technology, and ensure greatest compression. There are tools that instead of compressing an image before PDF creation, that will work specifically on a PDF format. The benefit of this is that they can leverage tools specifically within the PDF to create a compressed file. This usually results in the smallest file format with the greatest residual quality.

The most common tool for compression is a compressed file format such as RAR, ZIP, etc. These tools have the ability to very nicely compress many formats into a single file. The challenge is that for files that need to be viewed regularly it requires a step of un-compression. This is time consuming and increases the risk of file loss. This type of compression is useful for storing files that are not regularly accessed. Because it can compress many formats it is not as advanced in any one format as specialized file compression tools are.

People commonly overlook the importance of compression. Because compressed files often replace originals you only have one chance to get it right for the life of that file. Companies will use various forms of compression. Because PDF files usually contain important information please consider heavily how you wish to store and compress them.

Labels: , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, October 19, 2009

Zip, Compress, Tar, Rar ….. confuse me?

Hard drives continually get cheaper and cheaper, but the rate that people collect information is still filling up hard drive real-estate faster than we can get more storage. One trick to saving space is various compression technologies. Most people when they think about compression now think of either a Zip file or that little check-box on Windows settings to enable compression on a physical drive. What is often overlooked is the ability to compress a single file utilizing a file compression tool for a specific format.

Choosing what is the right way to compress files and save space is based on several things, how often will you access that file again, what ratio of compression are you getting, and what are the long term impacts of the compression. When you use the technologies Zip, Tar, and RAR you are usually combining multiple files together, and don't have plans to access them soon. These compression tools take multiple files and combine them into a single zipped file. This means that access to any one individual file in that zipped file will take additional time and effort to open. With this approach you can combine many various formats. Some formats will have a compression ratio of 0% and others a compression ratio of 60%. Rarely but occasional when a zip is not successful you can result in file corruption. I always suggest checking that you can un-zip a zip after it's created. People who need to access their files regularly, or need to be able to search on their content at any given time can still benefit with compression tools that are specific to a format and can be done one file at a time in batch.

The most common file format that people use for search and retrieval and is generated by Data Capture and OCR is PDF. PDFs get good compression usually in a Zip, Tar, or Rar tool but there are specific things that can be done just for a PDF to compress it even further. PDFs often have a text layer that is search-able, and an image layer for viewing. The bulk of the file size is always the image layer, so a specific image compression can be applied to just this layer, and a separate text compression to the text layer. The result is a PDF that opens just like any other file, but is taking up much less space. The benefit of this is that you can access your PDF at any time, it's still indexed with your search utility, and you are saving space!

Compression is almost always a good choice when considering saving space. Compression technologies have come a long way in the last 4 years. It's good to know what your purpose is in compression and the frequency you want access to your files. Don't be afraid to scout out compression tools for specific file formats and give them a try.

Labels: , , , , , ,

Bookmark and Share
posted by Chris Riley at 0 Comments

Monday, October 12, 2009

Down and dirty paperless office

In my office, paper comes in, is reviewed for value, gets scanned, and shredded or filed. I have setup a system that allows me to very efficiently scan documents to my “digital file cabinet”. Here is a quick guide on how I do it!

What you will need:

1. An unused computer attached to your network
2. Google Desktop Search with network browsing enabled
3. A document scanner
4. A server based automatic OCR product
5. A file compression product ( optional but recommended )

Now to put it all together. How I have my system setup is an inexpensive desktop computer with Windows XP installed. Once all the applications are installed you don't even need a monitor attached to this computer. The computer is visible on the network and has one folder shared the “File Cabinet” folder in my case. This computer is my stand alone digital file cabinet. Attached to it is a document scanner with a 30 page feeder. I have the scanner configured to scan to an “input” directory on the machine.

The automatic OCR processing product is configured to pick up images as soon as they arrive in the input folder “hot folder”, OCR them using specific index level OCR settings, and create a PDF with a hidden search-able layer. The resulting PDF is put into another hot folder that the PDF compression tool is watching. As soon as a PDF arrives in this folder it is instantly compressed and the compressed PDF is moved to the “File Cabinet Folder”.

Because Google desktop search is enabled to index all files in the “File Cabinet” folder the PDFs very quickly become apart of the index. Configure your Google desktop search to enable network searches so that any machine on the network can open a browser, go to a URL located on the digital file cabinet machine and be located with a search.

Once it's setup it's simply a matter of putting paper in the scanner and pressing the scan button, and your done. It's that easy, and extremely useful!

Labels: , , , , ,

Bookmark and Share
posted by Chris Riley at 2 Comments