Tuesday, February 23, 2010

What are the advantages to OCR PDF?

Why choose to OCR to PDF?  What are the advantages?

So, the Image with Hidden Text PDF has become an OCR standard, due to its ability to "carry" both the image and text in a single file.  It also allows you to avoid any legal issues, as the pristine image is not altered in an way through the OCR process, as it is left untouched.  If you have worked with OCR applications before, you know that they typically have a hard time with formatting, and can alter the original with formatting errors as well as substitute characters.

Friday, February 19, 2010

Is OCR PDF larger than a TIFF?

When performing the Optical Character Recognition process, a question that is often asked is from a file size perspective, is a searchable PDF going to be larger than just a plain PDF or TIFF image?  When converting to just image, PDF and TIFF are typically the same size, and both use compression.  The addition of the text layer adds a very small incremental file size portion, when compared to the overall size of the file.  The key here in keeping file sizes as small as possible is to utilize image processing prior to the recognition process to clean your image, remove speckles, etc.  This requires an engine or document capture application that will provide the means to process images.

Saturday, February 13, 2010

Why use OCR to PDF?

There are a ton of different OCR formats you can utilize:  text, html, word, PDF, etc.  Adobe provides an OCR format that gives you the ability to "carry" converted text along with the original image, called Image with Hidden text.  This format has the benefit of providing full text search capabilities, but also having the pristine original image.  Why use OCR Software?

Lets take for example a SharePoint Scanning application.   Utilizing the PDF iFilter, you can enable SharePoint to crawl OCR PDF content, providing end users not only with column based search capabilities, but also with full text searchHere is a link to enabling the PDF iFilter.

Tuesday, February 9, 2010

SharePoint, OCR and PDF

So, when you have a SharePoint Scanning application, why would you want to OCR PDF files?

I am seeing Microsoft SharePoint utilized more and more as a true document management / document imaging solution.  There are many document capture applications that provide the ability to scan, capture, index and OCR to PDF .  The advantage of the image with hidden text PDF, is that if you enable the PDF iFilter (how to enable PDF search in SharePoint ), you have a fully text searchable body of documentation.  SharePoint OCR is becoming a requirement.

Monday, February 8, 2010


So, when utilizing Optical Character Recognition , if your purpose is to make documents searchable PDFs, you want to choose the appropriate recognition engine.  So first I think we should explain all the different types of PDFs:

- Image - this is just a picture, no text layer.

-Text or Normal - this is normally what is created when you utilize the Adobe Acrobat distiller

-Image with Hidden Text - this is the standard in PDF OCR and provides a "pristine" image, with all the OCR text in the background.

The image with hidden text PDF is a great OCR output format, as it allows you to search your PDFs with hit highlighting.  So if you are utilizing a document capture application, or plan on Scanning to SharePoint and utilizing the Adobe iFilter for searching, the image with hidden text is the best format for OCR / PDF.