OCR and PDF

Monday, September 15, 2014

Mobile Capture and OCR

It will be interesting to see with all the buzz about mobile, who will take the lead in mobile capture. Sure, there are several of the big capture guys going for the fences: Kofax, OpenText, IBM, etc. But there are also some great new companies on the scene. Below are some of the up and comers:

StratusFlow - StratusFlow is a new generation of mobile data capture app that is focused not only on mobile capture of images, but also mobile data capture. you can see more from them here: StratusFlow Mobile Data Capture

Captricity - These guys have been building a pretty cool online tool that will perform form data extraction, and have received some accolades.

If you know of any others, please comment and let me know.

Saturday, April 26, 2014

What is PDF/A and Why Do I Care?

OCR and PDF/A - What it means.

So what exactly is PDF/A, and why does it matter to me? The Portable Document Format (PDF) has long been a simple, pervasive format for the sharing of documents, especially in the scanning, document capture and ECM industry. But for long-term storage and archiving, many organization chose TIFF as there were concerns over the viability of PDF for long-term digital preservation of electronic documents. In steps the PDF/A standard, with the goal of eliminating any feature that would inhibit long-term archiving. PDF/A is a standardized version of the PDF format that places a focus on removing constrained features like font embedding, and focuses on standardizing viewing requirements, support for embedded fonts, guidelines surrounding color management and the ability to read embedded comments and annotations. Below are some of the compatibility elements:

Any executable code is forbidden
Color is standardized
All fonts require the ability to be embedded
No encryption
No audio or video
Metadata is standards based
Digital signatures are allowed based on standards
Embedded files are allowed with the latest revision
External content references are not allowed
Compression standards are enforced

So why does all this matter? If you are archiving files for the long run, this standard will ensure that you will be able to open, view and read your archived content. Most document scanning and capture solutions will support this output type, and this can prevent long-term issues in your Records Center.

Here are some great references:

The PDF Association

Library of Congress PDF/A Overview

Adobe: PDF as an Archiving Standard

Thursday, April 24, 2014

Keys to Usable PDFs

PDFs have become the standard in many organizations for archiving files as records. Whether you are scanning paper files for long-term archival, or converting your Office documents to PDF / A for long-term storage, there are some key things you need to know. From a scanning perspective, most scanners just produce an image based PDF, barren if you will of all metadata. PDFs are a rich format that can become a long-term "suitcase" of metadata for storage and information. Here are some tips on how to make your PDFs complete records:

1. Make sure your Scanning or PDF converter supports the PDF /A standard. PDF /A is a long-term archive standard for image files. It ensures the viability of the file in the long-term, allows embedding of metadata and can prevent alteration of the record. This is a must for any long-term archival of documents. For a summary on the PDF Archive standard, see Adobe's summary PDFs for Long Term Archive

2. Make sure to Populate the Standard PDF Headers. When creating a PDF through a document capture or conversion process, make sure you populate the PDF headers with metadata. The standard headers include: author, subject, keywords and title. One example to consider is when you use an AS400 PDF or Spool to PDF tool, you need to make sure the headers are populated. Populating these fields can speed up searching and indexing, and makes sure critical information is secured about the record. Below is an example of an invoice that was scanned with a document capture application, where the standard headers were packed with PDF information:

: PDF Standard Headers

3. Build Complete Custom Headers for SharePoint Metadata. Advanced conversion software will build out custom PDF header information, and allow you to "tag" your documents. With this, the PDF can now become a redundant container for SharePoint Metadata column information with column name and metadata values. This is the ultimate in metadata packing, and creates a true portable PDF with all pertinent information. Below is an example of custom headers or properties, where invoice number, date, total and vendor are entered:

: PDF Custom Metadata

4. Always create PDFs that include OCR Text. Using an Optical Character Recognition (OCR) process will convert the image in the PDF into searchable text that can be crawled by SharePoint for the ultimate in searchability. This is a must for all documents.

Did I miss anything? Please comment with anything I missed.

Tuesday, February 23, 2010

What are the advantages to OCR PDF?

OCR PDF Advantages

Why choose to OCR to PDF? What are the advantages?

So, the Image with Hidden Text PDF has become an OCR standard, due to its ability to "carry" both the image and text in a single file. It also allows you to avoid any legal issues, as the pristine image is not altered in an way through the OCR process, as it is left untouched. If you have worked with OCR applications before, you know that they typically have a hard time with formatting, and can alter the original with formatting errors as well as substitute characters.

Friday, February 19, 2010

Is OCR PDF larger than a TIFF?

When performing the Optical Character Recognition process, a question that is often asked is from a file size perspective, is a searchable PDF going to be larger than just a plain PDF or TIFF image? When converting to just image, PDF and TIFF are typically the same size, and both use compression. The addition of the text layer adds a very small incremental file size portion, when compared to the overall size of the file. The key here in keeping file sizes as small as possible is to utilize image processing prior to the recognition process to clean your image, remove speckles, etc. This requires an engine or document capture application that will provide the means to process images.

Saturday, February 13, 2010

Why use OCR to PDF?

There are a ton of different OCR formats you can utilize: text, html, word, PDF, etc. Adobe provides an OCR format that gives you the ability to "carry" converted text along with the original image, called Image with Hidden text. This format has the benefit of providing full text search capabilities, but also having the pristine original image. Why use OCR Software?

Lets take for example a SharePoint Scanning application. Utilizing the PDF iFilter, you can enable SharePoint to crawl OCR PDF content, providing end users not only with column based search capabilities, but also with full text search. Here is a link to enabling the PDF iFilter.

Tuesday, February 9, 2010

SharePoint, OCR and PDF

So, when you have a SharePoint Scanning application, why would you want to OCR PDF files?

I am seeing Microsoft SharePoint utilized more and more as a true document management / document imaging solution. There are many document capture applications that provide the ability to scan, capture, index and OCR to PDF . The advantage of the image with hidden text PDF, is that if you enable the PDF iFilter (how to enable PDF search in SharePoint ), you have a fully text searchable body of documentation. SharePoint OCR is becoming a requirement.