OCR and PDF: April 2014

Saturday, April 26, 2014

What is PDF/A and Why Do I Care?

OCR and PDF/A - What it means.

So what exactly is PDF/A, and why does it matter to me? The Portable Document Format (PDF) has long been a simple, pervasive format for the sharing of documents, especially in the scanning, document capture and ECM industry. But for long-term storage and archiving, many organization chose TIFF as there were concerns over the viability of PDF for long-term digital preservation of electronic documents. In steps the PDF/A standard, with the goal of eliminating any feature that would inhibit long-term archiving. PDF/A is a standardized version of the PDF format that places a focus on removing constrained features like font embedding, and focuses on standardizing viewing requirements, support for embedded fonts, guidelines surrounding color management and the ability to read embedded comments and annotations. Below are some of the compatibility elements:

Any executable code is forbidden
Color is standardized
All fonts require the ability to be embedded
No encryption
No audio or video
Metadata is standards based
Digital signatures are allowed based on standards
Embedded files are allowed with the latest revision
External content references are not allowed
Compression standards are enforced

So why does all this matter? If you are archiving files for the long run, this standard will ensure that you will be able to open, view and read your archived content. Most document scanning and capture solutions will support this output type, and this can prevent long-term issues in your Records Center.

Here are some great references:

The PDF Association

Library of Congress PDF/A Overview

Adobe: PDF as an Archiving Standard

Thursday, April 24, 2014

Keys to Usable PDFs

PDFs have become the standard in many organizations for archiving files as records. Whether you are scanning paper files for long-term archival, or converting your Office documents to PDF / A for long-term storage, there are some key things you need to know. From a scanning perspective, most scanners just produce an image based PDF, barren if you will of all metadata. PDFs are a rich format that can become a long-term "suitcase" of metadata for storage and information. Here are some tips on how to make your PDFs complete records:

1. Make sure your Scanning or PDF converter supports the PDF /A standard. PDF /A is a long-term archive standard for image files. It ensures the viability of the file in the long-term, allows embedding of metadata and can prevent alteration of the record. This is a must for any long-term archival of documents. For a summary on the PDF Archive standard, see Adobe's summary PDFs for Long Term Archive

2. Make sure to Populate the Standard PDF Headers. When creating a PDF through a document capture or conversion process, make sure you populate the PDF headers with metadata. The standard headers include: author, subject, keywords and title. One example to consider is when you use an AS400 PDF or Spool to PDF tool, you need to make sure the headers are populated. Populating these fields can speed up searching and indexing, and makes sure critical information is secured about the record. Below is an example of an invoice that was scanned with a document capture application, where the standard headers were packed with PDF information:

: PDF Standard Headers

3. Build Complete Custom Headers for SharePoint Metadata. Advanced conversion software will build out custom PDF header information, and allow you to "tag" your documents. With this, the PDF can now become a redundant container for SharePoint Metadata column information with column name and metadata values. This is the ultimate in metadata packing, and creates a true portable PDF with all pertinent information. Below is an example of custom headers or properties, where invoice number, date, total and vendor are entered:

: PDF Custom Metadata

4. Always create PDFs that include OCR Text. Using an Optical Character Recognition (OCR) process will convert the image in the PDF into searchable text that can be crawled by SharePoint for the ultimate in searchability. This is a must for all documents.

Did I miss anything? Please comment with anything I missed.

OCR and PDF

BLOG List

Followers

Blog Archive

Contributors

My Blog List

Saturday, April 26, 2014

What is PDF/A and Why Do I Care?

OCR and PDF/A - What it means.

Thursday, April 24, 2014

Keys to Usable PDFs