OCR and PDF: Keys to Usable PDFs

PDFs have become the standard in many organizations for archiving files as records. Whether you are scanning paper files for long-term archival, or converting your Office documents to PDF / A for long-term storage, there are some key things you need to know. From a scanning perspective, most scanners just produce an image based PDF, barren if you will of all metadata. PDFs are a rich format that can become a long-term "suitcase" of metadata for storage and information. Here are some tips on how to make your PDFs complete records:

1. Make sure your Scanning or PDF converter supports the PDF /A standard. PDF /A is a long-term archive standard for image files. It ensures the viability of the file in the long-term, allows embedding of metadata and can prevent alteration of the record. This is a must for any long-term archival of documents. For a summary on the PDF Archive standard, see Adobe's summary PDFs for Long Term Archive

2. Make sure to Populate the Standard PDF Headers. When creating a PDF through a document capture or conversion process, make sure you populate the PDF headers with metadata. The standard headers include: author, subject, keywords and title. One example to consider is when you use an AS400 PDF or Spool to PDF tool, you need to make sure the headers are populated. Populating these fields can speed up searching and indexing, and makes sure critical information is secured about the record. Below is an example of an invoice that was scanned with a document capture application, where the standard headers were packed with PDF information:

: PDF Standard Headers

3. Build Complete Custom Headers for SharePoint Metadata. Advanced conversion software will build out custom PDF header information, and allow you to "tag" your documents. With this, the PDF can now become a redundant container for SharePoint Metadata column information with column name and metadata values. This is the ultimate in metadata packing, and creates a true portable PDF with all pertinent information. Below is an example of custom headers or properties, where invoice number, date, total and vendor are entered:

: PDF Custom Metadata

4. Always create PDFs that include OCR Text. Using an Optical Character Recognition (OCR) process will convert the image in the PDF into searchable text that can be crawled by SharePoint for the ultimate in searchability. This is a must for all documents.

Did I miss anything? Please comment with anything I missed.

OCR and PDF

BLOG List

Followers

Blog Archive

Contributors

My Blog List

Thursday, April 24, 2014

Keys to Usable PDFs

No comments:

Post a Comment