PDFlib pCOS (PDF Information Retrieval Tool ) let's you retrieve PDF metadata, hypertext, or any other information from a PDF document beside the actual text content. All objects can be accessed with a simple interface, without any low-level programming (pCOS is short for PDFlib Comprehensive Object Syntax). PDFlib pCOS does not access or extract the textual content of the PDF file – this functionality is provided by PDFlib TET (Text Extraction Toolkit).
Here are some key features of "PDFlib pCOS":
· general information: linearized PDF, tagged PDF, encryption details and permission settings, number of pages and fonts
· document info entries and XMP metadata
· all fonts with their name, embedding status, etc.
· target URLs and coordinates of Web links
· create a table of contents by extracting all bookmarks along with the corresponding page number
· form field data: full field names, contents, position, etc.
· page size, CropBox, page rotation
· status of PDF/X compliant files
· list or extract file attachments
· layer names
· annotation details
· list all comments along with the reviewer’s name
· digital signature details: name of signature field(s) signed/unsigned, name of signer, date and reason of signature
· extract ICC output intent profiles from PDF/X or PDF/A files
· list PDFlib block properties
There are many every-day pCOS applications for PDF practitioners, but you can also use PDFlib pCOS as a tool for learning or debugging PDF. Here are some typical scenarios:
· check incoming documents for predefined criteria
· identify problem files in a large collection
· create property summaries for document management
· quality assurance before publishing documents
· document retrieval and repository workflows
· learn details of PDF data structures
Supported PDF Input
PDFlib pCOS supports all relevant flavors of PDF input:
· all PDF versions up to PDF 1.6 (Acrobat 7)
· encrypted PDF with 40- and 128-bit encryption
· sophisticated security model: even if you don’t know the password you can query certain pieces of information as long as this doesn’t violate the document author’s intentions
PDFlib pCOS can create output for different purposes:
· plain text output
· tabular format for processing with a spreadsheet or database
· user-defined output formats for custom post-processing
binary data (e.g. ICC profiles, file attachments) can be extracted to file without any modification
· Unicode text can be created in UTF-8 and UTF-16 formats
Since PDFlib pCOS can process multiple PDF documents with a single call you can easily create summaries of document info entries, page formats, fonts or any other property. Combined with tabular output this provides a powerful PDF administration tool.
For accessing and extracting the actual text context of a PDF document, please use PDFlib TET (Text Extraction Toolkit).
pCOS Command-Line Tool and pCOS Library
PDFlib pCOS is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both support the full pCOS path syntax, but are suitable for different deployment scenarios. Here are some guidelines for choosing among both flavors:
· The pCOS command-line tool is suited for batch processing PDF documents. It doesn’t require any programming, but offers convenient command-line options which can be used to integrate it into arbitrary workflows. Many options provide shortcuts to commonly requested functionality.
· The pCOS programming library can be used for integration into your desktop or server application. With the pCOS library you can apply custom processing logic to the information in a PDF.
Features specific to the pCOS Library
Language bindings for use with C, C++, COM, .NET, and Java
· Read documents directly from memory (C language only)
· Programming examples included
Features specific to the pCOS Command-Line Tool
Simple retrieval of common PDF elements, such as bookmarks, annotations, metadata, form fields, etc.
· Extended mode for querying more complex objects and customizing the output format
· Emit information as comma-separated values or a user-defined format for import into a spreadsheet or database
· Recursion feature for dumping composite PDF objects, such as dictionaries and arrays
pCOS Paths: a Simple Syntax for PDF Objects
PDFlib pCOS supports a simple path syntax which can be used to address arbitrary objects within a PDF document. While the pCOS syntax is closely modelled along the PDF object structure, it offers convenient shortcuts for accessing commonly used objects, such as pages, fonts, bookmarks, form fields etc. Instead of getting bogged down by complex tree structures (e.g. bookmarks or form fields) you can easily access them using pCOS pseudo objects. pCOS paths for many common PDF objects are described in the documentation.
pCOS paths are required to use the library, but they are also supported in the pCOS command-line tool.
Programming and Performance
PDFlib pCOS has been developed with portability, performance, and robustness in mind. pCOS is thread-safe for deployment in multi-threaded server applications. The core library is written in highly optimized C code for maximum performance and small overhead. Additional language bindings are available for common development environments.