PDF files may be used to trigger malicious content, as described here. pdfid_PL is a Python tool to analyze and sanitize PDF files, written by Didier Stevens.
Developer comments
Here is a version that I have slightly modified so that it can be imported as a module in Python applications (originally for ExeFilter).
Modifications
The modified version is named pdfid_PL.py. The main differences with the original tool are in the PDFiD function:
def PDFiD(file, allNames=False, extraData=False, disarm=False, force=False,
output_file=None, raise_exceptions=False, return_cleaned=False,
active_keywords=ACTIVE_KEYWORDS):
The following parameters have been added:
* output_file: path of output file to be created.
* raise_exceptions: raise an exception when a parsing error happens, instead of ignoring it.
* return_cleaned: return a tuple (xmlDoc, cleaned), where cleaned=True if the PDF contained active content which has been cleaned.
* active_keywords: list of PDF tags to be disabled. Default value: ('/JS', '/JavaScript', '/AA', '/OpenAction', '/JBIG2Decode', '/RichMedia', '/Launch')
All these parameters are optional, so that pdfid_PL.py runs exactly like the original pdfid.py when they are not set.
Sample usage
import pdfid_PL as pdfid
xmldoc, cleaned = pdfid.PDFiD('file.pdf', disarm=True, output_file='cleaned.pdf',
raise_exceptions=True, return_cleaned=True)
if cleaned: print 'PDF has been cleaned.'
else: print 'PDF is clean.'
Product's homepage
Requirements:
· Python
What's New in This Release: [ read full changelog ]
· Fixed a bug that happened when using return_cleaned