09 June 2020

Removing malware from PDF's

Sometimes PDF files can contain malware. Although the techniques listed below are not a fool-proof guarantee of getting rid of malware, it can get rid of malware that's embedded in the metadata of PDF files.

First, confirm that the PDF has malware.

Now create the following bash script in a file named malwareRemover.sh:
(make sure you use a tab instead of spaces for the lines within the for loop).

#!/bin/bash
clear
for f in *.pdf; do
    echo "Removing malware from $f:"
    echo "-------------------------------"
    echo ""
    #---convert
    tempFilename="temp.ps"
    echo "Converting $f to $tempFilename"
    pdf2ps $f $tempFilename
    echo "Deleting $f"
    rm $f
    echo "Converting $tempFilename back to $f"
    ps2pdf $tempFilename $f
    echo "Deleting the ps file"
    rm $tempFilename
done


Change the permissions of the file and install ghostscript:
chmod +x malwareRemover.sh
sudo apt-get install ghostscript

Run it with all the PDF files in the same folder as malwareRemover.sh.
./malwareRemover.sh


If you are using Didier Stevens' tools, this script will help:

#!/bin/bash
clear
for f in *.pdf; do  
    echo "Scanning $f for malicious data:"
    echo "-------------------------------"
    echo ""
    #---scan
    python3 pdfid.py $f
done

echo ""
echo "How to interpret the info:"
echo "------------------------------"
echo "* Almost every PDF documents will contain the first 7 words (obj through startxref), and to a lesser extent stream and endstream. I’ve found a couple of PDF documents without xref or trailer, but these are rare (BTW, this is not an indication of a malicious PDF document)."
echo "* /Page gives an indication of the number of pages in the PDF document. Most malicious PDF document have only one page."
echo "* /Encrypt indicates that the PDF document has DRM or needs a password to be read."
echo "* /ObjStm counts the number of object streams. An object stream is a stream object that can contain other objects, and can therefor be used to obfuscate objects (by using different filters)."
echo "* /JS and /JavaScript indicate that the PDF document contains JavaScript. Almost all malicious PDF documents that I’ve found in the wild contain JavaScript (to exploit a JavaScript vulnerability and/or to execute a heap spray). Of course, you can also find JavaScript in PDF documents without malicious intend."
echo "* /AA and /OpenAction indicate an automatic action to be performed when the page/document is viewed. All malicious PDF documents with JavaScript I’ve seen in the wild had an automatic action to launch the JavaScript without user interaction."
echo "* The combination of automatic action  and JavaScript makes a PDF document very suspicious."
echo "* /JBIG2Decode indicates if the PDF document uses JBIG2 compression. This is not necessarily and indication of a malicious PDF document, but requires further investigation."
echo "* /RichMedia is for embedded Flash."
echo "* /Launch counts launch actions."
echo "* /XFA is for XML Forms Architecture."
echo "* A number that appears between parentheses after the counter represents the number of obfuscated occurrences. For example, /JBIG2Decode 1(1) tells you that the PDF document contains the name /JBIG2Decode and that it was obfuscated (using hexcodes, e.g. /JBIG#32Decode)."
echo "* BTW, all the counters can be skewed if the PDF document is saved with incremental updates."
echo ""

 

2 comments:

Tanmay said...

would like to collaborate with you for malware removing and detection sw !

Nav said...

Glad to hear that Tanmay, but what did you have in mind? Is it about writing a Python script that makes the result more human readable? I'm not very knowledgeable in the field of malware detection, so this will require a good amount of learning. I hope you are knowledgeable in this field?