Main menu

Post navigation

Stripping metadata from pdf files

Sometimes, for example when sending a review of a paper, I do not want the pdf file to contain any metadata. Ideally, the editorial process should take care of this, but I do not want to take any chances. This is how I strip all metadata from my pdf files.

First, lets see what metadata is generated by a simple ConTeXt file. Opening the file in Adobe Reader and going to File -> Properties gives me

So, I am giving away that the file is produced by ConTeXt. There is more metadata that Adobe Reader does not show by default. To see that, I use pdftk.

The file literarily contains a “Made by ConTeXt” badge. Given the number of ConTeXt users, this might be more than enough to identify me in my research community. I do not want this information in the pdf file.

Fortunately, stripping this information is easy. I use the following function in my .zshrc file

I was able to use sed to expunge the remaining infovalue strings that I didn’t want in the output pdf. I used sed, as seen below, which is far from ideal, but the pdf renders. I think the PDFID lines are some sort of hash, because they’re not stored in the PDF, so I manually redacted them in this paste.

I don’t understand the pdf spec to know if the resultant pdf is still valid. A slightly better option will be to first uncompress the pdf (pdftk –uncompress) remove the offending fields and then compress the pdf again.

# useable through running ./metadata “” “” from home;
# sudo apt-get install pdftk is a requisite for the above to be functional;

— However, I have just found out (through http://www.nsa.gov/ia/_files/app/pdf_risks.pdf) that removing baisc PDF v1.0 metadata (that is, the info dictionary) is not everything. For other types of metadata to be removed, another program has to be used (such as Acrobat Pro). Until a libre software on the Linux side comes out, then.

Just as a follow-up: files creating through LibreOffice or some other software to that effect seem fine to be made ran through pdftk for metadata removal, as at the time, no v1.4 metadata stream is likely to be embedded. For files coming from other sources, though, an more-thorough examination is quite justified.

I have to, because I forgot to link an important knowledge document — http://www.nsa.gov/ia/_files/vtechrep/I73_025R_2011.pdf — which can be returned through googling “acrobat x nsa” –, an useful guide for working with AAPX for current purposes and objectives to be met and realized towards further human-species growth.