Friday, December 31, 2010

Methodology:To collect information about the PDF/pdf files available on my system, the following command was run:

# find / -name "*.[pP][dD][fF]" -type f -exec ./test.sh {} \;

test.sh script:

echo " " >> index.datecho $1 >> index.datpdfinfo $1 >> index.dat

index.dat so generated was parsed into a postgresql database table with columns for title, author, creator, producer, keywords, subject, bytes, pages, creation_date, mod_date, page_size, pdf_version, encrypted, optimized and tagged field values returned by pdfinfo program for each PDF file.

PDF Version:More than 3500 files were using version 1.3 or 1.4, introduced a decade ago. The file counts for all major versions were as follows:

The PDF file format has now been published as an open standard in 2008, and Version 2.0 is in the offing. It may safely be predicted that Version 2.0, or one of the subsequent 2.* stable releases, would stabilize by 2020. For those wondering about the three files old files on my system using version 1.0, one of them is a copy of the "Hacker's Diet", by John Walker, that can still be downloaded from: http://fourmilab.ch/hackdiet/hdpdf.zip The pdfinfo for this file is as follows:

Many files had numbers with "Chapter", "Form", "Annexure", "Volume" or other prefixes that do not help to identify the file. A title like "Animal Farm by George Orwell" duplicates the Author field. As an example of good style, I could cite the five PDF books authored by David Carlisle, carlisle@cs.man.ac.uk available on the system. The Author name is written consistently across the PDF files, and the titles accurately describe the content:

Information relating to copyright is mostly part of the content - it should ideally form part of the file metadata too. Many of the document restrictions cease to apply after lapse of statutory period of copyright, and it would help to have particulars about the owner of the copyright, licensing terms and conditions, along with full details about the source of publication.

Optimization:PDF files are either linear (optimized) or non-linear (not optimized). Linear files are basically optimized for the web, so that the pages can be viewed without waiting for the whole file to download as is the case with non-linear files. The statistics for optimization were:

Optimization (File count):false (2934)true (2015)

Content:Of course, pdfinfo doesn't help here - one has to read the file to judge content. 500+ files were from www.arvindguptatoys.com and 100+ were from www.gandhiserve.org - I recommend both sites for useful reading :)