PDF metadata – why so poor?

Why is it so difficult to identify academic publisher PDFs?

With published MP3 files of audio you get rather good metadata. Take for example an MP3 file I downloaded from Hacker Public Radio available at the bottom of this post.

The Full Circle Magazine team added value to the content published by embedding clear and relevant metadata within the MP3 so that even years later, and renaming the file – I still know exactly what this file contains without the need to open/listen to it. The metadata standard for MP3 files is called ID3. I haven’t shown it but there’s even a nice little picture embedded as metadata. Rich, valuable metadata is a good thing to have and easy to provide.

What then for PDF files?

They have embedded metadata too. On *nix machines you can use the CLI tool pdfinfoto show the metadata on your PDF files. I read that the metadata embedded in PDF is called XMP (Adobe’s Extensible Metadata Platform).

Is there an XMP standard / schema for published academic works in the PDF format? The only reason I ask is when I look at the Version of Record (VoR) files from many different publishers – the metadata I find is so variable in quality! No wonder Mendeley has difficulty identifying all the PDFs I feed it.

Below are results from a little preliminary survey of academic publisher PDF metadata I’ve done, with the supporting data uploaded to figshare here if you’re interested…

Sample population: 21 different academic publishers, ~3 Version of Record PDFs per publisher, mostly all published in the year 2011.

You’ll notice I’m mixing large publishers (Elsevier, Wiley, Springer…) with tiny society/institution published PDFs. Open Access and Toll Access (TA) publishers both represented. The results are rather interesting…

I gathered data on 11 metadata fields:

PDF Version

Optimized?

File Size (bytes)

Page Size

Producer

Creator

Title

Subject

Author

Pages

Keywords

Results

The results are very very ragbag. Out of the 70 PDFs I’ve published (meta)data on over at Figshare, only 8 of them had Keywords metadata embedded in them. So take a bow Arthropod Systematics & Phylogeny, Frontiers in Neuroscience, and Geological Magazine (Cambridge Journals Online) for those.

55% of them were not Optimized. Among those notoptimized were Science (AAAS), Insect Systematics & Evolution (BRILL), Psyche (Hindawi), Acta Palaeontologica Polonica, Proceedings of the National Academy of Sciences, Zookeys (Pensoft) and more. Hard to say whether PDF optimization is a good thing or not, depends on your POV I suppose. If anyone has any strong preferences either way please do comment.

PDF version: rare praise for Elsevier here – they appear to be one of only two publishers (SAGE the other) I’ve sampled here that actually publishes PDFs according to the latest standard (1.7), which incidentally has been around since 2008! I’m no PDF guru though, so I don’t know if this actually entails any benefits or if there’s much difference between the different standards. The average joe probably wouldn’t notice the difference.

Curiously although two of the sampled Royal Society year-2011-published PDFs were version 1.4, a third one was a version 1.2 PDF – inconsistent and odd! Most PDFs (28) as the pie chart shows were version 1.4

Page size is entertainingly variable too: Geological Magazine, Acta Palaeontologica Polonica, Proceedings of the Royal Society B: Biological Sciences, Zootaxa, and Arthropod Systematics & Phylogeny all go for 595 x 842 pts (A4). Science, Journal of Vertebrate Paleontology and Canadian Journal of Earth Sciences use 612 x 792 pts (letter). The rest use an odd variety of sizes. Pensoft’s choice of 467.717 x 680.315 pts looks small in comparison to the rest, I wonder what the rationale behind that choice was?

Author: Only just >50% of the sampled PDFs embedded author metadata. Geological Magazine, Invertebrate Systematics, Zootaxa, Arthropod Systematics & Phylogeny and Systematics and Biodiversity can all take credit for supplying full author data for each and every author on the author list of each PDF.

Others like Nature, Molecular Phylogenetics and Evolution (Elsevier), Frontiers in Neuroscience, and Psyche (Hindawi)only acknowledge the first author of each PDF. The latter at least has the decency to acknowledge this with an embedded “{et al.}”.

Subject: bit of an odd field this one. Some publishers put the title of the journal in this field e.g. Psyche (Hindawi) and Nature (NPG). Whilst most of the others sampled had no metadata for this field, Frontiers in Neuroscience interestingly used this field for the first sentence of the abstract!

Creator:

data given here and in the next metadata item (Producer) shines light on how these PDFs were created.

It’s interesting that Nature in 2011 were using the oldest version of Acrobat Distiller, it’s perhaps understandable that they value stability over updates. In both Producer and Creator metadata it seems like NRC Research Press (as represented by Canadian Journal of Earth Sciences) had the ‘newest’ most bleeding-edge PDF software setup in 2011.

Discussion

Clearly as with MP3’s there’s a need for good rich metadata to identify all the millions of different files out there. Publishers could provide this and as I’ve shown some do.

If there are agreed upon standards in STM publishing what are they?

Is there any agreed metadata standard for STM published PDFs? If not I think there should be.

I for one would like richer metadata in 2013 so that PDFs can be more easily identified in a machine-readable way – not even Mendeley can cope with all the PDFs I throw at it – my library is a mess.

Given we live in an increasingly mixed world of Open Access and Closed Access publications with content mining applications on the rise, it seems obvious that in particular these PDFs need a Copyright and/or licencing metadata field (as there is in MP3 metadata), to help indicate clearly what can and cannot be done with each PDF.

Related

Chris Rusbridge

Very interesting, Ross. I checked to see how IJDC did on some of this, just by looking at the Properties in Adobe Reader. They do include the title and author name (although two authors are included in quotes, no usefuly standarised name form), and Subject is the journal name.

One other field worth checking in your data would be whether it is a tagged PDF. This would reduce the amount of arbitrary text placement on the page, supporting text readers for poorly sighted consumers and also support text mining a bit better.

Thanks for the comment. I can check if they’re tagged very easily with just a simple grep command. Only 3 of them report as being tagged, 2 (out of 3) of the ‘Frontiers In’ PDFs + 1 (of the 3) Zootaxa PDFs. Not good…

The other metadata fields that were available but I haven’t reported were:

CreationDate:

ModDate:

Form: (some used AcroForm)

Encrypted: (some older ones use this option to some extent)

Page rot: (no idea what this is!)

Perhaps I should also add these to the figshare dataset? Just in case people find them interesting/useful. The Zootaxa CreationDates were also consistently odd: “Sat Dec 25 17:51:40 1999” for 2011 published PDFs!

Peter Cock

Surely “Page rot.” is page rotation? Although most journal PDFs would be portrait, a few special pages might be presented in landscape (e.g. full page figures or tables), so hardly of interest in terms of metadata about the paper authors etc. I’d be far more excited about explicit markup of DOI or ISBN numbers in the metadata (far more robust than the text searching approach one must currently use to find them). We can hope, right?

The tech for embedding and accessing metadata in PDFs has been around for ages, but has had almost no takeup from publishers. As you say, what can be extracted now is mostly unreliable guff; usually some kind of default setting that no one has bothered to tweak. Its even possible to include semantic markup in PDFs — but no one does. Our attempt at a tool for reverse-engineering the metadata and structure of scientific articles is at http://pdfx.cs.man.ac.uk (a rather different emphasis to Peter Murray-Rust’s AMI2 (http://blogs.ch.cam.ac.uk/pmr/2012/10/15/opencontentmining-starting-a-community-project-and-introducing-ami2/ — we are after the overall structure and metadata, and less concerned about individual glyphs and units than AMI2).

It works reasonably well, but still, it’s frustrating to have to go to such lengths when the information could be easily put there in the first place by publishers.

I can definitely see some examples of nonsensical default settings unchanged between PDFs. It felt a bit mean to point this out in the main post but as an example for all Neotropical Ichthyology PDFs the ‘author’ metadata is given as “Malabarba” and the ‘subject’ and ‘title’ are both “Artigo01P127-42”. Oops…

I shall spend today using this and see what other data I can extract. Elsevier seems to have much richer metadata using this tool than what I could previously see using just pdfinfo. I’ll blog what I find tomorrow but the message is still basically the same – not all published PDFs have adequate or full metadata.

@rmounce It might be worth asking “why would publishers add metadata?” In other words, what’s in it for them? In the case of podcasts (such as the MP3 file you gave as an example) there is a strong incentive to embed high quality metadata because tools like iTunes make use of that metadata (both for discovery and display). There’s not really an equivalent market place for PDFs. Tools like Mendeley and Papers resemble iTunes, but they aren’t market places, they are personal tools. If publishers saw a clear commercial reason for embedding rich metadata they would. But I suspect their current focus is more on developing journal or publisher-specific apps. The idea of a market place for PDFs is probably not high on their want list given the experience of music publishers who took a while to realise that they’d lost control of music sales to Apple.

From my little experience, with french’s scientists in arts & humanities, they simply don’t know what in a “metadata” (in MS Word or when they save or print in PDF). Why are they important metadata ? to exchange between systems (like OAI-PMH), not for humans. Publishers – in research – disseminate PDF contents and charge access to the PDF file through captive web portal. metadata are considerated like “poor parent”. So, we have launch in 2010, a project to upgrade metadata’s values for researchers considerations, called Isidore (http://rechercheisidore.fr and in english http://www.tge-adonis.fr/article/isidore-going-strenght-strength), and after 2 years, we see progression in better.

I wouldn’t worry too much about the PDF versions not being the latest. As far as I can tell, it is fairly common practice to set the PDF version only to that which is required, based on the features used in the document. i.e. if a document only uses PDF 1.4 features, then the output is made to conform to the 1.4 version of the spec. That some are using PDF 1.7 makes me suspect that they are actually PDF/A-2 documents, which is based on PDF 1.7. The XMP metadata will tell you if this is the case.

The difference between MP3 files and PDF files is that the data in an MP3 file only contains audio, whereas the data in a PDF file already contains the title, author and (usually) publication date, which is all that’s really needed to identify the document. As long as heuristics can identify that information (which they generally can, though it’s not always easy), then the XMP is redundant for most purposes.

MP3 files also already contain the author(s) (their voices) and usually the ‘title’ (either in the introduction of the podcast, or perhaps the chorus of the song) of the work in them, if you ran some really complicated acoustic analyses you could probably extract these with machine-methods.

In that sense PDF files are the same, the desired metadata is plainly there in the content but for all to see (in the human-ocular sense) but can be variously easy/tricky/hard to cleanly and accurately extract with automated machine methods.

Of course I’m not pretending it wouldn’t be 100 times harder to get that metadata from an audio stream vs extracting it from textual (in the case of PDFs). Perhaps *because* it’s so hard to get that metadata that audio metadata is so commonly and routinely done. STM publishers need to recognise that whilst it’s easy for humans to see/read the author & title of many PDFs, it’s not so easy for computers to see these across the thousands of different publishers and millions of PDFs. The system doesn’t scale. It needs fixing.

I know some rappers mention themselves a lot, but I think saying that extracting an artist’s name from the audio of an MP3 file is the same as extracting an author name from a PDF file is going a bit far :-)

The latter is perfectly do-able (though it can’t hurt to have the XMP there for reassurance). As for the system not scaling, it seems to work well enough for Google Scholar – their inclusion guidelines for PDFs ask only that “the title of the paper appears in a large font on top of the first page, the authors of the paper are listed right below the title on a separate line”. [http://scholar.google.com/intl/en-gb/scholar/inclusion.html]

“the title of the paper appears in a large font on top of the first page, the authors of the paper are listed right below the title on a separate line”

Interesting… thanks for pointing this out. I bet if I search hard enough I can find examples of journals that *don’t* fulfil these criteria e.g. no larger font for title, or author names not immediately below the title. I wonder how stringently one has to adhere to these criteria? I also bet these journals will have poor embedded metadata.

As for identifying the authors of songs – I was thinking of computational analysis of timbre, pitch, flow… that kind of thing – they need not say their own name to be identified on the basis of comparison with a database of previous vocal recordings. But granted, that is a 1000-fold harder than a simple textual extraction of an author name from a PDF.

Guest

They are only guidelines, so yes, there will almost certainly be some PDF layouts that don’t fit that pattern. In those cases, though, the publisher will probably have an HTML page for each article that contains all the article metadata in a machine-readable form, which will get matched up with the PDF that it links to.

Totally agree, this has been a long-time frustration for me. The easiest would be if everyone included some standard semantic markup, like BibTex, or BibJSON or somesuch, trying to come up with all the different fields is hard, and there are already a ton of programs which handle BibTeX for example. JabRef does this (writes BibTeX into PDFs). One thing to go after is public archives, such as arxiv.org, PLoS, D-Space/EPrints, OJS etc… If they all started doing it, and citation software all took advantage of it, hopefully commercial publishers would have to keep up.