PDF metadata: different tool, same story

So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.

But Rod Page kindly alerted to me the fact that I might be using the wrong tool for this investigation. So at his suggestion I’ve tried again to extract metadata from the exact same set of PDFs as last time…

This time I’ve put the full raw metadata output from exiftool on figshare for each and every PDF file, just to really prove the point, reproducible research and all. I’d love to post the corresponding PDFs too but sadly many of them are not Open Access and this thus prevents me from uploading them to a public space. **Insert timely comment here about how closed access publications stifle effective research practices…**

Exiftool is really simple to use. You just need type:exiftool NameOfPDF.pdf
to get a human-readable exhaustive output of all possible metadata.

andexiftool -b -XMP NameOfPDF.pdf
to get XML-structured metadata. I could only extract this from 56 of the 69 PDF files. The data output from this for those 56 PDFs is available as a separate fileset on figshare here.

Finally, if you want to test a whole bunch of PDF files in your working directory I’ve made a simple shell script that loops through all PDFs in your working directory, available here (oops, it’s not data, perhaps I should have put that on github instead?). [I’m sure many readers will be able to create a simple bash loop themselves but just for those that don’t…]

I’m assuming that the reason exiftool -b -XMP failed on 13 of those PDFs is because they have no embedded XMP metadata – an empty (zero-byte sized) file is created for these. This is an assumption though… I notice that those 13 exactly correspond with all the 13 that were produced with iText. I checked the website and I’m pretty sure iText 2.x and up can embed XMP metadata, it’s just whether the publishers have bothered to use & apply this functionality.

So if I’m right, neither Taylor & Francis, BRILL, nor Acta Palaeontology Polonica embed XMP metadata (at all!) in their PDFs. The alternative explanation is that the XMP metadata is in there but exiftool for whatever reason can’t read/parse it from iText produced PDFs. I find this an unlikely alternative explanation though tbh.

Elsevier have superior XMP metadata to everyone else by the looks of it, but Elsevier aside the metadata is still very poor, so my conclusions from last week’s post still stand I think.

Most of the others do contain metadata (of some sort) but by and large it’s rather poor. I need to get some other work done on Monday so I’m afraid this is where I’m going to leave this for now. But I hope I’ve made the point.

Further angles to explore

Interestingly Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realised this but apparently some institutional repositories (IRs) universally add cover pages to most deposits. If this is done without care for the embedded metadata, the original metadata can be wiped and/or replaced with newer (less informative) metadata. Not to mention that cover pages are completely unnecessary -> all the information on a cover page is exactly the kind of stuff that should be put in embedded metadata! No need to waste time and space by putting that info as the first page. JSTOR does this too (cover pages) and it annoys the hell out of me.

After some excellent chat on Twitter about this IR angle I’ve discovered that UKOLN based here on campus at Bath have also done some interesting research in this area, in particular the FixRep project which is described in more detail here. CrossRef labs pdfmark tool also looks like something of interest towards fixing poor quality metadata PDFs. I’ve got this installed/compiled from the source on github but haven’t tried it out yet. It would be interesting to see the difference it makes – a before and after comparison of metadata to see what we’re missing… But why should we fix a problem that shouldn’t exist in the first place? Publishers are the point of origin for this. It’s their job to be the first to publish the Version of Record. They should provide the highest level of metadata possible IMO.

Why would publishers add metadata?

Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it. A pipe dream perhaps but that’s my $.02. I would ask for a refund if I downloaded MP3’s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?

PS Apologies for some of the very cryptic filenames in the metadata uploads on figshare. You’ll have to cross-match with this list here or the spreadsheet I uploaded last week to work out which metadata file corresponds to which PDF/Bibliographic Data record/Publisher.

Related

@rmounce “Why would publishers add metadata? Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it.” I’m not seeing a compelling business case here. High-quality metadata would be nice, but can anybody argue that their research is being hampered by a lack of such metadata? Could someone working in publishing make a case to their boss that adding such metadata would generate more revenue, web traffic, manuscript submissions (insert whatever metric matters)?

You ask “I would ask for a refund if I downloaded MP3′s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?” You may buy MP3’s, but I suspect the vast majority of people don’t buy PDFs (have you bought an article PDF as an individual?). You aren’t the customer, University libraries (and others with large budgets) are the customers. If you’re not the customer then your wishes aren’t going to matter a great deal. Get your credit card out and things may change ;)

I guess it’s just so obvious to me why metadata is vitally important I forget to state it sometimes.

Science is digital these days. Researchers get their articles, mostly via PDF from publishers and store these on their computers. Storing, arranging and retrieving these PDFs from one’s own personal library of thousands of such files is no trivial task. Sophisticated programs like Papers, Mendeley, Zotero, EndNote, Paperpile, ColWiz… help but since the metadata provided with PDF is so poor (independently confirmed by Victor @ Mendeley btw here: https://twitter.com/mendeley_com/status/288286004010442752 ) often can’t get 100% correct metadata for each PDF. Therefore every academic I know periodically spends (wastes) time arranging, filing, adding metadata to their personal library.

@rmounce Perhaps I’m being a bit obtuse, but what I’m getting at is that the reasons why you want something may be irrelevant to a publisher. I understand why you’d like rich metadata in PDFs, they help solve problems you face. But how does solving your problems help the publisher? I guess I’m looking for the incentive for them. If one exists, they are more likely to produce the kind of metadata you are after. Organising your PDFs has no benefit for a publisher, nor does data mining (unless the publisher can monetize the results of the mining, which is where I suspect Elsevier are heading). If you were a publisher why would you embed metadata – and “because it’s the right thing to do” isn’t an answer ;)

Regarding the issue of organising PDFs (the fabled “iTunes of papers”), there are several ways this could be tackled. Embedded metadata is one, another is a Gracenote-style solution where, say, you submit a sha1 hash of the PDF and you get back the corresponding metadata. Mendeley could offer something like this if their API could be searched by sha1 (they store sha1 signatures for PDFs that users uploaded). One wrinkle is that some publishers generate unique PDFs with each download, so each PDF will have a unique signature.

bobcorrigan

Ah, the dilemma of product development: having no customers is a setback, having one customer is a disaster.

Mike Taylor

Rod, if your argument is just that predatory publishers don’t give a shit about researchers, than I guess I can’t argue with that. But with all the we’re-you’re-friends rhetoric we keep hearing, it doesn’t seem unreasonable to me that publishers who do give a shit should do their damn jobs properly.

You say that “because it’s the right thing to do” isn’t an answer”. I hope publishers are better than that — they certainly tell us they are, repeatedly. As for those that really aren’t — well, we have plenty of case-history on what happens to companies that don’t care about their customers.

@MikeTaylor I’m not saying publishers “don’t give a shit”, I just imagine a meeting where there’s a bunch of things publishers are thinking about doing next. Where would “embedding full publication metadata using XMP” fall in that list? Publishers are grappling with a changing landscape, such as the rise of Open Access publishing models, mobile (e.g., do we design web sites for mobile, or develop apps?), increasing demands for alternative metrics beyond impact factor, and so on. Any change in practice has costs, so I’m looking for reasons why publishers would want to add XMP. In other words, if you could sit down with a publisher and wanted to convince them that they just had to add full metadata (say in XMP), what would you say? Is this the number one thing you’d want them to do (as opposed, say, to digitising their back catalogue, adding ePub as an output format, assigning DOIs if they don’t have them, or <insert other thing we’d like>).

In addition, the XMP toolkit from Adobe is quite shit. exiftool is OK, but not the most amazing tool for a production system. As it happens, I’m going to be visiting our typesetter in India in February, and I’ll be able to learn first hand how they do XMP embedding into our PDFs, I’ll write up my notes when I get back.

It would be good if you could also report on how figures are handled during this process. PMR and I are very keen on seeing that vector formats *remain* vectors and are kept largely intact from submission – no transmogrification into rasters please!

It’s nice to know there’s someone willing to lift the lid on these otherwise secretive black-box processes that go-on behind the scenes at publishers :)

Mike Taylor

The thing is, Rod, doing this right is trivial. It’s just a matter of taking ten minutes to fix the pipeline. Whereas the other things you mention — digitising a back-catalogue for example — are very significant undertakings.

So adding proper metadata shouldn’t even be on the agenda for your hypothetical meeting. They should just do it.

@MikeTaylor I’m not sure this is trivial. Publishers will have different pipelines, they may not have much control over fine details of production (it may be contracted out, the contract might require re-negotiating or additional charges if changed, etc.). This thread desperately needs input from people involved in the publishing process, which is why I look forward to reading what @IanMulvany learns after visiting their typesetter. I don’t think we’re in a position to simply say “they should just do it” if we don’t understand the production process. I’m not claiming any special insight, but my experience editing Systematic Biology was eye opening. When we deployed the Manuscript Central editorial system we had conference calls where pretty much any change to the system required the publisher talking to Scholar One, and any change of substance was treated as a billable item.

Mike Taylor

This is why we desperately need new publishers (PLOS, eLife, PeerJ). Even leaving aside the business ethics of the old ones, the whole approach is so desperately mired in inefficiencies that a technically trivial change like this becomes organisationally hard. For every $1 you pay these corporations, 1¢ goes on actually getting the work done and 99¢ on managing the process around it. I bet we’ll hear no whinging from PeerJ about how hard it is to run
$ set-pdftag author “Michael P. Taylor”.

My view is that most publishers, and their suppliers, have got themselves in a corner by using inappropriate tools for XML generation, PDF generation, etc. And I think they are a bit embarrassed to discuss these points in public!! It is the use of these “broken” and hugely expensive systems and tools that makes it so hard to embed XMP which is a no-brainer to any academic.

It’s not hard – publishers all demand XML-first typesetting, which I understand it to mean fully automated creation of PDF from XML. This is what we do every day and have done for a decade. The XMP is in the XML and gets embedded automatically in the PDF. Job done. Now where it gets hard is when the PDF is made first in a desktop publishing package, then XML is generated by some complex means, and then checked and rechecked to make sure it is right. Adding XMP becomes an enormous task.

So the ones who have trouble are those who are using broken systems. And let’s face it, they really don’t deserve to be around long. ;-)

By the way we are one of the typesetters to Elsevier, so I’m glad their files were better. ;-)

One incentive for OA publishers to do this is that Google does pay attention to XMP data, so it should make ones material more visible in the long tail of search results. It’s hard to do a verifiable experiment on this, however.

Nice angle. Rich metadata as a tool for increasing discoverability, I like it :)
If all Open Access publishers could put their licencing details in a standardised way in XMP data perhaps we could one day have effective ways to search for OA-only papers, in the same way that one can search by licence type on Flickr and other places.

Phil Harvey

I would be interested to see why ExifTool doesn’t extract metadata from the iText PDF files. If possible, could you email a sample to me (phil at owl.phy.queensu.ca)? Thanks.

Sent. Many thanks for taking the time to look into this. I’d be very interested to know the results of this. The Acta Palaeontologia Polonica PDFs are all Open Access and available from here: http://www.app.pan.pl/archives.html so anyone can try this themselves with these papers.

Brilliant. Phil suggested I try exiftool -a -G instead as it tells one exactly where each metadata came from. I’m now sure this is an area in which publishers both commercial and non-commercial alike could significantly improve their products (publications).
As an example for the Nagalingum et al 2011 Science paper we now get (which again shows that there’s really not much of use in the XMP provisioned metadata):

I can speak from a publisher’s perspective why XMP is important for all parties involved (authors, funders, editors, publishers, data miners, etc) to want XMP. (note: I am a co-founder at PeerJ and before that was at Mendeley).

In one word, branding. More and more researchers use PDF tools to organize and extract metadata from the PDFs that they download. Even those who don’t use those tools, are coming across data sets and statistics that make use of that aggregate data from others using these tools (and some of those people are major decision-makers).

When those tools built to utilize XMP are unable to properly extract the metadata, due to insufficient XMP in the PDF, then at least two negative consequences occur.

First, users are often frustrated with the tool unable to extract the metadata, but they also become frustrated with the publisher or journal. Imagine a user with 200 articles from journal ‘ABC’ and 200 from journal ‘XYZ’, but only metadata from ‘XYZ’ has metadata properly extracted near 100%. As we move more towards an author-pays model and away from subscriptions, that is an important negative branding experience for journal ‘ABC”. Even if it doesn’t affect an author’s intent on where to next publish, the fact that journal ‘XYZ’ has its metadata shown in full with every interaction for the user will have a positive branding experience. This is why brands like Coca-Cola spend millions/billions on display adverts on and offline. Constant presence even if you don’t buy that coke today, you will tomorrow.

Second, as mentioned above, that metadata is eventually aggregated, either alone in services such as Mendeley, or aggregated even further through a combination of APIs from various services or other means. The catch is that only accurate metadata can be properly aggregated; the rest is either lost or too incomplete to get an accurate count of how many times its usage is appearing. That too is bad branding, as it limits not just the dissemination of the brand, but reduces the appearance of being a highly read journal (or specific article within that journal). Not good news for either the publisher, authors, editor, funders, and potentially the reviewers of that article.

There are more reasons, but a less data-driven third reason is that XMP represents just the minimum bar in innovation. If a publisher cannot achieve that minimum standard, then how likely is it that they can be entrusted to improve science going forward?

Your first point particularly resonates with me. Certain journals *really* frustrate me WRT metadata (not just strictly PDFs but they would help no doubt) – if I can’t easily get the 100% correct metadata into Mendeley / Zotero / CIteulike every time for every article – my likelihood of wanting to cite articles in that publication goes down, I’ll go find another more easily citable paper to cite (not always an option, but increasingly so). Similarly if I’m disinclined to cite an article just because it’s harder to get accurate metadata / bibliographic data for it – I’ll remember that when it comes to submitting my next manuscript and avoid that difficult to cite journal…

Frustration with brands & journals can all too easily occur. http://thecostofknowledge.com/ is perhaps a rather good example of brand damage to think about!

@jasonHoyt @rmounce Just to continue to play Devil’s advocate, I suspect whether the journal embeds XMP metadata or not currently has pretty much zero impact on anybody’s decisions about where they publish or what they cite. And yes, if a tool can’t extract metadata from a PDF I’ll blame the tool rather than the PDF (as unfair as that may be). I say this as someone who embeds XMP metadata in PDFs I generate, see http://iphylo.blogspot.co.uk/2010/04/biostor-gets-pdfs-with-xmp-metadata.html.

I take the point about branding, but I’m unconvinced that XMP has a big part in that. Put another way, Elsevier does great XMP, what impact does that have, if any, on its brand? Nature also supports XMP, but their brand is probably seen as innovative for other reasons (e.g., ePub-based publishing on iOS devices).

Lastly, journals have been experimenting with metadata formats for a long time now, including RSS feeds, Dublin Core and Google Scholar tags in HTML, OAI-PMH harvesting, Medline/PubMed indexing, CrossRef metadata, XMP, OpenURL, etc. It would be interesting to discover which of these had the most impact for publishers and/or for users.

@rdmpage:disqus if embedding XMP was the only secret to driving a positive experience for a research publication then it would be quite the easy job :). You’re right, that it is just one part of building trust, that can be ruined in a flash as we’ve seen happen before. That said, there is no excuse for a publisher with dozens, hundreds, or thousands of staff not to be able to include XMP and other metadata tools.

I Simply
Could
Not
Depart
Your Web
Site
Before
Suggesting That I Actually
Loved
The
Usual
Information
A
Person
Provide
For
Your
Guests?
Is Going
To
Be Back
Ceaselessly
In
Order To
Check
Up On
New Posts seiryokushop

Hey, I Think Your Blog Might Be Having Browser Compatibility Issues. When I Look At Your Website In Safari, It Looks Fine But When Opening In Internet Explorer, It Has Some Overlapping. I Just Wanted To Give You A Quick Heads Up! Other Then That, Great Blog!

Teofila Popham

Interesting analysis , Just to add my thoughts , if you is looking for a service to merge some PDF files , my family encountered a tool here http://goo.gl/hVMSbp.

Zack Barkley

Even if the publishers are not responsible enough to do such a simple thing as properly tag their pdfs to make researchers lives easier and more productive, we as researchers should “at least” be able to at least write our own tags into pdfs so that we can easily sort pdfs in windows explorer browser. This was something doable in Windows XP, but sadly, I have worked many hours on this and it just seems impossible with Windows 7/8. Exiftool writes some tags, but not the ones Adobe uses, and neither the exif nor the Adobe tags are visible in explorer from Vista onwards, although incredulously pdfs are the most commonly shared document style and there are 288 tags for other things .