Thursday, January 22. 2009

On 22-Jan-09, at 5:18 AM, Francis Jayakanth wrote on the eprints-tech list:

"Till recently, we used to include references for all the uploads that are happening into our repository. While copying and pasting metadata content from the PDFs, we don't directly paste the copied content onto the submission screen. Instead, we first copy the content onto an editor like notepad or wordpad and then copy the content from an editor on to the submission screen. This is specially true for the references.

"Our experience has been that when the references are copied and pasted on to an editor like notepad or wordpad from the PDF file, invariably non-ascii characters found in almost every reference. Correcting the non-ascii characters takes considerable amount of time. Also, as to be expected, the references from difference publishers are in different styles, which may not make reference linking straight forward. Both these factors forced us take a decision to do away with uploading of references, henceforth. I'll appreciate if you could share your experiences on the said matter."

The items in an article's reference list are among the most important of metadata, second only to the equivalent information about the article itself. Indeed they are the canonical metadata: authors, year, title, journal. If each Institutional Repository (IR) has those canonical metadata for every one of its deposited articles as well as for every article cited by every one of its deposited articles, that creates the glue for distributed reference interlinking and metric analysis of the entire distributed OA corpus webwide, as well as a means of triangulating institutional affiliations and even name disambiguation.

Yes, there are some technical problems to be solved in order to capture all references, such as they are, filtering out noise, but those technical problems are well worth solving (and sharing the solution) for the great benefits they will bestow.

The same is true for handling the numerous (but finite) variant formats that references may take: Yes, there are many, including different permutations in the order of the key components, abbreviations, incomplete components etc., but those too are finite, can be solved once and for all to a very good approximation, and the solution can be shared and pooled across the distributed IRs and their softwares. And again, it is eminently worthwhile to make the relatively small effort to do this, because the dividends are so vast.

I hope the IR community in general -- and the EPrints community in particular -- will make the relatively small, distributed, collaborative effort it takes to ensure that this all-important OA glue unites all the IRs in one of their most fundamental functions.

(Roman Chyla has replied to eprints-tech with one potential solution: "The technical solution has been there for quite some time, look at citeseer where all the references are extracted automatically (the code of the citeseer, the old version, was available upon request - I dont know if that is the case now, but it was in the past). That would be the right way to go, imo. I think to remember one citeseer-based library for economics existed, so not only the computer-science texts with predictable reference styles are possible to process. With humanities it is yet another story.")

This is really interesting, and something I must admit I hadn't really given much thought to until now.

One thing is immediately clear to me, however, and that is that manual cutting and pasting to capture this canonical metadata is certainly not a viable option. I guess instead software developments need to be made that extract references from the full text automatically, perhaps with some minimal 'tidying up' then applied, in the same way that some repository staff currently tidy up other metadata before making items live.

I'm thinking aloud here, but if we could establish this 'glue', how would the reference linking actually work? Presumably the idea is that you could click on a reference and it takes you through to an OA version of that paper in another repository. If that is the case, how would the links be established? Would it work through your repository crawling the metadata of other repositories looking for a match?

Reference linking in publisher systems works by look-ups in CrossRef... I'm just trying to get a grasp of how a similar system in the OA world would work. Obviously we wouldn't use DOIs because that would link to the publisher site. Am I understanding things correctly?

Finally, is there an issue here surrounding PDF versus source file? It must be a lot easier to extract references from a source file. Also, if we don't have the full text for a particular item, where would publishers stand on us capturing the reference list from their site, in the same way that we might go and grab the abstract? Some publishers secure reference lists to subscribers only; others don't. Would we see publishers adding further conditions to their archiving policies relating to what we can and can't do with their reference lists?

(1) Yes, it has to be done by software, not by hand. (Citeseerx does a good automatic extraction.)
http://citeseerx.ist.psu.edu/

(2) For how they could be linked (at the harvester level), see citebase: http://www.citebase.org/

(3) The OA reference-linking service could use any resources that are not behind toll-barriers to find, check or link references. (Nothing wrong with linking to toll-cites too, but then that's the end of the chain...)

(4) Citeseer and google scholar can extract references from PDF, but obviously there are better formats too. They can also be extracted or even separately deposited at source, in the IR (then parsed with paracite-like software):
http://paracite.eprints.org/

The American Scientist Open Access Forum has been chronicling and often directing the course of progress in providing Open Access to Universities' Peer-Reviewed Research Articles since its inception in the US in 1998 by the American Scientist, published by the Sigma Xi Society.

The Forum is largely for policy-makers at universities, research institutions and research funding agencies worldwide who are interested in institutional Open Acess Provision policy. (It is not a general discussion group for serials, pricing or publishing issues: it is specifically focussed on institutional Open Acess policy.)