Jane's last post and a post on the ever excellent Language Log have got me thinking about permanence and accountability in the internet age. Its a theme that I encounter again and again, working for a digital archive.

First, Mark Liberman's post on Language Log was a fairly scathing breakdown, reference by reference, article by article, that showed that a point supposedly backed up by hard evidence, well, wasn't. A great effort really. And thanks to his extensive linking, and by simply placing the relevant articles online, we too can come to much the same conclusion. Well, admittedly, I just took his word for it... but that's all I've got time for over my morning coffee.

Exciting things are happening in academia with individuals and organisations starting to fully utilise the internet. As Jane mentions, Sydney University is having great success with several new digital initiatives. I think DSpace is the bee's knees! Its going to be wonderful. Its one step closer to directly linking to the actual section of the actual article which you're interested in. That's the kind of functionality that I'm after.

Great, you say, that'll save me 10 minutes of searching in the library. Well, yes, says I, but that's not what's interesting. Imagine linking directly and explicitly to the paragraphs or sentences that you are interested in. That's heaps better than a simple article reference. Suddenly the reader can discover exactly what your talking about, quickly, and by jumping directly from one article to the next, in a way that you're already accustomed to.

But really that's just the beginning. Think of the Altavista to Google leap. (This over simplifies it a bit but,) Altavista was a simple but vast index of content words and meta tags. Google came along with, amongst other things, the idea that links actually expressed something meaningful, and suddenly the internet became a whole lot more useful.

Well, imagine reading an interesting article and being able to see who quoted it. Imagine a density plot of the most popular quotes overlaid on the text of the article. What parts of a paper are people talking about the most? You could establish quotable articles and the articles that quoted the quotables and became quotable themselves. Imagine looking at quotability on a time line. You would be able to see the "hot spots" in the development of ideas over time. These would be simple additions to a modern search engine, and in fact have already be added in a rough sense. All that needs to happen (maybe that should be scare quoted) is for researchers to adopt a new referencing scheme and to archive their articles in digital repositories.

Leap sideways, imagine that instead of articles, we put up our raw research and/or field work data. Not only can you link directly to the source(s) for your argument, so can others in their critique. Let them crunch the numbers if they don't believe you. Or say you're talking about discourse pragmatics, then people may like to hear for themselves the utterance's intonation. Even better, say you were unable to explore an interesting avenue, this leaves it open for someone else to come along and explore. This is a critical ability to facilitate when you're talking about endangered languages.

To me this extensive referencing is a straight forward way of increasing the empirical weight that a piece of research holds. Sure, where people have already referenced articles, it already technically there, but I'm talking about granularity of referencing here. And in terms of source data, I'm talking about qualifying your statements as explicitly as possible.

But, to come crashing down to reality again, this technology is not quite there yet. Well actually it kinda is there, the main problem as I see it is adoption. So first of all get uploading!. A good solid base of data is where this revolution will begin!

These are hot topics in the Documentary Linguistics and Digital Repositories fields. There'll be a fair bit of discussion of this at our upcoming conference, which is look's like its going to have a great line up of interesting papers. If this kind of thing interests you, then we hope to see you there!

Comments

On uploading to repositories, Sten Christensen just passed on this nice site, the Sherpa Romeo project which lists many publishers' and journals' conditions on uploading pre-prints and post-prints to archives. (Yes there's a Juliet project too).

Sydney University is having great success with several new digital initiatives. I think DSpace is the bee's knees! Its going to be wonderful. Its one step closer to directly linking to the actual section of the actual article which you're interested in. That's the kind of functionality that I'm after.

Unfortunately D-Space, particularly as implemented by the University of Sydney library (though none go much better) is limited while papers are published in PDF form. PDF does not allow the passing on of bookmark information from the URL so any hope of linking into the documents from a web page is immediately quashed.

Of course PDF is the best way to archive papers at the moment as it allows for easy translation from whatever word processing system you use while keeping such important things as footnotes and endnotes intact, not to mention formatting.

We really need an archiving system that will allow for computer translation of PDF files to well formed HTML built in. This is a partially solved problem, PDF to HTML translation can be seen, for example, by GMail users who are sent PDF enclosures though it is badly formed HTML, good for display but no other purpose.

Once we can translate PDF into well formed HTML then achieving your goal of adding the ability to link to individual paragraphs is not difficult.

The largest problem with digital archives is the tension between getting information that is as useful, both immediately and in the future, as possible and making it as easy as possible for people to add documents so that you get a reasonable body of works.

As Jane pointed out in her post the flow of documents into even the University of Sydney archive, which has a reasonably large base of working academics to draw from, is slow. The most succesful online information gathering, places such as flickr, youtube and del.icio.us ask of people nothing that does not immediately benefit them. Perhaps archivists need to contemplate ways they can "add value" either before or after publication so that we get a benefit from using digital archive spaces. We could then imagine a good flow of documents into the archive.

As Jane pointed out in her post the flow of documents into even the University of Sydney archive, which has a reasonably large base of working academics to draw from, is slow. The most succesful online information gathering, places such as flickr, youtube and del.icio.us ask of people nothing that does not immediately benefit them. Perhaps archivists need to contemplate ways they can "add value" either before or after publication so that we get a benefit from using digital archive spaces. We could then imagine a good flow of documents into the archive.

Yes! Great point. I think the benefit is obvious for archival of field work data when we hear stories like this and this (so sorry to hear that!). In the case of raw data, these days Researchers increasingly need somewhere to put it. Its an extra sweetener if they get the ability to link to it extensively and directly in their work.

Theses are a bit harder, and quality is an issue. I think the current mechanisms that determine whether a web page is of sufficiently good quality to appear first in your web search are good enough already... we just need the linking for that to work with theses. So I say take whatever you can get for now. As for actually getting people to add material, publishing conference proceedings automatically is one way, encouraging honours and PhD students to publish is another. Many students feel their hard work goes nowhere once they've finished... this gives it a bit more prominence. On the other hand some students finish their thesis and then get embarrassed about showing it to other people, let alone the world!

Maybe if they library offered to print up and bind their theses at some discount rate that would encourage people to send it off... hmmm. I wonder if that could work? It'd be kinda neat.

Anyway, you're right. We definitely need to figure out a way of making it more attractive for people to upload.

Once we can translate PDF into well formed HTML then achieving your goal of adding the ability to link to individual paragraphs is not difficult.

Yes, I agree, PDF isn't really the bee's knees when it comes to linking...future HTML standards was in fact what I was thinking about when talking about linking to paragraphs and sentences. Good quality pdf -> html conversion is really only a matter of time, so I see it as only a problem in the short term. No (or not much) information is being lost by storing them this way. And in much the same way we have to convert from video format to video format to keep out data accessible, I imagine we'll have to do the same with rich text documents too.