26 October 2006

So, why am I interesting in how you save read papers? Well, I don't want to ruin the surprise yet. First, let's take a look at the (still incoming) results. The most popular method (roughly 60% of the population surveyed) is to save them locally. People have also pointed to some tools for archiving, though my guess is that these are probably under utilized. I'm actually a bit surprised more people don't use delicious, though I do not so perhaps I shouldn't be surprised. (Incidentally, I fall into the majority class.)

The reason I'm curious is that I spend a nontrivial amount of time browsing people's web pages to see what papers they put up. Some of this has to do with the fact that I only follow about a dozen conferences with regularity, which means that something that appears in, say, CIKM, often falls off my radar. Moreover, it seems to be increasingly popular to simply put papers up on web pages before they are published formally. Whether this is good or not is a whole separate debate, but it is happening more and more. And I strongly believe that it will continue to increase in popularity. So, I have a dozen or so researcher's whose web pages I visit once a month or so to see if they have any new papers out that I care about. And, just like the dozen conferences I follow, there are lots that fall off my radar here.

But this is (almost) exactly the sort of research problem I like to solve: we have too much information and we need it fed to us. I've recently been making a fairly obvious extension to my Bayesian query-focused summarization system that enables one to also account for "prior knowledge" (i.e., I've read such and such news stories -- give me a summary that updates me). I've been thinking about whether to try such a thing out on research articles. The basic idea would be to feed it your directory containing the papers you've read, and then it would routinely go around and find new papers that you should find interesting. Such a thing could probably be hooked into something like delicious, though given the rather sparse showing here, it's unclear that would be worthwhile.

Of course, it's a nontrivial undertaking to get such a thing actually running beyond my controlled research environment (my desktop), so I wanted to get a sense of whether anyone might actually be interested. Ross's comment actually really got my attention because it would be probably easier technologically if everything could be done online (so one wouldn't have to worry about cross-platform, etc.).

Anyway, this is something I've been thinking about for a while and it seems like a lot of the tools exist out there already.

8 comments:

People have also pointed to some tools for archiving, though my guess is that these are probably under utilized. I'm actually a bit surprised more people don't use delicious

I suspect that they are immensely under-utilised. If you were tempted to use del.icio.us I can't think of any reason why you would not prefer Connotea or CiteULike. Of these, Connotea appears to have the more active developer community.

I have a dozen or so researcher's whose web pages I visit once a month or so to see if they have any new papers out that I care about. And, just like the dozen conferences I follow, there are lots that fall off my radar here.

My research interests don't align with any recognised fields/journals/conferences. Besides which, I am not at a university, so don't have access to any subscription-only journal websites. Consequently I am obliged to pick up everything from personal web pages and open access archiving sites. Every time I find anything of interest I put a permanent watch on that page using WebSite-Watcher (there are other similar products). This checks nearly 6,000 pages for me once a week and alerts me to any changes. None the less, there is much that falls off my radar too, plus I spend a lot of time looking at false alarms.

I also have a number of Google Alerts. These work well except that they match against the literal text rather than the conceptual content.

The basic idea would be to feed it your directory containing the papers you've read, and then it would routinely go around and find new papers that you should find interesting. Such a thing could probably be hooked into something like delicious ... I wanted to get a sense of whether anyone might actually be interested

I would be interested, because I am far from convinced that tags are the answer to this problem. I think an approach is needed where a central server accumulates information on the text content of papers (hopefully in a way that doesn't trigger any copyright problems) and the user's local collection is used to generate queries that are periodically run against the central server.

There was extensive discussion of this sort of facility on connotea-discuss around February 2006 (sorry - you'll have to plough through the posts).

Yes, it would be great to have an automatic paper recommender and feed! I think the general area of NLP/IR tools for improved research productivity is a potentially very fruitful area. I'd like a program that recommends papers that are relevant (e.g. based on what papers I've already read), as well as what papers are important (e.g. what's listed in reading groups/lists). It would also be nice to have retrieval on all papers on the web not based on terms (like Google Alerts)--often when I'm doing literature survey and trying to find as many papers that relate remotely to what I'm doing, I find that different fields often use different terms. How do I know what terms to search? This is a classic problem in IR, but in the context of research papers there might be tailored solutions.