Posted
by
michaelon Friday August 03, 2001 @02:08PM
from the spreading-the-light dept.

Jabagi writes: "Hi, I am computer engineering undergrad. and my department in the university has given me the job of doing a web site that displays its publications in digital format. So I wanted to ask if anybody knows any good [prefereably university-originated] web pages that have a publication for research pages with an easy to use interface. A very good one in my opinion is from MIT. I would also accept any suggestions to what should be included on such a page [for example, which formats do you prefer the documents to be in, what types of information should be present, etc...]"

There's some work going on within our department (which I'm not involved in, so don't blame me if I don't get it all right;) which looks pretty useful/interesting.

Firstly the eprints.org author/institution self-archiving software:
eprints.org [eprints.org] It's been designed "to be as flexible and adaptable as possible so that universities can adopt and configure it with minimal effort for all disciplines".
An exapmple of its use is at the:
Cognitive Sciences eprint Archive [soton.ac.uk]

It's also got other noble principles behind it:"The generic version of eprints is fully interoperable with all other OAI-Compliant Open Archives. This means that it no longer matters where papers are archived; the papers in all registered OAI-compliant Archives can be harvested using the OAI protocol into one global "virtual archive" by Open Archives Service Providers".
See the Open Archives Initiative [openarchives.org] for more info.

I was actually asked to write the code to produce the bibtex entries as my supervisor for the project is a CS research assistant, too. But I don't know much about bibtex and they didn't give me enough details. Anyhow, the script I implemented produces a bibtex entry, with a few bugs perhaps.

therefore, it should be done in more ways than one. I would go for a fast database backend (such as mysql) and insert in the database the title, authors and abstract and keywords. Then, if your people have latex source, use hatex2html to produce html use pdflatex to produce latex, use latex2rtf to produce rtf and let the user choose the format. If there are ps files you have ps2pdf, ps2html and possible others. A text search would be nice if you have the harddisk space required. Use pdftotext and ps2ascii to convert to ascii, then index the whole stuff (much like google does).
Sorin M

I've been working on a project to organise and present documents by faculty at the University of Kentucky called Origami. Take a loot here [uky.edu] to see where i've gone with it. It's written in PHP and uses mysql and the local filesystem to store paper information. If you'd like I can write some quick documentation and send you a tarball of what i've got so far.

Actually I have done the first part of my assignment, the search engine. It uses php/mysql but it is not up yet on the school server. Other than that the school database has the files in.ps format but I will tell them about the pdf format also. BTW atropa, thanks for the suggestion. My version is pretty much the same I guess and if you [or anyone else] wants a tarball of my version, email me with my/. nickname at yahoo.com and I'd be more than happy to send it.

First of all, apologies for not giving the exact details about the project I was given. Here they are:
The paper information is kept on a mysql server as a table with name, author, date and many other detailed fields. Files are stored in folder on the server in ps format. I was asked to prepare a search engine and a login procedure for adding/updating/deleting papers using php.

The search engine I prepared goes through the papers in the database one by one and searches for the keyword/phrase and displays the results according to the # of hits in each db entry. It also displays a link to each papers' details and a link to its file. The details reproduce a formatted version of the information in the db for the specific paper. It also produces a bibtex entry.

Thanks for all the URLs and the advice everyone. I have decided to tell my teacher about adding pdf versions of the papers to the page. We may or may not put actual pdf copies as there are not that many papers now, but I didn't know about those conversion commands. Now that I know, I won't have to convert the papers myself.I will try to post a link to the page as soon as it's up.

PS: One funny thing, after I submitted the question, I checked AskSlashdot for a few days and after that I assumed that it was rejected. I actually noticed this page just today when I was given a comment from this discussion to metamoderate. But still thanks again guys!..

This is probably not going to be a popular opinion, but having documents available in pdf is a good thing. Most of your users are going to have readers, and the output is consistent.

You should also include HTML and SGML formats.

Depending on your users, they might demand.doc formats. It's probably better to ask your users rather than Slashdot. They know what they want to use themselves, and are more familiar with the standard formats of their particular academic community.

In some ways, this seems like the 'Ask a SlashdotLawyer' type of questions.

Actually, pdf is not a bad idea. It is what is used for publications all over the net, and it has good support, both in win and unix. Some documents doesn't work well in all pdf-viewers, but if you check if the files you create work with ghostview or other unix-equivalents, it's probably good enough.

Though, as you say, the optimal would have been a html-version in addition to this. If you mix XHTML 1.1 with MathML, it would probably get good results too, but I'm not sure how big support browsers have for MathML.

There are some tools, like 'pdftohtml', you can use. So it shouldn't be a big problem with having both versions on the site.

Visit the NEW ACM Portal - (and try the Online Guide to Computing Literature for

FREE through 9/30/01)!

I learned of this from reading the article
"ACM Opens Portal to Computing Literature" which appeared in the (dead tree) July 2001 Issue of "Communications of the ACM" Vol. 44, No. 7.

Here is part of a salient paragraph:

While it is currently out-of-scope for the Portal to capture the references in works put out by other publishers, it

is intended to create a comprehensive bibliographic database by capturing all the references in ACM articles that cite works from other publishers and for which there is no matching bibliographic entry in the database.

I would suggest you contact the ACM and see if they have any suggestions on what you are attempting. I would expect they have already explored a number of alternatives, one of which might be just what you are looking for.

If you should decide to create your own system, please consider making it open-source so that others may benefit as well.

And if not the entire ACM Portal, at least look at the ACM's digital library [acm.org].

It provides the ability to search on a variety of fields (title, author, keywords, full text, etc) and to download in PDF if available.

Searching is free, but you need a subscription or pay-per-use to get the full text. With membership, you also get access to a useful "My Bookshelf" tool where you can save and organize search results and individual papers.

My only complaint is that for some older articles, the PDF seems to be generated from an image scan of the text, rather than the text itself, resulting in huge files for only a few pages.

You could do a lot worse than that page at MIT. It's readable and navigable, and it gets the job done. Things they did particularly right: It's navigable from just about wherever you are. It gives you a clearly delineated path back from where you came. It appears to be all text-based (therefore accessible to all.) It's easily searchable. If they made one mistake it was in not using anchor tags on that first index page to carry the user back to the top area of the page (where the path navigation and left-hand navigation are located).

Although it pains me horribly to say it, I think PDF is a good bet. (And I hope somebody comes up with a viable alternative soon.) I'd pick that, plain HTML, and some other format you decide on that prevails at your institution--whether it be.doc, text, or something else. You want above all to make it useful to the people who need to read it, so let their opinions prevail.