A public service rant: please fix your bibliography

Like many academics, I spend a lot of time reading and reviewing technical papers. I find myself continually surprised at the things that show up in the bibliography, so I thought it might be worth writing this down all in one place so that future conferences and whatnot might just hyperlink to this essay and say “Do That.”

Do not use BibTeX entries that are auto-generated from Citeseer, DBLP, the ACM Digital Library, or any other such thing. It’s stunning how many errors these contain. One glaring example: papers that appeared in the Symposium on Operating System Principles (SOSP) often turn out as citations to ACM Operating Systems Review. While that’s not incorrect, it’s also not the proper way to cite the paper. Another common error is that auto-generated citations inevitably have the wrong address, if they have it at all. (Hint: the ACM’s headquarters are in New York but almost all of their conferences are elsewhere. If you have “New York” anywhere in your bib file, there’s a good chance it should be something else.)

Leave out LNCS volume numbers and such for conferences. Many, many conferences have their proceedings appear as LNCS volumes. That’s nice, but it consumes unnecessary space in your bibliography. All I need to know is that we’re looking at CRYPTO ’86. I don’t need to know that it’s also LNCS vol. 263.

For most any paper, leave out the editors. I need to know who wrote the paper, not who was the program chair of the conference or editor of the journal.

For most any conference paper, leave out the publisher or organization. I don’t need to see Springer-Verlag, USENIX Association, or ACM Press. For journal papers, you need to use your discretion. Sometimes the name of the association is part of the journal name, so there’s no real need to repeat it. The only places where I regularly include organization names are technical reports, technical manuals and documentation, and published books.

Be consistent with how you cite any conference. For space reasons, you may wish to contract a conference name, only listing “SOSP ’03” rather than “Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03)”. That’s fine, at least for big conferences like SOSP where everybody should have heard of it, but use the same contraction throughout your bibliography. If you say “SOSP ’03” in one place and “Proceedings of SOSP ’03” somewhere else, that’s really annoying. Top tip: if you’re space constrained, the easiest thing to nuke is the string “Proceedings of”.

Make sure you have the right author list and with the proper initials. When you use BibTeX and you plug in “D.S. Wallach”, what comes out is “D. Wallach” since there’s no whitespace before the “S”. It’s damn near impossible to catch these things in your source file by eye, so you should do a regular-expression search ([A-Z].[A-Z].) or proofread the resulting bibliography. I’ve sometimes seen citations to papers where there were co-authors missing. Please double-check this sort of thing, often by visiting the authors’ home pages or conference home pages.

Be consistent with spelling out names versus using initials. Most bib styles just use initials rather than whole names. However, if you’re using a style that uses whole names, make sure that you’ve got the whole name for every citation in your bibliography. (Or switch styles.)

Always include a URL for blogs, Wikipedia articles, and newspaper articles (or, at least, newspaper articles since the dawn of the web). Stock BibTeX styles don’t know what URLs are, so the easiest solution is to use the “note” field. Make sure you put the url in a url{} environment so it becomes a hyperlink in the resulting PDF. I’m less confident I can advise you to always include a string like “Accessed on 11/08/2010”. But if you do it, do it consistently. Top tip: if you say usepackage{url} urlstyle{sf} in your LaTeX header, you’ll get more compact URLs than the stock typewriter font. See also, urlbst.

Don’t use a citation just to point to a software project. If you need to give credit to a software package you used, just drop a footnote and put the URL there. You only need a citation when you’re citing an actual paper of some sort. However, if there’s a research paper or book that was written by the authors of the software you used, and that paper or book describes the software, then you should cite the paper/book, and possibly include the URL for the software in the citation.

BibTeX sometimes fails when given a long URL in the note field. This manifests itself as a %-character and a newline inserted in the generated bbl file. (Why? I have no idea.) I have a short Perl script that I always work into my Makefile that post-processes the bbl file to fix this. So should you.

Eliminate the string “to appear” from your bibliography. Somebody years from now will look back in time and find these sorts of markers amusing. Worse, you can easily forget you put that in your bib file. It’s odd reading a manuscript in 2011 that cites a paper “to appear” in 2009.

For any conference, include the address, month, and year. And for the month, use three letter codes in your BibTeX (jan, feb, mar, apr, …) without quotation marks. The BibTeX style will deal with expanding those or using proper contractions. For the address, be consistent about how you handle them. Don’t say “Berkeley, CA” in one place and “Berkeley, California” in another. Also, this may be my U.S.-centric bias showing through, but you don’t need to add “U.S.A.” after “Berkeley, CA”. For international addresses, however, you should include the country and the state/region is optional. “Paris, France” is an easy one. I’ll have to defer to my Canadian readers to chime in about whether it’s better to cite “Vancouver, B.C., Canada”, “Vancouver, Canada”, or “Vancouver, B.C.”

But not the page numbers. Back in the old days, I once got razzed by a journal editor for not including page numbers in all my citations. (And you think I’m pedantic!) Given how many conferences are ditching printed proceedings altogether, it’s acceptable to leave these out now, including for old references that you’re far more likely to dig up online than in the printed proceedings.

Double-check any author with accents in their name and try to get it right. BibTeX doesn’t seem to play nicely with Unicode characters, at least for me, so you have to use the LaTeX codes instead. I’m sure David Mazières appreciates it when you spell his name right.

Double-check the capitalization of your paper titles. I tend to use the BibTeX “abbrv” style, which forces lower case for every word in your paper title, excepting the first word. You then have to put curly braces around words that truly need capital letters like BitTorrent or something. Some hand-written bib entries I’ve seen put curly braces around every word because they really, really want the entry with lots of capital letters. Don’t do that. Use a different bib style if you want different behavior, but then make sure your resulting bibliography has consistent capitalization for every entry. I don’t particularly care whether you go with lots of capital letters or not, but please be consistent about it. Also, double check that proper nouns are properly capitalized.

When you post your own papers online, post a bib entry next to them. This might encourage people to cite your paper properly. For your personal web page, you might like the Exhibit API, which can turn BibTeX entries into HTML, dynamically. (See Ben Adida’s page for one example.) If you’re setting up something for your whole lab or department, Drupal Scholar seems pretty good. (See my colleague Lydia Kavraki’s lab page for one example. I’m expecting we will adopt this across our entire CS department.)

And, last but not least, a citation is not a noun. When you cite a paper, it’s grammatically the same as making a parenthetical remark. If you need to refer to a paper as a noun, you need to use the author names (“Alice and Bob [23] showed that the halting problem is hard.”) or the name of the system (“The Chrome web browser [47,48] uses separate processes for each tab to improve fault isolation.”) If there are three or more authors, then you just use the first one with “et al.” (“Alice et al. [24] proved P is not equal to NP.” — note also the lack of italics for “et al.”) For the ACM journal style, there’s something called citeN rather than the usual cite, which is worth using. You can also look into using various additional packages to get similar functionality in any LaTeX paper style like natbib.

Obligatory caveat: A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. – Ralph Waldo Emerson

For conferences that move around, the address is helpful so you know which one it is. “Ah yes, Montreal. That was a great year.”

For conferences that stay put, the address is part of the identity of the conference, most notably IEEE Security & Privacy — the Oakland conference — and including the address helps reinforce exactly what conference you’re talking about.

As a reader, the main use a citation has is helping me find the paper being cited as quickly as possible. A secondary use is scanning over the year fields to see how recent the citations are for a particular claim.

Given these tenets, what is the benefit for the extra cost incurred to do more than copy the most complete bibtex a trusted source (such as ACM) offers and strip out fields? What do you think is that actual percent of the time such a source will lead a reader to an incorrect paper (or make the paper unable to be found at all?). I don’t see it as worth my time to make citations ‘look pretty’, or even be consistent (“SOSP ’03” and “Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03)” resolve to the same value, after all), and I do not care if citations to my papers look pretty either.

Please help convince me. I am a 4th year PhD student, and I honestly would like to know why this isn’t a throwback convention to the time before easy search.

Certainly something like page numbers are a throwback, so that’s sensible to drop. What kills me about machine-generated BibTeX entries, as you might get from the ACM Digital Library, is that they’re often flat-out wrong and nobody bothers to double check them. I’ve read more than one paper where they just have the authors and title and nothing else. The auto-generated BibTeX entry was just wrong, or was set up in a non-standard way that caused the conference name and so forth to fail to render in the bibliography.

If you let this sort of thing through, if your bibliography is sloppy, it makes the reader wonder whether the rest of your scholarship is similarly sloppy. If you didn’t bother to proofread your bibliography, did you bother to proofread the rest of your paper?

I could just as well have written this rant saying, “damn it, people, English has rules!” and proceeded to quote chapter and verse from Strunck and White. Your comparable response, then, might have said that you didn’t see it as worth your time to write grammatically proper English since the reader can figure out what you’re trying to say.

At the same time, proper English serves an important purpose for disambiguation and organizing the material in a (hopefully) easily digestible manner for the reader. Bibliographies are really just a set of links. I am all for proper grammar. I do object to people insisting on passive voice and no pronouns in scientific papers for the same reasons of readability described above, but I fully understand the purpose of grammar.

Many of the default bib entries that you pick up from machine-generated sources are flat out wrong (e.g., rendering without any visible conference name), which means they can seriously fall flat in acting as a pointer to a paper. Furthermore, humans aren’t the only ones reading your bibliography. Computers are reading it as well and trying to infer the citation graph. Google Scholar, among other services, use this when they weight the search results. If you generate bogus citations, the links in the graph might not come out right.

Let’s say you cite an important paper as prior work. People visiting Google Scholar will find a seminal paper and then ask Google Scholar to tell them who cited it. If you cite it poorly, you might not show up in the list.

In my dream world, every academic paper would have a URI of some sort, and that would be embedded into the PDF when you cite it. At that point, a future version of Gün’s crosstex would be able to get the latest info from a canonical source and all of this copy-editing would no longer be necessary.

* Listing the publisher is sometimes important. There are a large number of conferences with similar names, or whose names are sometimes misquoted. Listing the publisher can help disambiguate these (unless, of course, the name is implied by the conference, e.g., the ACM Conference on blah, blah, blah.)

* There is no need to list the month of a conference, unless a conference is held more than once a year and confusion is possible. In particular, since some conferences publish conference proceedings after the conference is held, this can be downright confusing. Similarly, there is no need to list the location of a conference (and, in the future, electronic conferences may eliminate the need to list location at all).

* “To appear” should definitely be included on future publications, because sometimes planned publications don’t appear.

* It is not necessary to include a URL for newspaper or magazine articles unless the publication is only online. Indeed, since old articles frequently disappear behind a paywall, it is much better to give the classical “Chicago Manual of Style” type reference for these.

* I don’t think it is appropriate to cite Wikipedia articles (with a few exceptions: for example, one is writing about Wikipedia.) However, if one does feel it necessary to give a Wikipedia URL, go ahead and give the Wikipedia “Permanent Link” (currently the fifth entry under “Toolbox” in the left-hand column of Wikipedia pages) URL.

* If you are writing a paper in English, it is not necessary to write the author name with accent marks even if he or she writes it that way (for the same reason that it is not necessary to write the name of a Russian author in Cyrillic.) English does not have accent marks. However, if you do decide to include accent marks, they should be correct. (For the same reason, one should write “naive” rather than “naïve.”)

Month/address: I might be willing to agree on the month argument, but I think “address” is important because it’s an important way of disambiguating a conference. And I stand by my “it’s not New York” comment. If you’re going to have an address, have the correct one.

To appear: these things are pernicious because you leave it in your .bib file and cite it two years later without noticing that it’s already appeared. If I’m reading a paper in 2011 and it includes a future-dated citation, I can probably infer that it’s “to appear.”

Newspapers/magazines: I agree that you should have the classical citation and that URLs for these things have a habit of only lasting for a year or two until the web engineer decides to rearrange the whole content management system. Still, the long-term trend is for dead trees to go away and for all periodicals to go purely online. It’s good to get in the habit of citing them that way. (Besides, when’s the last time you cited a newspaper article with its proper page number? In classic style, that’s mandatory. Most newspapers which you read online don’t even give you the paper page number.)

Wikipedia: there are some truly good Wikipedia articles on a variety of technical topics. If you need to cite a general introduction to some topic, it’s often as good as anything else.

Accent marks: If you’re citing Ivan Damgård or Peter Schröder, it’s not that hard to get their names right, so why not do it? If you try to Anglicize names, that way lies madness. Do you really want me to type “Peter Schroeder” instead?

I wrote a replacement for bibtex called Crosstex for exactly these reasons (and more).

Crosstex is open source and comes with a reference library of CS citations that appeared at SOSP/OSDI/PODC/FOCS/STOC/USENIX ATC/and a ton of other CS conferences. The citation information is uniform and consistent, because the conference/venue/etc information appears only once and is inherited through an object-oriented database.

It lets you change citation appearance very easily. You can use “Networked System Design and Implementation” or “NSDI” at the drop of a switch, “California” vs “CA” etc. This is handy when you hit the space crunch right before submission.

It lets you cite papers by constraints. E.g. if you remember that it was a Felten paper at OSDI in 95, you can cite it through constraints without having to look up the primary key under which you filed the paper.

It runs on linux, mac and windows, and it is compatible with existing BIB files. Hope you check it out and find it to be as useful as I have.

I wrote a replacement for bibtex called Crosstex for exactly these reasons (and more).

Crosstex is open source and comes with a reference library of CS citations that appeared at SOSP/OSDI/PODC/FOCS/STOC/USENIX ATC/and a ton of other CS conferences. The citation information is uniform and consistent, because the conference/venue/etc information appears only once and is inherited through an object-oriented database.

It lets you change citation appearance very easily. You can use “Networked System Design and Implementation” or “NSDI” at the drop of a switch, “California” vs “CA” etc. This is handy when you hit the space crunch right before submission.

It lets you cite papers by constraints. E.g. if you remember that it was a Felten paper at OSDI in 95, you can cite it through constraints without having to look up the primary key under which you filed the paper.

It runs on linux, mac and windows, and it is compatible with existing BIB files. Hope you check it out and find it to be as useful as I have.

Don’t cite whole books to support something that appears in a few pages or a chapter in those books. When finding the need to cite a textbook, for instance, put in a note what chapter or section you’re referring to. cite[Chapter 10]{CLR} might be one way to do that in LaTeX. Citing entire books is only appropriate when the whole book is supporting the particular argument you are making (e.g., citing the Mythical Man Month to support Brooks’s law).

I wholeheartedly agree on checking, double-checking, and checking again your .bib files and the corresponding output.

However, I quite disagree with leaving out anything: the BibTeX file is just a database, the more info it contains, the better!!! If you don’t want LNCS volume numbers for conferences (I do want them), just use/write a BibTeX style that doesn’t include them. This way if I find one of your publication’s .bib entry on your webpage, I can use it without needing to complete the missing parts… Same for pages, editors, “to appear” (though I wrote myself a script that periodically reminds me of papers for which I need to check the current status).

For the accents, being French helps 😉 A good idea is to use an UTF-8 compliant tool such as bibtex8 or biblatex-biber (with XeLaTeX it’s even better!!!), however to help my coauthors I tend to remain ASCII-based (even if that implies writing “Fran{c{c}}ois” in my .bib file).

About consistency and spelling out names, I couldn’t agree more!

About citations for software, I once had to create some because a journal didn’t want an URL as reference in the text (or even in a footnote) 🙁

Finally, the “address” part is quite controversial: as per Oren Patashnik’s own BibTeXing (http://bibtexml.sourceforge.net/btxdoc.pdf end of page 9), this field is for the publisher’s address. It is thus not suited to host the conference’s location… Some BibTeX styles use a non-standard “location” field for that purpose, some people even end up adding the location information in the booktitle, along the conference name… Since I tend to use Biblatex as much as I can I mostly use “venue” as explained in http://mirror.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf 🙂

FWIW, I find the LNCS numbers useful, especially when the springer site is the only place I can find the paper online. For some reason, I find it difficult to find those LNCS numbers if I don’t actually have a copy of the proceedings; but perhaps this is a transitory problem until springer’s website improves? In any case, in the meantime, keep those LNCS numbers in there!

When you use BibTeX and you plug in “D.S. Wallach”, what comes out is “D. Wallach” since there’s no whitespace before the “S”.

i tasted a little vomit in the back of my throat when i read this.

Freedom to Tinker is hosted by Princeton's Center for Information Technology Policy, a research center that studies digital technologies in public life. Here you'll find comment and analysis from the digital frontier, written by the Center's faculty, students, and friends.