Tuesday, July 10, 2007

I recently decided to give open notebook science a try. In order for my lab notebook to be useful to others, I've gotta to put a little extra effort into making my notebook more understandable to outsiders. I think a lab notebook will never and perhaps should never be as easy to understand as a paper, since you want to spend most of your time doing science rather than making beautiful figures and writing stunning introductions. I would simply like to reach the point where someone in a similar field to me could pick up my notebook and understand it without too much effort.

I'm trying to catalog some basic ideas that would promote better open notebooks, with better defined as:

Here's what I've come up with so far. Please comments if you think of other tips.

use some sort of version control system (wiki, cvs, subversion)

this is particularly important if you have an electronic only lab notebook as it creates a time stamp for everything you enter into the notebook, which would be important for patents and other legal stuff

it also allows you to go back and look at previous versions

backup your notebook

with cvs or subversion back up your repository

with wiki's this becomes wiki specific, so check the documentation for your wiki

organize hierarchically

break the notebook into sections

break the sections into subsections

remember to include a time stamp in the text of your notebook at the beginning of each new experiment you do and at the beginning of each section you start

introduce every section giving the bigger picture (not too long, just a paragraph or so on the big idea); a nice figure would be useful too since many scientists prefer skimming figures to skimming text

if a section is complete or dead (i.e. you've abandoned the project), state so very prominently at the start of the section. If the work was published, provide a reference. If the work was abandoned, perhaps explain why.

also if a section hasn't been touched for a long while, you might add something like "This chapter is not being actively worked on"

link to raw data when and where you mention it in your notebook

remember the notebook is public, so be careful not to say stuff that might offend sensitive ears or sensitive scientists

include high quality images in your documents; things like agarose gels will need to be zoomed in a lot to be inspected in detail; if you convert your full resolution tiff to low-quality jpeg, it'll just look like pixelated blah. Then again, you can't always use full-size images, particularly from a high megapixels camera, because the notebook will quickly become giant; so here is my suggestion:

if the image is small (<1mb)>

if it is huge but detail doesn't matter, include a decent resolution image that can be zoomed in 2-4x and still look nice

if it is huge and detail matters, include a decent resolution image, but also include a link to the full size image like you would for other raw data

construct the document in such a way that it is easily indexed by search engines (otherwise no one will find your results; people probably wont read your lab notebook for fun)

the above statement difficult to comply with if you use pdfs because Google currently only indexes the first few hundred kbytes of a pdf; my lab manual is 30MB

please let me know if you have any ideas or suggestions about these rules.

One thing I didn't expect when I started blogging a month ago was to read other people's blogs. But I did, and I've been positively surprised at the quality of the writing in the science part of the of blogosphere. I think the lack of top-down editorial control spurs more novel ideas.

I've seen a number of posts in the blogosphere about different aspects of Open Science. I don't want to explain Open Science, particularly since it's not clear exactly what it is yet. But Bill Hooker at 3 quarks daily wrote a nice three part series (I, II, III) on the subject, which you should read if you're interested in the details. Here I'm only going to discuss Open Notebook Science, which is a term coined by Jean-Claude Bradley. The idea is simply that the heart of every person's research - their lab notebook - should be open to the world.

Since most of our scientific work is funded by tax payers who expect their money to be well-spent, it's interesting that openness isn't required. Science typically builds on the body of available knowledge - the more knowledge available the faster science goes. It's striking when you visit other labs in person; you see all of their unpublished work, and you know that most of their results and data won't be available to the bulk of the scientific community until a year after each particular scientific project is finished. By the time papers are in print, it's old news to the insiders. More striking is when you visit labs whose work you've thought about replicating and expanding on. It's not too uncommon to find that only one person in the entire lab is able to get the technique to work, and even for him the technique only works on Wednesdays. This type of information would be useful to know before you embark on a useless three months trying to adapt their method. But scientific publications are covered in a thick coat of high-gloss finish, making these unacknowledged difficulties hard to detect.

Lab notebooks on the other hand are flat black. As long as people keep them regularly updated, they contain the good, the bad, and the completely nonsensical results.

Today I test the waters of Open Notebook Science.

The latest version of my lab notebook is now automatically posted on J's Lab Notebook Page each night. I've been using an electronic lab notebook for two years now, so there's quite a bit of data in there - good and bad (300+ pages).

What I hope to gain by being Open Notebook:

a nice warm fuzzy feeling that I have nothing to hide

less likely to be accused of scientific fraud (though I really wasn't worried about this in the first place)

potentially helping others by allowing early access to my results and failed experiments

I really hope people will notice stuff I'm doing wrong and LET ME KNOW - would be a very big benefit if it were to occur

Bad things I don't think will happen by being Open Notebook:

people will take little details of the results from my experiments and nitpick about conclusions I've published based on the results - claiming the results in my notebook don't support the results and conclusions in my publications.

I don't think this will happen, since I'm pretty careful with what I publish and with doing proper stats and such.

people will take my data and scoop me

I think people are busy enough with their own work that they don't need to publish mine.

By putting my data on the web as soon as I make it, I have a pretty strong case to say I'm first (as long as other people see my results too; otherwise, you have the problem with the tree falling in the forest that may or may not make a sound)

Wednesday, July 4, 2007

Soon after starting their first lab, most new professors are disappointed to find that they spend a disproportionate amount of time on marketing and fund raising. There are a lot of smart scientists doing interesting work, and in the end, the ideas that are shouted loudest and most frequently become the accepted doctrine at any given time. That's not to say that working hard towards solving important scientific problems isn't important. But if you work hard and solve an important problem, it is very likely that no one will know unless you go out and advertise it (Mendel, Newton, and Einstein were all lost to the world initially because they were closet geniuses). And since everyone is advertising the important problems that they've solved, science becomes something of a popularity contest.

To be popular, you need to be a constant member of the lecture circuit. Every field has their own set of important conferences. I do bioinformatics research. If you want people to know you in bioinformatics (unless you are old and have already established a reputation through many years on the lecture circuit), it would be very useful to give a lecture at RECOMB, PSB, and ISMB. Enough people hear your story, enough people hear your story again, and again, and again, and they start believing. They start telling their friends. Next thing you know, people with pocket protectors are coming up to you on the street asking for your autograph.

The problem with the lecture circuit is that it is pretty expensive. After you pay for the flight, hotel, etc... you've spent $1000. But you can't just bring yourself. You need to bring a lab member or two with a poster, so that your students can start learning how the lecture circuit works. So the conference costs you $1000-5000. Our tax dollars hard at work.

I know, I'm a little sarcastic, and sometimes useful things like long fruitful collaborations get started at conferences. But you get my point. What we're doing with all of this is a roundabout version of what coke, pepsi, nike, apple, and state farm do more directly. We're securing name recognition and our place in the marketplace.

I propose that we be more direct. Why not skip one conference a year. That saves about $3000. Take this money and invest it into Adwords, Google's text based advertising product. Let me giving an example. Recently I was involved in some network inference work that resulted in the PLoS Biology publication,Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. Along with our analysis, we collected 445 E. coli Affymetrix microarrays, and we organized a set of software for benchmarking our algorithms and future network algorithms using the large amount of regulatory information that's already known for E. coli (yes, I'm marketing now, so go check out the paper if you're interested in network inference).

With Google Adwords, you bid on keywords; when those keywords are searched for and you are the winning bidder, your link and a little text goes up on the side of the Google search. For example if someone searches for "network inference", I might want a link that says Benchmark Your Network Inference Algorithm using our open source matlab scripts. Or if they type "E. coli affymetrix", I might want a sponsored link ad that says, Download custom datasets from a publicly available E. coli Affymetrix compendium at M3D. Since there's zero competition for those keywords, each click on my advertisement costs the Adwords minimum: 5 cents. And unlike my $3000 conference where the lecture hall is empty because I got the 8AM slot on the last day of the conference, these people are actually interested enough to click on my ad, which means they'll probably have a decent look at what I have to say. At 5 cents a click, my $3000 will give me 60,000 people that might find my data useful - more than the largest of conferences.

We want to create useful science that others can build on. And we want to build on other peoples' useful science. Why shouldn't we pay a dime to find each other? I know some people will think this is going too far, but we're doing it indirectly already. And why are we doing it? To a large extent, because no scientist has any free time. So if you want another scientist to look at your science, you've got to stick it right in their face. Once it's there, they can decide if it's the science they're looking for.

I think Adwords could free up a lot of time for professors, allowing them to spend less time marketing and more time on the original focus of their professional lives: solving important scientific problems.