Nearly 15 years after completion of the human genome, undergraduate and graduate programs still aren’t adequately training future scientists with the basic bioinformatics skills needed to be successful in the “big data in biology” era. Why?

As a project manager and developer of a long running model organism database (and a former bench scientist myself), I interact with biologists on a daily basis. Franky, I’m alarmed by what I see. Here are some examples of the types of questions I field on a daily basis:

I have a list of genes and I’d like to know the function of each.

I need all the [unspliced|spliced|upstream|downstream|translated] sequence for a group of genes.

I need my data in one very specific file format to support a legacy platform.

I need to do <this generic task> over and over again. It’s killing me and is a waste of my time. Help!

Many junior scientists percolating through the ranks lack the basic skills to address such questions. (I’ll talk about old dogs and new tricks in a subsequent post). More troubling, they often lack the core skills and initiative to tackle rudimentary informatics problems. These include common tasks like collecting and collating data from diverse sources, searching a wiki, reading a mailing list archive, or hacking a pre-existing script to suit a new purpose.

Bioinformatics is here to stay. Get used to it.

Ten or fifteen years ago, many research institutions displayed significant resistance to (and significant ignorance about) the field of bioinformatics. Was it really science? Was it sufficiently hypothesis driven? How did it fit into the mission of a research institute or primarily undergraduate teaching environment? Happily, that resistance has been overcome at most institutions.

Bioinformatics isn’t the same as learning a transient and fleeting laboratory skill. Becoming proficient at running Southern blots or learning a protein purification process might help a student address the discrete questions of their thesis. But in the long term, these are disposable skills learned at great cost.

Not so with bioinformatics. Bioinformatics is a way of thinking. It’s a critical process of organizing information that spills over into many aspects of modern research life. It’s also very easy to develop a useful skill set with a very small time investment.

Frustratingly, many students still have a mental block about programming. They’ve learned (through assimilation and not experience) that programming is difficult. Or they’ve been trained to expect a convenient web interface for everything they need to do. In an ideal world, there would be a web interface for everything. This isn’t an ideal world.

Why has bioinformatics education failed?

I believe that current efforts in bioinformatics education have failed for three reasons.

First, and most fundamentally, bioinformatics training still isn’t universally available. Because of the initial resistance to the field many institutions still lack qualified personnel capable of teaching entry and intermediate level bioinformatics courses.

Second, when bioinformatics training is offered, it’s often as an elective and not considered part of the core curricula.

Finally, the nature of much bioinformatics training is too rarefied. It doesn’t spend enough time on core skills like basic scripting and data processing. For example, algorithm development has no place in a bioinformatics overview course, more so if that is the only exposure to the field the student will have.

Can we fix bioinformatics education?

Yes. Look, it’s easy. Students need primer courses on basic skills first. And it needs to be MANDATORY. Maybe drop the radiation safety course if there isn’t time. Who uses radioactivity anymore anyways? Here are the three core areas that I think all students in cellular & molecular biology, genetics, and related subfields need to succeed.

Core Area 1: Data Discovery

Data discovery refers to a related set of knowledge and skills. What data is available and where can it be found? How can it be retrieved? What if there isn’t a web interface or the data needs to be fetched on a routine basis? Being able to answer such questions forms the basis for programmatically accessing and managing data.
Students should learn how to access common data repository structures like FTP sites, web-based data mining interfaces, wikis, and APIs. They should learn skills for programmatically mining data repositories by learning how to write basic web spiders.

Core Area 2: Data Management

Naming files and datasets consistently and unambiguously is rarely discussed. Nor is data organization and management. These skills are critical for effective analysis, for communication and publication, and for reproducibility.

Boring? Perhaps. But it is absolutely shocking what file naming and management schemes scientifically minded people have created.
Effective data management is not always intuitive. But there are conventions and strategies that can be immensely helpful for transparency, data sharing, and interoperability. Being able to programmatically manage data files is also incredibly useful and a great time saver: rearranging directories, renaming files, archiving files, basic I/O redirection. This is not just for bioinformatics per se, but applies to many areas of biology such as managing confocal images, for example.

Core Area 3: Data transmogrification

Finally, up-and-coming scientists should be able to easily convert files from one format into another.

Again, boring. But useful? You bet. Cast off your Excel shackles.

A quick note to current graduate level students

Are you a graduate student in cell biology, molecular biology, biochemistry, or genetics (or related subfields)?

You should be receiving bioinformatics training as part of your core curriculum. If you aren’t, your program is failing you and you should seek out this training independently. You should also ask your program leaders and department chairs why training in this field isn’t being made available to you.

In order for community annotation efforts to succeed, they need to become part of the established research process: mine annotations, generate hypotheses, do experiments, write manuscripts, submit annotations. Rinse and repeat.

A few weeks ago, I posted the following tweet:

A few retweeters responded that in their particular realm of bioinformatics, community annotation was called “community curation” or a “jamboree” and they’ve had various degrees of success. Points taken and effort applauded.

The real essence of my tweet was that community annotation — regardless of what it is called — largely fails or is undertaken on a very small scale because it simply isn’t a priority for biologists.

Working at the bench, community annotation doesn’t even make the long list of things to do: conducting experiments, writing manuscripts and grants, mentoring, sitting on committees, teaching. Contributing to community annotation efforts simply does not make the cut.

How might we fix this?

1. Top-down emphasis on the importance of community annotation.

Community annotation isn’t required of publishers or funding agencies except in the most minimal degree (eg submission of sequences). This needs to be changed. By making community annotation part of the process of doing research, the research itself will become more reproducible, more accessible to a broader audience, and more stable over time. It should be complementary to writing a manuscript.

Publishers benefit because extracted entities become markup targets to enhance their online product. Funding agencies benefit since having primary authors and domain experts submit annotation suits the mission of transparency and reproducibility and has a presumed efficiency over third party curation.

2. Better tools.

The tools for community annotation are embryonic and do not match the user experience people have come to expect in the Facebook / Pinterest / Instagram / Google Docs era. Bioinformatics teams need to begin employing user interface, user experience, and graphic design professionals to build friendlier, more efficient, and more beautiful tools to encourage participation.

3. Recognition.

Again, in an effort to encourage participation, we need to recognize the efforts of people who do contribute. This system must have professional currency to it, akin to writing a review paper, and should be citable for two reasons. First, it adds legitimacy to the contribution. It’s now part of the scientific record that can be extended by other researchers. Second, the primary contributor can now make note of their effort expended on CVs and in the tenure or job performance review process.

Nanopublications and microattribution represent the most promising avenues for providing suitable recognition with scientific legitimacy that maps to the current academic and professional status quo.

Is Central Serous Retinopathy (CSR) the new carpal tunnel for a generation of over-stressed and over-loaded information workers who spend far too many hours per day staring at screens of varying dimensions?

Central serous retinopathy (or choroidopathy) is essentially a delamination of the retina when cellular layers that normally serve as a fluid barrier between the choroid and the retina begin to leak. This introduces a bubble or blister of fluid underneath the retina. This results in blurred and dimmed vision.

Although CSR is idiopathic, it has been linked to chronic stress, defined biochemically as elevated serum cortisol levels. This finding is corroborated by an increased incidence of CSR in those with Cushing’s Syndrome (chronic overexposure to elevated levels of cortisol.) Men are more often affected than women; with an age of onset between 20-50, averaging around 45.

I’ve been having progressively worse vision problems since December that I had attributed to floaters or sleep deprivation. Given the sad state of my own personal health care coverage as a self-employeed worker and the prevalence of holidays and work deadlines around the turnover of a new year, I didn’t get around to checking this out until this week. After a standard eye exam, I was tentatively diagnosed with Central Serous Retinopathy (CSR), confirmed a few days later by fluoroscein angiography.

My symptoms currently include a large purplish gray blotch almost dead-center in my field of vision; completely distorted visual acuity that’s not just blurry but makes straight lines look broken and covered with Adobe’s marching ants from using the lasso tool; micropsia (things appear smaller than the unaffected eye); loss of several aspects of color perception; and — surprise — everything looks dim and desaturated.

I’m certainly not a high stress individual. I’m not Type-A; I don’t go around yelling at people. I am, however, a perfectionist, although I’ve softened in my old age. Now I’m satisfied if things are done as best as they possibly can be with the time and team available.

I do work hard and I work long hours and have been doing so for many years.

Here’s a brief outline of a typical day for me.

Wake up early, anytime between 3-4:30 am. Roll over and check the time on my phone. Check my email. Read about things I need to deal with and decide to just get up. Espresso. Since I’m a teleworker, lunch was almost always a working lunch at my desk. And without any seminars or Bits ‘n’ Nibbles to attend in the afternoon, I’d work straight through until 6, 7 or 8, with a full work day of 15, 16, or 17 hours. Multiply that times seven and I was typically logging close to 100 hours a week, each week, weekends and holidays inclusive.

So what am I doing to change? First off, I’m no longer tethered to my phone. If I’m not working, I’m not answering work emails. I’m waiting until I’m actually at my desk to start working. And I’m making every effort to reclaim my weekends and holidays and not working at all. And I’m keeping my fingers crossed I retain my vision.

Nearly 50 years after Sydney Brenner’s letter to Max Perutz set the wheels in motion for the use of Caenorhabditis elegans as a potent genetic model system, leading eventually to six Nobel prizes and a global research community numbering in the thousands, a new threshold has been crossed.

Starting with the latest release of the C. elegans genome (WS232 in worm-speak), the genetic map is now FROZEN. Recombinational distances have changed very little over the last three years, a testament both to the fine granularity of the genetic map as well as — perhaps — to shifting tides in experimental approaches.

New mutations, deficiencies and rearrangements will still be placed on the map but simply assigned an interpolated genetic position.

Welcome!
My name is Todd Harris. A geneticist by training, I now work at the intersection of biology and computer science developing tools and systems to organize, visualize, and query large-scale genomic data across a variety of organisms.

I'm driven by the desire to accelerate the pace of scientific discovery and to improve the transparency and reproducibility of the scientific process. Read more...