Should biologists study computer science?

Science has published a pair of articles in which it's argued that biology …

As in just about every other field, computers have become an essential part of biological research. Complicated algorithms and analyses that once took months of work by specialists are now available as Web services, and whole areas of study, such as genomics, can be pursued entirely in silico. But, even though most biologists know how to plug in their data and act on the output of computational tools, precious few understand the math that's going on behind the scenes, as most bioscience degree programs don't require computer science or any math more advanced than calculus.

Two papers in the latest issue of Science argue that that's a bad thing. One focuses on the ability to represent the behavior of biological systems through algebraic notation, an area that's badly neglected in both science and math education. The second focuses generally on the incorporation of biology-specific math and computer science into the education system. Both assume that the lack of a math background is a serious problem.

In general, as someone who has done a small bit of bioinformatics and a lot of biology, I'm the perfect target audience for this argument. But in reading the papers, I came away with the sense that the authors have lumped different arguments together in a way that confuses the real issues. So what follows is my attempt to separate them out and evaluate each issue separately. The first problem arises in the paper from Pevzner and Shamir, which treats the terms computational biology and bioinformatics as two names for the same discipline. That may be how things are commonly understood but, to me at least, these are two separate endeavors.

Bioinformatics, as its name suggests, is primarily focused on the computer-aided analysis of data generated in biological systems, such as genome and gene expression array analysis. We'll get back to that later. Computational biology involves the attempt to model biological systems in silico. These models are informed by the biology, but they don't necessarily require any biological data to be fed to them in order to run.

Obviously, anyone performing computational biology better have a really good grip on both biology and math/computer science, or they won't be able to know whether the models are valid and fix them if they're not. The same really doesn't apply to bioinformatics. Since there's always real, underlying biological data there, the computation and analysis can be separated—a bioinformatician can simply turn to a biologist and have them sanity-check the results.

Fundamental, tool, or service

So, if we accept that everyone doing computational biology better know both math and biology, that's still not evidence that regular biologists need math. Most regular biologists will end up using bioinformatics tools to align DNA sequences, pick primers, etc. So do they need to know the math behind the tools? I think to answer that, you have to understand where bioinformatics sits on what I'd call the fundamental/tool/service spectrum.

For biologists, fundamentals are things like organic chemistry. All of biology ultimately depends on it, and every biologist should really know something about it—even field biologists, who will have to consider things like how diet and environmental chemicals affect the organisms they study. Bioinformatics really isn't a fundamental; knowing how certain calculations are performed won't necessarily tell you anything about biology.

In fact, it's somewhere between a tool and a service. A tool is something that an average biologist will wind up using that has some biology behind it. So, for example, it's possible to use PCR to amplify DNA samples without knowing anything about what's going into the tubes used for the reactions. But it's much better if a biologist does know; the reactions behind PCR illustrate biological principles, and are essential knowledge for troubleshooting the procedure when it goes wrong (as it inevitably does). In contrast, DNA sequencing, which used to be a tool, has become a service. You put your DNA sample in the mail, and download the sequence data from an FTP account a few days later. The precise details of the actual sequencing reaction that was performed don't really matter.

For the most part, bioinformatics software like those for sequence search and alignment are analogous to a service: the computer spits out a useful result, and you really don't care how it got there. If you can't get a decent result, your first response isn't to look for someone who knows math; it's to look for someone who's more proficient with the service, and knows how to tweak the input parameters. Knowing the math behind things might help with the tweaking or to appreciate the underlying biology, but it just as well might not—empirical experience can be more useful in many cases.

In a worst case scenario, of course, biologists can always resort to contacting someone who has training in bioinformatics, in much the same way as a biochemist might contact an immunologist if they needed to know more about that field.

That's supposed to be helpful?

If bioinformatics is a service, why isn't knowing how to use something as a service good enough? The authors simply state it is without providing an explanation. "For example, biologists sometimes use bioinformatics tools in the same way that an uninformed mathematician might use a polymerase chain reaction (PCR) kit," they write, "without knowing how PCR works and without any background in biology." Presumably, we're supposed to view that as problematic, although the authors never explain why it is.

The second paper, from Robeva and Laubenbacher, isn't brilliant about supporting its position, either. It's a sort of plea for education in algebraic modeling, which can apparently be used to represent biological systems. The authors make their argument by using a textbook case: the Lac operon, a gene regulation system that appears multiple times in a typical biologist's educational history, probably starting at AP bio in high school. In modeling terms, however, the Lac operon needs three equations to be described, one of which takes the form:

L=kL?L(Le)?G(Ge)Q - 2?M(L)B - ?LL

They point out that presenting it in Boolean terms leads to a simplified diagram that still captures the essential features of the system. Even when simplified, however, it's not obvious that the model is any more informative than the standard textbook description, which refers directly to the biology. And I'm skeptical that knowing the model would actually improve a biologists' ability to perform biological research.

This probably comes across as overly harsh—to a certain extent, the authors have a valid point: the more biologists know about the tools and services that they rely on, the better off biology as a whole will be. Informed researchers are more likely to notice anomalous results and squeeze more information out of their data by better deploying existing tools. And the authors' suggestion that we design mathematics courses that will prepare biologists to solve the problems they'll ultimately face would undoubtedly produce a more appealing math education.

But the same sorts of things can be said about biostatistics and physical chemistry, and it's rare to see either of those made a requirement for undergraduate degrees or doctoral programs. (The former would have been very useful at several points in my research career, and even more useful now.)

If the argument is going to be made that biologists should learn more math and computer science, then those advancing it need to do a better job of explaining what, precisely, biologists need to understand about the computational tools, and why simply knowing how to use the tool isn't good enough. There's also a practical issue at play; the authors argue that these additional computation courses be added to educational programs that are already loaded with required courses. That's pretty difficult to justify, especially given the other deserving topics that are already omitted from most program requirements.

In the end, the key questions are avoided in these papers: what, specifically, biologists need to learn, and how will it help them perform their primary function, namely biological research. Without that information, it's going to be impossible to actually design a course that might improve anything.

58 Reader Comments

I'm reading a lot of good comments on what should (and some of what shouldn't) be done to educate successful biologists today. My feeling is that, slowly but surely, the curricula are in fact heading in the right direction, although it's definitely frustrating to know that my friends and colleagues are publishing papers with often very basic understanding of the programming and statistics that they used. My graduate program in biology at MIT has recently created a requirement of all graduate students to take a quantitative course in their first year classes -- either one of the already existing courses focusing either more on modeling or on programming, or a new class that I helped develop (as teaching assistant).

This new course is really only a survey, but touches on what I see as the three areas of quantitative biology which have been mentioned throughout this thread: statistics/analysis (eg t-test or fitting a curve), programming (eg pattern finding) and modeling (eg differential equations). The problem, as we saw it as educators, is one of accessibility: we biologists all use rudimentary aspects of modeling and computational analysis (maybe not necessarily programming), but shy away from the literature that really goes into these in detail. I really wish we could squeeze more of these ideas into the rest of the curriculum, but this requires a generation of professors who've got these quantitative skills. It's a slow process, but I think the "-omics" era (genomics, proteomics, etc) means that most people in my part of biology (molecular biology) will either have to form collaborations with skilled computational biologists or will have to learn some of these skills on their own. (Most of the new labs in our department have at least plans to produce data requiring more involved computational analysis, which is not true of the older labs.) It would of course be great if similar meaningful requirements existed at the undergraduate level, as these commentaries in Science propose.

It's a lot harder to get into the right mindset for thinking about quantitative biology if you're finishing a post-doc and haven't done any quantitative work since pre-calculus in high-school. if we can teach the right mindset -- any class focusing on differential equations, python or statistics should at least be a good start -- then people can take things from there and teach themselves the rest.

This is a fascinating question that I need to think more about, but I feel this may already be true. Almost all results are interpreted through a model.

Not at all true, at least if you mean a mathematical model. Take cardiovascular disease (since that's what I was working on before I left the bench) - there's no math or CS model of the disease, and it's the leading cause of death in the developed world. But there's lots of feeding, dissection, histology, immunohistochemistry, real-time PCR, image analysis and finally data analysis involved.

We're a long way off having a computer model AFAIK.

Were I training or employing new scientists, I'd be much more likely to hire someone who could show me that they knew their way around a lab (you'd be surprised how many times I had to show people how to do serial dilutions!), that they understand how and why to analyze their data, and that they're able to eloquently present and write up their data. I don't care if they don't know how GraphPad Prism works on a code level, just that they know how and why to use it as a tool.

Originally posted by kaitliac:Personally, rather than the normal "entry level Java/C#/C++/Basic/etc" "CS for non CS majors" course that is basically a "well here's a variable, here's an operand, here's a function, now you write some throwaway code, that was a waste of time and aren't you glad you didn't go into CS?" course, I'd really think a higher level, theory and thinking focused course would be the way to go. Something that didn't try, for instance, to teach the basics of programming in a particular language, but instead tried to teach the basics of "what it's all about," possibly using some light examples from various languages. Certain specifics for algorithms can be taught using language neutral symbolism without getting tied up in the particulars and issues of an actual compilable programming language.

In essence, more a philosophy of programming/computer science and programmatical thinking course.

I'd love to see something like this and think it'd help out a lot of students. That said, calling it basic is kinda like billing it as a 101 course for non-majors when it tackles concepts best discussed in a 200 or 300 level course for majors at most schools. It'd be great if you had strong science programs and made it a requirement for just them - for a general core-requirement like some schools have, it'd be dumbed down so the liberal arts students could take it and not fail. Forcing the future starbucks employees to take them is what kills "cs for non cs" courses at most schools.

Well, there may be a point where computer science may become the basic curriculum for any school. However, we are not quite there yet. I do think there should be a course on something like Applied Computer Science or Engineering which focus more on maximizing computer potential in give subject area.

best discussed in a 200 or 300 level course for majors at most schools.

When I use the word "basics" I'm not necessarily envisioning a "basic" course, in terms of university course terminology.

Instead I was thinking more of a high level course, and yes at least 200 level, probably more 300. I realize this is a point in time where most people groan about anything that's not core to their major, but it's really the best time I could see it applying.

To be honest, I wouldn't even want the words "programming" or "computer science" anywhere in the course title.

Something more like:

Philosophies of structured analytical thinking in scientific fields as they apply to computer aided problem solving and modeling.

Assuming THAT wouldn't scare everyone away who it might benefit, of course =/

quote:

if we can teach the right mindset -- any class focusing on differential equations, python or statistics should at least be a good start -- then people can take things from there and teach themselves the rest.

Exactly. And, really, these sorts of mindsets are very cross discipline. They provide a way of looking at things that can open new perspectives and in turn help formulate new ideas that might otherwise have been overlooked. As such, I don't think their value can be emphasized enough. The problem is courses which might help lead in the direction of learning them often bog students down in minutiae that really aren't relevant to their field of study, which tends to turn them off to really learning beyond a rote memorization.

Originally posted by Dr JonboyG:Anyway, this is the kind of great science writing that Ars should be proud of. Not like that tripe that Gitlin person used to write.

You obviously don't know what you are talking about, that Gitlin person was a great science writer. He might fish for complements every now and then, but his attention to detail made him a great science writer.

I've never had a Computer Science course beyond a 101 introduction to programming. Since I am mostly self taught I don't know what I am missing with some of the more advanced Computer Science courses. What would the advanced Computer Science courses give a scientist who already has a working knowledge of programming and strong math and statistics skills?

I've never had a Computer Science course beyond a 101 introduction to programming. Since I am mostly self taught I don't know what I am missing with some of the more advanced Computer Science courses. What would the advanced Computer Science courses give a scientist who already has a working knowledge of programming and strong math and statistics skills?

My sense is that if you are a self-taught programmer that is comfortable with the math and statistics, you may need to make sure that you are familiar with software engineering. Many highly competent self-taught programmers don't really understand how to write code for others to use. Things like design-by-contract and design patterns can increase the usefulness of whatever algorithms that you do develop tremendously by making the code easier to use, extend, and maintain. I have dealt with my share of smart self-taught programmers that can't discuss the design of the code until after they write it, and consider design a waste of their time (leading, not surprisingly, to an unmaintainable pile of crap -- at least from a pure software perspective).

I think all of these birds should be killed with one stone: a statistical computing course. As a computational biologist, most of my contact with wet biologists has taught me that they don't need more math or CS skills, they need better stats skills. Most wet biologists barely know what a p-value or t-test is, let alone how to do any kind of analysis beyond what a basic knowledge of Excel provides - even though such basic statistics are crucial to the publication of their results.

Biologists don't need to learn C++ or (groan) Java. Something more practical like R would serve them well, and could be taught as part of a general framework for analyzing data, rather than a course for learning CS principles (which they will most likely never need again).

Interesting article and comments, though I would reiterate what very commenters have said already - I am a biologist because I am interested in biology. Not maths, computer science (though computers are cool) or statistics. When ever I need to know how to do something that's out of my expertise, such as statistics, I go and learn it, but I'm not going to go and spend precious lab time doing a course that'll largely bore my socks of just for the sake of it (in fact, I did a stats course during my PhD and spent 90% of the time sleeping). I don't hear any cries for physicists to be tought speciation, taxonomy, molecular biology etc.I've no doubt some basic statistics should be included in biology undergrad and postgrad courses (generally, it is, in my experience), but there are other, much broader skill deficiencies that also need addressing, such as writing and interpersonal skills. I would imagine that a well crafted manuscript or grant application would further a biologists career far more effectively than a basic knowledge of C++.Oh, and I have a very good understanding of how PCR and BLAST works as should any molecular biologist.

this is a difficult issue. Those from a maths orientated background think the biology is easy and those from a bio background consider the maths as a fortunate secondary. The second point of dividing computational biology and bioinformatics is very valid, though from experience the computational biology is still only a model. A nice tool to trim down the possibilities that empirical evidence proves. Over reliance on the underpinning maths of a model is a flaw as a model is postulated to encompass known variables in a system. If a variable has not been considered or the underlying physical premise is incorrect then how ever perfectly constructed the algorithm is wrong. Analysing data in this manner requires a holistic knowledge of some maths (not a lot just enough to tell you what the equation means), lots of chemistry and biology and importantly a factor that has been ignored in the present discussion. The factor most important is the ability of student/postgrad/postdoc/prof to look at the data and analyse it critically. To analyse data you require a fundamental knowledge of the science in an area but in tandem a bank of resources to call upon to increase that base, coupled with expert opinion to clarify your thoughts.

In an increasingly interdisciplinary world, we are faced with the issue of what to teach. My own feeling is we should stick to the traditional subject boundaries at high school and undergraduate level as the base knowledge has to be acquired. From experience it's just as hard to teach biology to a maths bod as it is maths to a bio bod, what is truely necessary is a freedom of mine in postgrad levels to accept and seek new knowledge openly with or without the structure of a program. Part of the problem with to structured interdisciplinary problems is not establishing the base knowledge, and pacing the students through the transition from traditional disciplines to a wider prospective. Has anyone ever tried to in-lab teach optical spectroscopy to geneticists with no reliminary lectures?

"If the argument is going to be made that biologists should learn more math and computer science, then those advancing it need to do a better job of explaining what, precisely, biologists need to understand about the computational tools"

The comment from moreddt may have said it best, biology is becoming a science of distributions instead of individual values. We are reaching a point where studying a single gene, single protein, or even a single pathway is not enough because we have realized that everything is actually working in very interconnected and complex networks. Biologists need to understand that just learning facts about this or that will not lead to many more interesting discoveries or breakthroughs. The frontier of biological research is inevitably going to be complex systems and to truly deliver on the promises of personalized medicine and treatment of complex diseases we are going to need to dive heavily into this area. If Biologists want to be more than fact checkers (which the vast majority already are) then they are going to need to be able to build some models and do some quantitative analysis. As a Bioinformatics Phd with a focus in System's biology I am obviously biased. But also scared because I see more and more biologists leaning on Bioinformatics people and with 7 kids in my undergraduate Bioinformatics program, and 400 in the Biology program...we are going to have some problems. Overall I would really just like for biologists to learn modeling so that biological understanding can move forward at the same pace as the technology that is making it possible. It is not even that I think biologists will be left behind but rather that scientific progress as a whole will be stifled because there will not be enough people working on the most important parts.

Usually there is no problem with treating tools like black boxes, but the idea here is to teach people to recognize when and how to create better tools.

My background is currently physics and CS. I've taken classes on artificial intelligence and data mining. A.I. taught me about search spaces and how to use heuristics, and data mining taught me that the real magic happens in preprocessing: everything else is basically just massaging noise.

Let's take drug design as an example. Some form of in silico filtering is used to identify promising candidates, which are tried experimentally. The problem is they're using crude heuristics such as pharmacophores, and then wonder why they can't predict well. Introducing domain knowledge (beyond 'structure determines function') into the process should produce much better results.

Computational biology is far worse. We're approaching the ability to simulate an entire cell atomistically. In 10 or 20 years, perhaps we'll be able to do whole tissues, etc. How are we going to analyze all that raw data? You need to be able to automatically zoom in to individual cells/proteins/etc and recognize when something interesting is happening. Eyeballing 3d viewers is nowhere near good enough. These are the types of problems we need to be tackling.

Back to my original point: Cutting edge science requires cutting edge tools, and if they don't exist, you need to build them. This presents another problem: communicating complex requirements to a developer is guaranteed to have problems. Most developers will plow along merrily, and because they don't know biology and didn't understand the requirements, will make some fundamentally wrong assumptions and have to rewrite the whole thing 2 or more times. One solution is to use a better language such as lisp. The real solution (however unrealistic) is to make the biologist an expert programmer.

I have alot of experience with this as I have a graduate degree in both CS and molecular bio and work in a genetics lab.

The big problem that you have today in most (cutting-edge) bio labs is the inability of CS majors and Biologists to communicate with each other. The foremost reason for teaching Bio majors some computer science is that most Biologists have literally no clue as to how the technology behind their craft works, and it's extremely difficult for them to translate their needs into software/hardware.

Biology is computerized now. Period. People that are working towards a research degree in biology will find that all the underlying "things" required to perform their research involves robots, computers and software. You need to know about this stuff.

There are 'bioinformatics' tools such as BioPerl/Python, etc. Things for working with FASTA files, primer design scripts, sequence alignment tools, etc. These are useful to have in your bag. However, I would advocate some more CS'y type things as well.

2. Development time - how long will it take (ballpark estimate) to build me a tool to do XX. This type of knowledge would aid a biologist in planning his/her own projects.

3. Good programming practices - Most biologists use a shotgun approach to everything ("lets do 1000 replicates and then pick the 10 they actually worked"). Projects where the requirements change on a daily basis put an enormous amount of pressure on developers.

Therefore in addition to knowing what BLAST and FASTA files are, I would also require biologist to take the following courses:

Originally posted by j4ke:I think all of these birds should be killed with one stone: a statistical computing course. As a computational biologist, most of my contact with wet biologists has taught me that they don't need more math or CS skills, they need better stats skills. Most wet biologists barely know what a p-value or t-test is, let alone how to do any kind of analysis beyond what a basic knowledge of Excel provides - even though such basic statistics are crucial to the publication of their results.

Biologists don't need to learn C++ or (groan) Java. Something more practical like R would serve them well, and could be taught as part of a general framework for analyzing data, rather than a course for learning CS principles (which they will most likely never need again).

I agree that "bioinformatics" courses should be learned as part of a modern mol. bio curricula.

However, in addition to the normal sequence-based grunt work, there invariably comes these monolithic software projects ("Oh! let pull all our sequence data on our website!") where the engineers get involved. It's at this point where the lack of fundamental programming knowledge will shine through.

You don't have to be programmer , but you should be familiar with programming when these things come up.

Originally posted by j4ke:I think all of these birds should be killed with one stone: a statistical computing course.

Amen. I have academic background in computing and bio and work for a large pharma in a mathematical biology department. Too often our biologists are too uncomfortable with statistics to know when their work is significant, or significantly better than the status quo.

That said, I don't think learning more about computing would add much value by itself. Yet a course, let's say on statistical pattern recognition that uses R would be a great foundation not only for understanding the foundation of uncertainty & confidence in science, but the pragmatics of computing toward those ends. And let's teach it using examples from biology, not math.

quote:

This is a fascinating question that I need to think more about, but I feel this may already be true. Almost all results are interpreted through a model.

When looking for and proposing new biomarkers, pharmas demand not only statistical significance but biological plausibility. Every novelty in this industry is driven by whether it: 1) seems real, and 2) makes sense, given what is known/believed now.

All new science is driven by models -- novel concepts must better explain (or sometimes replace) existing conceptual models of How It Works. Yes, I think in time mainstream biology will encompass the use of computational simulation, and to some degree it's happening already. Smaller components of biological processes are routinely modeled using a variety of mathematical techniques, largely to explore what-if questions or to scope out a subsystem's sensitivity to its parameters. At present, the (sub)systems are still small. But that will change as instrumentation improves (providing more parameter data) and as constraint satisfaction techniques can fill in the gaps (whether numerically or by means of black box abstraction).

How will biologists unskilled in math deal with such models or their results? In short, they won't.

Now, training graduate students in 2009, the system has (d)evolved to a point and click interface with a cookbook series of steps for data reduction, model building and refinement.

No, but someone better know so the mouse and click folks have someone to ask when it doesn't work.

I was one of those undergraduates that did 'point-and-click' x-ray crystrallogaphy in 2006-2007. It's a fantastic tool (I even get bond lengths)! I'd love to know more about the maths behind it, but to be honest, it's pretty nasty. Didn't the guy who invented it get a Nobel prize in physics? Doesn't in involve some ugly complex number equations? Now I'm not stupid, but I'm not a mathematician either, I think you would struggle to teach that to biology graduates.