A computational biologist's personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery

Saturday, August 05, 2017

Computational Biology & Math: Am I Just Faking It?

Over on Quora a common type of question is "Can I be a computational biologist if I am now an X". Personally I take a very broad view and think just about anyone with intellectual curiosity can become any kind of scientist. A related type of question is "how skilled do I need to be in Y to succeed in computational biology", where Y is most often programming, biology or math. I got thinking about this and started wondering whether I am actually at all skilled in math. Here is the results of that analysis.

First, a bit of personal background. Math was a common topic of conversation in my home growing up.

Dad was an engineer whose ideas truly went far -- atoms from one of his instrumentation designs are now floating around Jupiter. When I was young in the 70s, we'd get a new calculator pretty much every year, so I got to see how they got smaller and more powerful each year. I'd goof around on them, fascinated by the LED display. Every now-and-then I'd ask what one of the more obscure functions were, and sometimes I'd even listen to Dad's careful explanation.

When I saw the movie Hidden Figures this year, I had a jolt of recognition: that's what Mom did! Not for NASA, but for the phone company -- she had calculated the signal propagation through phone cables and helped plan for capacity surges and such. My mother majored in chemistry and did one year of graduate school in that subject, but in the 1950s a woman chemist could encounter overt discrimination. She had a couple of job offers in that space, but took the math job -- she had been a math minor. Which leads to an existential issue for me: if she had taken one of those jobs she was excluded from, she wouldn't have met Dad. So I exist in part due to the hideous sexism that existed in the 1950s!

When I was growing up she mostly stayed home, but then started a science and math tutoring business. Once I even took on a client when she was booked solid (my classmate was not entirely thrilled I think to be tutored by someone he usually didn't socialize with, but it was a good session). Mom also would point out sometimes how some of the "new math" we had in school was stuff she didn't see until college or even that year of grad school. So that gave me some perspective on how education had changed over the years.

So when I got to school, I did already know a bit and wasn't always shy about pointing it out. We did have some basic set theory (union and intersection) early on and that was unfamiliar. When we got to division I exulted, because finally I was learning something. Later I'd learn more -- negative numbers and other stuff. But many times I felt like math class just went too slow -- I wanted to leap ahead. I felt ready for Algebra the year we did pre-Algebra and something harder when we had Algebra.

Some of this was salved by reading -- I remember a Martin Gardner book that introduced me to snowflake curves and other mindbenders. On a long summer vacation one year a brother brought Goedel, Escher, Bach: An Eternal Golden Braid. I was 8 or 9 and won't claim I read much beyond the dialogues between Achilles and the Tortoise and looked at the pictures and captions, but a number of bits there stuck in my head.

Then came ninth grade and Geometry. I loathed the teacher and I grew tired of proofs upon proofs. The next year was back to Algebra, but it seemed like mostly treading water. So by 11th grade pre-Calculus I was starting to slack off and then for 12th grade Calculus I was fully in goof-off mode -- and my grades showed it. So I bombed the chance to place out for University of Delaware.

I took two math-type classes at Delaware, and neither gained me much. I had to take a Calculus class for my major, and completely regretted not placing out as it was all old stuff. I took a Statistics class, but it was designed to spread elementary statistics over three semesters for the Business students and moved at a horrific pace. I wish I had taken the stats class in the Psychology department; my roommate was a Psych major and it clearly honed his skills and focused on experiment design.

So that's the formal and a bit of the informal. So how'd it all work out? Let's look by subject area.

Arithmetic and Basic Algebra: Certainly I use simple arithmetic regularly. Logs are my friend; more on that in the stats section. Basic algebra -- setting up and rearranging linear equations -- pops up periodically. I can't remember solving any quadratics from grad school on, but perhaps I've just shoved that out of my mind. I once wrote a simple linear regression program, but I doubt I could do that now without a lot of internet searching. That's another problem: there are a lot of skills that I once had which have disappeared due to disuse.

Geometry: The only thing I can remember using professionally that I would call geometry is the distance formula. Getting comfortable with the idea of distances in N dimensions is a key bit, since so many clustering methods rely on distances. I appreciate the concepts of proofs and certainly enjoy thinking about them, but have little interest in doing them myself.

Calculus: The two times I've tried to use calculus have both crashed-and-burned. I stayed up late one night trying to do an integration: the next day I consulted our stats post-doc and he let me down gently that you can't integrate the Gaussian. These days I'd just find the right library in R or Python to get that distribution (I wanted the area at extreme tails, which standard printed tables don't give) .Another time I tried to use Newton's method to find isoelectric points. After botching this, I realized that getting to within 0.1 pH unit was precise enough, so just scanning in 0.1 increments from 0 to 14 would work fine. So one lesson learned: sometimes there are really fast, stupid ways to solve problems, such as libraries or brute force. Still, at least I haven't, like one research group, proudly published as a new idea the concept of using infinitesimals to measure the area under the curve.

Frequentist Statistics: Grade school introduced range, mode, mean and median plus basic combinatorics (such as operations using factorial). Somewhere I learned to calculate standard deviations and learned about Z-scores. I think it was chemistry class that introduced T-tests and genetics definitely where chi squared came in. For my undergraduate thesis I ended up teaching myself hypergeometric tests.

But some of those other key distributions: Poisson and Binomial, not so much. I think that awful course at Delaware did touch on Binomial. But certainly I wish I had more training on when to use the different tests. Genetics class is particularly poor at telling you to use chi squared without actually understanding the method.

Another key tidbit I've picked up but was never formally taught: logs transforming data can be so powerful! It is not uncommon for genomics data, such as expression data, to have noise proportional to the size of the signal, so the noise is not normally distributed in the original measurements. But transform into log space, and often the noise is now normally distributed. But be careful: this is a useful tool, not a magic wand and subject to abuse.

Bayesian Statistics: Nope, never had anything on this in a class. So I've been trying to learn over the years -- Nate Silver's The Signal and The Noise is helpful, but still not a strong suit. I try to understand Hidden Markov Models because I use them so much, but if you start talking Dirchelet mixtures I'm not confident that I understand these well at all.

Advanced Algebra: Eigenvalues show up all the time; I wish I didn't feel nervous around matrix algebra. We touched on it somewhere in my grade school, but then it lay fallow for too many years. Linear dynamic programming I understand pretty well, but only the special case of sequence alignment.

Other: An interesting experience at Starbase. I was giving an internal talk early on trying to familiarize everyone else with what I did, so I talked about sequence assembly. That means talking about paths, so I mentioned the Seven Bridges of Konigsberg problem. Stunning -- in a room with about a half-dozen science Ph.D.s (one of whom taught at an Ivy), I was the only person familiar with the Seven Bridges! Thanks Goedel, Escher, Bach! But there's plenty more I haven't a clue on -- Ewan Birney mentioned on Twitter something about Galois Fields, which I had never heard of. There's certainly plenty of other topics in math that pop up in papers that I have weak understanding of (such as differential equations) or am totally in the dark.

There's also a number of areas I chased a bit, particularly as an undergrad and grad student, that never went anywhere for me -- wavelets and fuzzy logic for example. That doesn't mean others haven't found it useful, just that I haven't.

So after all this retrospection, what can I conclude? A few thoughts.

First, I am enormously appreciative of every computational biologist who is truly skilled and talented in math who has used that skill to develop new tools and generate new approaches. Without them, people like me would be pretty lost.

Second, you can get by in computational biology with a pretty weak operational skill in math -- that's certainly how I see myself. I still think it is important to be aware of the underlying concepts and to make the attempt to understand them, but much as I can appreciate a great painting without being able to execute any painting that isn't absurdly awful, mastery of all fields is not necessary. The worst trap to create -- and I have certainly done so to myself -- is to decide you just can never get a certain topic and permanently give up trying. If I ever really needed to learn Dirchlet mixtures, then dammit I'd force myself to do so.

A corollary to that is that forcing yourself to learn things new can be very rewarding, and sometimes you get that wonderful surprise of connecting for yourself two fields that seemed unrelated. When I first learned how the neighbor joining tree-building algorithm works, I was thrilled - -because I recognized as the method used in a compression method I was familiar with (Huffman Coding). It's also fun to find you have an edge on others: one thing that pointed me down my career path was when in my Honors Biology class I instantly grokked the concept frameshift mutations, whereas some very bright students around me were totally confused. Again, prior thinking about computational problems had primed me for biological problems.

Third, the focus on execution in our educational methods is understandable, but it does sometimes get in the way of moving forward. Too often I was taught to use a method without any guidance on the method. An obvious example is mean vs. median: we were taught early to calculate them, but the idea that means can be misleading on skewed data is something I picked up much later.

So the overall conclusion to "how much math do I need to succeed in computational biology" (note I'm not equating that with passing a particular Ph.D. program; the standard setters there may have very different opinions -- and more importantly the power to enforce them!) is you need to know some math. You can get away with being a bumbler like I am or you can be very good at math. The most important point is not to let a self-perceived weakness in mathematics be a roadblock to exploring computational biology.

As always, I welcome comments, particularly illustrating areas of math that I haven't touched on -- either due to mental blocks or because I am just ignorant of those areas and how they apply to computational biology.

1 comment:

Anonymous
said...

this was fun to read Keith... you know at least some half of you would have existed separately regardless of your parents paths :)... you left out FFT we use for image processing, EEG data etc, the whole thing about machine learning stuff for all the prediction modeling... and computational biology doesn't include systems biology.. where pde are abound!

About Me

Dr. Robison spent 10 years at Millennium Pharmaceuticals working with various genomics & proteomics technologies & working on multiple teams attempting to apply these throughout the drug discovery process. He spent 2 years at Codon Devices working on a variety of protein & metabolic engineering projects as well as monitoring a high-throughput gene synthesis facility. After a brief bit of consulting, he rejoined the cancer drug discovery field at Infinity Pharmaceuticals in May 2009. In September 2011 he joined Warp Drive Bio, a startup applying genomics to natural product drug discovery. Other recurring characters in this blog are his loyal Shih Tzu Amanda and his teenaged son alias TNG (The Next Generation).
Dr. Robison can be reached via his Gmail account, keith.e.robison@gmail.com
You can also follow him on Twitter as @OmicsOmicsBlog.