A biologist's work

Jan 6, 2019

When I would tell people I do computational biology, they would nearly
always remark on the ‘computational’ part, saying either that it
sounds way beyond their comprehension or asking what on Earth it
means. I would inwardly groan, because isn’t it obvious what it
means? More recently I realized why ‘computational biology’ was so
confusing to people, including both non-scientists and scientists in
other areas: because it isn’t a real field.

Non-biologists are aware that tons of information is being generated
in biology. They know about genome sequencing and how that is part of
biology, and that we use computers to analyze that data. Unlike many
of the Old Guard of biology, they aren’t still bewildered by the fact
that work is now primarily done on computers. So they wonder what
strange thing we’re doing on top of that to warrant the
‘computational’ label. Studying digital life forms?

Some people do wetlab experiments without doing the extensive analysis
afterwards; they’re biologists who do wetlab experiments. Some people
do all their research on computers doing simulations and analyzing
data; they’re biologists because they study biological organisms.

How we use biological data

I explain my opinion on that label to put the current state of work in
biology into perspective and to then predict where it’s headed.

The relative importance of experiments, data, and analysis over time
is familiar: early biologists designed and conducted experiments to
test properties of life, and the results were easily interpreted.
More recently, especially in the last couple decades, data such as
genomic sequences are generated primarily for use in future studies.
As far as I can tell, generating data in this way is seen as more of
an incomplete ‘first step’ than were previous forms of observation
such as the surveys of early naturalists.

Consequently, much of the current work in biology consists of using
available data to discover patterns, fit models, and otherwise
understand biological systems as best we can. However, this strategy
reaches its limits fairly quickly. Think about cancer mutations, in
which some important genes can be implicated due to the frequency in
which they are found to be mutated. Many other genes play a role, but
we cannot accumulate enough clinical data to ever identify most of
these using the same method. Estimating the relative impact of
different mutations within a gene, even with holistic supervised
models, already seems to be reaching an accuracy plateau – at around
30% of the variance of
impact values captured by the predictions.

Sure, we could make a great deal of medical progress by treating only
the effects of the main driver mutations, but the complexities of the
cell and individual variation mean that this isn’t as simple as
thoroughly studying one gene at a time.

We are increasingly making use of models to represent biological
knowledge, and those models are becoming less formally defined. What
use is a model then if it cannot be interpreted? Simulation.

Representing knowledge through simulations

Science historian George Dyson notes
that
the way we use and interact with technology is shifting from binary,
logical, and discrete to analog and continuous:

The next revolution is the assembly of digital components into
analog computers, similar to the way analog components were
assembled into digital computers in the aftermath of World War II.

We should be, and
are, following this trend with biological models. We can use a
network of proteins to model protein interactions, but this is a
simplification that will only answer basic questions.

We are starting to use analog models thanks to machine learning and
physical simulations such as molecular dynamics, but overall are still
making fairly formal and manual use of biological data. Further
development of artificial intelligence will allow models to be
developed more automatically, even allowing AI to collect data from an
experimental system as it sees fit.

I predict that the broad process of biological research will center
around huge models that are used to perform simulations. Engineers
will build the AI agents, and the AI agents will build models that
best simulate real measurements. Biologists will find new techniques
for observing biological systems in ways that are most useful for
human interests. They will guide the AI agents to understand the
observed data and to develop the models in ways that answer important
new questions. They will oversee and interpret the information
produced by the simulations.

Interpreting big analog models is difficult, but formal models with
nice intuitive interpretations, which can predict a few basic things
about an analog biological system, will eventually have little use in
research. Ironically, the analog and imperfect representations will
be most useful for computers, while discrete and formal
representations will be useful mainly to teach concepts to students.

As our approach to research changes steadily, eventually becoming what
would be unrecognizable today, we should remember that we’re still
doing biology.