A big data look at improving scientific research: A Q&A with Andrey Rzhetsky

Andrey Rzhetsky, PhD, professor of medicine and human genetics and director of the Conte Center for Computational Neuropsychiatric Genomics

Of all the possible experiments available in biomedical research, only a small subset are ever tackled by scientists. This is in part due to institutional and cultural pressures that lead researchers to avoid risk-taking and choose inefficient research strategies, according to a new study based on a computational analysis of millions of patents and research articles. Despite increased opportunities for groundbreaking experiments, most scientists choose conservative research strategies to reduce personal risk, which makes collective discovery slower and more expensive, conclude Andrey Rzhetsky, PhD, professor of medicine and human genetics and director of the Conte Center for Computational Neuropsychiatric Genomics, and his colleagues.

However, the team also uncovered more efficient approaches for maximizing discovery and identified the approaches used more often by scientists who have won Nobel Prizes and other prestigious awards. Not only do they quantify the advantages and disadvantages of modern science, which they published in the Proceedings of the National Academy of Sciences, they also propose steps for a more productive future (read more about the study here). ScienceLife asked Rzhetsky a few questions about his big data look at the science of science.

ScienceLife: Why did you choose to study the strategies that scientists use to choose experiments?

Andrey Rzhetsky: It started as personal curiosity. What’s the best personal strategy for scientific research? How should we select the most promising problem to advance collective discovery? These were questions I wanted to know the answer to. Scientists have lots of different strategies to select problems to work on, which in many cases plays a major role in the success of the work itself. This seemed like a solvable problem.

How do you even approach something this broad?

Essentially, we looked at molecules that were mentioned together in the same work—indicating they were linked in the brain of the same researcher. We extracted molecule names from millions of publications and patents and created knowledge networks to see how combinations of those names occur and evolve in time. We looked for everything we could identify. It was something like 40 million distinctive molecules.

Why molecules and not diseases?

Molecular structures are hugely important for biological, chemical and medical scientists. We use molecules, but you can think of them as representing ideas or concepts. For example, all drugs and diseases are mapped to specific molecules. The cancer gene p53 and its protein is a concept. Aspirin is another. And everything eventually should connect because nature is unified. This is the base of our knowledge network.

So what does this massive network give us?

This formulation allows us to discover strategies that maximize probability for success, for finding something important in the most efficient way. But these strategies are different for different ages of networks. Optimal strategies for the same network differ depending on how much is known.

For old, well-established disciplines where much of the network is already discovered, you have to be very adventurous to find something new. For very young disciplines, you can get away with pretty simple strategies—take two flashy concepts and connect them, and it’s likely to result in something important.

A small example network based on actual research papers over the past four decades. Dots represent molecules, and lines represent papers that mention the two connected molecules. Some nodes are heavily studied and well-connected to other molecules (orange and red), representing well-established fields of study. Others (blue) are understudied fields and more obscure.

How does this help with a researcher choosing what to research?

Essentially we have a simplified model in which discoveries boil down to a selection between two problems to connect. Think about the game theory problem the prisoner’s dilemma. It’s a very simple game, but it helps to articulate strategies that can be used in quite different situations. Same thing here.

Basically in our model, each experiment involves two molecules. If you are in an old, established field, the most efficient way to make new discoveries is risky. You have to make new connections between obscure molecules.

Is there a best practice here?

I’m not sure if you’ve ever collected mushrooms but in Europe people do it quite a bit. In a new forest, it’s very easy to find mushrooms after the rain. But when you know more people are around, you have to go into less accessible areas to find something. This is basically the same idea, but with molecules.

The trick is to find something really important, and not being attacked by thousands of other people around the globe.

So scientists should always study something obscure?

That is a sort of philosophical question. I don’t know if you’ve ever seen this, but ants communicate via chemical signals and sometimes, they can form a circle which essentially acts like a black hole. They can’t exit from it. The chemical attraction is so strong that they die.

Scientists cannot see the global picture typically. They can get trapped in a local problem because of practicality. It would be to the detriment of science if everyone were concentrated in a small part of the network and spent enormous amounts of resources trying to solve the same problem. Unfortunately, this is closer to what’s really happening.

Basically for society at large, it would be most beneficial to have great diversity in the problems that people try to solve. The goal is to discover something new, and not rediscover same thing over and over again.

But maybe some things deserve to be studied more because they are more important?

That is a little bit of circular. For example, p53 has something like 60,000 papers. This one protein and all its connections is surely important, but what is the objective measure of importance?

I would say to cure cancer and save lives….

But I’m not sure that’s what’s happening in reality, that studying this one particular protein is saving lives. Because cancer is a disease of regulation. By definition you need to think about networks and interacting parts, not one protein.

One of the implicit ideas in our study is that you need to gain overall knowledge about networks—that it’s useful to explore it as broadly as possible, rather than focusing on one connection. In terms of cancer, I really believe you need to know about the network at large. It’s not enough to focus on one protein, however important it is.

Aren’t obscure molecules obscure for a reason? Wouldn’t it be wasteful to run experiments that wind up not yielding anything?

Certainly a proportion of experiments will fail. But think about history. DNA was obscure molecule until the 50s. Everything that became prominent was obscure at some point. If scientists are not exploring obscure research areas, they’ll never discover anything new. It’s true that, in some sense, what we are suggesting can be wasteful. But it’s the price we have to pay. There is no better way.

What about scientists who might pursue risky research areas, fail and harm their careers?

The take home message here is not that you should maximize risk. Instead, there is a carefully balanced optimal risk. In new field, you can get away with relatively little risk. The older the network, higher risk is needed.

For researchers trying to gain tenure, it would be very detrimental to not find something new. But what we are showing statistically, is that one is bound to find something new, with the optimal strategy after say X number of experiments.

One of our side conclusions is that it is almost always better to be in a younger field. And that there is always a young field in the area between two old fields. This open space is always beneficial for scientists. That’s what many young scientists should be doing—exploring the space between old disciplines where not much has been done, but where old disciplines are almost exhausted.

Like this:

Kevin Jiang is a Science Writer and Media Relations Specialist at the University of Chicago Medicine. He focuses on neuroscience and neurosurgery, orthopedics, psychology, genetics, biology, evolution, biomedical and basic science research.