I Bayes again - how to interpret in vitro genotoxicity data

Hello,
yesterday at work there was a discussion about genotoxicity testing, and a point caused some controversy. I wonder if you could shed some light, please.

To give you some context: we work in drug discovery, and before a substance can be administered to people, we need to check that it's not toxic, and in particular not genotoxic.
To assess that, we need to run a genotoxicity assay (called micronucleus, more information here: https://en.wikipedia.org/wiki/Micronucleus_test).
This assay has two versions: in vitro and in vivo. The in vivo assay is the one that matters (legally) to decide if the substance can be used in humans, but it's very expensive, so we tend to run the in vitro one first as an initial screen of (relatively) many substances. If the in vitro assay is negative, we run the in vivo assay for confirmation of negativity (non-genotoxicity). If the in vitro assay is positive, we must decide whether to drop the substance or run it in the in vivo one to de-risk it (because if the in vivo assay is negative, it does not matter that the in vitro one was positive: it is then considered an in vitro false positive; so we could say that the in vivo assay is considered a golden standard telling us the real genotoxicity).

And here's where Bayes comes into play (I think).
The in vitro assay is reported to predict in vivo genotoxicity with a sensitivity of 90% and a specificity of 40% (i.e. with a massive FPR of 60%).
One colleague raised the question: is it worth running the in vitro assay as an initial screen of potential candidate substances, given that among those that test positive, many of them will probably be false positives?
The doubt in that case is whether to progress them to in vivo, i.e. spend the money perhaps just to confirm that they are actually genotoxic = true positives, or not progressing them even though they may actually be non-genotoxic = false positives?
[Consider that substances that get to this stage have undergone a long and expensive discovery process, and if they make it to human testing they can potentially generate a large return on the investment, so it is financially very risky to drop them unless it's quite sure they are toxic. That's why people are more concerned about false positives than about false negatives in this context].
On the whole I tend to agree. One would be tempted to say: just as well to select your substances by other criteria and only run the few you consider 'best' directly in the definitive in vivo assay.

However, I have doubts on this reasoning, in particular related to the prior probability of true genotoxicity (let's call it P(G); hence P(nonG) = 1-P(G), as there are only these two categories).

From the theory we know that P(G) is needed to calculate P(G|T+), where T+ is a positive outcome in the in vitro assay, and more importantly in this context, its complementary P(nonG|T+):

Trouble is I have no idea what P(G) is in the set of substances one may decide test. It's not like the prevalence of a disease in a population, for which we have historical estimates, and perhaps even information on specific subsets. Here we can choose any set of substances, and some sets can have higher or lower P(G) depending on the structures. But I don't know if and how that can be estimated.

Say that I take an indifferent approach and decide P(G) = 50%. Then:

P(nonG|T+) = 60%*50% / [60%*50% + 90%*50%] = 40%

This seems to oppose my colleague's view: if the test is positive, it is more likely that the substance is genotoxic (60%) than it isn't (40%), although quite weakly.

However, say that I have historical data saying that substances randomly chosen from chemical space are 25% likely to be genotoxic, and that the substances we may decide to test as drug candidates are essentially random themselves. Then P(G) = 25% and:

P(nonG|T+) = 60%*75% / [60%*75% + 90%*25%] = 66.7%

This seems to support my colleague's view: when the in vitro assay is positive, it's actually more likely that the substance is non-genotoxic (66.7%) than it is (33.3%).

I am aware that in terms of odds I am always doing better using the assay than not using it, because:
P(nonG|T+)/P(nonG) = 40% / 50% = 0.8, in the first case
and
P(nonG|T+)/P(nonG) = 66.7% / 75% = 0.89, in the second case
so I know that choosing which substances to progress and which not to progress based on the positivity of the in vitro assay is always better than random.
But people who don't appreciate such statistical subtleties will not be impressed if we tell them to drop one (or a series of) otherwise promising substance(s) based on the positivity of the in vitro assay, while at the same time we tell them that the probability that this positivity would be confirmed in vivo is only 33.3%. They don't care that this is higher than in the 'starting' set (25%): they only see that we are killing their substance based on a laughable 1/3 probability that it's bad. They don't like that, believe me.

And in fact ours is a more or less arbitrary estimate of P(G), which however influences very much the final probability that a positive result in the pre-screen assay is confirmed in the 'golden standard' one.

Question 1: could we elaborate and/or present the data in such a way to make the assessment/decision about individual substances independent of the choice of prior?

The second doubt I have is whether the above story would change for series of structurally closely related substances, which is usually what we test in the in vitro assay.
I mean that we don't really sample randomly from chemical space: we tend to sample relatively small regions of it where molecules are rather similar to one another. And there is a general principle in medicinal chemistry, stating that structurally similar molecules tend to have similar biological activity (of which genotoxicity is one type). This principle has important exceptions (activity cliffs), but on the whole it's not going to be too wrong.
So even when we have information on P(G) for 'random', generic molecules, it may well be that the specific set of molecules a team submits to the in vitro micronucleus assay has an inherently higher (or lower) P(G) than the general 'population'.
Would you take that into account in other contexts, e.g. in a laboratory test for a disease? If you know that a specific ethnic group has a higher incidence of the disease, and a person from this ethnic group tests positive, will you use P(disease) for the general population or P(disease)' for that ethnic group, to calculate P(disease|test positive)?
This also led me to wonder whether getting multiple in vitro positive results for a series of closely related substances should make us change the estimate for P(G).
Say for instance that a team tested in vitro 10 very similar substances, and 8 of them were positive, 2 negative.
The probability that all 8 positives are true positives, based on a P(G) of 25% is (I think - please correct me if I'm wrong):

"P(8 G | 8 T+)" = [P(G|T+)]8 = [33.3%]8 = 0.015%

So there is a lot of hope that at least one of them can be salvaged by de-risking in the in vivo assay, (although that will be expensive).
However, isn't the fact that 8 out of 10 substances turned out positive in vitro suspicious, if the hypothesis that P(G) = 25% were true?
If we took 10 random substances from a set with P(G) = 25%, on average 2.5 of them would be genotoxic, 7.5 non-genotoxic. If we then tested all of them in the in vitro assay, on average 2.5*90% + 7.5*60% = 6.75 would turn out positive, 2.5*10% + 7.5*40% = 3.25 would turn out negative.
So shouldn't we review our estimate of P(G) in order for the averages to match our observations?
In this case: P(G) = (8/10 - 60%)/(90%-60%) = 66.7%.
Which would lead to a very different P(G|T+) = 90%*66.7% / [90%*66.7% + 60%*33.3%] = 75%

I am not convinced by this approach, though, because it looks quite circular, and would lead to inconsistencies in case the number of in vitro positives is more than what is expected. E.g. if we had 10 positives out of 10, by the above logic we should conclude P(G) = 40%/30% = 133% (!), because we don't accept that no negatives would be found. But this does happen in reality: there are series of molecules that are ALL positive in vitro...
I have the impression that P(G) is really something that one must decide a priori and then leave alone; or am I wrong?
I tried reading some other threads and articles (see links below), but most are well above my level, and seem to address much more complex cases than this simple one I'm tackling without success.

Question 2: is the observation of a multiple in vitro micronucleus positives in closely related substances an indication that the underlying prior probability is higher for these substances than for generic chemical space? And if so, how should it be calculated?

Sorry for the long post - this is complicated stuff (at least, it is for me).
Thanks
L

[Consider that substances that get to this stage have undergone a long and expensive discovery process, and if they make it to human testing they can potentially generate a large return on the investment, so it is financially very risky to drop them unless it's quite sure they are toxic. That's why people are more concerned about false positives than about false negatives in this context].

Is there a reason that the in vitro test isn't done early in the discovery/development process ? Is the "long and expensive discovery process" the minimum amount of effort needed to get the substance in shape to do an in vitro test?

Yes. One reason is that even the in vitro test is quite expensive and time-consuming, so you wouldn't be able to process the huge numbers of substances that are generated in early discovery (or at least, my company wouldn't); second reason, as I explained, people are extremely wary of running this assay early and potentially 'kill' molecules or whole series based on what appear to be very shaky grounds.
Do you think that testing more molecules early would help?
Thanks

This is a problem of "statistical decision theory", which means the answer doesn't depend solely on the values of some probabilities, it also depends on the costs (including "opportunity cost") of various decisions. Sometimes such problems can be solved just by knowing qualitatively that certain costs are huge and others are neligible.

As I probably mentioned in other posts, I am an advocate of using simulations to study real life problems - perhaps just sketching the outline of how to simulate a process even if the outline is never coded up as a computer program. You have proposed a number of hypothetical situations. Can you specify a algorithm for simulating the drug development process that would allow you to pose all those questions by changing the inputs to the algorithm? That would be the best way to investigate the "decision space" of the drug development.

I feel reasonably confident about answering questions that have to do with probability theory, but I don't know anything about the biophysics of your questions. In general, the outlook of a statistician is seek information from current measurements that predicts the statistical properties of future measurements. In you case, it requires expert biophysical knowledge to assess whether measurements of various current populations are statistically representative of various future populations.

This is a problem of "statistical decision theory", which means the answer doesn't depend solely on the values of some probabilities, it also depends on the costs (including "opportunity cost") of various decisions. Sometimes such problems can be solved just by knowing qualitatively that certain costs are huge and others are neligible.

Yes, I used this approach in the past. In this case, not easy to put exact figures on each scenario, but let's try.
Scenario NN: in vitro negative --> in vivo negative --> progress
Scenario NP: in vitro negative --> in vivo positive --> stop
Scenario PX: in vitro positive --> do not test in vivo
Scenario PN: in vitro positive --> in vivo negative --> progress
Scenario PP: in vitro positive --> in vivo positive --> stop
Scenario XN: not tested in vitro --> in vivo negative --> progress
Scenario XP: not tested in vitro --> in vivo negative --> progress
Let's say that the in vitro assay costs 2000, the in vivo one costs 50000.
The return of progressing a substance can be much larger than any of these costs, potentially 108÷109, but it undergoes further attrition, so only about 1/10 of all substances that progress to first in human eventually generate this return.
Any substance that is stopped will surely generate no return, because no further investigations will be done on it. And the cost of stopping a substance is generally higher the longer one waits, because people work in chemistry to make the substance itself and its analogues, and in biology to assay the molecules.
The big bosses always say 'fail early', meaning indeed that when you are pretty confident that there are insurmountable problems (and confirmed genotoxicity is perhaps the worst you can get) you should cut the losses and move on to something else rather than dragging it along and wasting time and money on a lost cause.
So yes, there would be an argument in favour of running crucial assays early and know when to stop a substance before too much cash is burned. Trouble is that the FPR of the in vitro assay is very high, as I explained, and when genotoxicity is confirmed in vivo for one molecule, people get panicky and start to think that the whole set of its analogues is also genotoxic, but they don't want to spend the money to check that in vivo, and so many potential opportunities are lost.

In the case we were dealing with in my company, the in vitro assay was not run for a series that got a lot of resources (FTE's...). So a lot of money was spent on this series, without even attempting to assess their potential genotoxicity. When the candidate molecule for first in human was nominated, it went to the in vitro genotoxicity assay, was found to be positive, and was confirmed in vivo. Then 5 or so other molecules from the same series were also tested in vitro, and all of them except one were positive. No other molecule was tested in vivo. The series was dropped.
Now, the project leader thinks that some of them may be non-genotoxic, by the argument that the in vitro assay has a high FPR.
Other people (those who decide, ultimately) think that as the molecules are all similar, they are probably all genotoxic, and there is nothing to gain in testing them in vivo.
Is this a correct decision or a loss of a potential big opportunity? (note that the molecules are very good under all other respects)
This is the kind of question I am trying to answer, in the end.

A significant aspect in that question is whether substances in the same "series" are likely to have similar properties.

The way I would begin to conceptualize a simulation of the process is define the properties of the population of substances, from which substances are selected "at random" for initial investigation. For example, as a simplistic approach, suppose each substance has 3 properties:

Profitable to develop? (yes or no)
Passes in vivo test? (yes or no)
Passes in vitro test? (yes or no)

I'll postpone any modeling of different "series" of substances.

A specific population of substances is define by giving the fraction of the population that has each combination of properties. For example, perhaps 0.032 of the population has the combination: Is profitable to develop, fails in vivo test, passes in vitro test.

A simulation is sometimes useful even if the details of the population are not know. For example, by simulation, you can seek robust decision rules that do well with a wide variety of different populations.

What's a simple way to model costs and rewards? For example, we could assume the development of a profitable substance is worth 5 units, the cost of selecting the substance "at random" is 1 unit, the cost of an in vivo test is 2 units etc. There should be some cost constraints on how many substances the company can be investigating at a single time. We probably need to model time explicitly, at least as discrete series of steps.

A significant aspect in that question is whether substances in the same "series" are likely to have similar properties.

That's indeed a crucial point. When the property we refer to is biological activity, occasionally there can be sharp variations in activity by small structural changes. This usually happens when the structural change induces a big shift in the features that are responsible for the strength of the interaction of the substance with biological macromolecules (in the case of genotoxicity, typically DNA, although indirect mechanisms are known).
As a consequence, we may have a set of 10 structurally similar genotoxic substances, and find one small structural change that turns one of them into a non-genotoxic molecule. And vice versa, we may have a set of 10 structurally similar non-genotoxic substances, and find one small structural change that turns one of them into a genotoxic molecule (as far as I am aware, this latter case is much less frequent; and it makes sense, mechanistically, that it should be so).
Special cases include reactivity of the substance itself (but chemists can usually spot that very easily), or of its metabolites (that's less obvious, but there are ways of checking that too): these are cases where suppression of the features that are responsible for the genotoxicity definitively eliminates it.
In the most common cases however, even when one manages to suppress genotoxicity in one of a series of genotoxic molecules, what really changes is the 'potency', meaning that at sufficiently large(r) doses one may still observe DNA damage.
Which means nobody will be very keen to develop a molecule that has had even the slightest whiff of genotox about it, despite the enormous financial interests at play.
That's why finding genotoxicity in one or more substances in a series is so scary, and people are extremely wary of running unspecific assays that may raise doubts or suspicions.

Profitable to develop? (yes or no)
Passes in vivo test? (yes or no)
Passes in vitro test? (yes or no)

I confess that I am a bit lost here. To simulate different scenarios we would need some estimates of the probability of each of these outcomes and of their interdependencies, wouldn't we, and that's indeed the part that I am struggling with.
For instance, I could say that 10% of all molecules that arrive in late 'lead optimization' (LO) are potentially profitable to develop, i.e. won't fall at any of the next hurdles (excluding genotoxicity, that we leave unspecified).
However, to know their chances of passing the in vitro or in vivo genotoxicity test, wouldn't I have to estimate first the prior probability of genotoxicity in the 'initial' set from which they originate?
Without that, I don't see how I can partition the initial population into the various categories.
I have built many simulations, mainly of physiological processes like ADME, but I am not at all familiar with statistical simulations. Maybe there is some training material I could read?
BTW, thank you for your input, it's always very thought-provoking and insightful.

I confess that I am a bit lost here. To simulate different scenarios we would need some estimates of the probability of each of these outcomes and of their interdependencies, wouldn't we, and that's indeed the part that I am struggling with.

Yes, to simulate a particular scenario, we would need estimates of those probabilities. However, if the rules for making good decisions vary greatly with a small change in those probabilities, you aren't likely to find a set of good rules that works in real life! The hope must be that there is a good set of rules that is "robust" to changes in those probabilities. One use of a simulation is to try out sets of rules versus various assumed values for those probabilities. By trying various values, we can see how sensitive the performance of a given set of rules is to changes in the assumed population.

This is similar to the use of spread sheets in making financial decisions. They are used to test the performance of various alternative decisions as assumptions about conditions are varied - for example, what happens if interest rates increase by 0.1%,? by 0.2% etc. What happens if sales decrease by 3%, 4% ?

Estimating the correlations between the properties of substances in the same series and incorporating the facts of biophysics in those estimates sounds like a complicated task. So before tackling that task, it would wise to find out how much such estimates matter. How accurate must such estimates be? How sensitive is the financial performance of the company (using a given set of decision rules) to errors in those estimates?

I see... now I think I understand.
So simulating in this context means trying out a large combinatorial set of possible values of the parameters that determine the number of substances that fall into each 'final' category, so one can check if small changes in the input parameters cause large changes in the final proportions, and the corresponding costs or gains.
In my mind, simulating was more like calculating theoretical values and perhaps fitting them to some experimental data.
I will give it a shot, it does not sound too difficult now that the concept is clear(er) to me. And I believe R has got some excellent tools for that.

Sensitivity analysis: I see now in what sense the simulation may help.
The only example of it I've seen so far is in the context of scoring.
Molecules in a set are each assigned a 'score' that represents how good they are.
The score is essentially a function of the distances of the molecule's physico-chemical or structural parameters (e.g. molecular weight, log P, etc.) from certain reference values.
By changing the reference value of each property within a reasonable range and looking at how the ranking of the molecules -according to the overall score- changes (e.g. by measuring the Spearman rank correlation coefficient to an 'initial' ranking), one can see how important it is to set the reference value of that particular property. This was described in a very good paper by Matthew Segall.
I see now some similarity in the concept.