A reply to Thomas Metzinger’s BAAN thought experiment

Metzinger invites us to consider a hypothetical scenario where smarter-than-human artificial intelligence (AI) is built with the goal of assisting us with ethical deliberation. Being superior to us in its understanding of how our own minds function, the envisioned AI could come to a deeper understanding of our values than we may be able to arrive at ourselves. Metzinger has us envision that this artificial super-ethicist comes to conclude that biological existence – at least in its current form – is bound to contain more disvalue than value, and that the benevolent stance is to not bring more minds like us into existence (anti-natalism). Suffering from what Metzinger labels ‘existence bias,’ the vast majority of people in the scenario would not accept the AI's conclusions.

Metzinger emphasizes that he is not making a prediction about the future, nor does he necessarily claim that the BAAN-scenario is at all realistic. His idea is meant as a cognitive tool, to illustrate that a position such as anti-natalism, which most ethicists are currently reluctant to even engage with, could conceivably turn out to be correct (in some relevant sense) as inferred by an intellect of superior ethical expertise that is unaffected by human-specific biases.

From thought experiment…

The BAAN thought experiment invites us to look at human existence from a more distanced, more impartial perspective, wondering which type of biases could cloud our ethical judgment. This approach is very much in line with what many ethicists have been attempting to do. A recent example is The Point of View of the Universe by Katarzyna de Lazari-Radek and Peter Singer, who suggest that there might be a strong symmetry between suffering and happiness (hedonism). Alternatively, I do think there are many reasons why one might conclude that reducing suffering is more important than creating happiness, and that what Metzinger labels ‘existence bias’ seems to me like a concept with substantial merit. Having said that, a perspective that solely considers experiential well-being – suffering or happiness – but not a person’s life goals and where we derive meaning from, strikes me as missing a very important component of the situation.

If we had an artificial intelligence with the goal to help us answer ethical questions, what would it come to think? Here is where I suspect that Metzinger’s setup in the BAAN thought experiment is underdetermined and thus begging the question: I strongly suspect that there is no uniquely correct way to teach an AI to solve (human) ‘ethics.’ To use Brian Tomasik’s words, expecting AI to solve ethics for us is like expecting math to solve ethics for us: It's always garbage in, garbage out. In order to program an AI to have ethical goals, as opposed to it having any other type of goal, the human creators would already need to have a detailed understanding of what we mean by ‘ethics’ (or ‘altruism’), precisely. There is no guarantee that what we mean is something well-considered or well-specified, or that it even comes down to the same exact thing from person to person. Words get their meaning from how we use them, and if our usage is underspecified, it is also underspecified how we could go on about making it more concise. In order to work around these obstacles, one could try to build an AI that learns ethical goals from examples. However, this suffers from the same type of underspecification, as it then becomes relevant how the examples are picked and labelled, and whether the AI in question will try to infer people’s revealed preferences (in which case it would, in virtue of the process itself, never conclude that humans suffer from any biases) or whether it would go with some kind of extrapolated preferences (what we would wish more principled, non-hypocritical and informed versions of ourselves to do). In the latter case, there are many distinct ways to resolve inconsistencies within different modules of a person’s brain, and so value extrapolation is underspecified as well.

All of the above implies that while it does seems conceivable that a smarter-than-human artificial intelligence with a wisely chosen goal could someday assist us in our ethical inquiries, we would first have to make substantial philosophical progress on our own. Before the AI-assisted search for our true values can get off the ground, we would have to make some tough judgment calls that determine the precise nature of the AI’s role as an ethical advisor.

The BAAN-scenario Metzinger asks us to consider, where the allegedly benevolent AI comes up with an ethical stance that most humans would likely want to fight against, is only logically conceivable if the initial recipe installed into the AI – what its creators meant by ‘assisting us with doing ethics’ – allowed for an outcome where the AI ends up choosing a stance that, when it presents its result to humanity, the vast majority of people vehemently disagree. Perhaps we would be willing to accept this setup in case the AI in question would be able to eventually persuade us – without trickery, only through intellectual arguments – that the ethical reasoning it went through is truly in line with what we upon reflection care about. However, in a situation where the AI’s conclusions go strongly against our own ethical views in such a way that the AI could not persuade us that it knows our values better than we do, would we accept being too short-sighted – thereby placing greater trust in having set up the AI’s notion of doing ethics in the right way – or would we place greater trust in our own beliefs of which outcomes we deem unacceptable? Someone who thinks the latter could argue that the setup in the BAAN thought experiment does not count as a genuine inquiry into what humans value, but only counts as one interpretation of what we somehow ‘should’ value, based on (implicit or explicit) unilateral judgment calls made by the AI’s creators.

On the one hand, we want our ethical inquiries to stay open to the possibility that our current views suffer from important blind spots, just like it has been the case historically many times over. On the other hand, under the moral anti-realist assumption that the direction of moral progress is not (completely) ‘fixed,’ we also do not want to end up in a situation where open-ended moral inquiry unexpectedly comes up with so-called progress that has nothing to do anymore with our current goals, or our current concepts of ethics and altruism. The art lies in setting just the right amount of guidance. I take Metzinger’s thought experiment to be asking us the following: If we got offered to extrapolate our own values with the help of a benevolent superintelligence that will do for us exactly what we intend it to do, would we give it instructions that allow for the possibility of a BAAN-like outcome, where the AI knows us and our values better than we do? Or would we veto this outcome a priori, thereby determining that any preference for e.g. continued existence over suffering can never be accepted by us as a bias, but is always an axiom we are not willing to question?

… towards a future scenario universally worth rooting for

While the BAAN-scenario is best left as a thought experiment only – as intended by Metzinger – the idea that smarter-than-human artificial intelligence could assist us in our ethical inquiries seems intriguing, and, if done wisely, promising as an altruistic and cooperative vision for AI development. A collaborative project to build AI could perhaps be set up in a way where it becomes possible to narrow down one’s moral uncertainty, resolve moral disagreements, or (in case certain disagreements are bound to persist) determine a compromise for different value systems with maximal gains from trade. The task of value alignment for smarter-than-human AI is hard enough on its own, and does not become easier when zero-sum (or negative-sum) competition intensifies. Rooting for an AI that helps with only our own, idiosyncratic conception of how to do ethics is ill-advised. Instead, one should advocate for a cooperative solution to value alignment that helps everyone to get a better sense of their own goals, and then implements a compromise goal that gives everyone close to the perfect outcome.

Very roughly, an altruistic vision for a post-AI future that takes into account that reducing suffering is altruistically extremely important, should satisfy the following criteria:

The vision is a cooperative one; support for it is as broad as possible

The vision includes comparatively little amounts of suffering

Practical plans for implementation have safeguards against catastrophic failures

Point 1 most probably rules out universal anti-natalism. It is however compatible with Metzinger’s “Scenario 2” – AI technologically abolishing suffering in us or our descendants – provided that a future filled with flourishing posthuman beings is implemented in details rich enough and uncontroversial enough to be regarded as (mostly) utopian from a maximally broad a range of perspectives.

Point 2 is also satisfied in Metzinger’s Scenario 2. I deliberately wrote “comparatively little” suffering instead of “no suffering” because perfection here is the enemy of the good: Complaining about small traces of suffering in utopia is like a spoiled kid complaining at their birthday party that the new car they got has an ugly color, while the day before, they had an accident with their old car and got very lucky to not have lasting damage for the rest of their life. Utopia-like outcomes, even if they may contain serious suffering for some beings, are much, much better than futures with astronomical amounts of suffering. Therefore, the comparison should be made with what is otherwise in the range of likely outcomes, and not with what would the perfect outcome one can conceptualize. If some people’s conception of a utopian future e.g. strongly calls for being able to go rock climbing and experience occasional glimpses of fear from falling, or some suffering from muscle aches, it would be extremely uncooperative for people with suffering-focused values to object to that. Having said that, we should be wary of wishful thinking and recognize that near-optimal outcomes may be very unlikely, and that most of the value for suffering reducers may be gained with point 3:

Point 3 refers to the problem that utopias may be fragile, and that we should choose a path where small mistakes do not lead to a situation that is much worse than if no one had even deliberately tried to achieve an excellent outcome.