(the ideas in this post came out of a conversation with Scott, Critch, Ryan, and Tsvi, plus a separate conversation with Paul)

Consider the problem of optimization daemons. I argued previously that daemons shouldn’t be a problem for idealized agents, since idealized agents can just update on the logical observations of their subagents.

I think something like this is probably true in some cases, but it probably isn’t true in full generality. Specifically, consider:

It’s going to be difficult to centralize all logical knowledge. Probably, in a maximally efficient agent, logical knowledge will be stored and produced in some kind of distributed system. For example, an ideal agent might train simple neural networks to perform some sub-tasks. In this case, the neural networks might be misaligned subagents.

If the hardware the agent is running on is not perfect, then there will be a tradeoff between ensuring subagents have the right goals (through error-correcting codes) and efficiency.

Even if hardware is perfect, perhaps approximation algorithms for some computations are much more efficient, and the approximation can cause misalignment (similar to hardware failures). In particular, Bayesian inference algorithms like MCMC will return incorrect results with some probability. If inference algorithms like these are used to choose the goals of subagents, then the subagents will be misaligned with some probability.

Problems like these imply that maximally efficient agents are going to have daemons and spend some portion of their resources on anti-daemon measures (an “immune system”).

At a very rough level, we could model an agent as a tree with a supergoal at the top level, subagents with subgoals at the next level, subagents of those subagents at the next level, and so on (similar to hierarchical planning). Each level in the hierarchy allows some opportunity for the goal content to be corrupted, producing a daemon.

I suspect that there’s at least one anti-daemon measure that is compatible with maximal efficiency. Roughly: if an AI system is taken over by daemons, then the daemons it is taken over by will themselves have some sort of anti-daemon measures or else be taken over by daemons. Unless it’s “daemons all the way down”, at some point it bottoms out as an agent with effective anti-daemon measures. (This isn’t very formal; I think I would need to look at the problem from a different angle to get a crisp formal statement out of this).

(as an analogy, consider whale cancer. Cancer can take over individual whales, but will then fail to be able to gather food, and will additionally need anti-cancer mechanisms of its own. Any biological organism that is capable of pursuing the same convergent instrumental goals as whales do will need to spend some of its resources on anti-cancer mechanisms, so cancer is not a particular problem for whale values compared with those of alternative agents who pursue different values through the same instrumental strategies that whales do.)

Daemons are a problem for agents with platonic goals (i.e. agents who only care about a logical world). If we give the agent the goal of maximizing utility in an idealized computing environment, but then run the agent on insecure hardware (e.g. vulnerable to row hammer attacks) , then the agent will develop daemons plus an anti-daemon immune system, but the immune system will only be optimized to prevent attacks possible in the idealized computing environment. For example, it will invest in an immune system against subagents corrupted due to MCMC approximation errors, but the immune system won’t take into account that the subagents can exploit row hammer attacks.

(note that there are other problems with platonic goals; most ways of doing efficient logical reasoning involve an empirical component (e.g. logical inductors), and the empirical component could find faults in the hardware. But the problem with daemons makes it clearer that we should expect some of the results of hardware failures to be adversarial.)

It’s possible that low impact could help here: if we can’t specify the world that we care about (i.e. our physical world), then maybe we can design an agent with a motivation to have low impact across the board so that it will in particular desire to have a low impact on our world.

It’s also possible that it’s worth sacrificing maximal efficiency for better anti-daemon measures. But then there must be some argument for why this doesn’t severely cripple the agent’s performance.

Consider human society. What if, in an effort to control defection at every meta level, the world as a whole spent 10% more on security to avoid being taken over by nations; each nation spent 10% more on security to avoid being taken over by cities; each city spent 10% more on security to avoid being taken over by individuals; and so on all the way down to organelles? Then the total amount of spending on security grows by far more than 10%.

If the AI daemon problem has a “fractal” flavor similar to human society then in the limit, better anti-daemon measures (e.g. spending 10% more time on each MCMC inference) lead to more than a constant factor of slowdown. I am not sure how realistic the “fractal” model is, but in any case it seems like its plausibility will be an obstacle to a formal analysis of AI alignment.

It seems relatively plausible that it’s “daemons all the way down,” and that a sophisticated agent from the daemon-distribution accepts this as the price of doing business (it loses value from being overtaken by its daemons, but gains the same amount of value on average from overtaking others). The main concern of such an agent would be defecting daemons that building anti-daemon immune systems, so that they can increase their influence by taking over parents but avoid being taken over themselves. However, if we have a sufficiently competitive internal environment then those defectors will be outcompeted anyway.

In this case, if we also have fractal immune systems causing log(complexity) overhead, then the orthogonality thesis is probably not true. The result would be that agents end up pursuing a “grand bargain” of whatever distribution of values efficient daemons have, rather than including a large component in the bargain for values like ours, and there would be no way for humans to subvert this directly (we may be able to subvert it indirectly by coordinating and then trading, i.e. only building an efficient but daemon-prone agent after confirming that daemon-values pay us enough to make it worth our while. But this kind of thing seems radically confusing and is unlikely to be sorted out by humans.) The process of internal value shifting amongst daemons would continue in some abstract sense, though they would eventually end up pursuing the convergent bargain of their values (in the same way that a hyperbolic discounter ends up behaving consistently after reflection).

I think this is the most likely way the orthogonality thesis could fail. When there was an arbital poll on this question a few years ago, I had by far the lowest probability on the orthogonality thesis and was quite surprised by other commenters’ confidence.

Fortunately, even if there is logarithmic overhead, it currently looks quite unlikely to me that the constants are bad enough for this to be an unrecoverable problem for us today. But as you say, it would be a dealbreaker for any attempt to prove asymptotic efficiency.