Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

Highlights

AlphaStar: Mastering the Real-Time Strategy Game StarCraft II(The AlphaStar team): The AlphaStar system from DeepMind has beaten top human pros at StarCraft. You can read about the particular details of the matches in many sources, such as the blog post itself, this Vox article, or Import AI. The quick summary is that while there are some reasons you might not think it is conclusively superhuman yet (notably, it only won when it didn't have to manipulate the camera, and even then it may have had short bursts of very high actions per minute that humans can't do), it is clearly extremely good at StarCraft, both at the technically precise micro level and at the strategic macro level.

I want to focus instead on the technical details of how AlphaStar works. The key ideas seem to be a) using imitation learning to get policies that do something reasonable to start with and b) training a population of agents in order to explore the full space of strategies and how to play against all of them, without any catastrophic forgetting. Specifically, they take a dataset of human games and train various agents to mimic humans. This allows them to avoid the particularly hard exploration problems that happen when you start with a random agent. Once they have these agents to start with, they begin to do population-based training, where they play agents against each other and update their weights using an RL algorithm. The population of agents evolves over time, with well-performing agents splitting into two new agents that diversify a bit more. Some agents also have auxiliary rewards that encourage them to explore different parts of the strategy space -- for example, an agent might get reward for building a specific type of unit. Once training is done, we have a final population of agents. Using their empirical win probabilities, we can construct a Nash equilibrium of these agents, which forms the final AlphaStar agent. (Note: I'm not sure if at the beginning of the game, one of the agents is chosen according to the Nash probabilities, or if at each timestep an action is chosen according to the Nash probabilities. I would expect the former, since the latter would result in one agent making a long-term plan that is then ruined by a different agent taking some other action, but the blog post seems to indicate the latter -- with the former, it's not clear why the compute ability of a GPU restricts the number of agents in the Nash equilibrium, which the blog posts mentions.)

There are also a bunch of interesting technical details on how they get this to actually work, which you can get some information about in this Reddit AMA. For example, "we included a policy distillation cost to ensure that the agent continues to try human-like behaviours with some probability throughout training, and this makes it much easier to discover unlikely strategies than when starting from self-play", and "there are elements of our research (for example temporally abstract actions that choose how many ticks to delay, or the adaptive selection of incentives for agents) that might be considered “hierarchical”". But it's probably best to wait for the journal publication (which is currently in preparation) for the full details.

I'm particularly interested by this Balduzzi et al paper that gives some more theoretical justification for the population-based training. In particular, the paper introduces the concept of "gamescapes", which can be thought of as a geometric visualization of which strategies beat which other strategies. In some games, like "say a number between 1 and 10, you get reward equal to your number - opponent's number", the gamescape is a 1-D line -- there is a scalar value of "how good a strategy is", and a better strategy will beat a weaker strategy. On the other hand, rock-paper-scissors is a cyclic game, and the gamescape looks like a triangle -- there's no strategy that strictly dominates all other strategies. Even the Nash strategy of randomizing between all three actions is not the "best", in that it fails to exploit suboptimal strategies, eg. the strategy of always playing rock. With games that are even somewhat cyclic (such as StarCraft), rather than trying to find the Nash equilibrium, we should try to explore and map out the entire strategy space. The paper also has some theoretical results supporting this that I haven't read through in detail.

Rohin's opinion: I don't care very much about whether AlphaStar is superhuman or not -- it clearly is very good at StarCraft at both the micro and macro levels. Whether it hits the rather arbitrary level of "top human performance" is not as interesting as the fact that it is anywhere in the ballpark of "top human performance".

It's interesting to compare this to OpenAI Five (AN #13). While OpenAI solved the exploration problem using a combination of reward shaping and domain randomization, DeepMind solved it by using imitation learning on human games. While OpenAI relied primarily on self-play, DeepMind used population-based training in order to deal with catastrophic forgetting and in order to be robust to many different strategies. It's possible that this is because of the games they were playing -- it's plausible to me that StarCraft has more rock-paper-scissors-like cyclic mechanics than Dota, and so it's more important to be robust to many strategies in StarCraft. But I don't know either game very well, so this is pure speculation.

Exploring the full strategy space rather than finding the Nash equilibrium seems like the right thing to do, though I haven't kept up with the multiagent RL literature so take that with a grain of salt. That said, it doesn't seem like the full solution -- you also want some way of identifying what strategy your opponent is playing, so that you can choose the optimal strategy to play against them.

I often think about how you can build AI systems that cooperate with humans. This can be significantly harder: in competitive games, if your opponent is more suboptimal than you were expecting, you just crush them even harder. However, in a cooperative game, if you make a bad assumption about what your partner will do, you can get significantly worse performance. (If you've played Hanabi, you've probably experienced this.) Self-play does not seem like it would handle this situation, but this kind of population-based training could potentially handle it, if you also had a method to identify how your partner is playing. (Without such a method, you would play some generic strategy that would hopefully be quite robust to playstyles, but would still not be nearly as good as being able to predict what your partner does.)

Disentangling arguments for the importance of AI safety(Richard Ngo): This post lays out six distinct arguments for the importance of AI safety. First, the classic argument that expected utility maximizers (or, as I prefer to call them, goal-directed agents) are dangerous because of Goodhart's Law, fragility of value and convergent instrumental subgoals. Second, we don't know how to robustly "put a goal" inside an AI system, such that its behavior will then look like the pursuit of that goal. (As an analogy, evolution might seem like a good way to get agents that pursue reproductive fitness, but it ended up creating humans who decidedly do not pursue reproductive fitness single-mindedly.) Third, as we create many AI systems that gradually become the main actors in our economy, these AI systems will control most of the resources of the future. There will likely be some divergence between what the AI "values" and what we value, and for sufficiently powerful AI systems we will no longer be able to correct these divergences, simply because we won't be able to understand their decisions. Fourth, it seems that a good future requires us to solve hard philosophy problems that humans cannot yet solve (so that even if the future was controlled by a human it would probably not turn out well), and so we would need to either solve these problems or figure out an algorithm to solve them. Fifth, powerful AI capabilities could be misused by malicious actors, or they could inadvertently lead to doom through coordination failures, eg. by developing ever more destructive weapons. Finally, the broadest argument is simply that AI is going to have a large impact on the world, and so of course we want to ensure that the impact is positive.

Richard then speculates on what inferences to make from the fact that different people have different arguments for working on AI safety. His primary takeaway is that we are still confused about what problem we are solving, and so we should spend more time clarifying fundamental ideas and describing particular deployment scenarios and corresponding threat models.

Rohin's opinion: I think the overarching problem is the last one, that AI will have large impacts and we don't have a strong story for why they will necessarily be good. Since it is very hard to predict the future, especially with new technologies, I would expect that different people trying to concretize this very broad worry into a more concrete one would end up with different scenarios, and this mostly explains the proliferation of arguments. Richard does note a similar effect by considering the example of what arguments the original nuclear risk people could have made, and finding a similar proliferation of arguments.

Setting aside the overarching argument #6, I find all of the arguments fairly compelling, but I'm probably most worried about #1 (suitably reformulated in terms of goal-directedness) and #2. It's plausible that I would also find some of the multiagent worries more compelling once more research has been done on them; so far I don't have much clarity about them.

Technical AI alignment

Iterated amplification sequence

Learning with catastrophes(Paul Christiano): In iterated amplification, we need to train a fast agent from a slow one produced by amplification (AN #42). We need this training to be such that the resulting agent never does anything catastrophic at test time. In iterated amplification, we do have the benefit of having a strong overseer who can give good fedback. This suggests a formalization for catastrophes. Suppose there is some oracle that can take any sequence of observations and actions and label it as catastrophic or not. How do we use this oracle to train an agent that will never produce catastrophic behavior at test time?

Given unlimited compute and unlimited access to the oracle, this problem is easy: simply search over all possible environments and ask the oracle if the agent behaves catastrophically on them. If any such behavior is found, train the agent to not perform that behavior any more. Repeat until all catastrophic behavior is eliminated. This is basically a very strong form of adversarial training.

Rohin's opinion: I'm not sure how necessary it is to explicitly aim to avoid catastrophic behavior -- it seems that even a low capability corrigible agent would still know enough to avoid catastrophic behavior in practice. However, based on Techniques for optimizing worst-case performance, summarized below, it seems like the motivation is actually to avoid catastrophic failures of corrigibility, as opposed to all catastrophes.

In fact, we can see that we can't avoid all catastrophes without some assumption on either the environment or the oracle. Suppose the environment can do anything computable, and the oracle evaluates behavior only based on outcomes (observations). In this case, for any observation that the oracle would label as catastrophic, there is an environment that regardless of the agent's action outputs that observation, and there is no agent that can always avoid catastrophe. So for this problem to be solvable, we need to either have a limit on what the environment "could do", or an oracle that judges "catastrophe" based on the agent's action in addition to outcomes. That latter option can cache out to "are the actions in this transcript knowably going to cause something bad to happen", which sounds very much like corrigibility.

Thoughts on reward engineering(Paul Christiano): This post digs into some of the "easy" issues with reward engineering (where we must design a good reward function for an agent, given access to a stronger overseer).

First, in order to handle outcomes over long time horizons, we need to have the reward function capture the overseer's evaluation of the long-term consequences of an action, since it isn't feasible to wait until the outcomes actually happen.

Second, since human judgments are inconsistent and unreliable, we could have the agent choose an action such that there is no other action which the overseer would evaluate as better in a comparison between the two. (This is not exactly right -- the human's comparisons could be such that this is an impossible standard. The post uses a two-player game formulation that avoids the issue, and gives the guarantee that the agent won't choose something that is unambiguously worse than another option.)

Third, since the agent will be uncertain about the overseer's reward, it will have the equivalent of normative uncertainty -- how should it trade off between different possible reward functions the overseer could have? One option is to choose a particular yardstick, eg. how much the overseer values a minute of their time, some small amount of money, etc. and normalize all rewards to that yardstick.

Fourth, when there are decisions with very widely-varying scales of rewards, traditional algorithms don't work well. Normally we could focus on the high-stakes decisions and ignore the others, but if the high-stakes decisions occur infrequently then all decisions are about equally important. In this case, we could oversample high-stakes decisions and reduce their rewards (i.e. importance sampling) to use traditional algorithms to learn effectively without changing the overall "meaning" of the reward function. However, very rare+high-stakes decisions will probably require additional techniques.

Fifth, for sparse reward functions where most behavior is equally bad, we need to provide "hints" about what good behavior looks like. Reward shaping is the main current approach, but we do need to make sure that by the end of training we are using the true reward, not the shaped one. Lots of other information such as demonstrations can also be taken as hints that allow you to get higher reward.

Finally, the reward will likely be sufficiently complex that we cannot write it down, and so we'll need to rely on an expensive evaluation by the overseer. We will probably need semi-supervised RL in order to make this sufficiently computationally efficient.

Rohin's opinion: As the post notes, these problems are only "easy" in the conceptual sense -- the resulting RL problems could be quite hard. I feel most confused about the third and fourth problems. Choosing a yardstick could work to aggregate reward functions, but I still worry about the issue that this tends to overweight reward functions that assign a low value to the yardstick but high value to other outcomes. With widely-varying rewards, it seems hard to importance sample high-stakes decisions, without knowing what those decisions might be. Maybe if we notice a very large reward, we instead make it lower reward, but oversample it in the future? Something like this could potentially work, but I don't see how yet.

For complex, expensive-to-evaluate rewards, Paul suggests using semi-supervised learning; this would be fine if semi-supervised learning was sufficient, but I worry that there actually isn't enough information in just a few evaluations of the reward function to narrow down on the true reward sufficiently, which means that even conceptually we will need something else.

Techniques for optimizing worst-case performance(Paul Christiano): There are “benign” failures of worst-case performance, where the AI system encounters a novel situation and behaves weirdly, but not in a way that systematically disfavors human values. As I noted above, we can’t get rid of all of these, but that’s probably fine. We instead would like to focus on “malign” failures, where the AI system applies its intelligence in pursuit of the wrong goal. There are a few techniques that could be applied to this problem.

With adversarial training, we can have one system find inputs on which our agent fails catastrophically, and then train the agent to avoid those behaviors. The main issue here is that there will likely be some failures that aren't found.

On the other hand, verification has strong guarantees, but suffers from the problem that it is hard to know what to specify, and it is computationally expensive to actually perform verification. If we have a strong trusted overseer, eg. produced by amplification, we could use it as a very expensive specification. Alternatively, we could use a catastrophe-checker in lieu of a specification. (Note: While I can see some ways of using catastrophe-checkers, the post seems to have a specific method in mind that I don't understand.)

Any malignant failure must be using the intelligence of the agent somehow, and the agent is only intelligent on the training data, so if we can use strong transparency techniques on the training data, we could find such failures. However, if you use transparency for this, by default you weed out the comprehensible failures and leave in the incomprehensible ones.

Rohin's opinion: While I agree that it would be great to optimize worst-case performance, all of these techniques feel quite difficult to do scalably and with guarantees. With adversarial training, you need to find allof the ways that an agent could fail, while catastrophe could happen if the agent stumbles across any of these methods. It seems plausible to me that with sufficient additional information given to the adversary we can meet this standard, but it seems very hard to knowably meet this standard, i.e. to have a strong argument that we will find all of the potential issues.

With verification, the specification problem seems like a deal-breaker, unless combined with other methods: a major point with AI safety is that we can't write down a good specification for what we want. If we instead only use verification to propagate worst-case guarantees from one model to another (as the post suggests), then it seems possible in theory, but very expensive in practice: most verification techniques assume unlimited fast access to the specification, whereas our specification is very expensive to query. Of course, not much research has focused on this setting, so we can expect large gains; nonetheless you do need to somehow extrapolate the specification to all possible inputs, which seems hard to do with limited access to the specification.

Transparency also seems like it provides additional safety, rather than making any guarantees, since we probably can't get a guarantee that our transparency mechanisms can show us all possible failure modes in a way that we understand them. The argument that we can only focus on the training data makes the assumption that the AI system is not going to generalize well outside of the training dataset. While I'm sympathetic to this assumption (we really don't have good methods for generalization, and there are strong reasons to expect generalization to be near-impossible), it isn't one that I'm confident about, especially when we're talking about general intelligence.

Of course, I'm still excited for more research to be done on these topics, since they do seem to cut out some additional failure modes. But if we're looking to have a semi-formal strong argument that we will have good worst-case performance, I don't see the reasons for optimism about that.

Value learning sequence

Any feedback that the AI system gets must be interpreted using some assumption. For example, when a human provides an AI system a reward function, it shouldn't be interpreted as a description of optimal behavior in every possible situation (which is what we currently do implicitly). Inverse Reward Design (IRD) suggests an alternative, more realistic assumption: the reward function is likely to the extent that it leads to high true utility in the training environment. Similarly, in inverse reinforcement learning (IRL) human demonstrations are often interpreted under the assumption of Boltzmann rationality.

Analogously, we may also want to train humans to give feedback to AI systems in the manner that they are expecting. With IRD, the reward designer should make sure to test the reward function extensively in the training environment. If we want our AI system to help us with long-term goals, we may want the overseers to be much more cautious and uncertain in their feedback (depending on how such feedback is interpreted). Techniques that learn to reason like humans, such as iterated amplification and debate, would by default learn to interpret feedback the way humans do. Nevertheless it will probably be useful to train humans to provide useful feedback: for example, in debate, we want humans to judge which side provided more true and useful information.

Agent foundations

Learning human intent

Handling groups of agents

Theory of Minds: Understanding Behavior in Groups Through Inverse Planning(Michael Shum, Max Kleiman-Weiner et al) (summarized by Richard): This paper introduces Composable Team Hierarchies (CTH), a representation designed for reasoning about how agents reason about each other in collaborative and competitive environments. CTH uses two "planning operators": the Best Response operator returns the best policy in a single-agent game, and the Joint Planning operator returns the best team policy when all agents are cooperating. Competitive policies can then be derived via recursive application of those operations to subsets of agents (while holding the policies of other agents fixed). CTH draws from ideas in level-K planning (in which each agent assumes all other agents are at level K-1) and cooperative planning, but is more powerful than either approach.

The authors experiment with using CTH to probabilistically infer policies and future actions of agents participating in the stag-hunt task; they find that these judgements correlate well with human data.

Richard's opinion: This is a cool theoretical framework. Its relevance depends on how likely you think it is that social cognition will be a core component of AGI, as opposed to just another task to be solved using general-purpose reasoning. I imagine that most AI safety researchers lean towards the latter, but there are some reasons to give credence to the former.