Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

Highlights

How does Gradient Descent Interact with Goodhart?(Scott Garrabrant): Scott often thinks about optimization using a simple proxy of "sample N points and choose the one with the highest value", where larger N corresponds to more powerful optimization. However, this seems to be a poor model for what gradient descent actually does, and it seems valuable to understand the difference (or to find out that there isn't any significant difference). A particularly interesting subquestion is whether Goodhart's Law behaves differently for gradient descent vs. random search.

Rohin's opinion: I don't think that the two methods are very different, and I expect that if you can control for "optimization power", the two methods would be about equally susceptible to Goodhart's Law. (In any given experiment, one will be better than the other, for reasons that depend on the experiment, but averaged across experiments I don't expect to see a clear winner.) However, I do think that gradient descent is very powerful at optimization, and it's hard to imagine the astronomically large random search that would compare with it, and so in any practical application gradient descent will lead to more Goodharting (and more overfitting) than random search. (It will also perform better, since it won't underfit, as random search would.)

One of the answers to this question talks about some experimental evidence, where they find that they can get different results with a relatively minor change to the experimental procedure, which I think is weak evidence for this position.

The key idea with the original Transformer architecture is to use self-attention layers to analyze sequences instead of something recurrent like an RNN, which has problems with vanishing and exploding gradients. An attention layer takes as input a query q and key-value pairs (K, V). The query q is "compared" against every key k, and that is used to decide whether to return the corresponding value v. In their particular implementation, for each key k, you take the dot product of q and k to get a "weight", which is then used to return the weighted average of all of the values. So, you can think of the attention layer as taking in a query q, and returning the "average" value corresponding to keys that are "similar" to q (since dot product is a measure of how aligned two vectors are). Typically, in an attention layer, some subset of Q, K and V will be learned. With self-attention, Q, K and V are all sourced from the same place -- the result of the previous layer (or the input if this is the first layer). Of course, it's not exactly the output from the previous layer -- if that were the case, there would be no parameters to learn. They instead learn three linear projections (i.e. matrices) that map from the output of the previous layer to Q, K and V respectively, and then feed the generated Q, K and V into a self-attention layer to compute the final output. And actually, instead of having a single set of projections, they have multiple sets that each contain three learned linear projections, that are all then used for attention, and then combined together for the next layer by another learned matrix. This is called multi-head attention.

Of course, with attention, you are treating your data as a set of key-value pairs, which means that the order of the key value pairs does not matter. However, the order of words in a sentence is obviously important. To allow the model to make use of position information, they augment each word and add position information to it. You could do this just by literally appending a single number to each word embedding representing its absolute position, but then it would be hard for the neural net to ask about a word that was "3 words prior". To make this easier for the net to learn, they create a vector of numbers to represent the absolute position based on sinusoids such that "go back 3 words" can be computed by a linear function, which should be easy to learn, and add (not concatenate!) it elementwise to the word embedding.

This model works great when you are working with a single sentence, where you can attend over the entire sentence at once, but doesn't work as well when you are working with eg. entire documents. So far, people have simply broken up documents into segments of a particular size N and trained Transformer models over these segments. Then, at test time, for each word, they use the past N - 1 words as context and run the model over all N words to get the output. This cannot model any dependencies that have range larger than N. The Transformer-XL model fixes this issue by taking the segments that vanilla Transformers use, and adding recurrence. Now, in addition to the normal output predictions we get from segments, we also get as output a new hidden state, that is then passed in to the next segment's Transformer layer. This allows for arbitrarily far long-range dependencies. However, this screws up our position information -- each word in each segment is augmented with absolute position information, but this doesn't make sense across segments, since there will now be multiple words at (say) position 2 -- one for each segment. At this point, we actually want relative positions instead of absolute ones. They show how to do this -- it's quite cool but I don't know how to explain it without going into the math and this has gotten long already. Suffice it to say that they look at the interaction between arbitrary words x_i and x_j, see the terms that arise in the computation when you add absolute position embeddings to each of them, and then change the terms so that they only depend on the difference j - i, which is a relative position.

This new model is state of the art on several tasks, though I don't know what the standard benchmarks are here so I don't know how impressed I should be.

Rohin's opinion: It's quite interesting that even though the point of Transformer was to get away from recurrent structures, adding them back in leads to significant improvements. Of course, the recurrent structure is now at the higher level of segments, rather than at the word or character level. This reminds me a lot of hierarchy -- it seems like we're using the Transformer as a basic building block that works on the ~sentence level so that our RNN-like structure can deal with a higher level of abstraction (which of course also helps with vanishing/exploding gradients).

There's an interesting pattern where hierarchy and structure seem to be a good inductive bias, that let you get good performance with limited compute and data, but as those limits subside, you're better off doing something that has less bias. This would predict that as we get more data and compute, we would want larger Transformer models (i.e. longer segments) and less recurrence. It would be interesting to see if that actually holds.

Technical AI alignment

Iterated amplification sequence

Reliability amplification(Paul Christiano): One hope for building an aligned AI system is to alternate capability amplification and reward engineering (both AN #42) with semi-supervised learning in order to create an agent that becomes more and more capable with more iterations, while remaining aligned. However, our agents will likely always have a small probability of failing, and when we use capability amplification, that failure probability is also amplified (since each of the subagents could fail). If we do enough capability amplifiation, the failure probability could approach 1. This motivates the problem of reliability amplification, which aims to take an agent that fails with small probability ε and produce a new agent that fails with significantly lower probability. As with capability amplification, the new agent can take much more time and compute than the original. The hope would be that the combination of capability and reliability amplification would allow us to take an initial agent and make it both more capable and more reliable (even though capability amplification reduces reliability, and reliability amplification will probably reduce capability).

One simple thing to do would be to sample three agents, have them collaboratively settle on an action. If all three agents are aligned, we're fine. The chance of two or three of the agents being malign is small enough that we can ignore it. The interesting case is if one agent is malign and two are benign. If there were only one correct answer, we could simply take a majority vote and get the right answer, and we would have amplified reliability to order ε^2 (which happens when you get two or more malign agents). However, different benign agents could give different answers, since there is no one "correct" benign answer. If we were to choose randomly, as you might expect we get no benefit: we'd have a ~3ε chance of having exactly one malign agent, and a 1/3 chance of failing in that case, keeping the failure probability at ε. The hope for reducing the failure probability is if the two benign agents can "recognize" each other as benign, and so that the majority vote has a >2/3 chance of choosing one of the actions that the benign agents suggested. This does depend on all of the agents having similar capabilities -- if the malign agent is a superintelligence while the two benign agents are human-level, then it seems likely that the malign superintelligence would be able to convince the benign agents to go along with its plan.

Rohin's opinion: It seems like this requires the assumption that our agents have a small probability of failure on any given input. I think this makes sense of we are thinking of reliability of corrigibility (AN #35). That said, I'm pretty confused about what problem this technique is trying to protect against, which I wrote about in this comment.

Value learning sequence

Conclusion to the sequence on value learning(Rohin Shah): This post summarizes the value learning sequence, putting emphasis on particular parts. I recommend reading it in full -- the sequence did have an overarching story, which was likely hard to keep track of over the three months that it was being published.

Technical agendas and prioritization

Reward learning theory

One-step hypothetical preferences and A small example of one-step hypotheticals(Stuart Armstrong) (summarized by Richard): We don't hold most of our preferences in mind at any given time - rather, they need to be elicited from us by prompting us to think about them. However, a detailed prompt could be used to manipulate the resulting judgement. In this post, Stuart discusses hypothetical interventions which are short enough to avoid this problem, while still causing a human to pass judgement on some aspect of their existing model of the world - for example, being asked a brief question, or seeing something on a TV show. He defines a one-step hypothetical, by contrast, as a prompt which causes the human to reflect on a new issue that they hadn't considered before. While this data will be fairly noisy, he claims that there will still be useful information to be gained from it.

Richard's opinion: I'm not quite sure what overall point Stuart is trying to make. However, if we're concerned that an agent might manipulate humans, I don't see why we should trust it to aggregate the data from many one-step hypotheticals, since "manipulation" could then occur using the many degrees of freedom involved in choosing the questions and interpreting the answers.

Preventing bad behavior

Interpretability

How much can value learning be disentangled?(Stuart Armstrong) (summarized by Richard): Stuart argues that there is no clear line between manipulation and explanation, since even good explanations involve simplification, omissions and cherry-picking what to emphasise. He claims that the only difference is that explanations give us a better understanding of the situation - something which is very subtle to define or measure. Nevertheless, we can still limit the effects of manipulation by banning extremely manipulative practices, and by giving AIs values that are similar to our own, so that they don't need to manipulate us very much.

Richard's opinion: I think the main point that explanation and manipulation can often look very similar is an important one. However, I'm not convinced that there aren't any ways of specifying the difference between them. Other factors which seem relevant include what mental steps the explainer/manipulator is going through, and how they would change if the statement weren't true or if the explainee were significantly smarter.

Adversarial examples

Theoretically Principled Trade-off between Robustness and Accuracy(Hongyang Zhang et al) (summarized by Dan H): This paper won the NeurIPS 2018 Adversarial Vision Challenge. For robustness on CIFAR-10 against l_infinity perturbations (epsilon = 8/255), it improves over the Madry et al. adversarial training baseline from 45.8% to 56.61%, making it almost state-of-the-art. However, it does decrease clean set accuracy by a few percent, despite using a deeper network than Madry et al. Their technique has many similarities to Adversarial Logit Pairing, which is not cited, because they encourage the network to embed a clean example and an adversarial perturbation of a clean example similarly. I now describe Adversarial Logit Pairing. During training, ALP teaches the network to classify clean and adversarially perturbed points; added to that loss is an l_2 loss between the logit embeddings of clean examples and the logits of the corresponding adversarial examples. In contrast, in place of the l_2 loss from ALP, this paper uses the KL divergence from the softmax of the clean example to the softmax of an adversarial example. Yet the softmax distributions are given a high temperature, so this loss is not much different from an l_2 loss between logits. The other main change in this paper is that adversarial examples are generated by trying to maximize the aforementioned KL divergence between clean and adversarial pairs, not by trying to maximize the classification log loss as in ALP. This paper then shows that some further engineering to adversarial logit pairing can improve adversarial robustness on CIFAR-10.

Field building

The case for building expertise to work on US AI policy, and how to do it(Niel Bowerman): This in-depth career review makes the case for working on US AI policy. It starts by making a short case for why AI policy is important; and then argues that US AI policy roles in particular can be very impactful (though they would still recommend a policy position in an AI lab like DeepMind or OpenAI over a US AI policy role). It has tons of useful detail; the only reason I'm not summarizing it is because I suspect that most readers are not currently considering career choices, and if you are considering your career, you should be reading the entire article, not my summary. You could also check out Import AI's summary.

Miscellaneous (Alignment)

Can there be an indescribable hellworld?(Stuart Armstrong) (summarized by Richard): This short post argues that it's always possible to explain why any given undesirable outcome doesn't satisfy our values (even if that explanation needs to be at a very high level), and so being able to make superintelligences debate in a trustworthy way is sufficient to make them safe.