Critch on career advice for junior AI-x-risk-concerned researchers

In a recent e-mail thread, Andrew Critch sent me the following “subtle problem with sending junior AI-x-risk-concerned researchers into AI capabilities research”. Here’s the explanation he wrote of his view, shared with his permission:

I’m fairly concerned with the practice of telling people who “really care about AI safety” to go into AI capabilities research, unless they are very junior researchers who are using general AI research as a place to improve their skills until they’re able to contribute to AI safety later. (See Leveraging Academia).

The reason is not a fear that they will contribute to AI capabilities advancement in some manner that will be marginally detrimental to the future. It’s also not a fear that they’ll fail to change the company’s culture in the ways they’d hope, and end up feeling discouraged. What I’m afraid of is that they’ll feel pressure to start pretending to themselves, or to others, that their work is “relevant to safety”. Then what we end up with are companies and departments filled with people who are “concerned about safety”, creating a false sense of security that something relevant is being done, when all we have are a bunch of simmering concerns and concomitant rationalizations.

This fear of mine requires some context from my background as a researcher. I see this problem with environmentalists who “really care about climate change”, who tell themselves they’re “working on it” by studying the roots of a fairly arbitrary species of tree in a fairly arbitrary ecosystem that won’t generalize to anything likely to help with climate change.

My assessment that their work won’t generalize is mostly not from my own outside view; it comes from asking the researcher about how their work is likely to have an impact, and getting a response that either says nothing more than “I’m not sure, but it seems relevant somehow”, or an argument with a lot of caveats like “X might help with Y, which might help with Z, which might help with climate change, but we really can’t be sure, and it’s not my job to defend the relevance of my work. It’s intrinsically interesting to me, and you never know if something could turn out to be useful that seemed useless at first.”

At the same time, I know other climate scientists who seem to have actually done an explicit or implicit Fermi estimate for the probability that they will personally soon discover a species of bacteria that could safely scrub the Earth’s atmosphere of excess carbon. That’s much better.

I’ve seen the same sort of problem with political scientists who are “really concerned about nuclear war” who tell themselves they’re “working on it” by trying to produce a minor generalization of an edge case of a voting theorem that, when asked, they don’t think will be used by anyone ever.

At the same time, I know other political scientists who seem to be trying really hard to work backward from a certain geopolitical outcome, and earnestly working out the details of what the world would need to make that outcome happen. That’s much better.

Having said this, I do think it’s fine and good if society wants to sponsor a person to study obscure roots of obscure trees that probably won’t help with climate change, or edge cases of theorems that no one will ever use or even take inspiration from, but I would like everyone to be on the same page that in such cases what we’re sponsoring is intellectual freedom and development, and not climate change prevention or nuclear war prevention. If folks want to study fairly obscure phenomena because it feels like the next thing their mind needs to understand the world better, we shouldn’t pressure them to have to think that the next thing they learn might “stop climate change” or “prevent nuclear war”, or else we fuel the fire of false pretenses about which of the world’s research gaps are being earnestly taken care of.

Unfortunately, the above pattern of “justifying” research by just reflecting on what you care about, rationalizing it, and not checking the rationalization for rationality, appears to me to be extremely prevalent among folks who care about climate change or nuclear war, and this is not something I want to see replicated elsewhere, especially not in the burgeoning fields of AI safety, AI ethics, or AI x-risk reduction. And I’m concerned that if we tell folks to go into AI research just to “be concerned”, we’ll be fueling a false sense of security by filling departments and companies with people who “seem to really care” but aren’t doing correspondingly relevant research work, and creating a research culture where concerns about safety, ethics, or x-risk do not result in actually prioritizing research into safety, ethics, or x-risk.

When you’re giving general-purpose career advice, the meme “do AI yourself, so you’re around to help make it safe” is a really bad meme. It fuels a narrative that says “Being a good person standing next to the development of dangerous tech makes the tech less dangerous.” Just standing nearby doesn’t actually help unless you’re doing technical safety research. Just standing nearby does create a false sense of security through the mere-exposure effect. And the “just stand nearby” attitude drives people to worsen race conditions by creating new competitors in different geographical locations, so they can exercise their Stand Nearby powers to ensure the tech is safe.

Important: the above paragraphs are advice about what advice to give, because of the social pressures and tendencies to rationalize that advice-giving often produces. By contrast, if you’re a person who’s worried about AI, and thinking about a career in AI research, I do not wish to discourage you from going into AI capabilities research. To you, what I want to say is something different....

Step 1: Learn by doing. Leverage Academia. Get into a good grad school for AI research, and focus first on learning things that feel like they will help you personally to understand AI safety better (or AI ethics, or AI x-risk; replace by your area of interest throughout). Don’t worry about whether you’re “contributing” to AI safety too early in your graduate career. Before you’re actually ready to make real contributions to the field, try to avoid rationalizing doing things because “they might help with safety”; instead, do things because “they might help me personally to understand safety better, in ways that might be idiosyncratic to me and my own learning process.”

Remember, what you need to learn to understand safety, and what the field needs to progress, might be pretty different, and you need to have the freedom to learn whatever gaps seem important to you personally. Early in your research career, you need to be in “consume” mode more than “produce” mode, and it’s fine if your way of “consuming” knowledge and skill is to “produce” things that aren’t very externally valuable. So, try to avoid rationalizing the externally-usable safety-value of ideas or tools you produce on your way to understanding how to produce externally-usable safety research later.

The societal value of you producing your earliest research results will be that they help you personally to fill gaps in your mind that matter for your personal understanding of AI safety, and that’s all the justification you need in my books. So, do focus on learning things that you need to understand safety better, but don’t expect those things to be a “contribution” that will matter to others.

Step 2: Once you’ve learned enough that you’re able to start contributing to research in AI safety (or ethics, or x-risk), then start focusing directly on making safety research contributions that others might find insightful. When you’re ready enough to start actually producing advances in your field, that’s when it’s time to start thinking about the social impact of those advances would be, and start shifting your focus somewhat away from learning (consuming) and somewhat more toward contributing (producing).

I like the idea of optimizing for career growth & AI safety separately. However, I’m not sure the difference between “capabilities research” and “safety research” is as clear-cut as Critch makes it sound.

Consider the problem of making ML more data-efficient. Superficially, this is “capabilities research”: I don’t think it appears on any AI safety research agenda, and it’s an established mainstream research area.

However, in order to do value learning, I think we’ll want ML to become much more data-efficient than it is currently. If ML is not data-efficient, then assembling a dataset for our values will be time-consuming, which might tempt arms race participants to cut corners.

And if we could make ML really data-efficient, that gets us closer to “do what I mean” systems where you give it a few examples of things to do/​not do and it’s able to correctly infer your intent.

So does that mean the AI safety community should work on making ML more data-efficient? I’m not sure. I can think of arguments on both sides.

But my personal view is that answering these kind of “differential capabilities research” questions is higher-impact than a lot of the AI safety work that is being done. As far as I can tell, most existing AI safety work either

(a) Treats safety as a applications problem, where we try to use existing AI techniques to prototype what safe systems might look like. But I expect such prototypes will be thrown away as the state of the art advances. Arguably, you hit the point of diminishing returns with this approach as soon as you finish your architecture diagram (since that’s the part that’s least likely to change as the field advances).

(b) Treats safety as a security problem, where we try to think of flaws AI systems might have and how we might guard against them. But flaws only exist in the context of particular systems. The C programming language has a lot of security issues due to the fact that strings are null-terminated. There’s a massive cottage industry built around exploiting and guarding against C-specific issues. But this is all historically contingent: We only care about this because C is a popular programming language. If C was not popular, this cottage industry wouldn’t exist.

Instead I would suggest a third approach:

(c) Treat safety as a differential technological development problem. Try to figure out which capabilities are on the critical path for FAI but not on the critical path for UFAI. Try to evaluate competing AI paradigms and forecast which could most easily evolve into a secure system, then try to improve benchmarks for that platform so it can win the standards war. If none of the existing paradigms seem likely to be adequate, maybe devise a new paradigm de novo. Don’t forget about sociological factors.

Note that approach (c) looks a lot more like “capabilities research” than “safety research”. It requires careful judgement calls by domain experts. Work of types (a) and (b) will likely be useful to inform those judgement calls. But (c) is the way to go in the long run, IMO. If you were an effective altruist living during the 1980s trying to ensure that computers of the future would be secure, I think promoting the adoption of a non-C programming language would likely be the highest-leverage thing to do.

[This ended up being a pretty long tangent, maybe I should make this comment into a toplevel post? Perhaps people could tell me if/​why they disagree first.]

Well, here is a list of paradigms that might overtake deep learning. This list could probably be expanded, e.g. by researching various attempts to integrate deep learning with Bayesian reasoning, create more interpretable models, etc.

Then you could come up with a list of desiderata we seek in a paradigm: resistance to adversarial examples, robustness to distributional shift, interpretability, conservative concepts, calibration, etc. Additionally, there are pragmatic considerations related to whether a particular paradigm has a serious hope of widespread adoption. How competitive is it? Does it address researchercomplaints about deep learning?

Then you could create a 2d matrix with paradigms on one axis and desiderata on another axis. For each paradigm/​desiderata combo, figure out if that paradigm satisfies, or could be improved to satisfy, that desiderata. As you do this you’d probably get ideas for new rows/​columns in your matrix.

Then you could look at your matrix and try to figure out which paradigms are most promising for FAI—or if none seem good enough, invent a new one. Do technology evangelism for the chosen paradigm(s). Try to improve the paradigm’s resume of accomplishments. Rally the AI safety community.

Computer science, and AI in particular, have always been hype-driven. The market in paradigms doesn’t seem efficient or driven purely by questions of technical merit. And there can be a lot of path-dependence. As AI safety concerns gain mindshare, I think we stand a solid chance of influencing which paradigms gain traction.

Another approach to differential capabilities development is to try to identify an application of AI which shares a lot of features with the AI safety problem and demonstrate its commercial viability. For example, self-driving cars are safety-critical in nature, which seems good. But they also must make real-time decisions, whereas it is probably desirable for an FAI to spend time pondering the nature of our values, ask us clarifying questions, etc.

Fun fact: Silicon Valley’s new behemothinvestor is a believer in the technological singularity. It’s too bad the Singularity Summit is not still a thing or he could be invited to speak.

Then you could come up with a list of desiderata we seek in a paradigm: resistance to adversarial examples, robustness to distributional shift, interpretability, conservative concepts, calibration, etc.

For most of these examples, the current research in safety is more like ″Try to find any approach that has a hope of satisfying that desideratum while being competitive.″

So your matrix just ends up being a lot of “no” or “maybe if we did more research.”

It seems correct that people are trying to “find some approach that might work” before they try “rally the community around an approach that might work.”

Well, I haven’t seen even a blog post’s worth of effort put into doing something like what I suggested. So an extreme level of pessimism doesn’t seem especially well-justified to me. It seems relatively common for a task to be hard in one framework while being easy in another.

Standard CFAR advice: Instead of assuming a problem is unsolvable, sit down and try to think of a solution for a timed 5 minutes. Has anyone spent a timed 5 minutes trying to figure out, say, how vulnerable gcForest is likely to be to adversarial examples? You don’t necessarily have to solve all the problems yourself, either: 5 minutes of research is enough to determine that creating models which “correctly capture uncertainty” seems to be one of Uber’s design goals with Pyro (which seems related to calibration/​robustness to distributional shift).

BTW, I’ve spent a fair amount of time thinking about & reading about creativity, and I don’t think extreme pessimism is at all conducive to generating ideas. If your evidence for a problem being hard is “I couldn’t think of any good approaches”, and you were pretty sure there weren’t any good approaches before you started thinking, I don’t find that evidence super compelling.

It seems correct that people are trying to “find some approach that might work” before they try “rally the community around an approach that might work.”

I agree. That’s why I suggested going breadth-first initially.

Even if pessimism is justified, I think a breadth-first approach is sensible if it’s possible to estimate the difficulty of overcoming various problems in the context of various frameworks in advance. If making any progress at all is expected to be hard, all the more reason to choose targets strategically.

5 minutes of research is enough to determine that creating models which “correctly capture uncertainty” seems to be one of Uber’s design goals with Pyro (which seems related to calibration/​robustness to distributional shift)

In fact there is a (vast) literature on this topic.

Well, I haven’t seen even a blog post’s worth of effort put into doing something like what I suggested.

Go for it.

It seems relatively common for a task to be hard in one framework while being easy in another.

I agree. But note that from the perspective of the AI safety research that I do, none of the frameworks in the post you link change the basic picture, except maybe for hierarchical temporal memory (which seems like a non-starter).

FWIW, this claim doesn’t match my intuition, and googling around, I wasn’t able to quickly find any papers or blog posts supporting it. This 2015 blog post discusses how deep learning models are susceptible due to linearity, which makes intuitive sense to me; the dot product is a relatively bad measure of similarity between vectors. It proposes a strategy for finding adversarial examples for a random forest and says it hasn’t yet empirically been confirmed that random forests are unsafe. This empirical confirmation seems pretty important to me, because adversarial examples are only a thing because the wrong decision boundary has been learned. If the only way to create an “adversarial” example for a random forest is to permute an input until it genuinely appears to be a member of a different class, that doesn’t seem like a flaw. (I don’t expect that random forests always learn the correct decision boundaries, but my offhand guess would be that they are still less susceptible to adversarial examples than traditional deep models.)

I agree. But note that from the perspective of the AI safety research that I do, none of the frameworks in the post you link change the basic picture, except maybe for hierarchical temporal memory (which seems like a non-starter).

From my perspective, a lot of AI safety challenges get vastly easier if you have the ability to train well-calibrated models for complex, unstructured data. If you have this ability, the AI’s model of human values doesn’t need to be perfect—since the model is well-calibrated, it knows what it does/​does not know and can ask for clarification as necessary.

Calibration could also provide a very general solution for corrigibility: If the AI has a well-calibrated model of which actions are/​are not corrigible, and just how bad various incorrigible actions are, then it can ask for clarification as needed on that too. Corrigibility learning allows for notions of corrigibility that are very fine-grained: you can tell the AI that preventing a bad guy from flipping its off switch is OK, but preventing a good guy from flipping its off switch is not OK. By training a model, you don’t have to spend a lot of time hand-engineering, and the model will hopefully generalize to incorrigible actions that the designers didn’t anticipate. (Per the calibration assumption, the model will usually either generalize correctly, or the AI will realize that it doesn’t know whether a novel plan would qualify as corrigible, and it can ask for clarification.)

That’s why I currently think improving ML models is the action with the highest leverage.

FWIW, this claim doesn’t match my intuition, and googling around, I wasn’t able to quickly find any papers or blog posts supporting it.

“Explaining and Harnessing Adversarial Examples” (Goodfellow et al. 2014) is the original demonstration that “Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples”.

I’ll emphasize that high-dimensionality is a crucial piece of the puzzle, which I haven’t seen you bring up yet. You may already be aware of this, but I’ll emphasize it anyway: the usual intuitions do not even remotely apply in high-dimensional spaces. Check out Counterintuitive Properties of High Dimensional Space.

adversarial examples are only a thing because the wrong decision boundary has been learned

In my opinion, this is spot-on—not only your claim that there would be no adversarial examples if the decision boundary were perfect, but in fact a group of researchers are beginning to think that in a broader sense “adversarial vulnerability” and “amount of test set error” are inextricably linked in a deep and foundational way—that they may not even be two separate problems. Here are a few citations that point at some pieces of this case:

“Adversarial Spheres” (Gilmer et al. 2017) - ″For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/​√d).″ (emphasis mine)

I think this paper is truly fantastic in many respects.

The central argument can be understood from the intuitions presented in Counterintuitive Properties of High Dimensional Space in the section titled Concentration of Measure (Figure 9). Where it says “As the dimension increases, the width of the band necessary to capture 99% of the surface area decreases rapidly.” you can just replace that with the “As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere”. “Small distance from the center of the sphere” is what gives rise to “Small epsilon at which you can find an adversarial example”.

To summarize, my belief is that any model that is trying to learn a decision boundary in a high-dimensional space, and is basically built out of linear units with some nonlinearities, will be susceptible to small-perturbation adversarial examples so long as it makes any errors at all.

(As a note—not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for “adversarial examples linearity” in an incognito tab)

As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere

What does the center of the sphere represent in this case?

(I’m imaging the training and test sets consisting of points in a highly dimensional space, and the classifier as drawing a hyperplane to mostly separate them from each other. But I’m not sure what point in this space would correspond to the “center”, or what sphere we’d be talking about.)

“Adversarial Spheres” (Gilmer et al. 2017) - ″For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/​√d).″ (emphasis mine)

Slightly off-topic, but quick terminology question. When I first read the abstract of this paper, I was very confused about what it was saying and had to re-read it several times, because of the way the word “tradeoff” was used.

I usually think of a tradeoff as a inverse relationship between two good things that you want both of. But in this case they use “tradeoff” to refer to the inverse relationship between “test error”, and “average distance to nearest error”. Which is odd, because the first of those is bad and the second is good, no?

Is there something I’m missing that causes this to sound like a more natural way of describing things to others’ ears?

a group of researchers are beginning to think that in a broader sense “adversarial vulnerability” and “amount of test set error” are inextricably linked in a deep and foundational way—that they may not even be two separate problems.

I’d expect this to be true or false depending on the shape of the misclassified region. If you think of the input space as a white sheet, and the misclassified region as red polka dots, then we measure test error by throwing a dart at the sheet and checking if it hits a polka dot. To measure adversarial vulnerability, we take a dart that landed on a white part of the sheet and check the distance to the nearest red polka dot. If the sheet is covered in tiny red polka dots, this distance will be small on average. If the sheet has just a few big red polka dots, this will be larger on average, even if the total amount of red is the same.

As a concrete example, suppose we trained a 1-nearest-neighbor classifier for 2-dimensional RGB images. Then the sheet is mostly red (because this is a terrible model), but there are splotches of white associated with each image in our training set. So this is a model that has lots of test error despite many spheres with 0% misclassifications.

To measure the size of the polka dots, you could invert the typical adversarial perturbation procedure: Start with a misclassified input and find the minimal perturbation necessary to make it correctly classified.

(It’s possible that this sheet analogy is misleading due to the nature of high-dimensional spaces.)

Anyway, this relates back to the original topic of conversation: the extent to which capabilities research and safety research are separate. If “adversarial vulnerability” and “amount of test set error” are inextricably linked, that suggests that reducing test set error (“capabilities” research) improves safety, and addressing adversarial vulnerability (“safety” research) advances capabilities. The extreme version of this position is that software advances are all good and hardware advances are all bad.

(As a note—not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for “adversarial examples linearity” in an incognito tab)

Thanks. I’d seen both papers, but I don’t like linking to things I haven’t fully read.

Thanks. I’d seen both papers, but I don’t like linking to things I haven’t fully read.

I might just be confused, but this sentence seems like a non sequitur to me. I understood catherio to be responding to your comment about googling and not finding “papers or blog posts supporting [the claim that deep learning is not unusually susceptible to adversarial examples]”.

If that was already clear to you then, never mind. I was just confused why you were talking about linking to things, when before the question seemed to be about what could be found by googling.

googling around, I wasn’t able to quickly find any papers or blog posts supporting it

I think it’s a little bit tricky because decision trees don’t work that well for the tasks where people usually study adversarial examples. And this isn’t my research area so I don’t know much about it.

That said, in addition to the paper Wei Dai linked, there is also this, showing that adversarial examples for neural nets transfer pretty well to decision trees (though I haven’t looked at that paper in any detail).

Well, I haven’t seen even a blog post’s worth of effort put into doing something like what I suggested.

I think blog posts are potentially weird measures of effort, here. I also think that this is something that people are interested in doing—I think it’s a component of MIRI’s strategic sketch here, as part 8--but isn’t the sort of thing where we have anything particularly worthwhile to show for it yet.

Perhaps it makes sense to sketch an argument for why none of the standard paradigms satisfy some desideratum? This is kind of what AI Safety Gridworlds did. But it’s more the thing where, say, gradient boosted random forests have more of the ‘transparency’ property in a particular, legalistic way (it’s easier to figure out blame for any particular classification than it would be with a neural net) but not in the way that we actually care about (looking at a gradient boosted random forest, we could figure out if it’s thinking about things in the way that we want it to be thinking about), which might actually be easier with a neural net (because we could look at what neuron activations correspond to).

I have a model that there’s something like a Pareto distribution where 20% of the people in a field contribute to 80% of the Actually Important advances, and of those advances about 80% of those people are a further 20% split of people who are deliberately and strategically choosing fields such that they can rationally expect to make advances. This implies that for instance in climate change, there will be ~4% of people who have actually done a fermi estimate of their impact on climate change that will contribute ~64% of the relevant advances in the field.

One thing you can say is that this is awful, and you really would like to have a field without this ridiculous distribution, and try to tell people to really wait to go into this field so they can contribute to Actually Important things. But it seems like there’s a lot of countervailing forces preventing this, including the status incentive of saying “this is a field only for people who work on ”Actually Important things.″ If your timelines are really short, you might not be worried about this, but it does seem like something to worry about over a decade or so of putting this message out in a specific field.

The other way to handle it would be to expect the Pareto distribution to happen because most people just aren’t strategic about their careers, and do rationalize. The goal in that case is to just try and grow the field as much as possible, and know that some small percentage of the people who go into the field will be strategic thinkers who will contribute quite a bit. Not only does this strategy seem to actually capture the pattern of fields that have grown and made significant advances in solving problems, but it also has the benefit of getting the additional ~36% of Actually Important advances that come from people who aren’t strategically trying to create impact.

Somewhere, recently, I saw someone comment almost in passing that grad school shouldn’t cost anything. I can’t find the source now. Maybe someone can clarify if that’s a serious claim? I’ve been under the impression for a while that grad school and academia would be an awfully expensive way to acquire the prerequisite knowledge for AI safety work.

Expensive in terms of time, perhaps, but almost all good universities in the US and continental Europe provide decent salaries to PhD students. UK is a bit more haphazard but it’s still very rare for UK PhDs to actually pay to be there, especially in technical fields.

Specifically, the salary is for being a teaching assistant or a research assistant, rather than being a student, but everything is structured under the assumption that graduate students will have a relevant part-time job that covers tuition and living expenses.

I think that’s true in the US, but not in most of Europe. E.g. in Switzerland a first year PhD student gets paid $40000 a year WITHOUT doing any teaching, and more if they teach. That’s unusually generous, but I think the setup isn’t uncommon.

Thank you for posting this here, I mostly agree with the statement that acquiring skills early on is more important than producing anything directly. There’s one thing that bugs me, however.

Early in your research career, you need to be in “consume” mode more than “produce” mode [...]

Counterpoint: if you spend most of your early research career in “consume” mode, you won’t get any practice at producing valuable research or even know whether producing science is a good fit for you personally. I’ve personally seen people who are extremely good at processing large amounts of content during their studies, but got completely lost when tasked with finding and studying a novel problem that no-one had written on before. This seems like some kind of trap that many PhD student run into. Sometimes there’s just no good way to learn how to do research other than, y’know, by doing research.