In this post I sum­marise four lines of ar­gu­ment for why we should be skep­ti­cal about the po­ten­tial of deep learn­ing in its cur­rent form. I am fairly con­fi­dent that the next break­throughs in AI will come from some va­ri­ety of neu­ral net­work, but I think sev­eral of the ob­jec­tions be­low are quite a long way from be­ing over­come.

The­o­ret­i­cal Im­ped­i­ments to Ma­chine Learn­ing With Seven Sparks from the Causal Revolu­tion—Pearl, 2018

Pearl de­scribes three lev­els at which you can make in­fer­ences: as­so­ci­a­tion, in­ter­ven­tion, and coun­ter­fac­tual. The first is statis­ti­cal, iden­ti­fy­ing cor­re­la­tions—this is the level at which deep learn­ing op­er­ates. The in­ter­ven­tion level is about changes to the pre­sent or fu­ture—it an­swers ques­tions like “What will hap­pen if I do y?” The coun­ter­fac­tual level an­swers ques­tions like “What would have hap­pened if y had oc­curred?” Each suc­ces­sive level is strictly more pow­er­ful than the pre­vi­ous one: you can’t figure out what the effects of an ac­tion will be just on the as­so­ci­a­tion level, with­out a causal model, since we treat ac­tions as in­ter­ven­tions which over­ride ex­ist­ing causes. Un­for­tu­nately, cur­rent ma­chine learn­ing sys­tems are largely model-free.

Causal as­sump­tions and con­clu­sions can be en­coded in the form of graph­i­cal mod­els, where a di­rected ar­row be­tween two nodes rep­re­sents a causal in­fluence. Con­straints on the struc­ture of a graph can be de­ter­mined by see­ing which pairs of vari­ables are in­de­pen­dent when con­trol­ling for which other vari­ables: some­times con­trol­ling re­moves de­pen­den­cies, but some­times it in­tro­duces them. Pearl’s main claim is that this sort of model-driven causal anal­y­sis is an es­sen­tial step to­wards build­ing hu­man-level rea­son­ing ca­pa­bil­ities. He iden­ti­fies sev­eral im­por­tant con­cepts—such as coun­ter­fac­tu­als, con­found­ing, cau­sa­tion, and in­com­plete or bi­ased data—which his frame­work is able to rea­son about, but which cur­rent ap­proaches to ML can­not deal with.

Is shal­low, with limited ca­pac­ity for trans­fer. If a task is per­turbed even in minor ways, deep learn­ing breaks, demon­strat­ing that it’s not re­ally learn­ing the un­der­ly­ing con­cepts. Ad­ver­sar­ial ex­am­ples show­case this effect.

Has no nat­u­ral way to deal with hi­er­ar­chi­cal struc­ture. Even re­cur­sive neu­ral net­works re­quire fixed sen­tence trees to be pre­com­puted. See my sum­mary of ‘Gen­er­al­i­sa­tion with­out sys­tem­at­ic­ity’ be­low.

Strug­gles with open-ended in­fer­ence, es­pe­cially based on real-world knowl­edge.

Isn’t trans­par­ent, and re­mains es­sen­tially a “black box”.

Is not well-in­te­grated with prior knowl­edge. We can’t en­code our un­der­stand­ing of physics into a neu­ral net­work, for ex­am­ple.

Some of these prob­lems seem like they can be over­come with­out novel in­sights, given enough en­g­ineer­ing effort and com­pute, but oth­ers are more fun­da­men­tal. One in­ter­pre­ta­tion: deep learn­ing can in­ter­po­late within the train­ing space, but can’t ex­trap­o­late to out­side the train­ing space, even in ways which seem nat­u­ral to hu­mans. One of Mar­cus’ ex­am­ples: when a neu­ral net­work is trained to learn the iden­tity func­tion on even num­bers, it rounds down on odd num­bers. In this triv­ial case we can solve the prob­lem by adding odd train­ing ex­am­ples or man­u­ally ad­just­ing some weights, but in gen­eral, when there are many fea­tures, both may be pro­hibitively difficult even if we want to make a sim­ple ad­just­ment. To ad­dress this and other prob­lems, Mar­cus offers three al­ter­na­tives to deep learn­ing as cur­rently prac­ticed:

Un­su­per­vised learn­ing, so that sys­tems can con­stantly im­prove—for ex­am­ple by pre­dict­ing the next time-step and up­dat­ing af­ter­wards, or else by set­ting it­self challenges and learn­ing from do­ing them.

Fur­ther de­vel­op­ment of sym­bolic AI. While this has in the past proved brit­tle, the idea of in­te­grat­ing sym­bolic rep­re­sen­ta­tions into neu­ral net­works has great promise.

Draw­ing in­spira­tion from hu­mans, in par­tic­u­lar from cog­ni­tive and de­vel­op­men­tal psy­chol­ogy, how we de­velop com­mon­sense knowl­edge, and our un­der­stand­ing of nar­ra­tive.

Lake and Ba­roni iden­tify that hu­man lan­guage and thought fea­ture “sys­tem­atic com­po­si­tion­al­ity”: we are able to com­bine known com­po­nents in novel ways to pro­duce ar­bi­trar­ily many new ideas. To test neu­ral net­works on this, they in­tro­duce SCAN, a lan­guage con­sist­ing of com­mands such as “jump around left twice and walk op­po­site right thrice”. While they found that RNNs were able to gen­er­al­ise well on new strings similar in form to pre­vi­ous strings, perfor­mance dropped sharply in other cases. For ex­am­ple, the best re­sult dropped from 99.9% to 20.8% when the test ex­am­ples were longer than any train­ing ex­am­ple, even though they were con­structed us­ing the same com­po­si­tional rules. Also, when a com­mand such as “jump” had only been seen by it­self in train­ing, RNNs were al­most en­tirely in­ca­pable of un­der­stand­ing in­struc­tions such as “turn right and jump”. The over­all con­clu­sion: that neu­ral net­works can’t ex­tract sys­tem­atic rules from train­ing data, and so can’t gen­er­al­ise com­po­si­tion­al­ity any­thing like how hu­mans can. This is similar to the re­sult of a pro­ject I re­cently car­ried out, in which I found that cap­sule net­works which had been trained to recog­nise trans­formed in­puts such as ro­tated digits and digits with nega­tive colours still couldn’t recog­nise ro­tated, negated digits: they were sim­ply not learn­ing gen­eral rules which could be com­posed to­gether.

Ir­pan runs through a num­ber of rea­sons to be skep­ti­cal about us­ing deep learn­ing for RL prob­lems. For one thing, deep RL is still very data-in­effi­cient: Deep­Mind’s Rain­bow DQN takes around 83 hours of game­play to reach hu­man-level perfor­mance on an Atari game. By con­trast, hu­mans can pick them up within a minute or two. He also points out that other RL meth­ods of­ten work bet­ter than deep RL, par­tic­u­larly model-based ones which can util­ise do­main-spe­cific knowl­edge.

Another is­sue with RL in gen­eral is that de­sign­ing re­ward func­tions is difficult. This is a theme in AI safety—speci­fi­cally when it comes to re­ward func­tions which en­cap­su­late hu­man val­ues—but there are plenty of ex­ist­ing ex­am­ples of re­ward hack­ing on much sim­pler tasks. One im­por­tant con­sid­er­a­tion is the trade­off is be­tween shaped and sparse re­wards. Sparse re­wards only oc­cur at the goal state, and so can be fairly pre­cise, but are usu­ally too difficult to reach di­rectly. Shaped re­wards give pos­i­tive feed­back more fre­quently, but are eas­ier to hack. And even when shaped re­wards are de­signed care­fully, RL agents of­ten find them­selves in lo­cal op­tima. This is par­tic­u­larly preva­lent in multi-agent sys­tems, where each agent can overfit to the be­havi­our of the other.

Lastly, RL is un­sta­ble in a way that su­per­vised learn­ing isn’t. Even suc­cess­ful im­ple­men­ta­tions of­ten fail to find a de­cent solu­tion 20 or 30% of the time, de­pend­ing on the ran­dom seed with which they are ini­tial­ised. In fact, there are very few real-world suc­cess sto­ries fea­tur­ing RL. Yet achiev­ing su­per­hu­man perfor­mance on a wide range of tasks is a mat­ter of when, not if, and so I think Amara’s law ap­plies: we over­es­ti­mate the effects RL will have in the short run, but un­der­es­ti­mate its effects in the long run.

In the past, peo­ple have said that neu­ral net­works could not pos­si­bly scale up to solve prob­lems of a cer­tain type, due to in­her­ent limi­ta­tions of the method. Neu­ral net solu­tions have then been found us­ing minor tweaks to the al­gorithms and (most im­por­tantly) scal­ing up data and com­pute. Ilya Sutskever gives many ex­am­ples of this in his talk here. Some peo­ple con­sider this scal­ing-up to be “cheat­ing” and ev­i­dence against neu­ral nets re­ally work­ing, but it’s worth not­ing that the hu­man brain uses com­pute on the scale of to­day’s su­per­com­put­ers or greater, so per­haps we should not be sur­prised if a work­ing AI de­sign re­quires a similar amount of power.

On a cur­sory read­ing, it seems like most the prob­lems given in the pa­pers could plau­si­bly be solved by meta-re­in­force­ment learn­ing on a gen­eral-enough set of en­vi­ron­ments, of course with mas­sively scaled-up com­pute and data. It may be that we will need a few more non-triv­ial in­sights to get hu­man-level AI, but it’s also plau­si­ble that scal­ing up neu­ral nets even fur­ther will just work.

Can’t use re­gres­sion meth­ods for prob­lems that are not re­gres­sion prob­lems. Causal in­fer­ence is gen­er­ally not a re­gres­sion prob­lem. It’s not an is­sue of scale, it’s an is­sue of wrong tool for the job.

Okay, but (e.g.) deep RL meth­ods can solve prob­lems that ap­par­ently re­quire quite com­plex causal think­ing such as play­ing DotA. I think what is hap­pen­ing here is that while there is no ex­plicit causal mod­el­ling hap­pen­ing at the low­est level of the al­gorithm, the learned model ends up build­ing some­thing that serves the func­tions of one be­cause that is the sim­plest way to solve a gen­eral class of prob­lems. See the above meta-RL pa­per for good ex­am­ples of this. There seems to be no ob­vi­ous ob­struc­tion to scal­ing this sort of thing up to hu­man-level causal mod­el­ling. Can you point to a par­tic­u­lar task need­ing causal in­fer­ence that you think these meth­ods can­not solve?

Sure, and RL is not a re­gres­sion prob­lem. The rea­son RL meth­ods can do causal­ity is they can perform an es­sen­tially in­finite num­ber of ex­per­i­ments in toy wor­lds. DL can help RL scale up to more com­plex toy wor­lds, and some wor­lds that are not so toy any­more. But there, it’s not DL on it’s own—it’s DL+RL.

DL is very use­ful, in­deed! In fact, one could use DL as a “sub­rou­tine” for causal anal­y­sis of the sort Pearl wor­ries about. In fact, peo­ple do this now.

“Can you point to a par­tic­u­lar task need­ing causal in­fer­ence that you think these meth­ods can­not solve?”

To an­swer this—any­thing that’s not a re­gres­sion prob­lem. At best, you can use DL as a sub­rou­tine in some other larger al­gorithm that needs its own in­sights to work, that are un­re­lated to DL. So why would DL get all the credit for solv­ing the prob­lem?

I agree that you do need some sort of causal struc­ture around the func­tion-fit­ting deep net. The ques­tion is how com­plex this struc­ture needs to be be­fore we can get to HLAI. It seems plau­si­ble to me(at least a 10% chance, say) that it could be quite sim­ple, maybe just con­sist­ing of mod­estly more so­phis­ti­cated ver­sions of the RL al­gorithms we have so far, com­bined with re­ally big deep net­works.

Well, the DotA bot pretty much just used PPO,. AlphaZero used MCTS + RL, OpenAI re­cently got a robot hand to do ob­ject ma­nipu­la­tion with PPO and a simu­la­tor(the simu­la­tor was hand-built, but in prin­ci­ple it could be pro­duced by un­su­per­vised learn­ing like in this). Clearly it’s pos­si­ble to get so­phis­ti­cated be­hav­iors out of pretty sim­ple RL al­gorithms. It could be the case that these ap­proaches will “run out of steam” be­fore get­ting to HLAI, but it’s hard to tell at the mo­ment, be­cause our al­gorithms aren’t run­ning with the same amount of com­pute + data as hu­mans (for hu­mans, I am think­ing of our en­tire life­time ex­pe­riences as data, which is used to build a cross-do­main op­ti­mizer).

re: Uber, I agree that at least in the short term most ap­pli­ca­tions in the real world will fea­ture a fair amount of en­g­ineer­ing by hand. But the need for this could de­crease as more power be­comes available, as has been the case in su­per­vised learn­ing.

Well, I am fairly sure DL+RL will not lead to HLAI, on any rea­son­able timescale that would mat­ter to us. You are not sure. Seems to me, we could turn this into a bet. Any sort of bet where you say DL+RL → HLAI af­ter X years, I will prob­a­bly take the nega­tion of, gladly.

Hmmm...but if I win the bet then the world may be de­stroyed, or our en­vi­ron­ment could change so much the money will be­come worth­less. Would you take 20:1 odds that there won’t be DL+RL-based HLAI in 25 years?

I of­ten hear this re­sponse: “I can’t make bets on my be­liefs about the Escha­ton, be­cause they are about the Escha­ton.”

My re­sponse to this re­sponse is: you have left the path of em­piri­cism if you can’t trans­late your in­sight into [topic] (in this case “AI progress”) into tak­ing money via {bets with em­piri­cally ver­ifi­able out­comes} from folks with­out your in­sight.

---

If you are wor­ried the world will change too much in 25 years, can you for­mu­late a nearer-term bet you would be happy with? For ex­am­ple, some­thing non-toy DL+RL would do in 5 years.

“I can’t make bets on my be­liefs about the Escha­ton, be­cause they are about the Escha­ton.” -- Well, it makes sense. Be­sides, I did offer you a bet tak­ing into ac­count a) that the money may be worth less in my branch b) I don’t think DL + RL AGI is more likely than not, just plau­si­ble. If you’re more than 96% cer­tain there will be no such AI, 20:1 odds are a good deal.

But any­ways, I would be fine with bet­ting on a nearer-term challenge. How about—in 5 years, a bipedal robot that can run on rough ter­rain, as in this video, us­ing a policy learned from scratch by DL + RL(pos­si­bly in­clud­ing a simu­lated en­vi­ron­ment dur­ing train­ing) 1:1 odds.