How long it will be be­fore hu­man­ity is ca­pa­ble of cre­at­ing gen­eral AI is an im­por­tant fac­tor in dis­cus­sions of the im­por­tance of do­ing AI al­ign­ment re­search as well as dis­cus­sions of which re­search av­enues have the best chance of suc­cess. One fre­quently dis­cussed model for es­ti­mat­ing AI timelines is that AI ca­pa­bil­ities progress is es­sen­tially driven by grow­ing com­pute ca­pa­bil­ities. For ex­am­ple, the OpenAI ar­ti­cle on AI and Com­pute pre­sents a com­pel­ling nar­ra­tive, which shows a trend of well-known re­sults in ma­chine learn­ing us­ing ex­po­nen­tially more com­pute over time. This is an in­ter­est­ing model be­cause if valid we can do some quan­ti­ta­tive fore­cast­ing, due to some­what smooth trends in com­pute met­rics which can be ex­trap­o­lated. How­ever, I think there are a num­ber of rea­sons to sus­pect AI progress to be driven more by en­g­ineer and re­searcher effort than com­pute.

I think there’s a spec­trum of mod­els be­tween:

We have an abun­dance of ideas that aren’t worth the in­vest­ment to try out yet. Ad­vances in com­pute ca­pa­bil­ity un­lock progress by make re­search­ing more ex­pen­sive tech­niques eco­nom­i­cally fea­si­ble. We’ll be able to cre­ate gen­eral AI soon af­ter we have enough com­pute to do it.

Re­search pro­ceeds at its own pace and makes use of as much com­pute is con­ve­nient to save re­searcher time on op­ti­miza­tion and achieve flashy re­sults. We’ll be able to cre­ate gen­eral AI once we come up with all the right ideas be­hind it, and ei­ther:

We’ll already have enough com­pute to do it

We won’t have enough com­pute and we’ll start op­ti­miz­ing, in­vest more in com­pute, and pos­si­bly start truly be­ing bot­tle­necked on com­pute progress.

My re­search hasn’t pointed too solidly in ei­ther di­rec­tion, but be­low I dis­cuss a num­ber of the rea­sons I’ve thought of that might point to­wards com­pute not be­ing a sig­nifi­cant driver of progress right now.

There’s many ways to train more effi­ciently that aren’t widely used

Start­ing Oc­to­ber of 2017, the Stan­ford DAWNBench con­test challenged teams to come up with the fastest and cheap­est ways to train neu­ral nets to solve cer­tain tasks.

The most in­ter­est­ing was the ImageNet train­ing time con­test. The baseline en­try took 10 days and cost $1112; less than one year later the best en­tries (all by the fast.ai team) were down to 18 min­utes for $35, 19 min­utes for $18 or 30 min­utes for $14[^1]. This is ~800x faster and ~80x cheaper than the baseline.

Some of this was just us­ing more and bet­ter hard­ware, the win­ning team used 128 V100 GPUs for 18 min­utes and 64 for 19 min­utes, ver­sus eight K80 GPUs for the baseline. How­ever, sub­stan­tial im­prove­ments were made even on the same hard­ware. The train­ing time on a p3.16xlarge AWS in­stance with eight V100 GPUs went down from 15 hours to 3 hours in 4 months. The train­ing time on a sin­gle Google Cloud TPU went down from 12 hours to 3 hours as the Google Brain team tuned their train­ing and in­cor­po­rated ideas from the fast.ai team. An even larger im­prove­ment was seen on the CIFAR10 con­test re­cently, with times on a p3.2xlarge im­prov­ing by 60x with the ac­com­pa­ny­ing blog se­ries still men­tion­ing mul­ti­ple im­prove­ments left on the table due to effort con­straints. He also spec­u­lates that many of the op­ti­miza­tions would also im­prove the ImageNet ver­sion.

The main tech­niquesused for fast train­ing were all known tech­niques: pro­gres­sive re­siz­ing, mixed pre­ci­sion train­ing, re­mov­ing weight de­cay from batch­norms, scal­ing up batch size in the mid­dle of train­ing, and grad­u­ally warm­ing up the learn­ing rate. They just re­quired en­g­ineer­ing effort to im­ple­ment and weren’t already im­ple­mented in the library de­faults.

Similarly, the im­prove­ment due to scal­ing from eight K80s to many ma­chines with V100s was par­tially hard­ware but also re­quired lots of en­g­ineer­ing effort to im­ple­ment: us­ing mixed pre­ci­sion fp16 train­ing (re­quired to take ad­van­tage of the V100 Ten­sor Cores), effi­ciently us­ing the net­work to trans­fer data, im­ple­ment­ing the tech­niques re­quired for large batch sizes, and writ­ing soft­ware for su­per­vis­ing clusters of AWS spot in­stances.

Th­ese re­sults seem to show that it’s pos­si­ble to train much faster and cheaper by ap­ply­ing knowl­edge and suffi­cient en­g­ineer­ing effort. In­ter­est­ingly not even a team at Google Brain work­ing to show off TPUs ini­tially had all the code and knowl­edge re­quired to get the best available perfor­mance, and had to grad­u­ally work for it.

I would sus­pect that in a world where we were bot­tle­necked hard on train­ing times that these tech­niques would be more widely known about and ap­plied, and im­ple­men­ta­tions of them read­ily available for ev­ery ma­jor ma­chine learn­ing library. In­ter­est­ingly, in postscripts to both of hisar­ti­cles on how fast.ai man­aged to achieve such fast times, Jeremy Howard notes that he doesn’t be­lieve large amounts of com­pute are re­quired for im­por­tant ML re­search, and notes that many foun­da­tional dis­cov­er­ies were available with lit­tle com­pute.

[^1]: Us­ing spot/​pre­emptible in­stance pric­ing in­stead of the on-de­mand pric­ing the bench­mark page lists, due to much lower prices and the lack of need for on-de­mand in­stances given the short time. The au­thors of the win­ning solu­tion wrote soft­ware to effec­tively use spot in­stances and ac­tu­ally used them for their tests. It may seem un­fair to use spot prices for the win­ning solu­tion but not for the baseline, but a lot of the im­prove­ment in the con­test came from ac­tu­ally us­ing all the tech­niques for faster/​cheaper train­ing available de­spite in­con­ve­nience, and they had to write soft­ware to eas­ily use spot in­stances and had short enough train­ing times that it was vi­able with­out fancy soft­ware to au­to­mat­i­cally trans­fer train­ing to new ma­chines.

Hyper­pa­ram­e­ter grid searches are inefficient

I’ve heard hy­per­pa­ram­e­ter grid searches men­tioned as a rea­son why ML re­search needs way more com­pute than it would ap­pear based on the train­ing time of the mod­els used. How­ever, I can also see the use of grid searches as ev­i­dence of an abun­dance of com­pute rather than a scarcity.

As far as I can tell it’s pos­si­ble to find hy­per­pa­ram­e­ters much more effi­ciently than a grid search, it just takes more hu­man time and en­g­ineer­ing im­ple­men­ta­tion effort. There’s a large liter­a­ture of more effi­cient hy­per­pa­ram­e­ter search meth­ods but as far as I can tell they aren’t very pop­u­lar (I’ve never heard of any­one us­ing one in prac­tice, and all open source im­ple­men­ta­tions of these kind of things I can find have few Github stars).

Re­searcher Les­lie Smith also has a num­ber of pa­pers with lit­tle-used ideas on prin­ci­pled ap­proaches to choos­ing and search­ing for op­ti­mal hy­per­pa­ram­e­ters with much less effort, in­clud­ing a fast au­to­matic pro­ce­dure for find­ing op­ti­mal learn­ing rates. This sug­gests that it’s pos­si­ble to sub­sti­tute hy­per­pa­ram­e­ter search time for more en­g­ineer­ing, hu­man de­ci­sion-mak­ing and re­search effort.

There’s also likely room for im­prove­ment in the fac­tor­iza­tion of the hy­per-pa­ram­e­ters we use so that they’re more amenable to sep­a­rate op­ti­miza­tion. For ex­am­ple, L2 reg­u­lariza­tion is usu­ally used in place of weight de­cay be­cause they the­o­ret­i­cally do the same thing, but this pa­per points out that not only do they not do the same thing with ADAM and us­ing weight de­cay causes ADAM to sur­pass the more pop­u­lar SGD with mo­men­tum in prac­tice, but that weight de­cay is a bet­ter hy­per-pa­ram­e­ter since the op­ti­mal weight de­cay is more in­de­pen­dent of learn­ing rate than L2 reg­u­lariza­tion strength is.

All of this sug­gests that most re­searchers might be op­er­at­ing un­der an abun­dance of cheap com­pute rel­a­tive to their prob­lems that leads to them not in­vest­ing the effort re­quired to more effi­ciently op­ti­mize their hy­per­pa­ram­e­ters and just do so hap­haz­ardly or with grid searches in­stead.

The types of com­pute we need may not im­prove very quickly

Im­prove­ments in com­put­ing hard­ware are not uniform and there are many differ­ent hard­ware at­tributes that can be bot­tle­necks for differ­ent things. AI progress may rely on one or more of these that don’t end up im­prov­ing quickly, be­com­ing bot­tle­necked on the slow­est one rather than ex­pe­rienc­ing ex­po­nen­tial growth.

Ma­chine learn­ing accelerators

Modern ma­chine learn­ing is largely com­posed of large op­er­a­tions that are ei­ther di­rectly ma­trix mul­ti­plies or can be de­com­posed into them. It’s also pos­si­ble to train us­ing much lower pre­ci­sion than full 32-bit float­ing point us­ing some tricks. This al­lows the cre­ation of spe­cial­ized train­ing hard­ware like Google’s TPUs and Nvidia Ten­sor Cores. A num­ber of other com­pa­nies have also an­nounced they’re work­ing on cus­tom ac­cel­er­a­tors.

The first gen­er­a­tion of spe­cial­ized hard­ware de­liv­ered a large one-time im­prove­ment, but we can also ex­pect con­tin­u­ing in­no­va­tion in ac­cel­er­a­tor ar­chi­tec­ture. There will likely be sus­tained in­no­va­tions in train­ing with differ­ent num­ber for­mats and ar­chi­tec­tural op­ti­miza­tions for faster and cheaper train­ing. I ex­pect this will be the area our com­pute ca­pa­bil­ity will grow the most, but may flat­ten like CPUs have once we figure out enough of the eas­ily dis­cov­er­able im­prove­ments.

CPUs

Re­in­force­ment learn­ing simu­la­tions like the OpenAI Five DOTA bot, and var­i­ous physics play­grounds, of­ten use CPU-heavy se­rial simu­la­tions. OpenAI Five uses 128,000 CPU cores and only 256 GPUs. At cur­rent Google Cloud pre­emptible prices the CPUs cost 5-10x more than the GPUs in to­tal. Im­prove­ments in ma­chine learn­ing train­ing abil­ity will still leave the large cost of the CPUs. If the use of ex­pen­sive simu­la­tions that run best on CPUs be­comes an im­por­tant part of train­ing ad­vanced agents, progress may be­come bot­tle­necked on CPU cost.

GPU/​ac­cel­er­a­tor memory

Another scarce re­source is mem­ory on the GPU/​ac­cel­er­a­tor used for train­ing. The mem­ory must be large enough to store all the model pa­ram­e­ters, the in­put, the gra­di­ents, and other op­ti­miza­tion pa­ram­e­ters.

This is one of the most fre­quent limits I see refer­enced in ma­chine learn­ing pa­pers nowa­days. For ex­am­ple the new large BERT lan­guage model can only be trained prop­erly on TPUs with their 64GB of RAM. The Glow pa­per needs to use gra­di­ent check­point­ing and an al­ter­na­tive to batch­norm so that they can use gra­di­ent ac­cu­mu­la­tion, be­cause only a sin­gle sam­ple of gra­di­ents fits on a GPU.

How­ever there are ways to ad­dress this limi­ta­tion that aren’t fre­quently used. Glow already uses the two best ones, gra­di­ent check­point­ing and gra­di­ent ac­cu­mu­la­tion, but did not im­ple­ment an op­ti­miza­tion they men­tioned which would make the amount of mem­ory the model takes con­stant in the num­ber of lay­ers in­stead of lin­ear, likely be­cause it would be difficult to en­g­ineer into ex­ist­ing ML frame­works. The BERT im­ple­men­ta­tion uses none of the tech­niques be­cause they just use a TPU with enough mem­ory, in fact a reim­ple­men­ta­tion of BERT im­ple­mented 3 such tech­niques and got it to fit on a GPU. Thus it still seems that in a world with less RAM these might still have hap­pened, just with more difficulty or smaller demon­stra­tion mod­els.

In­ter­est­ingly, the max­i­mum available RAM per de­vice barely changed from 2014 through 2017 with the NVIDIA K80′s 24GB, but then shot up in 2018 to 48GB with the RTX 8000 as well as the 64GB TPU v2 and 128GB TPU v3. Prob­a­bly both be­cause of de­mand for larger de­vice mem­o­ries for ma­chine learn­ing train­ing, as well as the availa­bil­ity of high ca­pac­ity HBM mem­ory. It’s un­clear to me if this rapid rise will con­tinue or if it was mostly a one-time change re­flect­ing new de­mands for the largest pos­si­ble mem­o­ries reach­ing the mar­ket.

It’s also pos­si­ble that per-de­vice mem­ory will cease to be a con­straint on model size due to faster hard­ware in­ter­con­nects that al­low shar­ing a model across the mem­ory of mul­ti­ple de­vices like In­tel’s Ner­vana and Ten­sorflow Mesh plan to do. It also seems likely that tech­niques for split­ting mod­els across de­vices to fit in mem­ory, like the origi­nal AlexNet did, will be­come more pop­u­lar. It may be the case that the fact that we don’t split mod­els across de­vices like AlexNet any­more is ev­i­dence that we’re not con­strained by RAM much but I’m not sure.

Limited abil­ity to ex­ploit parallelism

As dis­cussed ex­ten­sively in a new pa­per from Google Brain, there seems to be a limit on how much data par­allelism in the form of larger batch sizes we can cur­rently ex­tract out of a given model. If this con­straint isn’t worked around, wall time to train mod­els could stall even if com­pute power con­tinues to grow.

How­ever the pa­per men­tions that var­i­ous things like model ar­chi­tec­ture and reg­u­lariza­tion af­fect this limit and I think it’s pretty likely that tech­niques to in­crease this limit will con­tinue to be dis­cov­ered so it isn’t a bot­tle­neck. A newer pa­per by OpenAI finds that more difficult prob­lems also tol­er­ate larger batch sizes. Even if the limit re­mains, in­creas­ing com­pute would al­low train­ing more differ­ent mod­els in par­allel, po­ten­tially just mean­ing that more pa­ram­e­ter search and evolu­tion gets lay­ered on top of the train­ing. I also sus­pect that just us­ing ever-larger mod­els may al­low use of more com­pute with­out in­creas­ing batch sizes.

Conclusion

Th­ese all seem to point to­wards com­pute be­ing abun­dant and ideas be­ing the bot­tle­neck, but not solidly. For the points about train­ing effi­ciency and grid searches this could just be an in­effi­ciency in ML re­search and all the ma­jor AGI progress will be made by a few well-funded teams at the bound­aries of mod­ern com­pute that have solved these prob­lems in­ter­nally.

Vaniver com­mented on a draft of this post that it’s in­ter­est­ing to con­sider the case where train­ing time is the bot­tle­neck rather than ideas, but mas­sive en­g­ineer­ing effort is highly effec­tive at re­duc­ing train­ing time. In this case an in­crease in in­vest­ment in AI re­search which lead to hiring more en­g­ineers to ap­ply tech­niques to speed up train­ing could lead to rapid progress. This world might also lead to more siz­able differ­ences in ca­pa­bil­ities be­tween or­ga­ni­za­tions, if large some­what se­rial soft­ware en­g­ineer­ing in­vest­ments are re­quired to make use of the most pow­er­ful tech­niques, rather than a well-funded new­comer be­ing able to just read pa­pers and buy all the nec­es­sary hard­ware.

The course of var­i­ous com­pute hard­ware at­tributes seems un­cer­tain both in terms of how fast they’ll progress and whether or not we’ll need to rely on any­thing other than spe­cial-pur­pose ac­cel­er­a­tor speed. Since the prob­lem is com­plex with many un­knowns, I’m still highly un­cer­tain, but all of these points did move me to vary­ing de­grees in the di­rec­tion of con­tin­u­ing com­pute growth not be­ing a driver of dra­matic progress.

I think the ev­i­dence in the first part sug­gest­ing an abun­dance of com­pute is mostly ex­plained by the fact that aca­demics ex­pect that we need ideas and al­gorith­mic break­throughs rather than sim­ply scal­ing up ex­ist­ing al­gorithms, so you should up­date on that fact rather than this ev­i­dence which is a down­stream effect. If we con­di­tion on AGI re­quiring new ideas or al­gorithms, I think it is un­con­tro­ver­sial that we do not re­quire huge amounts of com­pute to test out these new ideas.

The “we are bot­tle­necked on com­pute” ar­gu­ment should be taken as a state­ment about how to ad­vance the state of the art in big un­solved prob­lems in a suffi­ciently gen­eral way (that is, with­out en­cod­ing too much do­main knowl­edge). Note that ImageNet is ba­si­cally solved, so it does not fall in this cat­e­gory. At this point, it is a “small” prob­lem and it’s rea­son­able to say that it has an over­abun­dance of com­pute, since it re­quiresfour or­ders of mag­ni­tude less com­pute than AlphaGo (and prob­a­bly Dota). For the un­solved gen­eral prob­lems, I do ex­pect that re­searchers do use effi­cient train­ing tricks where they can find them, and they prob­a­bly op­ti­mize hy­per­pa­ram­e­ters in some smarter way. For ex­am­ple, AlphaGo’s hy­per­pa­ram­e­ters were trained via Bayesian op­ti­miza­tion.

Par­tic­u­lar nar­row prob­lems can be solved by adding do­main knowl­edge, or ap­ply­ing an ex­ist­ing tech­nique that no one had both­ered to do be­fore. Par­tic­u­lar new ideas can be tested by build­ing sim­ple en­vi­ron­ments or datasets in which those ideas should work. It’s not sur­pris­ing that these ap­proaches are not bot­tle­necked on com­pute.

The ev­i­dence in the first part can be ex­plained as fol­lows, as­sum­ing that re­searchers are fo­cused on test­ing new ideas:

New ideas can of­ten be eval­u­ated in small, sim­ple en­vi­ron­ments that do not re­quire much com­pute.

Any trick that you ap­ply makes it harder to tell what effect your idea is hav­ing (since you have to dis­en­tan­gle it from the effect of the trick).

Many tricks do not ap­ply in the do­main that the new idea is be­ing tested in. Su­per­vised learn­ing has a bunch of tricks that now seem to work fairly ro­bustly, but this is not so with re­in­force­ment learn­ing.

Jeremy Howard notes that he doesn’t be­lieve large amounts of com­pute are re­quired for im­por­tant ML re­search, and notes that many foun­da­tional dis­cov­er­ies were available with lit­tle com­pute.

I would as­sume that Jeremy Howard thinks we are bot­tle­necked on ideas.

For the points about train­ing effi­ciency and grid searches this could just be an in­effi­ciency in ML re­search and all the ma­jor AGI progress will be made by a few well-funded teams at the bound­aries of mod­ern com­pute that have solved these prob­lems in­ter­nally.

This seems ba­si­cally right. I’d note that there can be a bal­ance, so it’s not clear that this is an “in­effi­ciency”—you could be­lieve that any ac­tual AGI will be de­vel­oped by well-funded teams like you de­scribe, but they will use some ideas that were de­vel­oped by ML re­search that doesn’t re­quire huge amounts of com­pute. It still seems con­sis­tent to say “com­pute is a ma­jor driver of progress in AI re­search, and we are bot­tle­necked on it”.

Sugges­tion to test your the­ory: Look at the best AI re­sults of the last 2 years and try to run them /​ test them in a rea­son­able time on a com­puter that was af­ford­able 10 years ago.

My own opinion is that hard­ware ca­pac­ity has been a huge con­straint in the past. We are mov­ing into an era where it is less of a prob­lem. But, I think, still a prob­lem. Hard­ware limi­ta­tions in­fect and limit your think­ing in all sorts of ways and slow you down ter­ribly.

To take an ex­am­ple from my own work. I have a prob­lem that needs about 50Gb RAM to test effi­ciently. Other­wise it does not fit in mem­ory and the run time is 100X slower.

I had the op­tion to spend 6 months maybe find­ing a way to squeeze it into 32Gb. Or, what I did: spend a few thou­sand on a ma­chine with 128Gb RAM. To run in 1Gb RAM would have been a world of pain, maybe not doable in the time I have to work on it.

I en­joyed the dis­cus­sion. My own take is that this view is likely wrong.

The “many ways to train that aren’t widely used” is ev­i­dence for al­ter­na­tives which could sub­sti­tute for a cer­tain amount of hard­ware growth, but I don’t see it as ev­i­dence that hard­ware doesn’t drive growth.

My im­pres­sion is that al­ter­na­tives to grid search aren’t very pop­u­lar be­cause al­ter­na­tives don’t re­ally work re­li­ably. Maybe this has changed and peo­ple haven’t picked up on it yet. Or maybe al­ter­na­tives take more effort than they’re worth.

The fact that these things are fairly well known and still not used sug­gests that it is cheaper to pick up more com­pute rather than use them. You dis­cuss these things as ev­i­dence that com­put­ing power is abun­dant. I’m not sure how to quan­tify that. It seems like you mean for “com­put­ing power is abun­dant” to be an ar­gu­ment against “com­put­ing power drives progress”.

“com­put­ing power is abun­dant” could mean that ev­ery­one can run what­ever crazy idea they want, but the hard part is spec­i­fy­ing some­thing which does some­thing in­ter­est­ing. This is quite rel­a­tive, though. Com­put­ing power is cer­tainly abun­dant com­pared to 20 years ago. But, the fact that peo­ple pay a lot for com­put­ing power to run large ex­per­i­ments means that it could be even more abun­dant than it is now. And, we can cer­tainly write down in­ter­est­ing things which we can’t run, and which would pro­duce more in­tel­li­gent be­hav­ior if only we could.

“com­put­ing power is abun­dant” could mean that buy­ing more com­put­ing power is cheaper in com­par­i­son to a lot of low-hang­ing-fruit op­ti­miza­tion of what you’re run­ning. This seems like what you’re pro­vid­ing ev­i­dence for (on my in­ter­pre­ta­tion—I’m not imag­in­ing this is what you in­tend to be pro­vid­ing ev­i­dence for). This to me sounds like an ar­gu­ment that com­put­ing power drives progress: when peo­ple want to pur­chase ca­pa­bil­ity progress, they of­ten pur­chase com­put­ing power.

I do think that your ob­ser­va­tions sug­gest that com­put­ing power can be re­placed by en­g­ineer­ing, at least to a cer­tain ex­tent. So, slower progress on faster/​cheaper com­put­ers doesn’t mean cor­re­spond­ingly slower AI progress; only some­what slower.

Elab­o­rat­ing on my com­ment (on the world where train­ing time is the bot­tle­neck, and en­g­ineers help):

To the ex­tent ma­jor progress and flashy re­sults are de­pen­dent on mas­sive en­g­ineer­ing efforts, that this seems like this low­ers the porta­bil­ity of ad­vances and makes it more difficult for teams to form coal­i­tions. [Com­pare to a world where you just have to glue to­gether differ­ent con­cep­tual ad­vances, and so you plug one model into an­other and are ba­si­cally done.] This also means we should think about how progress hap­pens in other fields with lots of free pa­ram­e­ters that are sort of op­ti­mized jointly—semi­con­duc­tor man­u­fac­tur­ing is the pri­mary thing that comes to mind, where you have about a dozen differ­ent fields of en­g­ineer­ing that are all con­strained by each other and the joint trade­offs are sort of night­mar­ish to be­hold or man­age. [Subfield A would be much bet­ter off if we switched from sili­con to ger­ma­nium, but ev­ery­one else would scream—but per­haps we’ll need to switch even­tu­ally any­way.] The more bloated all of these pro­jects be­come, the harder it is to do fun­da­men­tal reimag­in­ings of how these things work (a fa­vorite ex­am­ple of mine here is re­plac­ing mat­muls in neu­ral net­works with bit­shifts, also known as “you only wanted the abil­ity to mul­ti­ply by pow­ers of 2, right?”, which seems like it is lu­dicrously more effi­cient and is still pretty train­able, but re­quires think­ing about gra­di­ent up­dates differ­ently, and the more effort you’ve put into op­ti­miz­ing how you pipe gra­di­ent up­dates around, the harder it is to make tran­si­tions like that).

This is also pos­si­bly quite rele­vant to safety; if it’s hard to ‘tack on safety’ at the end, then it’s im­por­tant we start with some­thing safe and then build a moun­tain of small im­prove­ments for it, rather than build­ing the moun­tain of im­prove­ments for some­thing that turns out to be not safe and then start­ing over.

It seems to me like one of the main differ­ences (but prob­a­bly not the core one?) is whether or not whether or not some­thing works seems pre­dictable. Sup­pose Alice thinks that it’s hard to come up with some­thing that works, but things that look like they’ll work do with pretty high prob­a­bil­ity, and sup­pose Bob thinks it’s easy to see lots of things that might work, but things that might work rarely do; I think Alice is more likely to think we’re ideas-limited (since if we had a text­book from the fu­ture, we could just code it up and train it real quick) and Bob is more likely to think we’re com­pute-limited (since our ac­tual progress is go­ing to look much more like rul­ing out all of the bad ideas that are in be­tween us and the good ideas, and the more com­pu­ta­tional ex­per­i­ments we can run, the faster that pro­cess can hap­pen).

I tend to be quite close to the end of the ‘ideas’ spec­trum, tho the is­sue is pretty nu­anced and mixed.

I think one of the things that’s in­ter­est­ing to me is not how much train­ing time can be op­ti­mized, but ‘model size’—what seems im­por­tant is not whether our RL al­gorithm can solve a dou­ble-pen­du­lum light­ning-quick but whether we can put the same ba­sic RL ar­chi­tec­ture into an oc­to­pus’s body and have it figure out how to con­trol the ten­ta­cles quickly. If the ‘ex­po­nen­tial effort to get lin­ear re­turns’ story is true, even if we’re cur­rently not mak­ing the most of our hard­ware, gains of 100x in uti­liza­tion of hard­ware only turn into 2 higher steps in the re­turn space. I think the pri­mary thing that in­clines me to­wards the ‘ideas will drive progress’ view is that if there’s a method that’s ex­po­nen­tial effort to lin­ear re­turns and an­other method that’s, say, polyno­mial effort to lin­ear re­turns, the sec­ond method should blow past the ex­po­nen­tial one pretty quickly. (Even some­thing that re­duces the base of the ex­po­nent would be a big deal for com­pli­cated tasks.)

If you go down that route, then I think you start think­ing a lot about the effi­ciency of other things (like how good hu­man Go play­ers are at turn­ing games into knowl­edge) and what in­for­ma­tion the­ory sug­gests about strate­gies, and so on. And you also start think­ing about how close we are—for a lot of these things, just turn­ing up the re­sources plowed into ex­ist­ing tech­niques can work (like beat­ing DotA) and so it’s not clear we need to search for “phase change” strate­gies first. (Even if you’re in­ter­ested in, say, some­thing like cur­ing can­cer, it’s not clear whether con­tin­u­ing im­prove­ments to cur­rent NN-based molec­u­lar dy­nam­ics pre­dic­tors, causal net­work dis­cov­ery tools, and other di­ag­nos­tic and ther­a­peu­tic aids will get to the finish line first as op­posed to figur­ing out how to build robot sci­en­tists and then putting them to work on cur­ing can­cer.)

Some of this was just us­ing more and bet­ter hard­ware, the win­ning team used 128 V100 GPUs for 18 min­utes and 64 for 19 min­utes, ver­sus eight K80 GPUs for the baseline. How­ever, sub­stan­tial im­prove­ments were made even on the same hard­ware. The train­ing time on a p3.16xlarge AWS in­stance with eight V100 GPUs went down from 15 hours to 3 hours in 4 months.

Was the origi­nal 15 hour time for fp16 train­ing, or fp32?

(A fac­tor of 5 in a few months seems plau­si­ble, but be­fore up­dat­ing on that dat­a­point it would be good to know if it’s just from switch­ing to ten­sor cores which would be a rather differ­ent nar­ra­tive.)

I just checked and seems it was fp32. I agree this makes it less im­pres­sive, I for­got to check that origi­nally. I still think this some­what counts as a soft­ware win, be­cause work­ing fp16 train­ing re­quired a bunch of pro­gram­mer effort to take ad­van­tage of the hard­ware, just like op­ti­miza­tion to make bet­ter use of cache would.

How­ever, there’s also a differ­ent set of same-ma­chine dat­a­points available in the bench­mark, where train­ing time on a sin­gle Cloud TPU v2 went down from 12 hours 30 min­utes to 2 hours 44 min­utes, which is a 4.5x speedup similar to the 5x achieved on the V100. The Cloud TPU was spe­cial-pur­pose hard­ware be­ing trained with bfloat16 from the start, so that’s a similar mag­ni­tude im­prove­ment more clearly due to soft­ware. The his­tory shows in­cre­men­tal progress down to 6 hours and then a 2x speedup once the fast.ai team pub­lished and the Google Brain team in­cor­po­rated their tech­niques.

I think that fp32 → fp16 should give a >5x boost on a V100, so this 5x im­prove­ment still prob­a­bly hides some in­effi­cien­cies when run­ning in fp16.

I sus­pect the ini­tial 15 - > 6 hour im­prove­ment on TPUs was also mostly deal­ing with low hang­ing fruit and clean­ing up var­i­ous in­effi­cien­cies from port­ing older code to a TPU /​ larger batch size /​ etc.. It seems plau­si­ble the last fac­tor of 2 is more of a steady state im­prove­ment, I don’t know.

My take on this story would be: “Hard­ware has been chang­ing rapidly, giv­ing large speedups, and peo­ple at the same time peo­ple have been scal­ing up to larger batch sizes in or­der to spend more money. Each time hard­ware or scale changes, old soft­ware is poorly adapted, and it re­quires some en­g­ineer­ing effort to make full use of the new setup.” On this read­ing, these speedups don’t provide as much in­sight into whether fu­ture progress will be driven by hard­ware.

I went and checked and as far as I can tell they used the same 1024 batch size for the 12 and 6 hour time. The changes I no­ticed were bet­ter nor­mal­iza­tion, la­bel smooth­ing, a some­what tweaked in­put pipeline (not sure if op­ti­miza­tion or re­fac­tor­ing) and up­dat­ing Ten­sorflow a few ver­sions (plau­si­bly in­cludes a bunch of hard­ware op­ti­miza­tions like you’re talk­ing about).

The things they took from fast.ai for the 2x speedup were train­ing on pro­gres­sively larger image sizes, and the bet­ter tri­an­gu­lar learn­ing rate sched­ule. Separately for their later sub­mis­sions, which don’t in­clude a sin­gle-GPU figure, fast.ai came up with bet­ter meth­ods of crop­ping and aug­men­ta­tion that im­prove ac­cu­racy. I don’t nec­es­sar­ily think the 2x speedup pace through clever ideas pace is sus­tain­able, lots of the fast.ai ideas seem to be pretty low hang­ing fruit.

I ba­si­cally agree with the quoted part of your take, just that I don’t think it ex­plains enough of the ap­a­thy to­wards train­ing speed that I see, al­though I think it might more fully ex­plain the situ­a­tion at OpenAI and Deep­Mind. I’m mak­ing more of a re­vealed prefer­ences effi­cient mar­kets kind of ar­gu­ment where I think the fact that those low hang­ing fruits weren’t picked and aren’t in­cor­po­rated into the vast ma­jor­ity of deep learn­ing pro­jects sug­gests that re­searchers are suffi­ciently un-con­strained by train­ing times that it isn’t worth their time to op­ti­mize things.

Like I say in the ar­ti­cle though, I’m not su­per con­fi­dent and I could be un­der­es­ti­mat­ing the zeal for faster train­ing be­cause of sam­pling er­ror of what I’ve seen, read and thought of, or it could just be in­effi­cient mar­kets.

It’s in­ter­est­ing to set the OpenAI com­pute ar­ti­cle’s graph to lin­ear scale so you can see that the com­pute that went into AlphaGo ut­terly dwarfs ev­ery­thing else. It seems like Deep­Mind is definitely ahead of nearly ev­ery­one else on the en­g­ineer­ing effort and money they’ve put into scal­ing.