Generalising CNNs

This topic is for peo­ple who already know what CNNs are, and are in­ter­ested in how to in­no­vate to riff on and ex­tend the core rea­son (per­haps?) that CNNs learn faster. Prob­ing the tech­nol­ogy topic is one ‘sub goal’ in ques­tion­ing where our AI knowl­edge is head­ing, and how fast. In turn, that’s be­cause we want it to progress in a good di­rec­tion.

Sub Goal

Q: Can the re­duc­tion in num­ber of pa­ram­e­ters that a CNN in­tro­duces be achieved in a more gen­eral way?

A: Yes. Here are sketches of two ways:

1) Sac­cades. Train one net­work (layer) on at­ten­tion. Train it to learn which lo­cal blocks of the image to give at­ten­tion to. Train the sec­ond part of the net­work us­ing those cho­sen ‘lo­cal blocks’ in con­junc­tion with co­or­di­nates of their lo­ca­tions.

The num­ber of blocks that have large CNN ker­nels ap­plied to them is much re­duced. Those blocks are the blocks that mat­ter.

2) Pa­ram­e­ter Com­pres­sion. Give each layer of a neu­ral net­work more (po­ten­tial) con­nec­tions than you think will ac­tu­ally end up be­ing used. After train­ing for a few cy­cles, com­press the pa­ram­e­ter val­ues us­ing a lossy al­gorithm, always choos­ing the com­pres­sion which scores best on some weight­ing of size and qual­ity. Un­com­press and re­peat this pro­cess till you have com­pleted the train­ing set.

The num­ber of bits used to rep­re­sent pa­ram­e­ters is be­ing kept low, helping to guard against over fit­ting.

Dialog

[Doubter] This all sounds very hand wavy. How ex­actly would you train a sac­cadic net­work on the right move­ments?

[Op­ti­mist] One step­ping stone, be­fore you get to a true sac­cadic net­work with the lo­cus of at­ten­tion fol­low­ing a tem­po­ral tra­jec­tory, is to train a shal­low net­work to clas­sify where to give at­ten­tion. So this step­ping stone out­puts a weight­ing for how much at­ten­tion to give to each lo­ca­tion. For sake of be­ing more con­crete, it works on a down sam­pled image and gives 0 for no at­ten­tion, 1 for con­volu­tion with a 3x3 ker­nel, 2 for con­volu­tion with a 5x5 ker­nel.

[Doubter] You still haven’t said how you would do that at­ten­tion train­ing.

[Op­ti­mist] You could re­ward a net­work for ro­bust­ness to cor­rup­tion of the image. Re­ward it for ze­roes in the at­ten­tion lay­ers.

[Doubter] That’s not clear, and I think there is a Catch 22. You need to have analysed the image to de­cide where to give it at­ten­tion.

[Op­ti­mist] …but not analyse in full de­tail. Use only a few down sam­pled lay­ers to de­cide where to give at­ten­tion. You save a ton of CPU by only giv­ing more at­ten­tion where it is needed.

[Doubter] I re­ally doubt that. You will pay for that sav­ing many times over by the less reg­u­lar pat­tern of ‘at­ten­tion’ and the more com­plex code. It will be re­ally hard to use a GPU to ac­cel­er­ate it as well as is already done with a stan­dard CNNs. Be­sides, even a 16x re­duc­tion in to­tal work­load, and I ac­tu­ally doubt there would be any re­duc­tion in work­load at all, is not that sig­nifi­cant. What ac­tu­ally mat­ters is the qual­ity of the end re­sult.

[Op­ti­mist] We shouldn’t be wor­ry­ing about that GPU. That’s ‘pre­ma­ture op­ti­mi­sa­tion’. You’re ar­tifi­cially con­strain­ing your think­ing by the hard­ware we use right now.

[Doubter] Nev­er­the­less, GPU is the hard­ware we have right now, and we want prac­ti­cal sys­tems. An al­ter­na­tive to CNNs us­ing hy­brid CPU/​GPU at least has to come close on speed to cur­rent CNNs on GPU, and have some other key ad­van­tage.

[Op­ti­mist] Ex­plain­abil­ity in a sac­cadic CNN is bet­ter, since you have the ex­plicit weight­ings for at­ten­tion. For any out­put, you can show where the at­ten­tion is.

[Doubter] But that is not new. We can already show where at­ten­tion is by look­ing at what weights mat­tered in a clas­sifi­ca­tion. See for ex­am­ple the way we learned that ‘hands’ were im­por­tant in de­tect­ing dumb­bell weights, or that snow was im­por­tant in differ­en­ti­at­ing wolves from dogs.

[Op­ti­mist] Right. And those in­sights in how CNNs clas­sify were re­ally valuable land­marks, weren’t they? And now we would have some­thing more di­rect to do that, as we can go straight to the at­ten­tion weights. And we can ex­plore bet­ter strate­gies for set­ting those weights.

[Doubter] You still haven’t ex­plained ex­actly how the at­ten­tion lay­ers would be con­structed, nor have you ex­plained the later ‘bet­ter strate­gies’ nor how you would progress to tem­po­ral at­ten­tion strate­gies. I doubt the ba­sic idea would do more than a slightly deeper CNN would. Un­til I see an ac­tual work­ing ex­am­ple, I’m un­con­vinced. Can we move onto ‘pa­ram­e­ter com­pres­sion’?

[Op­ti­mist] Sure.

-----

[Doubter] So what I am strug­gling with is that you are throw­ing away data af­ter a lit­tle train­ing. Why ‘lossy com­pres­sion’ and not ‘lossless com­pres­sion’?

[Op­ti­mist] That’s part of the point of it. We’re try­ing to re­ward a low bit count de­scrip­tion of the weights.

[Doubter] Hold on a mo­ment. You’re talk­ing more like a pro­po­nent of evolu­tion­ary al­gorithms than of neu­ral net­works. You can’t ‘back prop­a­gate’ a re­ward for a low en­tropy solu­tion back up the net. All you can do is choose one such pa­ram­e­ter set over an­other.

[Op­ti­mist] Ex­actly. Neu­ral net­works are in fact just a par­tic­u­lar rather con­strained case of evolu­tion­ary al­gorithm. I’d con­tend there is ad­van­tage in ex­plor­ing new ways of re­duc­ing the de­grees of free­dom in them. CNNs do re­duce the de­grees of free­dom, but not in a very gen­eral way. We need to add some­thing like com­pres­sion of pa­ram­e­ters if we want low de­grees of free­dom with more gen­er­al­ity.

[Doubter] In CNNs that lack of gen­er­al­ity is an ad­van­tage. Your ap­proach could en­code a net­work with a ridicu­lously large num­ber of use­less non-zero weights—whilst still us­ing very few bits. That won’t work. That would take way longer to com­pute one iter­a­tion. It would be as slow as pitch drops drip­ping.

[Op­ti­mist] Right. So some at­ten­tion must be paid to ex­actly what the lossy com­pres­sion al­gorithm is. Just as jpeg throws away low weight vec­tors, this com­pres­sion al­gorithm could too.

[Doubter] So I have a cou­ple of com­ments here. You have not worked out the de­tails, right? It also doesn’t sound like this is bio-in­spired, which was at least a sav­ing grace of the sac­cadic idea.

[Op­ti­mistic] Well, the com­pres­sion idea wasn’t bio-in­spired origi­nally, but later I got to think­ing about how genes could cre­ate many ‘similar pat­terns’ of con­nec­tions lo­cally. That could do CNN type con­nec­tions, but genes can also do similar pat­terns with long range con­nec­tions. So for ex­am­ple, genes could learn the ideal den­sity of long range con­nec­tions rel­a­tive to short range con­nec­tions. That con­nec­tion plan gets re­peated in many places whilst be­ing en­coded com­pactly. In that sense genes are a com­pres­sion code.

[Doubter] So you are mix­ing ge­netic al­gorithms and neu­ral net­works? That sounds like a recipe for more pa­ram­e­ters.

[Op­ti­mistic] …a recipe for new ways of re­duc­ing the num­ber of pa­ram­e­ters.

[Doubter] I think I see a pat­tern here, in that both ideas offer CNNs as a spe­cial case. With sac­cadic net­works the se­cret sauce is some not too clear way you would pro­gram the ‘at­ten­tion’ func­tion. With pa­ram­e­ter com­pres­sion your se­cret sauce is the choice of lossy com­pres­sion func­tion. If you ‘got funded’ to do some demo cod­ing, you could keep naive in­vestors happy for a long while with net­works that were ac­tu­ally no bet­ter than ex­ist­ing CNNs and plenty of promises of more to come with more fund­ing. But the ‘more to come later’ never would come. Your deep prob­lem is the ‘se­cret sauce’ is more as­pira­tion than ac­tu­ally demon­stra­ble.

[Op­ti­mist] I think that’s a lit­tle un­fair. I am not claiming these ap­proaches are im­ple­mented demon­stra­ble im­prove­ments. I am not claiming that I know ex­actly how to get the de­tails of these two ideas right quickly. You are also los­ing sight of the over­all goal, which is to progress the value of AI as a pos­i­tive trans­for­ma­tive force.

[Doubter] Hmm. I see only a not-too-con­vinc­ing claim of be­ing able to in­crease the power of ma­chine learn­ing and an at­tempt to bur­nish your ego and your rep­u­ta­tion. Where is the fo­cus on pos­i­tive trans­for­ma­tive force?

[Op­ti­mist] Break­ing the mould on how to think about ma­chine learn­ing is a pretty im­por­tant sub­goal in pro­gress­ing thought on AI, don’t you think? “Less Wrong” is the best pos­si­ble place on the in­ter­net for en­gag­ing in dis­cus­sion of eth­i­cal pro­gres­sion of AI. If this ‘sub­goal’ post does not gather any use­ful feed­back at all, then I’ll have to agree with you that my post is not helping progress the pos­si­ble pos­i­tive trans­for­ma­tive as­pects of AI—and try again with an­other iter­a­tion and differ­ent post, un­til I find what works.

I think this could use an open­ing line or para­graph roughly in­di­cat­ing who the post is for (from the looks of it, peo­ple with some back­ground in neu­ral net­works, al­though I couldn’t spec­ify it in more de­tail than that)

So if I un­der­stand you, for (1) you’re propos­ing a “hard” at­ten­tion over the image, rather than the “soft” differ­en­tiable at­ten­tion which is typ­i­cally meant by “at­ten­tion” for NNs.

You might find in­ter­est­ing “Re­cur­rent Models of Vi­sual At­ten­tion” by Deep­Mind (https://​​arxiv.org/​​pdf/​​1406.6247.pdf). They use a hard at­ten­tion over the image with RL to train where to at­tend. I found it in­ter­est­ing—there’s been sub­se­quent work us­ing hard at­ten­tion (I thiiink this is a cen­tral pa­per for the topic, but I could be wrong, and I’m not at all sure what the most in­ter­est­ing re­cent one is) as well.

That pa­per is new to me—and yes re­lated and in­ter­est­ing. I like their use of a ‘glimpse’ = more re­s­olu­tion in cen­tre, less re­s­olu­tion fur­ther away.

About ‘hard’ and ‘soft’ - if ‘hard’ and ‘soft’ mean what I think they do, then yes, the at­ten­tion is ‘hard’. It forces some weights to zero that in a fully con­nected net­work could end up non zero. That might re­quire some at­ten­tion in train­ing, as a net­work that has at­ten­tion ‘way off’ where it should be has no gra­di­ent to give it bet­ter solu­tions.

Thanks for the link to the pa­per and the idea of think­ing about to what ex­tent the at­ten­tion is/​is-not differ­en­tiable.

You might be in­ter­ested in Trans­former Net­works, which use a learned pat­tern of at­ten­tion to route data be­tween lay­ers. They’re pretty pop­u­lar and have been used in some im­pres­sive ap­pli­ca­tions like this very con­vinc­ing image-syn­the­sis GAN.

re: whether this is a good re­search di­rec­tion. The fact that neu­ral net­works are highly com­press­ible is very in­ter­est­ing and I too sus­pect that ex­ploit­ing this fact could lead to more pow­er­ful mod­els. How­ever, if your goal is to in­crease the chance that AI has a pos­i­tive im­pact, then it seems like the rele­vant thing is how quickly our un­der­stand­ing of how to al­ign AI sys­tems pro­gresses, rel­a­tive to our un­der­stand­ing of how to build pow­er­ful AI sys­tems. As de­scribed, this idea sounds like it would be more use­ful for the lat­ter.

The image syn­the­sis is im­pres­sive. The Trans­former net­work pa­per looks in­trigu­ing. I will need to read again much more slowly, and not skim to un­der­stand it. Thanks for both the links and feed­back on al­ign­ing AI.

I agree the ideas re­ally are about pro­gress­ing AI, rather than pro­gress­ing AI speci­fi­cally in a pos­i­tive way. As a post-hoc jus­tifi­ca­tion though, ex­plor­ing at­ten­tion mechanisms in ma­chine learn­ing in­di­cates that what AI ‘cares about’ may be pretty deeply em­bed­ded in its tech­nol­ogy. Your com­ment, and my need to jus­tify post-hoc, set me to the task of mak­ing that link more con­crete, so let me ex­pand on that.

I think many an­i­mals have al­most hard-wired at­ten­tion mechanisms for alert­ing them to eyes. Things with eyes are an­i­mate, and may need a re­ac­tion more rapidly than rocks or trees do. An­i­mals do have al­most hard-wired at­ten­tion mechanisms for sud­den move­ment too.

What alert­ing or at­ten­tion set­ting mechanisms will AIs for self-driv­ing cars have? Prob­a­bly they will pri­ori­tise sud­den move­ment de­tec­tion. Prob­a­bly they won’t have any spe­cific mechanism for alert­ing to eyes. Per­haps that’s a mis­take.

I’ve no­ticed that the bound­ing boxes in some videos of ‘what a car sees’ are pretty good for fol­low­ing ve­hi­cles, but flick on and off for bound­ing boxes around peo­ple on the side­walk. The sta­ble bound­ing boxes are rel­a­tively square. The un­sta­ble bound­ing boxes are tall and thin.

Now just maybe, we want to make a vi­sual ar­chi­tec­ture that is very good at dis­t­in­guish­ing tall thin ob­jects that could be peo­ple, from tall thin ob­jects that could be lamp posts. That has im­pli­ca­tions all the way down to the vi­sual pipeline. The car is not go­ing to be good at solv­ing trol­ley prob­lems if it can tell trucks from cars, but can’t tell peo­ple from lamp posts.

There is a large ex­ist­ing liter­a­ture on prun­ing neu­ral net­works, start­ing with the 1990 pa­per “Op­ti­mal Brain Da­m­age” by Le Cun, Denker and Solla. A re­cent pa­per with more refer­ences is https://​​arxiv.org/​​pdf/​​1803.03635.pdf