Researchers propose faster, more efficient alternatives to backpropagation

cl3misch · on Dec 20, 2020

I find the nomenclature in this article a bit weird.

> Another disadvantage of backpropagation is its tendency to become stuck in the local minima of the loss function. Mathematically, the goal in training a model is converging on the global minimum, the point in the loss function where the model has optimized its ability to make predictions.

"Backpropagation" is the method how to compute the gradient of the weights with respect to a loss function. But the article repeatedly uses the term as if it was the whole optimization algorithm, running into local minima.

joe_the_user · on Dec 20, 2020

Wikipedia: "The term backpropagation strictly refers only to the algorithm for computing the gradient, not how the gradient is used; however, the term is often used loosely to refer to the entire learning algorithm, including how the gradient is used, such as by stochastic gradient descent."

-- Meaning follows usage.

https://en.wikipedia.org/wiki/Backpropagation

enriquto · on Dec 20, 2020

This is wrong, stupid and extremely confusing.

atty · on Dec 20, 2020

As someone who uses this in their day job, I have no problem using loose terminology to describe a well known procedure. Most of the time when I am referring to the act of optimization, it doesn’t matter exactly what method I am using, and I can use backprop as a stand in. If I’m talking about the technical details of my work, I will state the specific optimization strategy. Everyone on my team does similar things, and no one is confused or misled. Use rigorous language when necessary, and use colloquial language when appropriate.

enriquto · on Dec 20, 2020

> Use rigorous language when necessary, and use colloquial language when appropriate.

Do you think a peer-reviewd publication is formal enough to warrant precise language? These publications are not only read by specialists in the field. I use automatic differentiation in my daily work, but I'm not familiar with machine learning. Thus I am very confused when "backpropagation" is used to mean an optimization algorithm.

EDIT: It is as if physicists used the term "special relativity" to talk about "quantum mechanics" because, after all, quantum mechanics happens in Lorentzian spacetime. Now for specialists of quantum physics it may make sense, since they are using "special relativity" to distinguish it from fancier quantum theories that combine field theory with GR. But for normal people it would be certainly misleading. Using "backpropagation" to include optimization has the same feeling.

miemo23 · on Dec 20, 2020

QM doesn't happen in lorentzian spacetime, QFT does though

enriquto · on Dec 20, 2020

yes, sorry, I meant quantum field theory

frankling_ · on Dec 21, 2020

Don't worry about it, it was close enough to get your point across.

uoaei · on Dec 21, 2020

"Gradient descent" is more accurate and equally "loose".

edoceo · on Dec 20, 2020

Could you provide a definition correct, smart and sorta helpful?

savant_penguin · on Dec 20, 2020

Some people use backpropagation and gradient decent interchangeably. It's really confusing though

jhrmnn · on Dec 20, 2020

I wouldn’t say it’s confusing—it’s wrong and suggests at the least sloppiness in terminology

ethbr0 · on Dec 20, 2020

> sloppiness in terminology

venturebeat.com

Not the worst, but we're not talking Nature or Spectrum here.

dnautics · on Dec 21, 2020

That's bad. There are gradient descent techniques that are feed-forward, and understand the domains where those are appropriate and the domains where backprop is appropriate is I think important, especially as you try to do things like mix machine learning techniques with other differentiable programming strategies.

option · on Dec 20, 2020

of course it is wrong. One nice thing about math is that thing are defined precisely and back propagation and, say, SGD or Adam are different things

bonoboTP · on Dec 21, 2020

Deep learning terminology is often loose like that, see "softmax loss" or "deconvolution layer" etc.

duvenaud · on Dec 20, 2020

I'm one of the speakers at the workshop mentioned [1]. The article is a bit of a concept salad. I'm not familiar with all of the papers mentioned but am happy to try to answer questions.

[1] https://beyondbackprop.github.io/

master_yoda_1 · on Dec 21, 2020

Are you not worried that this article has a lot of misinformation and these kind of workshop are very common in NIPS. In 2015 there is a "quantum machine learning" workshop https://nips.cc/Conferences/2015/ScheduleMultitrack?event=49...

I can easily write an article, "deep learning is dead, here comes quantum ml"

How you academic stop this bullshit, or are you yourself a bullshit who spread misinformation by supporting these kind of articles?

duvenaud · on Dec 21, 2020

Good question. I do think there are some bogus workshops, and at least a couple of the papers at this workshop seem to be written by people who don't understand the area very well. However, there's also a lot of great work in this workshop. It's also asking an important and fundamental question, which we have reason to believe has better answers than our current ones. I also think quantum ML is a perfectly sensible line of inquiry (although imo unlikely to pay off for a long time, if ever), even if its results are consistently overhyped in the media.

I don't really know what to do about the state of journalism, however. I thought I was being pretty hard on the author of this piece elsewhere in this thread, actually. I guess you could blame everyone who upvoted the article. On the other hand, we can't let the perfect be the enemy of the good.

There is a screening process for NIPS workshops (which I've been part of the last two years) where more experienced researchers rate the proposed workshops. There are usually a couple that get through that I think shouldn't be. However, I'd rather let 1000 stupid workshops through than shut down the one seemingly weird one that's actually a promising direction. Because if not here, where can we nurture and support weird ideas?

Finally, I think workshops play a vital role in onboarding newcomers to the field. One of the main ways you learn is by writing papers, and workshops lower the barrier to getting something out the door and getting feedback on it.

master_yoda_1 · on Dec 26, 2020

Workshop is not bogus the way you are commercializing is bogus. Research is incremental process and there are lot of frontier scientist currently working on. The importance given to only this workshop is bullshit

pretty_dumm_guy · on Dec 20, 2020

Hi Professor,

Good day,

I was wondering whether it is be possible for you to provide an overview of different methods that you think might have a better shot at replacing backpropagation algorithm?

duvenaud · on Dec 20, 2020

Sure. First of all, I want to say that backprop, by which I mean reverse-mode differentiation for computing gradients, combined with gradient descent for updating parameters, is pretty great. In a sense it's the last thing we should be trying to replace, since pretty much the whole deep learning revolution was about replacing hand-designed functions with ones that can be optimized in this way.

Reverse-mode differentiation has about the same time cost as whatever function you're optimizing, no matter how many parameters you need gradients for. This which is about as good as one could hope for, and is what lets it scale to billions of parameters.

The main downside of reverse-mode differentiation (and one of the reasons it's biologically implausible) is that it requires storing all the intermediate numbers that were computed when evaluating the function on the forward pass. So its memory cost grows with the complexity of the function being optimized.

So the main practical problem with reverse-mode differentiation + gradient descent is the memory requirement, and much of the research presented in the workshop is about ways to get around this. A few of the major approaches are:

1) Only storing a subset of the forward activations, to get noisier gradients at less memory cost. This is what the "Randomized Automatic Differentiation" paper does. You can also save memory and get exact gradients if you re-construct the activations as you need them (called checkpointing), but this is slower.

2) Only training one layer at a time. This is what the "Layer-wise Learning" papers are doing. I suppose you could also say that this is what the "feedback alignment" papers are doing.

3) If the function being optimized is a fixed-point computation (such as an optimization), you can compute its gradient without needing to store any activations by using the implicit function theorem. This is what my talk was about.

4) Some other forms of sensitivity analysis (not exactly the same as computing gradients) can be done by just letting a dynamical system run for a little while. Barak Pearlmutter has some work on how he thinks this is what happens in slow-wave sleep to make our brains less prone to seizures when we're awake.

I'm missing a lot of relevant work, and again I don't even know all the work that was presented at this one workshop. But I hope this helps.

bmc7505 · on Dec 20, 2020

> Barak Pearlmutter has some work on how he thinks this is what happens in slow-wave sleep to make our brains less prone to seizures when we're awake.

Interesting! I am more familiar with Pearlmutter's work on automatic differentiation, but was was unaware of this work with Houghton.

A new hypothesis for sleep: tuning for criticality: https://zero.sci-hub.se/2153/6c1cfbc1b78d23ef2e1cb7102dd8339...

There is also a related paper on wake-sleep learning from UofT, of which I am sure you are aware:

The wake-sleep algorithm for unsupervised neural networks: https://www.cs.toronto.edu/~hinton/absps/ws.pdf

Are you aware of any recent work investigating the role of sleep in biological and statistical learning?

duvenaud · on Dec 20, 2020

Yes, "tuning for criticality" was the paper I was thinking of ! But I'm afraid I'm a dilettante when it comes to neuroscience. I basically just know the basic theories about consolidating learning during sleep.

pretty_dumm_guy · on Dec 20, 2020

Thank you for your answer. It appears to me that we are trying to achieve an algorithm that has better time complexity than the one that we have right now(reverse mode differentiation with gradient descent).

Is it possible to combine these methods in a straight forward manner with methods that try to reduce the space complexity? For example, Lottery ticket hypothesis(https://arxiv.org/abs/1803.03635) seems to reduce spacial complexity(Please do correct me if I am wrong).

Also, based on my rather poor and limited knowledge, it appears to me that set of proposed methods that reduced space complexity and set of proposed methods that reduce time complexity are disjoint. Is that the case ?

duvenaud · on Dec 20, 2020

Thanks for your question! But as I said, no one is really worried about the asymptotic time complexity of reverse mode differentiation, although there is scope for improving constants). The main scope for improvement is in the space complexity.

There is a lot of work on trying to speed up optimization, for example the K-FAC optimizer by Roger Grosse that uses second-order gradient information in a scalable way.

The lottery ticket pruning strategies do reduce space complexity, but I think the main reason people are interested in it is to reduce training time complexity, or deployment memory requirements, but not so much training memory requirements.

As for whether memory-saving and time-saving approaches are disjoint, many methods (like checkpointing) introduce a tradeoff between time and space complexity, so no.

pretty_dumm_guy · on Dec 23, 2020

Thank you again for the clarifications. You have given me something to chew on over the holidays.

I wish you and your family a happy Christmas :)

sdenton4 · on Dec 20, 2020

(Lottery Ticket, to date, produces small networks ex post facto... You still have to train the giant network. There's also some indication that it's chancy on 'large' datasets+problems. https://arxiv.org/abs/1902.09574 )

klipt · on Dec 21, 2020

There's also RevNets, which saves memory by only using reversible layers: https://arxiv.org/abs/1707.04585

duvenaud · on Dec 21, 2020

Yes, good point!

im3w1l · on Dec 20, 2020

One approach I've been thinking about is optimizing each neuron using only the global loss and information about the neighboring neurons.

Basically if the network made the correct prediction tell each neuron to do a little bit more of what it just did. If it sent a high output, change the weights so it sends an even higher output. Weaken connections that were inhibitory and strengthen connections that were excitatory. And for a neuron with a low output, make it even lower by doing the opposite.

If on the other hand prediction was wrong, then try to make the neuron do less of what it did.

Do you know if something like this has been tried?

duvenaud · on Dec 20, 2020

This is basically what the chain rule / backprop does. The only caveat is that it's pretty hard to make a network with only positive weights. And once you allow negative weights, you have to know whether your downstream neighbours should have had higher or lower output. The easiest way to do this is to propagate the gradient backwards from the output, hence backprop.

sdenton4 · on Dec 20, 2020

Backprop is good at giving credit where credit is due: you're looking at the impact of each weight on loss, which allows changing each weight to improve the loss, by an appropriate amount proportional to the other weights. You can even have some negative weight gradients and some positive; ie, it may be that even with a 'good' overall result that it's best to turn down a particular weight.

So my guess is that this approach would either take a much longer time to converge (as there's less information transmitted back for the neuron updates) or stall out completely.

Probably not too hard to code up, if you want to try it. But I would also be pretty surprised if it hadn't been tried before.

im3w1l · on Dec 21, 2020

I agree it's likely to converge slower, but I was thinking that it could maybe make up for it by being cheaper. Based on what I know it also seems more biologically plausible that learning is a mostly local process.

interblag · on Dec 20, 2020

Thanks for offering your insight!

Slightly tangential question, but on reading the article I was surprised that it mentioned multiple anonymous submissions, given that this was a workshop at one of the most prestigious ML conferences. Any particular reason for this that you can think of?

duvenaud · on Dec 21, 2020

The names of all the authors are clearly listed on the workshop website [1]. I think this is just a case of incompetence on the part of the journalist.

https://beyondbackprop.github.io/

Ar-Curunir · on Dec 21, 2020

The submission process is anonymous, but after papers are accepted there’s no need for anonymity, so author names are published. Don’t see why youre questioning the competence of the journalist if you don’t know the submission process yourself.

duvenaud · on Dec 21, 2020

To be charitable, you could call it misleading laziness. But the article was published long after all the papers were de-anonymized, so it seems perverse to explain that the submissions were originally anonymous instead of just giving the authors' names.

Furthermore, the journalist managed to write an entire article about a workshop without once naming it, giving a link, or even defining backpropagation.

Edit: I want to say that I understand it's hard to cover an area that you're not an expert in. But if the journalist had googled the title of the articles he was writing about, he would have found the authors' names. Instead he gave the reader the impression that the articles are still anonymous.

Ar-Curunir · on Dec 21, 2020

Submissions are anonymous, but after acceptance papers are unblinded

bra-ket · on Dec 20, 2020

The best part of the article is a quote by G. Hinton, the father of "deep learning", at the end: "My view is throw it all away and start again, I don’t think that’s how the brain works."

deehouie · on Dec 20, 2020

While Hinton's view need to be noted, I heard a quote attributed to Yann LeCun, something like,

"If you want to learn flying by modeling the biology of birds, you're doing it wrong. Just look at today's airplanes. They have no resemblance to birds at all. Yet they're million times better and faster than any bird."

b112 · on Dec 20, 2020

I get what the author was trying to say, but it's still -- a very limited view. Mostly because of the last bit (better/faster).

Birds are to planes, as humans are to cars. Yet can a car leap over barricades, climb mountains, trees, self-repair, turn on a dime, stop instantly, etc, etc?

A plane cannot maneuver like a bird, take off in crazy weather conditions, land on a dime in a tree, stop almost instantly in flight, and change direction, etc.

I think what you've quoted has a lot of value here, for, what we should expect from an artificial brain, isn't a human brain. This is truth. However, while it may be faster in a specific capacity, but it won't have the same characteristics.

So yes, expecting it to be like a human brain doesn't make sense.

Yet better/faster? I don't think we can compare this, they're too different.

(which is really the quote's point, but I just didn't like the better/faster bit at the end...)

nn3 · on Dec 20, 2020

Also birds (and insects/bats/pterosaurs) flight is a lot more energy efficient than any plane. Today's deep learning is essentially brute force, burning thousands of watts for anything more complicated which a single human brain can often do in ~15Watts.

The advanced models like GPT-3 are burning millions of watts in the cloud but they're not that much better than what a brain can do (and in many ways worse, as in often requiring supervised learning)

That's the key point. The algorithms need to become more energy efficient to make significant leaps, thus become more like brains.

ant6n · on Dec 20, 2020

Also, birds produce themselves out of an egg, with only food, water and air as production input. They also can produce more of themselves with minimal input. They are also self-repairing/maintaining, something planes cant do.

qlk1123 · on Dec 21, 2020

> single human brain can often do in ~15Watts

There were similar arguments when AlphaGo showed up and beat master Go player Lee Sedol, but is power(in Watt) the right measurement? I always feel like it should be the total energy(in J or Cal) required to transform a computing device like biological brain or electrical computer from knowing nothing to being capable of a skill like Go game. In such sense, deep learning is still more energy efficient than human.

littlestymaar · on Dec 21, 2020

Lee Sedol's lifetime energy consumption for all biological process is around 50MWh, Alpha Go consumed more than 100kW and was likely to have been trained more than 500h, so even counting all the energy Sedol spent eating, dancing or having sex ends up being less than the amount of energy spent to train the machine used to beat him.

starpilot · on Dec 20, 2020

No, it's not. What's your comparison? Are there birds that can carry 80,000 lb of passenger + cargo weight? Condors fly like fixed-wing aircraft for 99% of their flight, hummingbirds fly more like insects. There isn't one type of bird flight.

This whole HN discussion of bird flight is a trainwreck and reflects massive gaps in understanding of aerodynamics. This is '00s "computer virus news report" level competence in this subject.

sterlind · on Dec 20, 2020

We understand the aerodynamics of bird flight, and used it to make fixed-wing planes optimized for carrying lots of cargo. Once we understand the principles behind intelligence, we can make very efficient AI optimized for our usage. But we're still at the point where we don't understand intelligence as well as we understood aerodynamics when building the first planes, so we still have a lot to learn from "birds" - animal brains.

webmaven · on Dec 21, 2020

> But we're still at the point where we don't understand intelligence as well as we understood aerodynamics when building the first planes

Actually, I'd say that our understanding of intelligence is right about at the level of aerodynamics at the dawn of heavier than air flight:

https://youtu.be/Sp7MHZY2ADI

https://youtu.be/gN-ZktmjIfE

I mean, we could quibble about exactly where we are pre- or post-Wright Flyer, but given the amount of AI research that amounts to brute-force flailing about in search of incremental improvements, disagreements on the importance of "biological plausibility" and so on, it's pretty clear that, roughly speaking, AI is currently somewhere in the equivalent of the Lilienthal-Langley-Wright-Curtis continuum (ie. 1890-1910-ish) and still prior to the most important theoretical breakthroughs. IOW, AI has not in my opinion yet achieved an equivalent to aerodynamics' Prandtl lifting-line theory: https://en.m.wikipedia.org/wiki/Lifting-line_theory

frongpik · on Dec 20, 2020

I believe AI will start as a basic principle or idea that can be applied to any sufficiently big state machine that controls e.g. an RC airplane or traffic lights. That idea will be obvious in a hindsight. I'd even make a guess that it will be like a "stateful" state machine that accumulates state in a particular manner and uses that to control the underlying state machine. We still will be nowhere near understanding intelligence, but that clever trick will be enough in most cases.

2-tpg · on Dec 20, 2020

> The question of whether a computer can think is no more interesting than the question of whether a submarine can swim. --- Dijkstra

Better/faster we would not directly compare to humans, but to benchmarks and timed experiments.

LeCun is saying to treat "intelligence" the same as "flight" or "swimming". It is a matter of function, not a matter of a specific instantiation on a biological substrate. You don't need to recreate flapping wings to gain "flight", you can strap a combustion engine on a cylinder and beat all birds on earth in regards to speed. You don't say "we don't have flight yet", because an airplane is not able to land on a tree branch. Maybe we don't have yet all the components and aspects of "flight", but this is not a show stopper, and drones have come a long way.

Now the more interesting question becomes: What are the laws of aerodynamics for intelligence?

Aside: I think it is absolutely insane that a conference workshop with papers yet to go through peer-review, is highlighted as a popsci article on VentureBeat. That's such a narrow workshop, that even researchers in the field may be unaware of it. And now these get to read the paper summaries from a HN-story. "the centre cannot hold".

Aside II: Yann LeCun talk from 2019 about this subject (better to debate the source ;)):

> Clearly, Deep Learning research would greatly benefit from better theoretical understanding. DL is partly engineering science in which we create new artifacts through theoretical insight, intuition, biological inspiration, and empirical exploration. But understanding DL is a kind of "physical science" in which the general properties of this artifact is to be understood. The history of science and technology is replete with examples where the technological artifact preceded (not followed) the theoretical understanding: the theory of optics followed the invention of the lens, thermodynamics followed the steam engine, aerodynamics largely followed the airplane, information theory followed radio communication, and computer science followed the programmable calculator. My two main points are that (1) empiricism is a perfectly legitimate method of investigation, albeit an inefficient one, and (2) our challenge is to develop the equivalent of thermodynamics for learning and intelligence. While a theoretical underpinning, even if only conceptual, would greatly accelerate progress, one must be conscious of the limited practical implications of general theories. --- https://www.ias.edu/video/DeepLearningConf/2019-0222-YannLeC...

deehouie · on Dec 21, 2020

See my link to his 2013 ICML talk above. There's a very nice photo of L'Avion III de Clement Ader, a plane modeled as a bird.

deehouie · on Dec 21, 2020

The better/faster part is my embellishment :)

See my link to his ICML 2013 presentation above.

habitue · on Dec 20, 2020

It's true we don't build planes like birds. But we understand how bird wings work, and we understand why we don't design planes like bird wings.

We don't have the same kind of understanding of how brains learn, so the comparison is not quite right.

When we understand how to build things that learn like brains, we'll be in a better position to say things like "Ok this is strictly worse than backprop, let's stick with backprop" or "Actually, this is better than backprop because X", (or, more likely, there are things we can use from both). Until we have that understanding it's silly to stop trying to understand how the brain does things.

That being said, nobody is going to stop working on backprop, and no one is going to stop working on understanding biological mechanisms . Research works by a bunch of people investigating different avenues simultaneously.

cbarrick · on Dec 21, 2020

Reminds me of Dijkstra's aphorism:

> "The question of whether machines can think is about as relevant as the question of whether submarines can swim."

jokethrowaway · on Dec 21, 2020

Better is subjective. A specific AI may perform better than a human (eg. processing images for hours, faster and longer than a human) But we're still not able to have an AI you can talk to like you talk to a human (gpt-3 is still nowhere enough in my book, albeit being a good technological feat)

Mistletoe · on Dec 21, 2020

I would caveat "million times better." When 747s pop out baby 747s and procreate and refuel autonomously for 60 millon years I will give you million times better.

But I like the gist of the quote.

deehouie · on Dec 21, 2020

I found the source. He and Ranzato gave a talk at ICML 2013

https://cilvr.nyu.edu/lib/exe/fetch.php?media=deeplearning:2...

Slide #9

Let's be inspired by nature, but not too much

It's nice imitate Nature,

But we also need to understand

For airplanes, we developed aerodynamics and compressible fluid dynamics.

Question : what is the equivalent of aerodynamics for understanding intelligence

bra-ket · on Dec 20, 2020

the problem with that analogy is trying to build an airplane before you figured out the laws of physics.

sdenton4 · on Dec 20, 2020

The physics of lift and aerodynamics were faaaaar from well understood at the time of the first airplanes, though. New areas tend to run a bit ahead of the underlying science; the fundamentals expand to support and improve the applications over time.

bra-ket · on Dec 20, 2020

but we did have quite a few advances at the time of the first airplane, for example by that time steam & combustion engines were already invented, which required non-trivial understanding of physics, chemistry and material science was very advanced.

I hold a pessimistic view that we are still in hunter-gatherer mode as it comes to understanding cognition.

sdenton4 · on Dec 20, 2020

Well, it's your right to be a pessimist... I tend to think that the current hardware specialized for fast, parallelized linear algebra is at least as good as the wheels available at the start of the industrial revolution, though. We have learning algorithms that can match human/animal performance in a wide - but still constrained - set of tasks, which previous non-learned algorithms hadn't been able to crack. It's a start!

At some point you have to strike rocks to make fire, because the butane lighter hasn't been invented yet. You make do with what's available, and progressively get better at it. I tend to think that we're a couple-few perspective shifts away from getting it 'right,' and that the hardware side likely barely matters. But, I'm an optimist.

hyko · on Dec 20, 2020

People build things without understanding the underlying principles all the time, e.g. the steam engine. You could probably make the case that building things has helped our understanding more than our understanding has helped us build things.

Having said that, you can certainly improve a design when you better understand the fundamentals (vs intuition + trial & error).

star-trek-fleet · on Dec 20, 2020

"They have no resemblance to birds at all. Yet they're million times better and faster than any bird."

LMAO

A 6 years old kid can see the fundamental resemblance between a bird and a modern passenger airplane: The wings Tail stabilizer Slender body

Planes are faster bigger

Are they better?

Not necessarily, for example, humming bird can fly in a way that is far beyond any human machine in terms of efficiency and flexibility.

Of course man should not imitate birds, because human flight is fundamentally different activity than bird flying. But to say human aviation did not start by mimicking birds, is like to say Ann was not inspired human brain...

nightski · on Dec 20, 2020

I think the point was that aviation at the time did start mimicking birds and that was why there was so much failure. It was not until they let go of mimicking birds and took a different approach that they found success.

deehouie · on Dec 21, 2020

See my link to his ICML talk above.

officehero · on Dec 20, 2020

The main problem is not backpropagation though, but the fixation of resources on DL projects (that's what I call local minimum!). In my department, for example, they don't seem to care about the application, integration, deployment etc, as long as it's DL or DRL.

faitswulff · on Dec 20, 2020

Didn’t he recently publish an article about a drastically different way to approach machine learning called capsule networks?

p1esk · on Dec 21, 2020

There's nothing "drastically different" about capsules networks.

manjunaths · on Dec 20, 2020

https://beyondbackprop.github.io/

This is the NeurIPS workshop that the article is talking about.

dr_j_ · on Dec 20, 2020

So assign random values to connection weights and then ‘spin’ those weights to a combination of other random values that hopefully perform a bit more favourably.. isn’t this just random search?

quotemstr · on Dec 20, 2020

It's not a random search through the parameter space:

"But how do we select a good network from these Kn different networks? Brute-force evaluation of all possible configurations is clearly not feasible due to the massive number of different hypotheses. Instead, we present an algorithm, shown in Figure 1, that iteratively searches the best combination of connection values for the entire network by optimizing the given loss. To do this, the method learns a real-valued quality score for each weight option. These scores are used to select the weight value of each connection during the forward pass. The scores are then updated in the backward pass based on the loss value in order to improve training performance over iterations."

It's actually pretty clever.

ssivark · on Dec 20, 2020

Random search is a technical term in optimization with a very specific meaning (which unfortunately does not mean searching random locations in parameter space a la brute force). It’s more in the spirit of randomly deciding the direction in which to try to take the next step, thereby implicitly deriving a gradient component by sampling.

https://en.m.wikipedia.org/wiki/Random_search

sdenton4 · on Dec 20, 2020

It reminds me of Bayesian model sampling, where you have a distribution over possible weights and 'draw' a model from the distribution for each evaluation... A problem is that there may be interesting co-dependencies amongst the weights which independent sampling will have a hard time getting right.

chronolitus · on Dec 21, 2020

An interesting biologically-inspired alternative to backpropagation is STDP (https://en.wikipedia.org/wiki/Spike-timing-dependent_plastic...)

However, it's still quite hard to get useful results from it in practice.

---------------

I personally believe that one component of intelligence is the ability to apply cognitive patterns created for a particular input to other inputs. (Very simplified example: A block of "neurons" which have learned to recognize the pattern "is hurt by" when given a subject (group of pixels in image) and object (other group of pixels in image), could be applied to another subject/object pair, for example coming from processed audio. But if the audio processing takes 10 layers, and the image processing 5, the connection has to run backwards)

To do this in a state of the art deep network, you need the ability to create backward connections. Backward connections imply loops, and loops break backprop (unlike loops in RNNs, which can be easily unrolled AFAIK). So with the current backprop-trained feedforward model, you have to create patterns multiple time instead of reusing them.

This is why I will pay attention to backprop alternatives which allow loops, despite their (currently many) disadvantages. This and modular training are the two aspects of learning I would personally focus on.

alok-g · on Dec 26, 2020

Can you pls. point to relevant literature showing attempts on using STDP.

I am not exactly working on this, however, intuitively believe that it should be possible and effective to make STDP work (including handling loops).

Thanks.

mlthoughts2018 · on Dec 20, 2020

I found it very weird that the SLIDE algorithm from early 2019 isn’t mentioned. Maybe I missed it or maybe it is compared just deeper in the referenced publications?

SLIDE seems way, way superior to any of the listed solutions or approaches, as far as I could tell on a first read through.

https://arxiv.org/abs/1903.03129

quotemstr · on Dec 20, 2020

AIUI, that only works on sparse networks

mlthoughts2018 · on Dec 20, 2020

But there’s also been a lot of research suggesting most SOTA dense networks are arbitrarily replicatable with sparse networks, and may even be better in the sense of less overfitting. Perhaps things like GPT are still an exception, but for most applications SLIDE should work to train networks just as effective as naively specified dense architectures.

cinquemb · on Dec 21, 2020

> But there’s also been a lot of research suggesting most SOTA dense networks are arbitrarily replicatable with sparse networks

I'm not sure if its related, but would this work kind of how armadillo can do singular value decomp [0] of a matrix by embedding arbitrary n by m matrix X in a higher dimensional n+m by n+m null matrix M?

[0] http://arma.sourceforge.net/docs.html#svds

quotemstr · on Dec 20, 2020

Yeah. I think part of the problem is just that SLIDE represents a Kuhnesque paradigm shift and these things take time. I really want to play with SLIDE myself but just haven't had a chance.

sesuximo · on Dec 20, 2020

Can someone ELI am an undergrad? I don’t see how gradient descent “forgets” anything

unishark · on Dec 20, 2020

Its not a complicated concept, just a stretch of the concept of memory. Training in deep learning is done in batches. So "learning" (i.e. the gradient updates to weights) that happens due to your early batches of data can be undone by the gradient updates for later batches.

The gradient in machine learning is based on the loss. Specifically it's the direction that reduces the loss the fastest. So, not only the most recent batches, but specifically by the recent data that is predicted incorrectly. It doesn't have any "confidence" from the memory of what was predicted right previously, for example, it just currently only cares about changing to suit the most recent batches.

sdenton4 · on Dec 20, 2020

Seems like you could just use better active learning strategies to get around the issue, though... Keep your usual dataset, but progressively build a reservoir of 'important' examples while training. (where important == high loss or near decision boundary, for example.) Then when building batches, mix in some examples from the broad training set and some from the reservoir.

freeone3000 · on Dec 21, 2020

The lab I'm in has tried approaches with this.

It's effective, but the so-called "boundary cases" often have to be hand-chosen due to the difficulty of selecting them automatically: Early samples always have high loss and the decision boundary nearness is implicitly connected to the accuracy of the network at time of evaluation. In other words, the function we evaluate on forward pass itself is changing as a result of backprop, so the critical points and output of the function are also in flux.

You also lose an increasing portion of each batch to the "important" cases as you add more, so maintaining the size and contents of this pool is difficult - if you added every case, you'd have no new data.

So I think it's promising, but it needs more foundational work on deriving the impact of individual samples on the output. (If we ever get that breakthrough in explainability...)

sdenton4 · on Dec 21, 2020

Neat! It makes sense that you'd also want a mechanism for taking things out of the reservoir.

Overall I tend to think that this space is underexplored compared to searching for new architectures... We know that it helps to choose a curriculum for humans to help guide learning, even beginning with 'baby talk' to develop early communication skills.

saiojd · on Dec 20, 2020

Gradient decent doesn't per se, but retraining ("fine-tuning") on another dataset forgets most of the training done on the first dataset.

joshgel · on Dec 20, 2020

Is this true? My understanding was that in fine tuning, you’d only re train some of the layers. And even if you re train all the layers, the starting point for the layers is not random. If it really was all forgotten then fine tuning would not be orders of magnitude faster...

thunderbird120 · on Dec 20, 2020

Gradient decent optimizes performance of a model on a given dataset. If you stop training on one dataset and start training on another one your model will become more optimized for the second dataset and less optimized for the first. This will usually result in degraded performance on classes of data found more commonly in the first dataset but not the second. This is what people mean by "forgetting". It doesn't matter how much of the model you fine-tune, the effect is still present though the effect size varies.

webmaven · on Dec 20, 2020

Aside from being an excellent overview of NeurIPS 2020 papers on this topic, I found it curious that several of them were anonymous.

Are anonymously submitted papers becoming (more) common? If so, what's driving this?

gwern · on Dec 20, 2020

I'd assume they just haven't been unblinded yet.

duvenaud · on Dec 21, 2020

They were all unblinded weeks ago, the authors are listed on the workshop website: https://beyondbackprop.github.io/

webmaven · on Dec 21, 2020

So, at this point, the most likely explanation seems to be that TFA was written prior to the unblinding, and then wasn't subsequently reviewed or updated before publication.

webmaven · on Dec 20, 2020

Hmm. Shouldn't papers all be unblinded at once when acceptances/rejections are sent out?

dkislyuk · on Dec 20, 2020

These are workshop submissions (which typically implies a more lightweight review process, for more exploratory work), and it is possible the same submissions are currently in blind review for other conferences in their final form.

webmaven · on Dec 21, 2020

> [I]t is possible the same submissions are currently in blind review for other conferences in their final form.

Ah, that makes sense.

zipotm · on Dec 20, 2020

Bullshit, backpropagation was discovered by Rosenblatt...

lock-free · on Dec 20, 2020

I mean if we want to get pedantic I'm pretty sure Shannon used "backpropagation" for machine learning before either was called such.

Feedback for the purpose of regulating the state of a machine in response to input dates to antiquity, if we're really getting absurd. The formal definition is also debatable, I think Maxwell has the strongest claim.

6gvONxR4sf7o · on Dec 20, 2020

> In the 1960s, academics including... arrived at the theory of backpropagation.

It was clearly phrased this way specifically because backprop is just the chain rule applied in a particular direction, and as such has been invented and reinvented over and over by every one under the sun. Hell, a lazy googling says gradient descent goes back to Cauchy.

bra-ket · on Dec 20, 2020

I'd say Gottfried W. Leibniz is the true author, as it's all comes down to calculus. The particular implementation for "neural nets" is just a special case of function minimization by taking derivatives.

contingencies · on Dec 20, 2020

I like zoom-out views. To push what you describe further, it is essentially what ancient humans or their non-hominid forebears did subconsciously when calculating optimum motion trajectories to catch or spear prey while hunting... merely a version in formal notation ... we can thank the zero of India (https://en.wikipedia.org/wiki/0#History), the Persians (https://en.wikipedia.org/wiki/Algorithm#Etymology), the Islamic renaissance in Europe (https://mitpress.mit.edu/books/islamic-science-and-making-eu...) and numerous others for the slow development of the requisite formal maths. But a rose by any other name would smell as sweet. And perhaps, in the context of the stupefyingly deferred emergence of zero, even nameless!

webmaven · on Dec 21, 2020

> To push what you describe further, it is essentially what ancient humans or their non-hominid forebears did subconsciously when calculating optimum motion trajectories to catch or spear prey while hunting

Hah!

So, invention recapitulates evolution?

contingencies · on Dec 21, 2020

Zero to one!