I find the nomenclature in this article a bit weird.
> Another disadvantage of backpropagation is its tendency to become stuck in the local minima of the loss function. Mathematically, the goal in training a model is converging on the global minimum, the point in the loss function where the model has optimized its ability to make predictions.
"Backpropagation" is the method how to compute the gradient of the weights with respect to a loss function. But the article repeatedly uses the term as if it was the whole optimization algorithm, running into local minima.
Wikipedia: "The term backpropagation strictly refers only to the algorithm for computing the gradient, not how the gradient is used; however, the term is often used loosely to refer to the entire learning algorithm, including how the gradient is used, such as by stochastic gradient descent."
As someone who uses this in their day job, I have no problem using loose terminology to describe a well known procedure. Most of the time when I am referring to the act of optimization, it doesn’t matter exactly what method I am using, and I can use backprop as a stand in. If I’m talking about the technical details of my work, I will state the specific optimization strategy. Everyone on my team does similar things, and no one is confused or misled. Use rigorous language when necessary, and use colloquial language when appropriate.
> Use rigorous language when necessary, and use colloquial language when appropriate.
Do you think a peer-reviewd publication is formal enough to warrant precise language? These publications are not only read by specialists in the field. I use automatic differentiation in my daily work, but I'm not familiar with machine learning. Thus I am very confused when "backpropagation" is used to mean an optimization algorithm.
EDIT: It is as if physicists used the term "special relativity" to talk about "quantum mechanics" because, after all, quantum mechanics happens in Lorentzian spacetime. Now for specialists of quantum physics it may make sense, since they are using "special relativity" to distinguish it from fancier quantum theories that combine field theory with GR. But for normal people it would be certainly misleading. Using "backpropagation" to include optimization has the same feeling.
That's bad. There are gradient descent techniques that are feed-forward, and understand the domains where those are appropriate and the domains where backprop is appropriate is I think important, especially as you try to do things like mix machine learning techniques with other differentiable programming strategies.
I'm one of the speakers at the workshop mentioned [1]. The article is a bit of a concept salad. I'm not familiar with all of the papers mentioned but am happy to try to answer questions.
Good question. I do think there are some bogus workshops, and at least a couple of the papers at this workshop seem to be written by people who don't understand the area very well. However, there's also a lot of great work in this workshop. It's also asking an important and fundamental question, which we have reason to believe has better answers than our current ones. I also think quantum ML is a perfectly sensible line of inquiry (although imo unlikely to pay off for a long time, if ever), even if its results are consistently overhyped in the media.
I don't really know what to do about the state of journalism, however. I thought I was being pretty hard on the author of this piece elsewhere in this thread, actually. I guess you could blame everyone who upvoted the article. On the other hand, we can't let the perfect be the enemy of the good.
There is a screening process for NIPS workshops (which I've been part of the last two years) where more experienced researchers rate the proposed workshops. There are usually a couple that get through that I think shouldn't be. However, I'd rather let 1000 stupid workshops through than shut down the one seemingly weird one that's actually a promising direction. Because if not here, where can we nurture and support weird ideas?
Finally, I think workshops play a vital role in onboarding newcomers to the field. One of the main ways you learn is by writing papers, and workshops lower the barrier to getting something out the door and getting feedback on it.
Workshop is not bogus the way you are commercializing is bogus. Research is incremental process and there are lot of frontier scientist currently working on. The importance given to only this workshop is bullshit
I was wondering whether it is be possible for you to provide an overview of different methods that you think might have a better shot at replacing backpropagation algorithm?
Sure. First of all, I want to say that backprop, by which I mean reverse-mode differentiation for computing gradients, combined with gradient descent for updating parameters, is pretty great. In a sense it's the last thing we should be trying to replace, since pretty much the whole deep learning revolution was about replacing hand-designed functions with ones that can be optimized in this way.
Reverse-mode differentiation has about the same time cost as whatever function you're optimizing, no matter how many parameters you need gradients for. This which is about as good as one could hope for, and is what lets it scale to billions of parameters.
The main downside of reverse-mode differentiation (and one of the reasons it's biologically implausible) is that it requires storing all the intermediate numbers that were computed when evaluating the function on the forward pass. So its memory cost grows with the complexity of the function being optimized.
So the main practical problem with reverse-mode differentiation + gradient descent is the memory requirement, and much of the research presented in the workshop is about ways to get around this. A few of the major approaches are:
1) Only storing a subset of the forward activations, to get noisier gradients at less memory cost. This is what the "Randomized Automatic Differentiation" paper does. You can also save memory and get exact gradients if you re-construct the activations as you need them (called checkpointing), but this is slower.
2) Only training one layer at a time. This is what the "Layer-wise Learning" papers are doing. I suppose you could also say that this is what the "feedback alignment" papers are doing.
3) If the function being optimized is a fixed-point computation (such as an optimization), you can compute its gradient without needing to store any activations by using the implicit function theorem. This is what my talk was about.
4) Some other forms of sensitivity analysis (not exactly the same as computing gradients) can be done by just letting a dynamical system run for a little while. Barak Pearlmutter has some work on how he thinks this is what happens in slow-wave sleep to make our brains less prone to seizures when we're awake.
I'm missing a lot of relevant work, and again I don't even know all the work that was presented at this one workshop. But I hope this helps.
Yes, "tuning for criticality" was the paper I was thinking of ! But I'm afraid I'm a dilettante when it comes to neuroscience. I basically just know the basic theories about consolidating learning during sleep.
Thank you for your answer. It appears to me that we are trying to achieve an algorithm that has better time complexity than the one that we have right now(reverse mode differentiation with gradient descent).
Is it possible to combine these methods in a straight forward manner with methods that try to reduce the space complexity? For example, Lottery ticket hypothesis(https://arxiv.org/abs/1803.03635) seems to reduce spacial complexity(Please do correct me if I am wrong).
Also, based on my rather poor and limited knowledge, it appears to me that set of proposed methods that reduced space complexity and set of proposed methods that reduce time complexity are disjoint. Is that the case ?
Thanks for your question! But as I said, no one is really worried about the asymptotic time complexity of reverse mode differentiation, although there is scope for improving constants). The main scope for improvement is in the space complexity.
There is a lot of work on trying to speed up optimization, for example the K-FAC optimizer by Roger Grosse that uses second-order gradient information in a scalable way.
The lottery ticket pruning strategies do reduce space complexity, but I think the main reason people are interested in it is to reduce training time complexity, or deployment memory requirements, but not so much training memory requirements.
As for whether memory-saving and time-saving approaches are disjoint, many methods (like checkpointing) introduce a tradeoff between time and space complexity, so no.
(Lottery Ticket, to date, produces small networks ex post facto... You still have to train the giant network. There's also some indication that it's chancy on 'large' datasets+problems. https://arxiv.org/abs/1902.09574 )
One approach I've been thinking about is optimizing each neuron using only the global loss and information about the neighboring neurons.
Basically if the network made the correct prediction tell each neuron to do a little bit more of what it just did. If it sent a high output, change the weights so it sends an even higher output. Weaken connections that were inhibitory and strengthen connections that were excitatory. And for a neuron with a low output, make it even lower by doing the opposite.
If on the other hand prediction was wrong, then try to make the neuron do less of what it did.
Do you know if something like this has been tried?
This is basically what the chain rule / backprop does. The only caveat is that it's pretty hard to make a network with only positive weights. And once you allow negative weights, you have to know whether your downstream neighbours should have had higher or lower output. The easiest way to do this is to propagate the gradient backwards from the output, hence backprop.
Backprop is good at giving credit where credit is due: you're looking at the impact of each weight on loss, which allows changing each weight to improve the loss, by an appropriate amount proportional to the other weights. You can even have some negative weight gradients and some positive; ie, it may be that even with a 'good' overall result that it's best to turn down a particular weight.
So my guess is that this approach would either take a much longer time to converge (as there's less information transmitted back for the neuron updates) or stall out completely.
Probably not too hard to code up, if you want to try it. But I would also be pretty surprised if it hadn't been tried before.
I agree it's likely to converge slower, but I was thinking that it could maybe make up for it by being cheaper. Based on what I know it also seems more biologically plausible that learning is a mostly local process.
Slightly tangential question, but on reading the article I was surprised that it mentioned multiple anonymous submissions, given that this was a workshop at one of the most prestigious ML conferences. Any particular reason for this that you can think of?
The names of all the authors are clearly listed on the workshop website [1]. I think this is just a case of incompetence on the part of the journalist.
The submission process is anonymous, but after papers are accepted there’s no need for anonymity, so author names are published. Don’t see why youre questioning the competence of the journalist if you don’t know the submission process yourself.
To be charitable, you could call it misleading laziness. But the article was published long after all the papers were de-anonymized, so it seems perverse to explain that the submissions were originally anonymous instead of just giving the authors' names.
Furthermore, the journalist managed to write an entire article about a workshop without once naming it, giving a link, or even defining backpropagation.
Edit: I want to say that I understand it's hard to cover an area that you're not an expert in. But if the journalist had googled the title of the articles he was writing about, he would have found the authors' names. Instead he gave the reader the impression that the articles are still anonymous.
The best part of the article is a quote by G. Hinton, the father of "deep learning", at the end: "My view is throw it all away and start again, I don’t think that’s how the brain works."
While Hinton's view need to be noted, I heard a quote attributed to Yann LeCun, something like,
"If you want to learn flying by modeling the biology of birds, you're doing it wrong. Just look at today's airplanes. They have no resemblance to birds at all. Yet they're million times better and faster than any bird."
I get what the author was trying to say, but it's still -- a very limited view. Mostly because of the last bit (better/faster).
Birds are to planes, as humans are to cars. Yet can a car leap over barricades, climb mountains, trees, self-repair, turn on a dime, stop instantly, etc, etc?
A plane cannot maneuver like a bird, take off in crazy weather conditions, land on a dime in a tree, stop almost instantly in flight, and change direction, etc.
I think what you've quoted has a lot of value here, for, what we should expect from an artificial brain, isn't a human brain. This is truth. However, while it may be faster in a specific capacity, but it won't have the same characteristics.
So yes, expecting it to be like a human brain doesn't make sense.
Yet better/faster? I don't think we can compare this, they're too different.
(which is really the quote's point, but I just didn't like the better/faster bit at the end...)
Also birds (and insects/bats/pterosaurs) flight is a lot more energy efficient than any plane. Today's deep learning is essentially brute force, burning thousands of watts for anything more complicated which a single human brain can often do in ~15Watts.
The advanced models like GPT-3 are burning millions of watts in the cloud but they're not that much better than what a brain can do (and in many ways worse, as in often requiring supervised learning)
That's the key point. The algorithms need to become more energy efficient to make significant leaps, thus become more like brains.
Also, birds produce themselves out of an egg, with only food, water and air as production input. They also can produce more of themselves with minimal input. They are also self-repairing/maintaining, something planes cant do.
There were similar arguments when AlphaGo showed up and beat master Go player Lee Sedol, but is power(in Watt) the right measurement? I always feel like it should be the total energy(in J or Cal) required to transform a computing device like biological brain or electrical computer from knowing nothing to being capable of a skill like Go game. In such sense, deep learning is still more energy efficient than human.
Lee Sedol's lifetime energy consumption for all biological process is around 50MWh, Alpha Go consumed more than 100kW and was likely to have been trained more than 500h, so even counting all the energy Sedol spent eating, dancing or having sex ends up being less than the amount of energy spent to train the machine used to beat him.
No, it's not. What's your comparison? Are there birds that can carry 80,000 lb of passenger + cargo weight? Condors fly like fixed-wing aircraft for 99% of their flight, hummingbirds fly more like insects. There isn't one type of bird flight.
This whole HN discussion of bird flight is a trainwreck and reflects massive gaps in understanding of aerodynamics. This is '00s "computer virus news report" level competence in this subject.
We understand the aerodynamics of bird flight, and used it to make fixed-wing planes optimized for carrying lots of cargo. Once we understand the principles behind intelligence, we can make very efficient AI optimized for our usage. But we're still at the point where we don't understand intelligence as well as we understood aerodynamics when building the first planes, so we still have a lot to learn from "birds" - animal brains.
I mean, we could quibble about exactly where we are pre- or post-Wright Flyer, but given the amount of AI research that amounts to brute-force flailing about in search of incremental improvements, disagreements on the importance of "biological plausibility" and so on, it's pretty clear that, roughly speaking, AI is currently somewhere in the equivalent of the Lilienthal-Langley-Wright-Curtis continuum (ie. 1890-1910-ish) and still prior to the most important theoretical breakthroughs. IOW, AI has not in my opinion yet achieved an equivalent to aerodynamics' Prandtl lifting-line theory: https://en.m.wikipedia.org/wiki/Lifting-line_theory
I believe AI will start as a basic principle or idea that can be applied to any sufficiently big state machine that controls e.g. an RC airplane or traffic lights. That idea will be obvious in a hindsight. I'd even make a guess that it will be like a "stateful" state machine that accumulates state in a particular manner and uses that to control the underlying state machine. We still will be nowhere near understanding intelligence, but that clever trick will be enough in most cases.
> The question of whether a computer can think is no more interesting than the question of whether a submarine can swim. --- Dijkstra
Better/faster we would not directly compare to humans, but to benchmarks and timed experiments.
LeCun is saying to treat "intelligence" the same as "flight" or "swimming". It is a matter of function, not a matter of a specific instantiation on a biological substrate. You don't need to recreate flapping wings to gain "flight", you can strap a combustion engine on a cylinder and beat all birds on earth in regards to speed. You don't say "we don't have flight yet", because an airplane is not able to land on a tree branch. Maybe we don't have yet all the components and aspects of "flight", but this is not a show stopper, and drones have come a long way.
Now the more interesting question becomes: What are the laws of aerodynamics for intelligence?
Aside: I think it is absolutely insane that a conference workshop with papers yet to go through peer-review, is highlighted as a popsci article on VentureBeat. That's such a narrow workshop, that even researchers in the field may be unaware of it. And now these get to read the paper summaries from a HN-story. "the centre cannot hold".
Aside II: Yann LeCun talk from 2019 about this subject (better to debate the source ;)):
> Clearly, Deep Learning research would greatly benefit from better theoretical understanding. DL is partly engineering science in which we create new artifacts through theoretical insight, intuition, biological inspiration, and empirical exploration. But understanding DL is a kind of "physical science" in which the general properties of this artifact is to be understood. The history of science and technology is replete with examples where the technological artifact preceded (not followed) the theoretical understanding: the theory of optics followed the invention of the lens, thermodynamics followed the steam engine, aerodynamics largely followed the airplane, information theory followed radio communication, and computer science followed the programmable calculator. My two main points are that (1) empiricism is a perfectly legitimate method of investigation, albeit an inefficient one, and (2) our challenge is to develop the equivalent of thermodynamics for learning and intelligence. While a theoretical underpinning, even if only conceptual, would greatly accelerate progress, one must be conscious of the limited practical implications of general theories. --- https://www.ias.edu/video/DeepLearningConf/2019-0222-YannLeC...
It's true we don't build planes like birds. But we understand how bird wings work, and we understand why we don't design planes like bird wings.
We don't have the same kind of understanding of how brains learn, so the comparison is not quite right.
When we understand how to build things that learn like brains, we'll be in a better position to say things like "Ok this is strictly worse than backprop, let's stick with backprop" or "Actually, this is better than backprop because X", (or, more likely, there are things we can use from both). Until we have that understanding it's silly to stop trying to understand how the brain does things.
That being said, nobody is going to stop working on backprop, and no one is going to stop working on understanding biological mechanisms . Research works by a bunch of people investigating different avenues simultaneously.
Better is subjective.
A specific AI may perform better than a human (eg. processing images for hours, faster and longer than a human)
But we're still not able to have an AI you can talk to like you talk to a human (gpt-3 is still nowhere enough in my book, albeit being a good technological feat)
I would caveat "million times better." When 747s pop out baby 747s and procreate and refuel autonomously for 60 millon years I will give you million times better.
The physics of lift and aerodynamics were faaaaar from well understood at the time of the first airplanes, though. New areas tend to run a bit ahead of the underlying science; the fundamentals expand to support and improve the applications over time.
but we did have quite a few advances at the time of the first airplane, for example by that time steam & combustion engines were already invented, which required non-trivial understanding of physics, chemistry and material science was very advanced.
I hold a pessimistic view that we are still in hunter-gatherer mode as it comes to understanding cognition.
Well, it's your right to be a pessimist... I tend to think that the current hardware specialized for fast, parallelized linear algebra is at least as good as the wheels available at the start of the industrial revolution, though. We have learning algorithms that can match human/animal performance in a wide - but still constrained - set of tasks, which previous non-learned algorithms hadn't been able to crack. It's a start!
At some point you have to strike rocks to make fire, because the butane lighter hasn't been invented yet. You make do with what's available, and progressively get better at it. I tend to think that we're a couple-few perspective shifts away from getting it 'right,' and that the hardware side likely barely matters. But, I'm an optimist.
People build things without understanding the underlying principles all the time, e.g. the steam engine. You could probably make the case that building things has helped our understanding more than our understanding has helped us build things.
Having said that, you can certainly improve a design when you better understand the fundamentals (vs intuition + trial & error).
"They have no resemblance to birds at all. Yet they're million times better and faster than any bird."
LMAO
A 6 years old kid can see the fundamental resemblance between a bird and a modern passenger airplane:
The wings
Tail stabilizer
Slender body
Planes are faster bigger
Are they better?
Not necessarily, for example, humming bird can fly in a way that is far beyond any human machine in terms of efficiency and flexibility.
Of course man should not imitate birds, because human flight is fundamentally different activity than bird flying. But to say human aviation did not start by mimicking birds, is like to say Ann was not inspired human brain...
I think the point was that aviation at the time did start mimicking birds and that was why there was so much failure. It was not until they let go of mimicking birds and took a different approach that they found success.
The main problem is not backpropagation though, but the fixation of resources on DL projects (that's what I call local minimum!). In my department, for example, they don't seem to care about the application, integration, deployment etc, as long as it's DL or DRL.
So assign random values to connection weights and then ‘spin’ those weights to a combination of other random values that hopefully perform a bit more favourably.. isn’t this just random search?
It's not a random search through the parameter space:
"But how do we select a good network from these Kn different networks? Brute-force evaluation of
all possible configurations is clearly not feasible due to the massive number of different hypotheses.
Instead, we present an algorithm, shown in Figure 1, that iteratively searches the best combination
of connection values for the entire network by optimizing the given loss. To do this, the method
learns a real-valued quality score for each weight option. These scores are used to select the weight
value of each connection during the forward pass. The scores are then updated in the backward pass
based on the loss value in order to improve training performance over iterations."
Random search is a technical term in optimization with a very specific meaning (which unfortunately does not mean searching random locations in parameter space a la brute force). It’s more in the spirit of randomly deciding the direction in which to try to take the next step, thereby implicitly deriving a gradient component by sampling.
It reminds me of Bayesian model sampling, where you have a distribution over possible weights and 'draw' a model from the distribution for each evaluation... A problem is that there may be interesting co-dependencies amongst the weights which independent sampling will have a hard time getting right.
However, it's still quite hard to get useful results from it in practice.
---------------
I personally believe that one component of intelligence is the ability to apply cognitive patterns created for a particular input to other inputs. (Very simplified example: A block of "neurons" which have learned to recognize the pattern "is hurt by" when given a subject (group of pixels in image) and object (other group of pixels in image), could be applied to another subject/object pair, for example coming from processed audio. But if the audio processing takes 10 layers, and the image processing 5, the connection has to run backwards)
To do this in a state of the art deep network, you need the ability to create backward connections. Backward connections imply loops, and loops break backprop (unlike loops in RNNs, which can be easily unrolled AFAIK). So with the current backprop-trained feedforward model, you have to create patterns multiple time instead of reusing them.
This is why I will pay attention to backprop alternatives which allow loops, despite their (currently many) disadvantages. This and modular training are the two aspects of learning I would personally focus on.
I found it very weird that the SLIDE algorithm from early 2019 isn’t mentioned. Maybe I missed it or maybe it is compared just deeper in the referenced publications?
SLIDE seems way, way superior to any of the listed solutions or approaches, as far as I could tell on a first read through.
But there’s also been a lot of research suggesting most SOTA dense networks are arbitrarily replicatable with sparse networks, and may even be better in the sense of less overfitting. Perhaps things like GPT are still an exception, but for most applications SLIDE should work to train networks just as effective as naively specified dense architectures.
> But there’s also been a lot of research suggesting most SOTA dense networks are arbitrarily replicatable with sparse networks
I'm not sure if its related, but would this work kind of how armadillo can do singular value decomp [0] of a matrix by embedding arbitrary n by m matrix X in a higher dimensional n+m by n+m null matrix M?
Yeah. I think part of the problem is just that SLIDE represents a Kuhnesque paradigm shift and these things take time. I really want to play with SLIDE myself but just haven't had a chance.
Its not a complicated concept, just a stretch of the concept of memory. Training in deep learning is done in batches. So "learning" (i.e. the gradient updates to weights) that happens due to your early batches of data can be undone by the gradient updates for later batches.
The gradient in machine learning is based on the loss. Specifically it's the direction that reduces the loss the fastest. So, not only the most recent batches, but specifically by the recent data that is predicted incorrectly. It doesn't have any "confidence" from the memory of what was predicted right previously, for example, it just currently only cares about changing to suit the most recent batches.
Seems like you could just use better active learning strategies to get around the issue, though... Keep your usual dataset, but progressively build a reservoir of 'important' examples while training. (where important == high loss or near decision boundary, for example.) Then when building batches, mix in some examples from the broad training set and some from the reservoir.
It's effective, but the so-called "boundary cases" often have to be hand-chosen due to the difficulty of selecting them automatically: Early samples always have high loss and the decision boundary nearness is implicitly connected to the accuracy of the network at time of evaluation. In other words, the function we evaluate on forward pass itself is changing as a result of backprop, so the critical points and output of the function are also in flux.
You also lose an increasing portion of each batch to the "important" cases as you add more, so maintaining the size and contents of this pool is difficult - if you added every case, you'd have no new data.
So I think it's promising, but it needs more foundational work on deriving the impact of individual samples on the output. (If we ever get that breakthrough in explainability...)
Neat! It makes sense that you'd also want a mechanism for taking things out of the reservoir.
Overall I tend to think that this space is underexplored compared to searching for new architectures... We know that it helps to choose a curriculum for humans to help guide learning, even beginning with 'baby talk' to develop early communication skills.
Is this true? My understanding was that in fine tuning, you’d only re train some of the layers. And even if you re train all the layers, the starting point for the layers is not random. If it really was all forgotten then fine tuning would not be orders of magnitude faster...
Gradient decent optimizes performance of a model on a given dataset. If you stop training on one dataset and start training on another one your model will become more optimized for the second dataset and less optimized for the first. This will usually result in degraded performance on classes of data found more commonly in the first dataset but not the second. This is what people mean by "forgetting". It doesn't matter how much of the model you fine-tune, the effect is still present though the effect size varies.
So, at this point, the most likely explanation seems to be that TFA was written prior to the unblinding, and then wasn't subsequently reviewed or updated before publication.
These are workshop submissions (which typically implies a more lightweight review process, for more exploratory work), and it is possible the same submissions are currently in blind review for other conferences in their final form.
I mean if we want to get pedantic I'm pretty sure Shannon used "backpropagation" for machine learning before either was called such.
Feedback for the purpose of regulating the state of a machine in response to input dates to antiquity, if we're really getting absurd. The formal definition is also debatable, I think Maxwell has the strongest claim.
> In the 1960s, academics including... arrived at the theory of backpropagation.
It was clearly phrased this way specifically because backprop is just the chain rule applied in a particular direction, and as such has been invented and reinvented over and over by every one under the sun. Hell, a lazy googling says gradient descent goes back to Cauchy.
I'd say Gottfried W. Leibniz is the true author, as it's all comes down to calculus. The particular implementation for "neural nets" is just a special case of function minimization by taking derivatives.
I like zoom-out views. To push what you describe further, it is essentially what ancient humans or their non-hominid forebears did subconsciously when calculating optimum motion trajectories to catch or spear prey while hunting... merely a version in formal notation ... we can thank the zero of India (https://en.wikipedia.org/wiki/0#History), the Persians (https://en.wikipedia.org/wiki/Algorithm#Etymology), the Islamic renaissance in Europe (https://mitpress.mit.edu/books/islamic-science-and-making-eu...) and numerous others for the slow development of the requisite formal maths. But a rose by any other name would smell as sweet. And perhaps, in the context of the stupefyingly deferred emergence of zero, even nameless!
> To push what you describe further, it is essentially what ancient humans or their non-hominid forebears did subconsciously when calculating optimum motion trajectories to catch or spear prey while hunting
> Another disadvantage of backpropagation is its tendency to become stuck in the local minima of the loss function. Mathematically, the goal in training a model is converging on the global minimum, the point in the loss function where the model has optimized its ability to make predictions.
"Backpropagation" is the method how to compute the gradient of the weights with respect to a loss function. But the article repeatedly uses the term as if it was the whole optimization algorithm, running into local minima.