Its not a complicated concept, just a stretch of the concept of memory. Training in deep learning is done in batches. So "learning" (i.e. the gradient updates to weights) that happens due to your early batches of data can be undone by the gradient updates for later batches.
The gradient in machine learning is based on the loss. Specifically it's the direction that reduces the loss the fastest. So, not only the most recent batches, but specifically by the recent data that is predicted incorrectly. It doesn't have any "confidence" from the memory of what was predicted right previously, for example, it just currently only cares about changing to suit the most recent batches.
Seems like you could just use better active learning strategies to get around the issue, though... Keep your usual dataset, but progressively build a reservoir of 'important' examples while training. (where important == high loss or near decision boundary, for example.) Then when building batches, mix in some examples from the broad training set and some from the reservoir.
It's effective, but the so-called "boundary cases" often have to be hand-chosen due to the difficulty of selecting them automatically: Early samples always have high loss and the decision boundary nearness is implicitly connected to the accuracy of the network at time of evaluation. In other words, the function we evaluate on forward pass itself is changing as a result of backprop, so the critical points and output of the function are also in flux.
You also lose an increasing portion of each batch to the "important" cases as you add more, so maintaining the size and contents of this pool is difficult - if you added every case, you'd have no new data.
So I think it's promising, but it needs more foundational work on deriving the impact of individual samples on the output. (If we ever get that breakthrough in explainability...)
Neat! It makes sense that you'd also want a mechanism for taking things out of the reservoir.
Overall I tend to think that this space is underexplored compared to searching for new architectures... We know that it helps to choose a curriculum for humans to help guide learning, even beginning with 'baby talk' to develop early communication skills.
Is this true? My understanding was that in fine tuning, you’d only re train some of the layers. And even if you re train all the layers, the starting point for the layers is not random. If it really was all forgotten then fine tuning would not be orders of magnitude faster...
Gradient decent optimizes performance of a model on a given dataset. If you stop training on one dataset and start training on another one your model will become more optimized for the second dataset and less optimized for the first. This will usually result in degraded performance on classes of data found more commonly in the first dataset but not the second. This is what people mean by "forgetting". It doesn't matter how much of the model you fine-tune, the effect is still present though the effect size varies.