I found it very weird that the SLIDE algorithm from early 2019 isn’t mentioned. Maybe I missed it or maybe it is compared just deeper in the referenced publications?
SLIDE seems way, way superior to any of the listed solutions or approaches, as far as I could tell on a first read through.
But there’s also been a lot of research suggesting most SOTA dense networks are arbitrarily replicatable with sparse networks, and may even be better in the sense of less overfitting. Perhaps things like GPT are still an exception, but for most applications SLIDE should work to train networks just as effective as naively specified dense architectures.
> But there’s also been a lot of research suggesting most SOTA dense networks are arbitrarily replicatable with sparse networks
I'm not sure if its related, but would this work kind of how armadillo can do singular value decomp [0] of a matrix by embedding arbitrary n by m matrix X in a higher dimensional n+m by n+m null matrix M?
Yeah. I think part of the problem is just that SLIDE represents a Kuhnesque paradigm shift and these things take time. I really want to play with SLIDE myself but just haven't had a chance.
SLIDE seems way, way superior to any of the listed solutions or approaches, as far as I could tell on a first read through.
https://arxiv.org/abs/1903.03129