edwardjhu's comments

edwardjhu · on March 24, 2023

Yup. That's exactly what happened.

edwardjhu · on March 24, 2023

Good question! I came up with the name because the idea is best described as low-rank adaptation. I know very little about radio communication and didn't anticipate the visibility my repo has today :)

edwardjhu · on March 24, 2023

> Merged means you are modifying the model weights, which means you are stuck with that one model on that device (though, this usually applies for most implementations for the unmerged versions too).

If one is careful with floating point issues, it's straightforward to unmerge the weights.

W_0 = W_1 - BA

Yes, prompt-based methods don't involve swapping weights.

arugulum · on March 24, 2023

Right, it's mathematically easy (again, up to floating point issues) to recover the weights as needed, but in terms of distribution/serving I'm guessing the plan is to have the original weights and carry around the LoRA weights and merge as necessary.

(Also, I'm assuming you're the first author of LoRA.)

edwardjhu · on March 25, 2023

Yes, the plan is to keep the original weights in VRAM and merge/unmerge LoRA weights on the fly. You can even cache a large library of LoRA ckpts in RAM.

Yup, I am!

edwardjhu · on March 24, 2023

Hi! I'm the author of the repo.

The insight is that we don't need to modify a lot of parameters to get a generally competent model to do well on specific tasks. When you have a linear layer with a weight matrix of dimension d_in x d_out, the change you undergo during full finetuning is also a matrix of d_in x d_out, which can be huge. We represent the latter using two matrices of shape d_in x r and r x d_out. You save a lot of parameters when r is small. So when you use it, the input goes through two streams 1) the orignal frozen weight turning a vector of size d_in to d_out and 2) the low-rank weights turning a vector of size d_in to r and r to d_out. The two streams are then summed together. (There's a figure in the paper.)

This way of doing thing is nice for a few reasons. It's easy to parallelize. You can change r to control how many parameters to train. You can also merge the low-rank weights with the original one to avoid latency.

Note that we don't select a subset of the original parameters. We train extra ones.

leobg · on March 25, 2023

What is the difference to training an adapter? Or to adding a new task specific layer [0]? Has it been demonstrated that LoRA works best out of all of these approaches?

[0] https://towardsdatascience.com/adding-custom-layers-on-top-o...

edwardjhu · on March 25, 2023

Adapters are extra layers inserted between existing layers, so they can't be parallelized. LoRA reparametrizes the weight updates and is easily parallelized or merged with the original weights during inference. Also, if you let the rank r be the hidden size you roughly recover finetuning, so you can see LoRA as a generalization of the latter.

Add a task specific layer and only training that layer doesn't work well. In practice, people combine many of these things, e.g., LoRA + task-specific final layer.

leobg · on March 25, 2023

Thanks for the clarification. Does that mean then that when parallelization is not important, training an adapter might be just as good as or better than LoRA?

edwardjhu · on March 25, 2023

If latency is irrelevant, I don't think there is a strong practical reason to prefer one over another. (LoRA is more elegant in my biased opinion because you roughly recover finetuning with a large r.) In practice, you see one do a little better on some tasks and vice versa on others as observed by papers after mine.

loxias · on March 24, 2023

Hi! I in _no way_ mean to detract or malign or "anything negative" the parent comment (communication is hard!!), BUT I really compliment that exact sentence. :)

My background contains signal processing, "pre-deep learning ML", systems engineering, and firmware, and that sentence jumped out at me as crystal clear in my mind, despite not knowing what HuggingFace is or PyTorch.

Correct me if I'm wrong: These huge models involve lots of weights used in large matrices. The contribution of this work is to plug in some matrix factorization and learn a lower dimensional representation, instead of a large second matrix.

Fantastic!

Also makes me wonder what other performance improvements await through proper application of established and well known Mathematics. :D

eternalban · on March 24, 2023

Great, we can get authoritative answers. (I'm trying to understand the ML space and have mostly done readings, not an expert.)

I am assuming you can have n LoRA fine-tunings, say each specializing in one aspect of a coherent task, with n summers, running in parallel, and then combine them at the end? Or more generally, does LoRA enable a sort of modularizing around a core (un-merged) model?

And curious if you ever tried merging 2 or more fine-tunings and then testing the resultant single model (merge all) against the original tests to check retention?

edwardjhu · on March 24, 2023

This paper tries something like that

https://arxiv.org/pdf/2202.13914.pdf

The gain isn't that significant. We don't understand what these low-rank updates represent, and they might not correspond to "skills" that humans have.

edwardjhu · on Dec 6, 2020

The claim here is a bit misleading, as already pointed out by other comments, since the kernel is an evolving one that is essentially learned after seeing the data.

Contrary to many related works that compare wide neural networks to kernel methods, our recent work shows that one can study a feature learning infinite-width limit with realistic learning rate.

https://arxiv.org/abs/2011.14522

We identified what separates the kernel regime (e.g., NTK) and the feature learning regime. In the infinite-width limit, OP's work could belong to either regime depending on the parametrization, i.e. the path kernel is either equal to the NTK or performing feature learning.

It's an incredibly interesting research topic. Please feel free to comment with thoughts on our work :)