I agree, but that can't happen with the vast majority of these models because they're trained on unlicensed data so they can't slap an open source license on the training data and distribute it.
I've decided to draw my personal line at Open Source Initiative compliance for the license they release the model itself under.
I respect the opinion that it's not truly open source unless they release the training data as well, but I've decided not to make that part of my own personal litmus test here.
My reasoning is that knowing something is "open source" helps me decide what I legally can or cannot do with it when building my own software. Not having access to the training data downs affect my legal rights, it just affects my ability to recompile myself. And I don't have millions of dollars of GPUs so that isn't so important to me, personally.
> that can't happen with the vast majority of these models because they're trained on unlicensed data
Tough beans? There's lots of actual software that can't be open source because it embeds stuff with incompatible restrictions, but nobody tries to redefine "open source" because of that.
... and, on a vaguely similar-flavored note, you'd better hope that the models you're using end up found to be noninfringing or fair use or something with respect to those "unlicensed data", because otherwise you're in a world of hurt. It's actually a lot easier to argue that the models aren't copyrightable than it is to argue that they're not derivative of the input.
> I've decided to draw my personal line at Open Source Initiative compliance for the license they release the model itself under.
You're allowed to draw your personal line about what you'll use anywhere you want, but that doesn't mean that you should try to redefine "open source" or support anybody who does.
The Llama models aren't. Some of the Mistral models are (the Apache 2 ones). Microsoft Phi-3 is - it's MIT.