Yes, it's been discussed many times before. All the corporations training LLMs have to have done a legal analysis and concluded that it's defensible. Even one of the white papers commissioned by the FSF ( "Copyright Implications of the Use of Code Repositories to Train a Machine Learning Model" at https://www.fsf.org/licensing/copilot/copyright-implications... ), concluded that using copyrighted data to train AI was plausibly legally defensible and outlined the potential argument. You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.
> Even one of the white papers commissioned by the FSF
Quoting the text which the FSF put at the top of that page:
"This paper is published as part of our call for community whitepapers on Copilot. The papers contain opinions with which the FSF may or may not agree, and any views expressed by the authors do not necessarily represent the Free Software Foundation. They were selected because we thought they advanced the discussion of important questions, and did so clearly."
So, they asked the community to share thoughts on this topic, and they're publishing interesting viewpoints that clearly advance the discussion, whether or not they end up agreeing with them. I do acknowledge that they paid $500 for each paper they published, which gives some validity to your use of the verb "commissioned", but that's a separate question from whether the FSF agrees with the conclusions. They certainly didn't choose a specific author or set of authors to write a paper on a specific topic before the paper was written, which a commission usually involves, and even then the commissioning organization doesn't always agree with the paper's conclusion unless the commission isn't considered done until the paper is updated to match the desired conclusion.
> You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.
This would be consistent with them agreeing with this paper's conclusion, sure. But that's not the only possibility it's consistent with.
It could alternatively be because they discovered or reasonably should have discovered the copyright infringement less than three years ago, therefore still have time remaining in their statute of limitations, and are taking their time to make sure they file the best possible legal complaint in the most favorable available venue.
Or it could simply be because they don't think they can afford the legal and PR fight that would likely result.
Since I very specifically wrote "commissioned by the FSF" instead of "represents the opinion of the FSF" to avoid misrepresenting the paper, you're arguing against something I have not said.
> Even one of the white papers commissioned by the FSF [...] concluded that using copyrighted data to train AI was plausibly legally defensible [...] notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.
I agree with jkaplowitz, but for a different reason I still believe that your description feels a bit misleading to me. The FSF commissioned paper makes the argument that Microsoft's use of code FROM GITHUB, FOR COPILOT is likely non-infringing, because of the additional github ToS. This feels like critical context to provide given in the very next statement, you widened it to LLMs generally, and the FSF which likely cares about code, not on github as well.
All of that said, I'm not sure it matters, because while I don't find the argument from the that whitepaper very compelling, because it's based critically on additional grants in the ToS. IIRC (going only from memory) the ToS requires that you grant github a license as it's needed to provide the service. Github can provide the services the user reasonably understood github to provide, without violating the additional clauses specified in the existing FOSS license covering the code. That being from a while ago, and I'd say it's very murky now, because everyone knows Microsoft provides copilot, so "obviously" they need it.
Unfortunately, and importantly, when dealing with copyrights, the paper also covers the transformative fair use arguments in depth. And I do find those following arguments very compelling. The paper, (and likely others) are making the argument that the code output from an LLM is likely transformative. And thus can't be infringing compelling, (or is unlikely to be). I think in many cases, the output is clearly transformative in nature.
I've also seen code generated by claude (likely others as well?) to copy large sections from existing works. Where it's clearly "copy/paste" which clearly can't be fair use, nor transformative. The output clearly copies the soul of the work. Thus given I have no idea what dataset they're copying this code from, it's scary enough to make me unwilling to take the chance on any of it.
No, it's also not illegal to train your brain. If you break into a store, and read all the books, you'll get arrested for breaking and entering. Not for reading the books. My (superficial) take on the argument is that they're hoping by saying "it's not illegal to read" no one will notice, and no one will ask how they got into the book store to begin with.
The answer is in the name of the law, copyright, the right to produce a copy. The original, ethical intent behind the law was to encourage people to create things. Someone could invest time and money into creating some art that had value, and then they were given the exclusive right to monetize it for some amount of time. You could create something, and I'm not allowed to copy what you created, and sell it without your permission, preventing me from doing no work but capturing all the money you could reasonably make off your work.
Want to create a song? You're the only person allowed to make, or authorize people to duplicate it. You're the only person allowed to control the supply of your effort. Eventually, the public good, and interest was supposed to take over, because in the end, you're right, it's just information. It was supposed to enter "the public domain" where anyone could freely use it. But then Disney got involved, and now it's a toxified weapon used mostly by unethical lawyers against curiosity.
Our current laws are written to make it legal for you to copy the Quran via your brain — some people learn it by rote and can stand up and speak the entire work from one end to the other. This is intended to be legal. Fair use of the Quran.
I went to a concert recently where someone copied every word and (as far as I could hear) every note from a copyrighted work by Bruce Springsteen. Singing and playing. This too is intended to be fair use.
You can learn how to play and sing Springsteen songs verbatim, and you can use his records to learn to sound like him when you sing, and that's intended to be legal.
Since the law doesn't say "but you cannot write a program to do these things, or run such a program once written", why would it be illegal to do the same thing using some code?
The people who want the law to differentiate have a difficult challenge in front of them. As I see it, they need to differentiate between what humans do to learn from what machines do, and that implies really knowing what humans do. And then they need to draw boundaries, making various kinds of computer-assisted human learning either legal or illegal.
Some of them say things like "when an AI draws Calvin and Hobbes in the style of Breughel, it obviously has copied paintings by Breughel" but a court will ask why that's obvious. Is it really obvious that the way it does that drawing necessarily involves copying, when you as a human can do the same thing without copying?
> I went to a concert recently where someone copied every word and (as far as I could hear) every note from a copyrighted work by Bruce Springsteen. Singing and playing. This too is intended to be fair use.
Only the learning part is fair use. Playing an artist's songs in public does not violate the copyright of the original performing artist, but it does violate the songwriters' copyright, and you do need a license to play covers in public.
What? I didn't know that. Do you have a reference? I'm particularly interested in the origin — is this something that applies to countries with a common law tradition, a roman law tradition, does it originate in one of the copyright treaties, etc. That kind of question.
The movie played on my screen but I may or may not have seen the results of the pixels flashing. As such, we can only state with certainty that the movie triggered the TV's LEDs relative to its statistical light properties.
You're probably being sarcastic but that's actually how the law works. You'll note that when people get sued for "pirating" movies, it's almost always because they were caught seeding a torrent, not for the act of watching an illegal copy. Movie studios don't go after visitors of illegal streaming sites, for instance.
If I am not mistaken, the law prohibits producing any unauthorized copies. So if you download a pirated book on a computer, you produce an illegal copy: [1]. If I am not missing anything, ML companies are galaxy-scale infringers.
> 106. Exclusive rights in copyrighted works
> Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:
> (1) to reproduce the copyrighted work in copies or phonorecords;
> 501. Infringement of copyright
> (a) Anyone who violates any of the exclusive rights of the copyright owner as provided by sections 106 through 122 or of the author as provided in section 106A(a), or who imports copies or phonorecords into the United States in violation of section 602, is an infringer of the copyright or right of the author, as the case may be.
The problem is when you use your "copy" as inspiration and actually create and publish something. It is very hard to be certain you are safe, besides literal expression close paraphrasing is also infringing, using world building elements, or using any original abstraction (AFC test). You can only know after a lawsuit.
It is impossible to tell how much AI any creator used secretly, so now all works are under suspicion. If copyright maximalists successfully copyright style (vibes), then creativity will be threatened. If they don't succeed, then copyright protection will be meaningless. A catch 22.
> Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?
It makes some sense, yeah. There's also precedent, in google scanning massive amounts of books, but not reproducing them. Most of our current copyright laws deal with reproductions. That's a no-no. It gets murky on the rest. Nvda's argument here is that they're not reproducing the works, they're not providing the works for other people, they're "scanning the books and computing some statistics over the entire set". Kinda similar to Google. Kinda not.
I don't see how they get around "procuring them" from 3rd party dubious sources, but oh well. The only certain thing is that our current laws didn't cover this, and probably now it's too late.
Scanning books is literally reproducing them. Copying books from Anna's Archive is also literally reproducing them. The idea that it is only copyright infringement if you engage in further reproduction is just wrong.
As a consumer you are unlikely to be targeted for such "end-user" infringement, but that doesn't mean it's not infringement.
This is the conclusion of the saga between the author's guild v. google. It goes through a lot of factors, but in the end the conclusion is this:
> In sum, we conclude that: (1) Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use. (2) Google’s provision of digitized copies to the libraries that supplied the books, on the understanding that the libraries will use the copies in a manner consistent with the copyright law, also does not constitute infringement. Nor, on this record, is Google a contributory infringer.
It seems like they pretty much don't care unless you distribute the copy. There is certainly precedent for it, going back to the Betamax case in the 1980s.
Backups are permitted (and not for all media) when you legally acquired the source. Scanning a physical book is not a permitted backup, and neither is downloading a book from Anna's archive.
> Scanning a physical book is not a permitted backup
On what basis do you claim that?
You're also missing critical legal context. When a would be consumer downloads pirated media in lieu of purchasing it he damages the would be seller. When my automated web scraper inadvertently archives some pirated content on my local disk no one is financially harmed.
The question is where the boundary between those things lies.
It does make sense. It’s controversial. Your memory memorizes things in the same way. So what nvidia does here is no different, the AI doesn’t actually copy any of the books. To call training illegal is similar to calling reading a book and remembering it illegal.
Our copyright laws are nowhere near detailed enough to specify anything in detail here so there is indeed a logical and technical inconsistency here.
I can definitely see these laws evolving into things that are human centric. It’s permissible for a human to do something but not for an AI.
What is consistent is that obtaining the books was probably illegal, but say if nvidia bought one kindle copy of each book from Amazon and scraped everything for training then that falls into the grey zone.
To be fair, that seems to be where some of the IA lawsuits are going. The argument goes that the models themselves aren't derivative works, but the output they produce can absolutely be - in much the same way that reproducing a book from memory could be copyright violation, trademark infringement, or generally go afoul of the various IP laws.
They do memorize some books. You can test this trivially by asking ChatGPT to produce the first chapter of something in the public domain -- for example a Tale of Two Cities. It may not be word for word exact, but it'll be very close.
These academics were able to get multiple LLMs to produce large amounts of text from Harry Potter:
Unfortunately a settlement doesn't really show you anything definitive about the legality or illegality of something.
It only shows you that the defendant thought it would be better for them to pay up rather than continue to be dragged through court, and that the plaintiff preferred some amount of certain money now over some other amount of uncertain money later, or never.
We cannot say with any amount of confidence how the court would have ruled on the legality, had things been allowed to play out without a settlement.
>Also, generating output is what these models are primarily trained for.
Yes but not generating illegal output. These models were trained with intent to generate legal output. The fact that it can generate illegal output is a side effect. That's my point.
If you use AI to generate illegal output, that act is illegal. If you use AI to generate legal output that act is not illegal. Thus the point of output is where the legal question lies. From inception up to training there is clear legal precedence for the existence of AI models.
If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws. So model doesnt have to reproduce the entire book, it only required to reproduce one specific sentence (which may be a characteristic sentence to that author or to that book).
> With a simple two-phase procedure, we show that it is possible to extract large amounts of in-copyright text from four production LLMs. While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984.
The supplementary files in that paper—verbatim reproductions of the full texts of Frankenstein and The Great Gatsby—are pretty instructive. The research group highlighted all additions and omissions, but on most pages the differences are difficult to spot because they are only missing spaces, extra hyphens, and other typographical minutiae.
It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.
> It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.
It sounds then like you're saying that scale does indeed matter in this context, as using every single piece of writing in existence isn't being slurped up purely to learn, it's being slurped up to make a profit.
Do you think they'd be able to offer a usefull LLM if the model was trained only what what an average person could read in a lifetime?
It's common knowledge among LLM experts that the current capabilities of LLMs are triggered as emergent properties of training transformers on reams and reams of data.
That is intent of scale. To trigger LLMs to reach this point of "emergence". Whether or not it's AGI is a debate I'm not willing to entertain but everyone pretty much agrees that there's a point where the scale flips from a transformer being an autocomplete machine to something more than that.
That is legal basis for why companies would go for scale with LLMs. It's the same reason why people are allowed to own knives even though knives are known to be useful for murder (as a side effect).
So technically speaking these companies have legal runway in terms of intent. Making an emergent and helpful AI assistant is not illegal, but also making a profit isn't illegal either.
Right, but in the weed analogy, the scale is used as a proxy to assume intent. When someone is caught with those 400 joints, the prosecution doesn't have to prove intent, because the law has that baked in already.
You could say the same in LLM training, that doing so at scale implies the intent to commit copyright infringement, whereas reading a single book does not. (I don't believe our current law would see it this way, but it wouldn't be inconsistent if it did, or if new law would be written to make it so.)
It’s clear nvidia and every single one of these big AI corps do not want their AIs to violate the law. The intent is clear as day here.
Scale is only used for emergence, openAI found that training transformers on the entire internet would make is more then just a next token predictor and that is the intent everyone is going for when building these things.
>Businesses routinely break the law if they believe the benefits in doing so will outweigh the consequences.
I'm saying there's collective incentive among businesses to restrict the LLM from producing illegal output. That is aligned and ultra clear. THAT was my point.
But if LLMs produce illegal output as a side effect and it can't be controlled than your point comes into play here because now they have to weigh the cost + benefit as they don't have a choice in the matter. But that wasn't what I'm getting at. That's your new point, which you introduced here.
In short it is clear all corporations do not want LLMs to produce illegal content and are actively trying to restrict it.
Er no. I’ve read and remember hundreds of books in my life time. It’s not any more illegal based off scale. The law doesn’t differentiate whether I remember one book or a hundred then there’s no difference for thousands or millions.
You can only read the book, if you purchased it. Even if you dont have the intent to reproduce it, you must purchase it. So, I guess NVDA should just purchase all those books, no?
That's true. But the book itself was legally purchased. So if nvidia went to the library and trained AI by borrowing books, that should be technically legal.
Do you have the same legal rights to something that you've borrowed as you do with something you've purchased, though?
Would it be legal for me to borrow a book from the library, then scan and OCR every page and create an EPUB file of the result? Even if I didn't distribute it, that sounds questionable to me. Whereas if I had purchased the book and done the same, I believe that might be ok (format shifting for personal use).
Back when VHS and video rental was a thing, my parents would routinely copy rented VHS tapes if we liked the movie (camcorder connected to VCR with composite video and audio cables, worked great if there wasn't Macrovision copy protection on the source). I don't think they were under any illusions that what they were doing was ok.
Well If I copied it word for word maybe, but if I read it and "trained it" into my brain then it's clearly not illegal.
SO the grey area here is if I "trained" an LLM in a similar way and not copied it word for word then is it legal? Because fundamentally speaking it's literally the same action taken.
You had to do this for reading too. The words were burned onto your retina as volatile memory before getting processed by your brain.
You retina likely overwrote it's "memory" as soon as you looked at something else, but that's no different than copying and deleting or the more apt analogy: streaming.
The law makes a distinction between storing it on a disk and just remembering the content. The latter is not a "copy" and not a subject of law:
> “Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. The term “copies” includes the material object, other than a phonorecord, in which the work is first fixed.
> A work is “fixed” in a tangible medium of expression when its embodiment in a copy or phonorecord, by or under the authority of the author, is sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration. A work consisting of sounds, images, or both, that are being transmitted, is “fixed” for purposes of this title if a fixation of the work is being made simultaneously with its transmission.
Interesting. How long is the transitory duration? The interpretation of that likely has yet to be determined by a court case and can evolve similar to how “all men are created equal” doesn’t just refer to men.
Seems to me a possible interpretation is just deleting the data after training is finished.
But it’s not just about recall and reproduction. If they used Anna’s Archive the books were obtained and copied without a license, before they were fed in as training data.
Partially true. I can pay for a book then lend it out to people for free.
The government is in full support of this "lending" concept, in fact they have created entire facilities devoted to this very concept of lending out books.
If I’m rich enough to employ thousands of people I can hire each one of them to borrow as many books as possible then use all the books to train an AI. Perfectly legal. And also very possible.
Point being that the library prevents you from checking out 500gb because of logistical issues. First how can you carry all those books and how can they let other patrons in the library check out books if you grabbed that many? These rules aren’t enforced to prevent “scale” hence why my methodology got around the rules.
It's not settled law as it pertains to LLMs, but, yes, creating a "statistical summary" of a book (consider, e.g., a concordance of Joyce's "Ulysses") is generally protected as fair use. However, illegally accessing pirated books to create that concordance is still illegal.
Copyright laws are so undefined and NVIDIAs lawyers so plentiful that the statement works in their favor. You're allowed to copy part of a work in many cases, the easiest example is you can quote a line from a book in a review. The line is fuzzy.
It seems so, stealing copyrighted content is only illegal if you do it to read it or allow others to read it. Stealing it to create slop is legal.
(The difference, is that the first use allows ordinary poeple to get smarter, while the second use allows rich people to get (seemingly) richer, a much more important thing)
Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?