Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The breathtaking audacity of calling distilling GPT4 'stealing' when GPT4 trained on data it has no proprietary right to.


"We ignore what created us; we adore what we create." — Aleister Crowley, The Book of Lies


"You are trying to kidnap what I've rightfully stolen, and I think it quite ungentlemanly."


They put "stealing" in scare quotes, so it's probably not worth getting fired up about.


Was GPT-4 trained on data that was acquired illegally? Or was it trained on data acquired legally that OpenAI didn't have the rights to redistribute? There is a difference. In the latter case, whether it counts as "stealing" would come down to whether or not GPT-4 counts as a derivative work, or some similar legal concept.


https://www.washingtonpost.com/technology/interactive/2023/a...

Scribd has lots of pdfs of books that are copyrighted. The Washington Post article mentions there are several other places it downloaded and scraped pdfs of copyrighted textbooks, etc


That's interesting to know, but that doesn't by itself imply that it's illegal. For example, Google Books, which has massive amounts of scanned PDFs of copyrighted works, is considered fair use under US copyright law.


There's no good faith world where OPENAI trained only on legally available works

The only valid arguments is whether their model or it's output is itself protected legally.


As long as you don't try to scrape all the book's content…

It's only fair use for search purposes.


It's fair use if the work is "transformative". GPT-4 isn't publishing the content of the books, it's publishing a model derived from the entire corpus. I'm not a lawyer, but I think there's an argument that it is transformative.


Imho as transformative as encoding a DVD as DivX…

It's correct that OpenAI isn't publishing any of the "stolen" content directly. But they "stole" it to make their service possible in the first place. Not distributing it themself doesn't make much difference than.


Just because someone can convert text to numbers doesn’t mean they have a right to the numbers. That’s like trying to own the emotion a book has on someone, or the things they see in mind when they read it.


What I find rather amusing is they spend the whole paper dismissing it as ineffective yet still feel the need to worry about the 'ethics' and 'legality'. They don't cite anything with regards to a discussion/evidence of either, of course, and looking at the authorship list I don't believe any of them are lawyers or ethics experts.


No one should have "rights" to any data, information, bits, or whatever. It's not physical and any attempt to apply artificial scarcity to replicate the physical world is a crime against humanity. The lines around which data is protected and which is copyable is arbitrary bullshit. You aren't stealing a fire when you light one candle with another. It's my storage device and I'm not breaking the law all of a sudden because the gates are holding a different set of charges.


By that logic, you also need to accept that no one should ever need to pay you for creating artifacts that are not bound to the physical world solely. I assume you work for free for your employer or in a space that is not “dealing” with data, information, bits, whatsoever.


Hairdressers charge for a service and none of them will assume that they "own" your hair.


They are transforming physical objects very much the same way a carpenter does. The service industry is not equal to Tech / digital. A hairdresser does not create Bits or data. I would also argue in this particular case you are wrong. You hand over the hair on the ground to them which they then dispose or maybe resell (maybe without explicit consent but at least implicit). If that wasn’t the case they would commit theft when they dispose your hair…


I am honestly shocked that this hasn't happened, what with how the world has been going in recent decades.


Well, they probably can since you give them consent to keep your hair when you leave the shop… Disposing it would otherwise be considered theft, no?


Like a torrent of the last GoT season then?

… with compression.


Imagine the GoT producers used GRRM's books without licensing and then claim copyright on the series.

Does OpenAI have the rights on all the texts they used to train their GPTs?


i think we agree


I would like the big players to argue that they have some right to the numbers as it has important applications to BitTorrent and cryptography too for that matter.


yeah this is insane thinking haha


Stolen twice is still stolen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: