Was GPT-4 trained on data that was acquired illegally? Or was it trained on data acquired legally that OpenAI didn't have the rights to redistribute? There is a difference. In the latter case, whether it counts as "stealing" would come down to whether or not GPT-4 counts as a derivative work, or some similar legal concept.
Scribd has lots of pdfs of books that are copyrighted. The Washington Post article mentions there are several other places it downloaded and scraped pdfs of copyrighted textbooks, etc
That's interesting to know, but that doesn't by itself imply that it's illegal. For example, Google Books, which has massive amounts of scanned PDFs of copyrighted works, is considered fair use under US copyright law.
It's fair use if the work is "transformative". GPT-4 isn't publishing the content of the books, it's publishing a model derived from the entire corpus. I'm not a lawyer, but I think there's an argument that it is transformative.
It's correct that OpenAI isn't publishing any of the "stolen" content directly. But they "stole" it to make their service possible in the first place. Not distributing it themself doesn't make much difference than.
Just because someone can convert text to numbers doesn’t mean they have a right to the numbers. That’s like trying to own the emotion a book has on someone, or the things they see in mind when they read it.
What I find rather amusing is they spend the whole paper dismissing it as ineffective yet still feel the need to worry about the 'ethics' and 'legality'. They don't cite anything with regards to a discussion/evidence of either, of course, and looking at the authorship list I don't believe any of them are lawyers or ethics experts.
No one should have "rights" to any data, information, bits, or whatever. It's not physical and any attempt to apply artificial scarcity to replicate the physical world is a crime against humanity. The lines around which data is protected and which is copyable is arbitrary bullshit. You aren't stealing a fire when you light one candle with another. It's my storage device and I'm not breaking the law all of a sudden because the gates are holding a different set of charges.
By that logic, you also need to accept that no one should ever need to pay you for creating artifacts that are not bound to the physical world solely. I assume you work for free for your employer or in a space that is not “dealing” with data, information, bits, whatsoever.
They are transforming physical objects very much the same way a carpenter does. The service industry is not equal to Tech / digital. A hairdresser does not create Bits or data.
I would also argue in this particular case you are wrong. You hand over the hair on the ground to them which they then dispose or maybe resell (maybe without explicit consent but at least implicit). If that wasn’t the case they would commit theft when they dispose your hair…
I would like the big players to argue that they have some right to the numbers as it has important applications to BitTorrent and cryptography too for that matter.