> The research findings “could present a challenge to those who argue that the AI model does not store or reproduce any copyright works,” said Cerys Wyn Davies, an intellectual property partner at law firm Pinsent Masons.
The defense to training with copyright is that it is the same as how humans learn from copyrighted material. The storage or reproduction is a red herring. Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.
The defense to actually reproducing a work is that in order to do so, the user has to "break" the system. It is the same as how you can make legal software do illegal things (e.g. screen recorder to "steal" a movie)
None of this is to say that these defenses are correct/moral; but rather that this article doesnt add any additional input into whether it is or isnt.
> Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.
Ultimately this is a matter for the courts and the law, but I'd just like to point out that a human memorizing a work, reproducing it, and distributing it is just as much a copyright violation as doing a more mechanical form of reproduction.
There's a reason that fan fiction routinely falls afoul of copyright. There's quite a lot of case law in this area, and hand-waving "humans can do it too" doesn't really make for a strong argument. Humans get in trouble for it ALL THE TIME. The consequences can be fines, injuctions, or even criminal liability.
I'm not sure why you think AI gets off the hook here. Just because you like the outcome at the moment?
This isn't the defense you think it is. Performing a copyrighted work from memory - e.g. a piece of music, a poem, a story, etc - is still a copyright violation. There's no special protection for works that a human has memorized.
Humans are not judged on the basis of what they _can_ do.
Reasoning about how to constrain tools on the basis of what they _could_ do, if e.g. used outside their established guardrails, needs to be very nuanced.
Correct; the ability of a model to reproduce source material verbatim does not necessarily make the model's existence illegal. However, using a model to do just that might very well present a legal liability for the user. I would be interested to see the extent to which models can "recite from memory" source code, e.g., from the various MS code leaks. Put another way, if I'm using LLM code generation extensively, do I need to run a filter on its output to ensure that I don't "accidentally" copy large chunks of the Windows codebase?
>There's no special protection for works that a human has memorized.
Who's liable for the copyright infringement if you can coax it out of a system? If you can bypass paywalls by using google's cache feature (or since they got rid of it, but using carefully crafted queries to extract the entire text via snippets), is google on the hook or the person doing it?
Both. If I sell obviously pirated CDs on the street corner, it's not only illegal for me to copy them and sell them, it's also illegal for my customers to buy them.
Is it? There's plenty of people prosecuted for running illegal streaming sites and torrenting (which involves uploading), but I don't know of any efforts to crack down on non-distributors.
1. How does this interact with the ruling that both google books (ie. large scale scanning of books without author's consent) and google snippets (the same, but for websites) have been ruled legal by the courts?
2. Google might not be the most sympathetic defendant, but what about libraries? They offer books to be borrowed, and some offer photocopiers. If you put the two together, you get a copyright infringement operation, all enabled by the library. Should libraries be on the hook too?
For #2 yes...you would be engaging in copyright infringement. The library, being on the hook, would probably ask you to stop if they noticed you copying full books. If not the first time, certainly on the second
>If you can bypass paywalls by using google's cache feature
that is quite different. Google serves (used to serve) to its users whatever the website presents to its crawler, it does not try to avoid paywalls or interact with the website in any capacity other than requesting information
The whole “humans also do this” isn’t a winning defence here. Humans and copyright has long history and so much law that it is easy to get confused.
The default assumption here seems to be that the system needs to be broken. This is similar to the Google defence. If a user intent is to search for a cracked software what can poor Google do about it? The answer is to make it even more difficult.
This is a defence also used by torrent sites using magnet urls. “We don’t host files” is the default defence. But then if these sites get hit with DMCA they are required to remove the magnet url.
So the article shows what the lawyer is saying. Despite claims that it is difficult to search for full books, it really isn’t so. It is trivial. When it goes to court and it will, AI models will be required to make it even more difficult and allow for a DMCA like takedowns.
> Humans can also reproduce copyrighted works from memory as well
That's simply not true. No humans can memorize entire novels, as this research proved these models do. And definitely not all of these novels, and code bases, and who knows what else all at the same time.
>No humans can memorize entire novels, as this research proved these models do.
Humans can however, remember entire songs, and songs are definitely long enough to be considered copyright protected. There is still a difference in scale, but that's not really relevant when it comes to copyright law. You can't be like "well humans are committing copyright infringement but since it's limited to a few hundred words we'll give it a pass".
It's not that you can remember a song and therefore copyright infringement when you sing.
For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.
For the last 0.001%, we would call it a cover and typically the individually doing a cover takes some liberties of their own, still making it not a replacement in any way. Artists are typically cool with covers.
>For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.
You realize that lyrics are often written by someone other than the actual singer, and whoever wrote the lyrics is entitled to compensation too? The "amateur singing isn't a replacement for the studio album" excuse doesn't work in this context. Also courts have ruled that lyrics themselves are protected by copyright.
Clearly the team, if it is a team, that is entitled to the copyright is entitled to the copyright of the song, that's a silly statement to make. Copyright belongs to some entity, obviously.
You were specifically calling out individuals singing a song, not publishing lyrics online. These are not the same thing. Again your distribution/consumption model matters here.
On artists being "cool" with it - if the copyright holder doesn't pursue you then does it matter? The only valid argument I would see here is if the copyright holder doesn't know about the infringement and therefore cannot seek remedies, but we can fish for illegal scenarios all day if we would like: that's not useful though.
>Clearly the team, if it is a team, that is entitled to the copyright is entitled to the copyright of the song, that's a silly statement to make. Copyright belongs to some entity, obviously.
>You were specifically calling out individuals singing a song, not publishing lyrics online. These are not the same thing. Again your distribution/consumption model matters here.
I'm not sure why you're so confidently dismissive here. I wasn't trying to claim that nobody owned the lyrics. I brought that point up because even in the case of an amateur singing a song, even if you accept the "for 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form" excuse, you're still infringing on the copyright of the lyrics, because it's a derivative work. Moreover it's unclear whether that excuse even works. If you make a low cost version of star wars, copying the screenplay exactly, that still seems like copyright infringement, even if "it's not a replacement for the original in any way shape or form".
>On artists being "cool" with it - if the copyright holder doesn't pursue you then does it matter?
Virtually nobody got sued for torrenting with a VPN on. Does that mean it's fair to round that off as being legal, because "if the copyright holder doesn't pursue you then does it matter"?
> Moreover it's unclear whether that excuse even works. If you make a low cost version of star wars, copying the screenplay exactly, that still seems like copyright infringement, even if "it's not a replacement for the original in any way shape or form".
Are you being intentionally obtuse here? Intention matters here.
> Virtually nobody got sued for torrenting with a VPN on.
Let's not use obviously illegal actions which are done covertly to act as an example that is in any way similar to singing a song in the "open."
But the crime in the human instance is the reproduction, not the storage. So the crime in the AI circumstance would not be in the training, but in prompting the output.
And of course AIs are excellent at taking direction, so:
If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy" and get "Manacled," that's copyright infringement, and on me, not on the LLM/training.
If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy, and change enough to avoid infringing copyright," and get "Alchemised," then that should be fine. I doubt the legal world agrees with me though.
> But the crime in the human instance is the reproduction, not the storage. So the crime in the AI circumstance would not be in the training, but in prompting the output.
I wouldn't be so sure, at least under US law. 17 USC 101 defines a "copy" as:
[...] material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.
If I memorize a work what ends up in my brain is not a copy according to that definition because with current technology there is no machine or device which can be used to perceive, reproduce, or otherwise communicate it. The work can only be perceived, reproduced, or otherwise communicated by using my brain which is not a machine or device.
No copy in my brain means that memorizing the work cannot infringe the copyright owner's exclusive right to reproduce the work in copies.
An LLM, unlike my brain, is a machine or device which can be used to perceive, reproduce, or otherwise communicate the work and so the work stored in the LLM is a copy.
Training an LLM then, unlike a brain memorizing a work, makes a copy and so would be covered by the copyright owner's exclusive right to make copies.
That's going to need to be justified, probably by arguing fair use.
I'd argue your brain is that "machine or device" -- the fact that the storage and the playback mechanism are one and the same is irrelevant. The fact that you have to be willing/induced to replay the content back just makes you a worse machine :-)
Interesting argument but not likely to go far. As far as I can tell US copyright law has never been taken to include brains as machines or devices.
This is actually relevant in some real cases, namely improvised works. Attempts to claim copyright on improvised works that were not recorded have generally failed. If brains counted as machines or devices than the work inside the performer's head would be a recording and the work would have copyright.
That is one of the reasons it is usually recommended that musicians should record their live performances. That gets them copyright on anything they improvise during the show. Also it gets them copyright on that particular performance of their music, which helps them go after anyone who makes an unauthorized recording of the show. (Copyright is only automatic upon recording when the recording is by or under the authority of the creator).
I am so confused at how this is supposed to work. If the code, running in whatever language, does any sort of transform with the key that it thinks it has, doesnt this break? E.g. OAuth 1 signatures, JWTs, HMACs...
Now that I think further, doesnt this also potentially break HTTP semantics? E.g. if the key is part of the payload, then a data.replace(fake_key, real_key) can change the Content Length without actually updating the Content-Length header, right?
Lastly, this still doesnt protect you from other sorts of malicious attacks (e.g. 'DROP TABLE Users;')...Right? This seems like a mitigation, but hardly enough to feel comfortable giving an LLM direct access to prod, no?
My understanding is that it only surfaces the real keys when the request is actually sent under the hood, and doesn't make it available to the code itself, so that LLMs aren't able to query the key values. They have placeholder values for what seems to be obfuscation purposes, so that the LLM receives a fake value if it tries, which would help with stuff like prompt injection since that value is useless.
All of the examples you gave, the challengers had some revolutionary idea/improvement on top. Tiktok had its recommendation algorithm and short videos. Google had pagerank. That's also the reason why whatsapp hasn't been supplanted. There's no room for innovation (or nobody bothered trying). The same is true for digital distribution. Every steam competitor is basically "steam but [publisher]" or in epic's case, "steam but with steam games".
That's what the person who started this comment chain said, though. Every Steam competitor has been "does the same thing as Steam, but worse" so why would anyone switch over?
There is some argument to be made that the cost benefit analysis for your average user doesn't make sense unless the platform is a significant improvement over steam. Having two fragmented systems is a huge inconvenience to users now almost to the point that I will outright refuse to play games that are not on Steam.
And for companies that shoehorn really bad launchers as an extra layer on steam like EA, you are doing the work of the devil himself
Some extremely popular games, like all the Hoyoverse stuff (Genshin/ZZZ/etc) or most of Blizzard's games, have their own launchers and aren't on Steam. So gamers are certainly willing to use non-Steam platforms and launchers if there's a reason.
That didn't stop overwatch 2 from eventually making its way over to Steam. They also have the best integration, once your steam account is linked to your Battle.NET account you don't have to even think about the launcher
That's not the same as "terrible" though? Signal is basically "whatsapp but not facebook", but you wouldn't say it's "terrible". Same with lyft (which came after uber), or ubereats (which came after many food delivery startups).
Right but if there were a better platform than Steam for buying games it'd win out in the marketplace. It's not like anyone is locked into Steam really.
Every online gaming platform other than Steam and GOG sucks. And in fact GOG competes very well with Steam precisely because it offers something Steam doesn't, which is DRM-free games. Steam didn't just beat the Epic Games Store and Origin and Games For Windows Live because it came first, it's just a better platform and the others offer nothing outside of exclusives which they paid for.
Lets not forget Ubisofts uPlay which was absolutely shambolic. Blizzard's / Activision launcher was alright though. It did the job but no where to the likes of Steam which is really feature rich.
> Blizzard's / Activision launcher was alright though.
I'd personally say it was better as a launcher. Launching Steam itself takes relatively long and when its just in the background its just there idling with ~400Mb of RAM (specifically its WebHelper), which aren't a problem with Battle.net since it idles at 170MB or you can just close it since it launches way faster.
This only works when strangers = target customer because there is no way a stranger would have the understanding of the pain you are relieving for someone when they dont feel that pain. Therefore, it can be better read as "validate your ideas on your target customer" which is kind of obvious.
The main thing that confuses me is that this seems to be PHP implemented in React...and talks about how to render the first page without a waterfall and all that makes sense, but the main issue with PHP was that reactivity was much harder. I didnt see / I dont understand how this deals with that.
When you have a post with a like button and the user presses the like button, how do the like button props update? I assume that it would be a REST request to update the like model. You could make the like button refetch the like view model when the button is clicked, but then how do you tie that back to all the other UI elements that need to update as a result? E.g. what if the UI designer wants to put a highlight around posts which have been liked?
On the server, you've already lost the state of the client after that first render, so doing some sort of reverse dependency trail seems fragile. So the only option would be to have the client do it, but then you're back to the waterfall (unless you somehow know the entire state of the client on the server for the server to be able to fully re-render the sub-tree, and what if multiple separate subtrees are involved in this?). I suppose that it is do-able if there exists NO client side state, but it still seems difficult. Am I missing something?
>When you have a post with a like button and the user presses the like button, how do the like button props update?
Right, so there's actually a few ways to do this, and the "best" one kind of depends on the tradeoffs of your UI.
Since Like itself is a Client Component, it can just hit the POST endpoint and update its state locally. I.e. without "refreshing" any of the server stuff. It "knows" it's been liked. This is the traditional Client-only approach.
Another option is to refetch UI from the server. In the simplest case, refetching the entire screen. Then yes, new props would be sent down (as JSON) and this would update both the Like button (if it uses them as its source of truth) and other UI elements (like the highlights you mentioned). It'll just send the entire thing down (but it will be gracefully merged into the UI instead of replacing it). Of course, if your server always returns an unpredictable output (e.g. a Feed that's always different), then you don't want to do that. You could get more surgical with refreshing parts of the tree (e.g. a subroute) but going the first way (Client-only) in this case would be easier.
In other words, the key thing that's different is that the client-side things are highly dynamic so they have agency in whether to do a client change surgically or to do a coarse roundtrip.
Since nobody here has actually read the article, it states that the reason the posts were taken down was "prohibits incitement to terrorism praise for acts of terrorism and identification or support of terror organizations." This type of speech (incitement) is illegal in the United States and support is very borderline depending on the type and meaning of "support". Now, if the reason doesnt match the actual content removed that should definitely be addressed which is your point, but I think that the reason is valid.
On the one hand there are comments from users that want to “turn Gaza into a parking lot” or worse and were not removed because they don’t violate the community guidelines.
On the other hand there are people posting educational explainers about Palestinian human rights censored under hate speech or dangerous individuals rules.
So, if I understand correctly, the consistency model is essentially git. I.e. you have a local copy, makes changes to it, and then when its time to "push" you can get a conflict where you can "rebase" or "merge".
The problem here is that there is no way to cleanly detect a conflict. The documentation talks about pages which have changed, but a page changing isnt a good indicator of conflict. A conflict can happen due to a read conflict. E.g.
Update Customer Id: "UPDATE Customers SET id='bar' WHERE id='foo'; UPDATE Orders SET customerId='bar' WHERE customerId='foo'"
Add Customer Purchase: "SELECT id FROM Customers WHERE email="blah"; INSERT INTO Orders(customerId, ...) VALUES("foo", ...);"
If the update task gets committed first and the pages for the Orders table are full (i.e. inserting causes a new page to allocated) these two operations dont have any page conflicts, but the result is incorrect.\
In order to fix this, you would need to track the pages read during the transaction in which the write occurred, but that could easily end up being the whole table if the update column isnt part of an index (and thus requiring a table scan).
If strict serializability is not possible, because your changes are based on a snapshot that is already invalid, you can either replay (your local transactions are not durable, but system-wide you regain serializability) or merge (degrading to snapshot isolation).
As long as local unsynchronized transactions retain the page read set, and look for conflicts there, this should be sound.
What I find hard to imagine is how the app should respond when synchronisation fails after locally committing a bunch of transactions.
Dropping them all is technically consistent but it may be unsafe depending on the circumstances. E.g. a doc records an urgent referral but then the tx fails because admin staff has concurrently updated the patient's phone number or whatever. Automatically replaying is unsafe because consistency cannot be guaranteed.
Manual merging may be the only safe option in many cases. But how can the app reconstitute the context of those failed transactions so that users can review and revise? At the very least it would need access to a transaction ID that can be linked back to a user level entity, task or workflow. I don't think SQLite surfaces transaction IDs. So this would have to be provided by the Graft API I guess.
What I find hard to imagine is how the app should respond when synchronisation fails after locally committing a bunch of transactions... Manual merging may be the only safe option in many cases.
Yeah, exactly right. This is why CRDTs are popular: they give you well-defined semantics for automatic conflict resolution, and save you from having to implement all that stuff from scratch yourself.
The author writes that CRDTs "don’t generalize to arbitrary data." This is true, and sometimes it may be easier to your own custom app-specific conflict resolution logic than massaging your data to fit within preexisting CRDTs, but doing that is extremely tricky to get right.
It seems like the implied tradeoff being made by Graft is "you can just keep using the same data formats you're already using, and everything just works!" But the real tradeoff is that you're going to have to write a lot of tricky, error-prone conflict resolution logic. There's no such thing as a free lunch, unfortunately.
The problem I have with CRDTs is that while being conflict-free in a technical sense they don't allow me to express application level constraints.
E.g, how do you make sure that a hotel room cannot be booked by more than one person at a time or at least flag this situation as a constraint violation that needs manual intervention?
It's really hard to get anywhere close to the universal usefulness and simplicity of centralised transactions.
Yeah, this is a limitation, but generally if you have hard constraints like that to maintain, then yeah you probably should be using some sort of centralized transactional system to avoid e.g. booking the same hotel room to multiple people in the first place. Even with perfect conflict resolution, you don't want to tell someone their booking is confirmed and then later have to say "oh, sorry, never mind, somebody else booked that room and we just didn't check to verify that at the time."
But this isn't a problem specific to CRDTs, it's a limitation with any database that favors availability over consistency. And there are use cases that don't require these kinds of constraints where these limitations are more manageable.
"How do you make sure that a hotel room cannot be booked by more than one person at a time" Excellent question! You don't. Instead, assuming a globally consistent transaction ordering, eg Spanner's TrueTime, but any uuid scheme suffices, it becomes a tradeoff between reconciliation latency and perceived unreliability. A room may be booked by several persons at a time, but eventually only one of them will win the reconciliation process.
A: T.uuid3712[X] = reserve X
...
B: T.uuid6214[X] = reserve X // eventually loses to A because of uuid ordering
...
A<-T.uuid6214[X]: discard T.uuid6214[X]
...
B<-T.uuid3712[X]: discard T.uuid6214[X], B.notify(cancel T.uuid6214[X])
-----
A wins, B discards
The engineering challenge becomes to reduce the reconciliation latency window to something tolerable to users. If the reconciliation latency is small enough, then a blocking API can completely hide the unreliability from users.
From the description, you can reapply transactions. How the system handles it (how much of it is up to the application, how much is handled in graft) I have no idea.
What does that mean though? How can you possibly reapply a failed transaction later? The database itself can't possibly know how to reconcile that (if it did, it wouldn't have been a failure in the first place). So it has to be done by the application, and that isn't always possible. There is still always the possibility of unavoidable data loss.
"Consistency" is really easy, as it turns out, if you allow yourself to simply drop any inconvenient transactions at some arbitrary point in the future.
This! Solving merge conflicts in git is quite hard. Building an app such that it has a UI and use cases for merging every operation is just unrealistic. Perhaps if you limit yourself to certain domains like CRDTs or turn based games or data silos modified by only one customer it can be useful. I doubt it can work in general case.
The only situation I can think of where it's always safe is if the order that you apply changes to the state never matters:
- Each action increments or decrements a counter
- You have a log of timestamps of actions stored as a set
- etc.
If you can't model your changes to the data store as an unordered set of actions and have that materialize into state, you will have data loss.
Consider a scenario with three clients which each dispatch an action. If action 1 sets value X to true, action 2 sets it to true, and action 3 sets it to false, you have no way to know whether X should be true or false. Even with timestamps, unless you have a centralized writer you can't possibly know whether some/none/all of the timestamps that the clients used are accurate.
I have tried SO hard to get rr to work for me, including buying a separate pc just to use it...but it just consistently fails so I've basically abandoned it. Something like this would absolutely be a godsend. Just getting something consistently working with Ubuntu is amazing. Does this approach make working in something like WSL viable?
I would love if this were upstreamed. Is there a github issue where you discuss the possibility of this with the rr devs? That might be something to add to your readme for everyone else who wants to follow along. Thanks!
Thanks for the encouraging words! Please do try it out and report back if it worked well or not for you on the issue tracker.
With sufficient usage I think we can make a good case to get merged upstream. This patch introduces dynamic/static instrumentation for ticks counting which is quite different to how things have happened till now on rr. If there are many success stories a stronger case for upstream merge can be made. The rr maintainers are aware of this project but it is early days yet for an upstream merge PR attempt yet
With a big changeset, its better to have a brief discussion about how it works / what it needs before you actually actually make a PR. Just big principles high level stuff. This way if you build a train station, the devs wont be like "ooh, we really need an airport." Thats why an issue to track it is good: it raises visibility for anyone who has an issue with the approach etc. long before its time to make a merge. Also, if theyre like "well never take this" or "well take this if you build a space station" its good to know that before investing a ton of time into something PR-able.
> suggesting that side stepping lock in altogether by simplifying down to traditional techniques is not “serious” makes me bristle a little
This is a strawman. You're misinterpreting the word "serious". They are using it to mean scalable, not about unimportance/ability. At some point in the scaling process, it will be more effective to scale to another machine than stay on a single one at which point you need a lot of other primitives like the article mentions. E.g. a shared cache with proper invalidation mechanisms. If you dont need scale, then you're right, you dont have to worry about this. I will also note that it is slightly odd to use a framework like next.js if you arent (or planning on) running at scale because most of its features (e.g. SSR) are entirely performance oriented. Essentially, the whole point of the article is that despite being "open source" you cannot run next.js at scale yourself without a massive investment of your own.
> Essentially, the whole point of the article is that despite being "open source" you cannot run next.js at scale yourself without a massive investment of your own.
I don't know about that. Asking a service provider to provide an implementation for a cache interface isn't a "massive investment". It's an investment sure, but it's the type of investment that seems should be customizable per provider depending on their needs, technologies they want to bet on, etc. It seems to me the problem with Netlify isn't comfortable putting in the investment to have a NextJS specific cache service. It's understandable considering they don't control the framework and to them it's just another option, so they don't want to invest in it too much.
(disclaimer: Netlify employee)
The big challenge with the cache interface atm is not using Redis (personally, I love Redis). It's that this interface is far from being a straightforward GET/SET/DELETE. Rather, you need to learn the default implementation and all its nuances (for different payload types), and duplicate/transform all the logic.
The division of labor between what the framework does and what the platform developers (or any other developer working on a high-scale/high-availability deployment) need to do, has to be fixed. If this happens - plus better docs - you should be able to "just use" Redis.
The defense to training with copyright is that it is the same as how humans learn from copyrighted material. The storage or reproduction is a red herring. Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.
The defense to actually reproducing a work is that in order to do so, the user has to "break" the system. It is the same as how you can make legal software do illegal things (e.g. screen recorder to "steal" a movie)
None of this is to say that these defenses are correct/moral; but rather that this article doesnt add any additional input into whether it is or isnt.