> in the same way that hacking into a competitor's GitHub account That's like co...

jakelazaroff · on Oct 18, 2022

OP can correct me if I'm wrong, but they don't seem particularly interested in solving anything. They literally said "I don't care that it reproduces copyrighted content." So the problem, as I see it, is the people who see the laundering of open source and proprietary code as a draw, rather than a drawback.

nearbuy · on Oct 18, 2022

OP said they don't care that it can reproduce copyrighted content because they're not going to do so with it. That's roughly the opposite of what you're implying.

smaudet · on Oct 18, 2022

No they are just going to let other people commit crimes and claim they will never do it...

Do you really believe Microsoft employees aren't going to be using this, illegally or unofficially?

"Yes! We (Microsoft) aren't doing anything illegal, but we are going to turn a blind eye to everyone using it illegally as we directly benefit from it - and here's the kicker! Our employees are legally liable not us evil laughter all the way to the bank"

Of course the legal execs aren't using it, this is classic Microsoft (Embrace, Extend, Extinguish).

Ajedi32 · on Oct 18, 2022

If accidental reproduction of copyrighted material by AI systems is illegal under current law then we should change the law immediately so that it's not.

These AI systems are highly novel, transformative, and useful. Their development is exactly the sort of thing copyright law was originally created to encourage. If it's hindering them instead, that's a problem.

(And no, I'm not saying people should be allowed to use AI to intentionally launder stolen code; use some common sense here.)

jakelazaroff · on Oct 18, 2022

Why is it so outlandish to expect the people who make money by selling AI systems to only train them using material for which they have a license?

As many commenters have pointed out, no one would have a problem had Microsoft trained Copilot on the Windows source code. The fact that they intentionally left it out of the training set is a huge red flag.

Ajedi32 · on Oct 18, 2022

Because AI systems require large amounts of training data, the more the better, and requiring manual review of those datasets to ensure compliance with copyright would consume significant resources and slow down the pace of innovation across the entire AI industry.

Now let me flip that question around on you: What benefit would society gain from that forcing AI developers to do all that extra work?

im3w1l · on Oct 20, 2022

If you are going to use my work for free and without attribution and turn it around to compete with me, then it decreases my incentive to produce anything, and if I do it decreases my incentive to publish it. This goes directly against the intentions behind copyright law.

Ajedi32 · on Oct 21, 2022

That's the best argument I've heard so far, but still doesn't make sense to me. It's not like your individual project is going to make any significant difference to the capabilities of the resulting AI that's "competing with you" one way or the other. So really all you'd be doing by not releasing your code is shooting yourself in the foot for no gain.

Granted, people are not necessarily rational actors, so maybe you could argue it still makes sense to have some protections in place to assuage people's irrational fears. Maybe like some kind of robots.txt for determining whether a page can be used in an AI dataset could serve that purpose. I'd be hesitant to support anything more burdensome than that.

jakelazaroff · on Oct 18, 2022

The benefit is that our collective genius isn’t mined by mega corps and rented back to us. That we exist as more than mindless resources to be tapped for profit.

Again, if (for argument’s sake) we want to maximize the effectiveness of the AI, why are we okay with Microsoft intentionally omitting one of the most important codebases in human history — which it unambiguously has the right to use — from its training set?

Ajedi32 · on Oct 19, 2022

> The benefit is that our collective genius isn’t mined by mega corps and rented back to us.

That sounds like a downside to me, not a benefit. You're basically arguing it would be better if Copilot, Stable Diffusion, GPT-3, etc (which all included copyrighted works in their training set) didn't exist. I'm just not seeing that.

nearbuy · on Oct 18, 2022

They are only using material for which they have a license (at least debatably). Open source software licenses usually require attribution if you reproduce the source code or use the source code in a program.

Some other uses are allowed without attribution. Someone can read and learn from open source software without needing to put an attribution anywhere. You could run an analysis of the code on GitHub to find out what percent of code is written in C++. You wouldn't need to attribute every project on GitHub.

Now the debate is whether this applies to training ML models.

comex · on Oct 18, 2022

Not sure if they edited their comment, but the end of it contradicts your interpretation:

> I won't be afraid of accidently violating copyright myself, because I won't be trying to bait it into reproducing heavily copy&pasted cherrypicked examples, and I won't use 20 lines of its output with zero modification.

jakelazaroff · on Oct 18, 2022

No, that was there, but it doesn’t contradict my interpretation. Copyright doesn’t only cover reproducing code verbatim. It also includes derivative works.

dmix · on Oct 18, 2022

Maybe in court which interprets software dev very strictly, but in practice a developer automatically copying a single function from some 'freemium' style licensed library [1] posted publicly on Github - getting autocompleted into a different codebase with many thousands of lines of custom code isn't the same as going into some proprietary codebase and stealing their code to compete / build the same product as another company.

We could come up with scenarios where there might be some fancy algorithms posted on some public Git repo that's super efficient or unique, and that somehow fits into the size of individual functions that could be auto-inserted into some other person's codebase. But IRL that sort of thing is rarely ever going to be the thing that these IDE tools do. At least in a way that meaningfully contributes to another project.

That is still a concern yes, but it's still a niche usecase, which doesn't justify killing off otherwise extremely useful tools.

Maybe I'm being too techno-libertarian here, but I believe existing courts + public feedback cycles + iterating on how the public code is consumed by these tools + spreading awareness of the issue is enough to address the licensing problems.

The more accurately we explain the problem, the quicker we'll find good solutions.

[1] usually licensing saying commercial projects need to either pay or not use it at all. Or some attribution clause

smaudet · on Oct 18, 2022

"Maybe I'm being too techno-libertarian here, but I believe existing courts + public feedback cycles + iterating on how the public code is consumed + spreading awareness of the issue is enough to address the licensing problems."

I think you are, though. You have to automate the justice as well, traditional courts can't keep pace. You'll just end up with more automated DMCA-style takedowns, not less.

dmix · on Oct 18, 2022

I think you misunderstood my comment then (or how these tools work IRL)... because I'm not saying that it's even worthy of a court case in the vast majority of cases. So why would you need to automate such a thing?

And I don't even see how an automated DMCA system could exist because I doubt they'd win monetary damages in court over a 'stolen' function or two (or detect it in most commercial applications in the first place).

Regardless a single class action should be enough to make Microsoft either shut down their project or adapt (via whistleblowers, leaked code, public repos, etc). And regardless if they don't adapt by investing in the possible solutions here, an OSS project could take it's place eventually and the courts wouldn't even be a useful solution.

Ideally a capital-backed company will help solve this, with the obvious legal incentives that already exist. But even if it doesn't this isn't going away.

SantalBlush · on Oct 18, 2022

>That's like comparing grand-theft auto to someone stealing a pack of gum from a convenience store.

They're both more like grand-theft auto, but one involves the valet driver leaving with your car, and the other involves smashing a window.