July 11th, 2024

GitHub Copilot is not infringing your copyright

GitHub Copilot, an AI tool, faces controversy for using copyleft-licensed code for training. Debate surrounds copyright infringement, AI-generated works, and implications for tech industry and open-source principles.

Read original articleLink Icon
GitHub Copilot is not infringing your copyright

GitHub Copilot, an AI tool generating code suggestions for programmers, has sparked controversy in the Free Software community. Some argue that its use of copyleft-licensed code for training may infringe copyright, as Copilot itself is not under a copyleft license. However, scraping code for AI training is not a copyright violation, and machine-generated code is not protected by copyright. Critics fear an extension of copyright to AI-generated works, potentially benefiting major tech corporations. The debate also touches on text & data mining, where the EU allows scraping for analysis purposes. While some criticize Copilot for outputting code similar to its training data, this does not necessarily constitute copyright infringement. The discussion highlights the complex intersection of copyright, AI, and open-source principles, with implications for the broader tech industry.

Link Icon 38 comments
By @carom - 6 months
This is missing the largest argument in my opinion. The weights are the derivative work of the GPL licensed code and should therefore be released under the GPL. I would say these companies release their weights or simply not train on copyleft code.

It is truly amazing how many people will shill for these massive corporations that claim they love open source or that their AI is open while they profit off of the violation of licenses and contribute very little back.

By @cowsandmilk - 6 months
> What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.

That should not be astonishing. The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright. If the authors wanted their code to be made use of in non-Free software, they would have used a BSD or MIT license.

By @desiderantes - 6 months
I think that the author has a warped idea of how LLMs work, and that infects its reasoning. Also, I see no mention of the inequality of this new "copyright free code generation" situation it defends; As much as Microsoft thinks all code is ripe for taking, I can't imagine how happy they would be if an anonymous person drops a model trained on all leaked Windows code and the ReactOS people start using it. Or if employees start taking internal code to train models that they then use after their employment ends (since it's not copyright infringement, it should be cool).
By @flaptrap - 6 months
Who are they trying to fool? Wholesale expropriation after stripping the license and authorship, while those in the open source community observe both of them very carefully.

Give credit where credit is due, including paying the creators when the licensing is violated.

By @Y-bar - 6 months
> The output of a machine simply does not qualify for copyright protection – it is in the public domain.

I am reading this right… ? If this argument is generally true, does this mean that the output of a compiler might also be sent into the public domain? Or the live recording and broadcast of an event which involves automated machines on all levels?

By @Guvante - 6 months
If Copilot spits out the entirety of a GPL library and you include that code in your project you are certainly violating the GPL license.

AI is trying to avoid paying for training data since the amount of data required is so vast anything reasonable to content creators as payment would result in billions of expenses.

Additionally there have been copyright exemptions around scrapping and reproducing the scrapped contents but typically those exemptions have been explicitly granted as part of a copyright case and have been narrowly defined.

For instance Google Images only provides thumbnails and your browser gets the full size image from the original source.

The biggest problem for AI is that most previous copyright cases that were similar have all been partially avoided by not being the same thing. Google scrapping isn't trying to do the same thing your content is doing.

However training data output is trying to do the same thing as the original so falls under stricter scrutiny.

Although as this post eludes to the problem is going after the AI is untested territory and going after violators tends to be complex at best. After all in my first hypothetical how would anyone know? I will say that historically the courts haven't been very positive about shell games like this.

By @anileated - 6 months
Copyleft and copyright are not at odds. To promote copyleft, you exercise copyright.

Furthermore, copyright is key to ensuring attribution, and attribution is an important enabler and motivator of creativity (copyleft does not at all imply non-attribution, in fact copyleft licenses may require it).

By @mangecoeur - 6 months
The basic problem is GPL tries to use copyright as a way to drive a “fair sharing and resharing” approach to code. AI generated code sold for profit violates the spirit of this approach, but not the letter of the law behind copyright. Fundamentally copyright has limitations and exceptions for good reason and is probably not the best legal method to enforce this sharing idea, but other methods would be complicated and expensive (eg writing and enforcing contracts). On the contrary, it would probably be better for open source if it was decided that ai generated code cannot be copyrighted and therefore any ai generated code would be in the public domain automatically.
By @cudgy - 6 months
The issue would be approached much differently if, for example, a “video llm” was created that scraped movies and generated content from those sources. The well organized, well connected movie industry would be up in arms burying ai companies with lawsuits and newly passed legal protections.
By @giovannibajo1 - 6 months
Should be tagged 2021
By @simion314 - 6 months
Was proven with examples that LLM can produce exact text from it's input, this was such a problem that OpenAI had to add various filters to stop those things to repeat, this was also proven when the pre prompt was revealed.

So we know for sure the LLM can spit out exact code with exact same names and comments, or exact paragraphs from books, so there is no question that it memorizes stuff and my explanation is that popular book quotes and popular code snippets will appear more then once in the training data so the training will cause to memorize this text.

Also how the F** can the AI spit facts about a metal band if it has no memory.

If corporations are allowed to do this to the community then we should allow to do the same, train open models on proprietary code and copyrighted images,music and videos.

By @Eliah_Lakhin - 6 months
> If I go to a bookshop, take a book off the shelf and start reading it, I am not infringing any copyright.

I'm not sure this is applicable to licensed programs because a book is sold, not licensed.

> The output of a machine simply does not qualify for copyright protection – it is in the public domain.

As far as I know, the output of a compiler that builds executables from copyrighted source code is still subject to copyright protection. Is software like an LLM fundamentally different from a compiler in this regard?

In my opinion, the author's argument has several flaws, but perhaps a more important question is whether society would benefit from making an exception for LLM technologies.

I think it depends on how this technology will be used. If it is intended for purely educational purposes and is free of charge for end users, maybe it's not that bad. After all, we have Wikipedia.

However, if the technology is intended for commercial use, it might be reasonable to establish common rules for paying royalties to the original authors of the training data whenever authorship can be clearly determined. From this perspective, it could further benefit authors of open-source and possibly free software too.

By @qjakdx - 6 months
Mr Reda appears to be a politician whose expertise is in attaining lucrative positions:

https://okfn.de/en/vorstand/

Felix was elected to the board of the Open Knowledge Foundation Germany in 2020. Felix is an expert in copyright law and has been Director of Developer Policy at GitHub since March 2024. He previously headed the “control ©” project at the Gesellschaft für Freiheitsrechte. From 2014 to 2019, Felix was a Member of the European Parliament within the Greens/EFA group. Felix is an Affiliate of the Berkman Klein Center for Internet and Society at Harvard University and a member of the Advisory Board of D64 - Center for Digital Progress.

By @nhinck3 - 6 months
I don't think the interpretation of the 2019 Directive is correct.

There is definitely arguments to be made that Copilot contravenes this:

> they can be applied only in certain special cases that do not conflict with the normal exploitation of the works or other subject matter and do not unreasonably prejudice the legitimate interests of the rightholders.

and the only other exception is:

> ... (a lawful use) ... of a work or other subject-matter to be made, and which have no independent economic significance, shall be exempted from the reproduction right provided for in Article 2.

By laundering licensing restrictions copilot definitely has the ability to conflict with the normal exploitation of works and also doesn't have independent economic significance because it competes with programmers.

By @Barrin92 - 6 months
Focussing on the legal or procedural technicalities of how these systems work is in my opinion completely missing what the resistance is about. There is a difference between sharing your creation with your neighbor and sharing your creation with the corporate equivalent of those Matrix robots that turn people into batteries.

"Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work."

This is the sort of thing that may be technically true but these kinds of rules were made under the assumption that most valued intellectual creations are indeed made by people, if you're going to argue that gigantic companies can use machines to basically launder intellectual artifacts, and that this doesn't compete with the interests of actual creators because technically ChatGPT isn't a legal person I think you're getting lost in legalese and missing the point

By @cmiles74 - 6 months
This article makes a compelling case that GitHub Copilot isn't infringing on our copyright but that doesn't change the fact that it's infringing on something.

A US corporation is slurping up as much open source code as they possibly can and spending bucketloads of money to build a product that they are going to sell for (possibly) more bucketloads of money. The people who worked hard on writing the open source code are getting nothing, except maybe a tighter job market. IMHO, it's hard not to take it personally and it's difficult to get away from the feeling that there is a real injustice taking place.

By @autarchprinceps - 6 months
If you have code that is under copyleft, and Copilot suggest part of it to somebody else to embed in their code on the basis of reading that repo, then either that new repo also has to be under that copyleft license, or the person is unknowingly committing a violation based on what Copilot suggested them.

Most of the time it is probably irrelevant, as Copilot doesn't suggest entire files yet, and nobody is going to care about expanding a loop or finishing a line or the likes, but I have seen as much as 14 lines in my tests. Eventually you are going to get to the point where it becomes truly relevant.

By @bananapub - 6 months
https://okfn.de/en/vorstand/:

> Director of Developer Policy at GitHub since March 2024

so this should be understood the same way you understand an editorial in the New York Times entitled Why Babies Can Learn To Like Bombs, by Joe Blow (Raytheon).

By @minifridge - 6 months
> I go to a bookshop, take a book off the shelf and start reading it, I am not infringing any copyright.

That's a false analogy. It is more like going to the bookshop and taking a photo of every page of the book.

Even so, if you use this content in any shape or form the source should be cited regardless of book ownership.

By @Havoc - 6 months
Feels like a bit of an over correction.

LLMs seem to - with the right prompt- be able to reproduce copyrighted work. So it is “in there” in some abstract baked in sense.

We really need some sort of legal middle ground to reflect reality on the ground. It’s not quite straight stealing but it’s also not entirely not copied.

By @IsTom - 6 months
If this were true then there would never be a legal need for clean-room implementation or design.
By @skywhopper - 6 months
The author asserts that AI-generated code cannot be copyrighted, which courts have agreed with so far, but in practice most AI-generated code is being claimed to be copyrighted, if you believe GitHub’s own stats about how much Copilot is being used.
By @fire_lake - 6 months
Copilot often feels like an automation of clean room reimplementation of protected materials.
By @koolala - 6 months
I hope the GPL people get their AI GPL revolution. A trillion dollar GPL data center running an open intellegence from scrapping every GPL program ever made.

People would start releasing works of Art under GPL! Not just Computer Code!

By @temac - 6 months
This article inverse the notion of free software and of copyleft...
By @ChrisArchitect - 6 months
By @reginald78 - 6 months
If it isn't copyright infringement I assume Microsoft has trained it on all of its proprietary source code to make it the best it can be.
By @squarefoot - 6 months
Just out of curiosity, does Copilot license allow the use of its output to train other similar AI code generators?
By @bakugo - 6 months
> If I go to a bookshop, take a book off the shelf and start reading it, I am not infringing any copyright

Do pro-generative AI people have absolutely no argument besides this? If I had a dollar for every time I've heard it, I'd be rich by now. And it's not even close to being a good argument.

By @plasticeagle - 6 months
Making money from AI models "trained" on data that belongs to other people is so clearly wrong that this person has to write thousands of words of impenetrable nonsense just to distract attention away from this fundamental.

What absolute nonsense.

By @koshinae - 6 months
(Nice necromancy)

Dude got ripped into in his own comment section... apparently he conveniently ignored the fact that copilot spits out verbatim code blocks, which is the main problem everyone talks about.

I don't want to shit on the Pirate Party he is a member of, but most of his blogposts seems like the typical anti-EU bashing to me. But YMMV.

By @jclulow - 6 months
> The output of a machine simply does not qualify for copyright protection – it is in the public domain.

The machine, such as it is, is generally not acting on its own. A person operates the machine, and presumably is on the hook for infringement on some level.

Consider: what if one directs the machine to reproduce a specific body of code and it ostensibly does so. Was there copying?

What if I have a person read out that body of code and I type exactly what I hear? I used a machine to produce the resultant text file, but it's pretty clear that copying has occurred then.

FWIW, I'm not a copyright maximalist, but I don't think you can win a conflict by abstaining from playing the game the other team is playing. That is: the companies producing, and drawing incredible and exclusive economic value from, industrial scale plagiarism machinery are hardly going to stop ruthlessly enforcing copyright on their own proprietary software. It would seem best, if it's the goal, to get laws changed in advance of simply declaring, lopsidedly, that this is all fine.

By @hexo - 6 months
Nope. It does. No amount of mental acrobacy changes that.
By @haksz - 6 months
Has this person created anything of value that he'd like to protect?
By @lofaszvanitt - 6 months
People live and think in the present, as an added plus people have terrible memory. Think in 5-20 years and use that timeframe to put the events one after another.
By @kalium-xyz - 6 months
Chatgpt would verbatim output its training data when given 200 successive “a “ in the past. Now this has been fixed by showing an error when you try this. You arent going to convince me this is anything but compression.