GitHub Copilot is not infringing your copyright
GitHub Copilot, an AI tool, faces controversy for using copyleft-licensed code for training. Debate surrounds copyright infringement, AI-generated works, and implications for tech industry and open-source principles.
Read original articleGitHub Copilot, an AI tool generating code suggestions for programmers, has sparked controversy in the Free Software community. Some argue that its use of copyleft-licensed code for training may infringe copyright, as Copilot itself is not under a copyleft license. However, scraping code for AI training is not a copyright violation, and machine-generated code is not protected by copyright. Critics fear an extension of copyright to AI-generated works, potentially benefiting major tech corporations. The debate also touches on text & data mining, where the EU allows scraping for analysis purposes. While some criticize Copilot for outputting code similar to its training data, this does not necessarily constitute copyright infringement. The discussion highlights the complex intersection of copyright, AI, and open-source principles, with implications for the broader tech industry.
Related
Microsoft AI CEO: Web content is 'freeware'
Microsoft's CEO discusses AI training on web content, emphasizing fair use unless restricted. Legal challenges arise over scraping restrictions, highlighting the balance between fair use and copyright concerns for AI development.
Coders' Copilot code-copying copyright claims crumble against GitHub, Microsoft
A judge dismissed a DMCA claim against GitHub, Microsoft, and OpenAI over Copilot. Remaining are claims of license violation and breach of contract. Dispute ongoing regarding discovery process. Defendants defend Copilot's compliance with laws.
Judge dismisses DMCA copyright claim in GitHub Copilot suit
A judge dismissed a DMCA claim against GitHub, Microsoft, and OpenAI over Copilot. The lawsuit alleged code suggestions lacked proper credit. Remaining claims involve license violation and breach of contract. Both sides dispute document production.
The developers suing over GitHub Copilot got dealt a major blow in court
A California judge dismissed most claims in a lawsuit against GitHub, Microsoft, and OpenAI over code copying by GitHub Copilot. Only two claims remain: open-source license violation and breach of contract. The court ruled Copilot didn't violate copyright law.
Judge dismisses lawsuit over GitHub Copilot coding assistant
A US judge dismissed a lawsuit against GitHub over AI training with public code. Plaintiffs failed to prove damages for breach of contract. GitHub Copilot faces scrutiny for using open-source code.
It is truly amazing how many people will shill for these massive corporations that claim they love open source or that their AI is open while they profit off of the violation of licenses and contribute very little back.
That should not be astonishing. The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright. If the authors wanted their code to be made use of in non-Free software, they would have used a BSD or MIT license.
Give credit where credit is due, including paying the creators when the licensing is violated.
I am reading this right… ? If this argument is generally true, does this mean that the output of a compiler might also be sent into the public domain? Or the live recording and broadcast of an event which involves automated machines on all levels?
AI is trying to avoid paying for training data since the amount of data required is so vast anything reasonable to content creators as payment would result in billions of expenses.
Additionally there have been copyright exemptions around scrapping and reproducing the scrapped contents but typically those exemptions have been explicitly granted as part of a copyright case and have been narrowly defined.
For instance Google Images only provides thumbnails and your browser gets the full size image from the original source.
The biggest problem for AI is that most previous copyright cases that were similar have all been partially avoided by not being the same thing. Google scrapping isn't trying to do the same thing your content is doing.
However training data output is trying to do the same thing as the original so falls under stricter scrutiny.
Although as this post eludes to the problem is going after the AI is untested territory and going after violators tends to be complex at best. After all in my first hypothetical how would anyone know? I will say that historically the courts haven't been very positive about shell games like this.
Furthermore, copyright is key to ensuring attribution, and attribution is an important enabler and motivator of creativity (copyleft does not at all imply non-attribution, in fact copyleft licenses may require it).
So we know for sure the LLM can spit out exact code with exact same names and comments, or exact paragraphs from books, so there is no question that it memorizes stuff and my explanation is that popular book quotes and popular code snippets will appear more then once in the training data so the training will cause to memorize this text.
Also how the F** can the AI spit facts about a metal band if it has no memory.
If corporations are allowed to do this to the community then we should allow to do the same, train open models on proprietary code and copyrighted images,music and videos.
I'm not sure this is applicable to licensed programs because a book is sold, not licensed.
> The output of a machine simply does not qualify for copyright protection – it is in the public domain.
As far as I know, the output of a compiler that builds executables from copyrighted source code is still subject to copyright protection. Is software like an LLM fundamentally different from a compiler in this regard?
In my opinion, the author's argument has several flaws, but perhaps a more important question is whether society would benefit from making an exception for LLM technologies.
I think it depends on how this technology will be used. If it is intended for purely educational purposes and is free of charge for end users, maybe it's not that bad. After all, we have Wikipedia.
However, if the technology is intended for commercial use, it might be reasonable to establish common rules for paying royalties to the original authors of the training data whenever authorship can be clearly determined. From this perspective, it could further benefit authors of open-source and possibly free software too.
Felix was elected to the board of the Open Knowledge Foundation Germany in 2020. Felix is an expert in copyright law and has been Director of Developer Policy at GitHub since March 2024. He previously headed the “control ©” project at the Gesellschaft für Freiheitsrechte. From 2014 to 2019, Felix was a Member of the European Parliament within the Greens/EFA group. Felix is an Affiliate of the Berkman Klein Center for Internet and Society at Harvard University and a member of the Advisory Board of D64 - Center for Digital Progress.
There is definitely arguments to be made that Copilot contravenes this:
> they can be applied only in certain special cases that do not conflict with the normal exploitation of the works or other subject matter and do not unreasonably prejudice the legitimate interests of the rightholders.
and the only other exception is:
> ... (a lawful use) ... of a work or other subject-matter to be made, and which have no independent economic significance, shall be exempted from the reproduction right provided for in Article 2.
By laundering licensing restrictions copilot definitely has the ability to conflict with the normal exploitation of works and also doesn't have independent economic significance because it competes with programmers.
"Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work."
This is the sort of thing that may be technically true but these kinds of rules were made under the assumption that most valued intellectual creations are indeed made by people, if you're going to argue that gigantic companies can use machines to basically launder intellectual artifacts, and that this doesn't compete with the interests of actual creators because technically ChatGPT isn't a legal person I think you're getting lost in legalese and missing the point
A US corporation is slurping up as much open source code as they possibly can and spending bucketloads of money to build a product that they are going to sell for (possibly) more bucketloads of money. The people who worked hard on writing the open source code are getting nothing, except maybe a tighter job market. IMHO, it's hard not to take it personally and it's difficult to get away from the feeling that there is a real injustice taking place.
Most of the time it is probably irrelevant, as Copilot doesn't suggest entire files yet, and nobody is going to care about expanding a loop or finishing a line or the likes, but I have seen as much as 14 lines in my tests. Eventually you are going to get to the point where it becomes truly relevant.
> Director of Developer Policy at GitHub since March 2024
so this should be understood the same way you understand an editorial in the New York Times entitled Why Babies Can Learn To Like Bombs, by Joe Blow (Raytheon).
That's a false analogy. It is more like going to the bookshop and taking a photo of every page of the book.
Even so, if you use this content in any shape or form the source should be cited regardless of book ownership.
LLMs seem to - with the right prompt- be able to reproduce copyrighted work. So it is “in there” in some abstract baked in sense.
We really need some sort of legal middle ground to reflect reality on the ground. It’s not quite straight stealing but it’s also not entirely not copied.
People would start releasing works of Art under GPL! Not just Computer Code!
Some discussion then: https://news.ycombinator.com/item?id=27736650
Do pro-generative AI people have absolutely no argument besides this? If I had a dollar for every time I've heard it, I'd be rich by now. And it's not even close to being a good argument.
What absolute nonsense.
Dude got ripped into in his own comment section... apparently he conveniently ignored the fact that copilot spits out verbatim code blocks, which is the main problem everyone talks about.
I don't want to shit on the Pirate Party he is a member of, but most of his blogposts seems like the typical anti-EU bashing to me. But YMMV.
The machine, such as it is, is generally not acting on its own. A person operates the machine, and presumably is on the hook for infringement on some level.
Consider: what if one directs the machine to reproduce a specific body of code and it ostensibly does so. Was there copying?
What if I have a person read out that body of code and I type exactly what I hear? I used a machine to produce the resultant text file, but it's pretty clear that copying has occurred then.
FWIW, I'm not a copyright maximalist, but I don't think you can win a conflict by abstaining from playing the game the other team is playing. That is: the companies producing, and drawing incredible and exclusive economic value from, industrial scale plagiarism machinery are hardly going to stop ruthlessly enforcing copyright on their own proprietary software. It would seem best, if it's the goal, to get laws changed in advance of simply declaring, lopsidedly, that this is all fine.
Related
Microsoft AI CEO: Web content is 'freeware'
Microsoft's CEO discusses AI training on web content, emphasizing fair use unless restricted. Legal challenges arise over scraping restrictions, highlighting the balance between fair use and copyright concerns for AI development.
Coders' Copilot code-copying copyright claims crumble against GitHub, Microsoft
A judge dismissed a DMCA claim against GitHub, Microsoft, and OpenAI over Copilot. Remaining are claims of license violation and breach of contract. Dispute ongoing regarding discovery process. Defendants defend Copilot's compliance with laws.
Judge dismisses DMCA copyright claim in GitHub Copilot suit
A judge dismissed a DMCA claim against GitHub, Microsoft, and OpenAI over Copilot. The lawsuit alleged code suggestions lacked proper credit. Remaining claims involve license violation and breach of contract. Both sides dispute document production.
The developers suing over GitHub Copilot got dealt a major blow in court
A California judge dismissed most claims in a lawsuit against GitHub, Microsoft, and OpenAI over code copying by GitHub Copilot. Only two claims remain: open-source license violation and breach of contract. The court ruled Copilot didn't violate copyright law.
Judge dismisses lawsuit over GitHub Copilot coding assistant
A US judge dismissed a lawsuit against GitHub over AI training with public code. Plaintiffs failed to prove damages for breach of contract. GitHub Copilot faces scrutiny for using open-source code.