June 29th, 2024

Microsofts AI boss thinks its perfectly OK to steal content if its on open web

Microsoft's AI boss, Mustafa Suleyman, challenges copyright norms by advocating for free use of online content. His stance triggers debates on AI ethics and copyright laws in the digital era.

Read original articleLink Icon
Microsofts AI boss thinks its perfectly OK to steal content if its on open web

Microsoft's AI boss, Mustafa Suleyman, has sparked controversy by suggesting that content on the open web is fair game for anyone to copy and use freely, dubbing it "freeware." This belief contradicts copyright law, as creating content automatically grants copyright protection in the US. Suleyman's stance comes amidst lawsuits accusing Microsoft and OpenAI of using copyrighted online content to train AI models. While he acknowledges the importance of specifying restrictions in a robots.txt file, he downplays its legal weight compared to fair use. Suleyman's comments have raised concerns about the ethical use of AI and intellectual property rights. Despite the prevalence of AI companies justifying the use of copyrighted material under fair use, Suleyman's bold assertions have drawn attention to the complexities of copyright law in the digital age. The debate surrounding AI's access to and use of online content continues to evolve, with legal and ethical implications at the forefront of discussions in the tech industry.

Related

Microsoft's AI boss Suleyman has a curious understanding of web copyright law

Microsoft's AI boss Suleyman has a curious understanding of web copyright law

Microsoft's AI boss, Mustafa Suleyman, suggests open web content is free to copy, sparking copyright controversy. AI firms debate fair use of copyrighted material for training, highlighting legal complexities and intellectual property concerns.

Microsoft says that it's okay to steal web content it because it's 'freeware.'

Microsoft says that it's okay to steal web content it because it's 'freeware.'

Microsoft's CEO of AI, Mustafa Suleyman, believes web content is "freeware" for AI training unless specified otherwise. This stance has sparked legal disputes and debates over copyright infringement and fair use in AI content creation.

Microsoft CEO of AI Your online content is 'freeware' fodder for training models

Microsoft CEO of AI Your online content is 'freeware' fodder for training models

Mustafa Suleyman, CEO of Microsoft AI, faced legal action for using online content as "freeware" to train neural networks. The debate raises concerns about copyright, AI training, and intellectual property rights.

All web "content" is freeware

All web "content" is freeware

Microsoft's CEO of AI discusses open web content as freeware since the 90s, raising concerns about AI-generated content quality and sustainability. Generative AI vendors defend practices amid transparency and accountability issues. Experts warn of a potential tech industry bubble.

Microsoft AI CEO: Web content is 'freeware'

Microsoft AI CEO: Web content is 'freeware'

Microsoft's CEO discusses AI training on web content, emphasizing fair use unless restricted. Legal challenges arise over scraping restrictions, highlighting the balance between fair use and copyright concerns for AI development.

Link Icon 22 comments
By @jimmaswell - 7 months
From the beginning, it's seemed completely intuitive to me that training a computer made of sand on publicly available content and then generating art later should be fair use, so long as it's fair use to train the meat computer in your head on the same content and then use it to generate art later. There's no meaningful difference to me as far as the ethics of the act are concerned.
By @jsyang00 - 7 months
No he doesn't.

> I think that with respect to content that’s already on the open web, the social contract of that content since the ‘90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been “freeware,” if you like, that’s been the understanding.

> There’s a separate category where a website, or a publisher, or a news organization had explicitly said ‘do not scrape or crawl me for any other reason than indexing me so that other people can find this content.’ That’s a grey area, and I think it’s going to work its way through the courts.

By @JonChesterfield - 7 months
I'll bet they don't consider the windows and office source code fair game for arbitrary reuse provided the other party found the copy on the web. Even if the person found the copy on GitHub.
By @beefnugs - 7 months
Isn't this discussion at all stupidly letting them control the goal posts? They have already gone far beyond this thinking that everything someone does on their own personal computer in their own home without the slightest bit of consent is going to be slurped up and recorded in case they want to query it someday.

This is like arguing that this guy who just murdered someone 10 minutes ago, should actually be able to steal the candy from this child since the child put it down on the park bench.

By @starik36 - 7 months
The more I read about this guy the more I get the feeling that he is an unscrupulous individual.

robots.txt is a "grey idea" to him, instead of being a directive to keep moving? Wow.

By @mewpmewp2 - 7 months
What exactly is wrong with the statement he has made?
By @1vuio0pswjnm7 - 7 months
He compares fear of "AI" to fear of calculators. But "AI" cannot do math. Calculators do not "hallucinate". They are not correct "80%" of the time. They are correct 100% of the time. We know how they work. IIRC, in the 1970s someone at Bell Labs wrote a UNIX program that could generate fake academic papers. It might be a fun gag but it does it have much practical utility. No matter how "real" the papers might appear, or even if they are correct "80%" of the time, it is not an "invention", and it is certainly not comparable to a calculator.
By @avivallssa - 7 months
Will this make people who make indirect money through their content, less motivated from publishing their content on the Web ? This might be arguable.

May be, there should be a similar amount of openness in publishing the content used for training commercial models.

The copyright owner should have a privilege to ask for that content to be removed from training. This may also allow individual authors to gain their share with their Advanced RAG applications, that are specially focussed on the content they own and also published on the web.

By @29athrowaway - 7 months
One thing is a robots.txt policy, meant mostly for search crawlers.

Another thing is the copyright of the content, terms of use policies, etc.

Abiding by a robots.txt policy doesn't make you immune to copyright, terms of service, law in various jurisdictions, etc. If you think that you are probably a kleptomaniac.

Just create a robots.txt with "User-Agent: one billion asterisks" so that the crawlers die when parsing it.

By @sircastor - 7 months
It seems obvious to me that there is no such thing as AI without publicly training on the open web, and that any kind of licensing is an impossible feat.

Programs from my youth (Daria, Captain N) had licensed music for their broadcast, and that’s all because what else was ever going to be done? 20 years later, streaming with the music intact is an impossibility because the kind of money necessary to license all of it was too much. And you have to make deals with dozens of companies.

Multiply that by several orders of magnitude and you start to see the scope of the problem.

By @sircastor - 7 months
Part of the problem here is that the web has gone through lots of change as to what it is and how people understand it.

Some people think of it as billboards posted on the highway. Some think it’s a bulletin board. Some think it’s a newspaper. A television, a “zine”, a diary, graffiti. It has been all of these things, and is and isn’t. And people who publish are really bad at explicitly stating which one they are. But they expect you to know.

By @fimdomeio - 7 months
So we've now learned that copyright is determined by communications protocol. If you're using torrents it's copyright infringement, if it's the web then it's public domain.
By @boring-alterego - 7 months
Hmm hear me out, go to a public website and add black space below any video or picture with random adjectives that are your satire review of that piece of art then feed those into the ai model and tell it to ignore any text.
By @KoolKat23 - 7 months
This is nothing but performative clickbait by the Verge.

It is classified as fair use, the term is transformative use, where those using it are training models (their intention) if anyone wishes to Google it.

The end.

By @whacko_quacko - 7 months
scraping the open web shouldn't be a crime[1], even if unsavoury people do it for unsavoury purposes

[1]: or even just an issue

By @byyll - 7 months
It's not stealing content if the content is still in the original place. Stop trying to redefine words. It's copying.
By @93po - 7 months
If buying isn't owning, copying isn't stealing. This is a really tired argument.
By @cjk2 - 7 months
Ah yes the implied social contract that it's ok because it happens all the time.

That's how society falls.

By @tiahura - 7 months
The open web's ethos since its inception in the 1990s has been one of unrestricted access and fair use. Content published openly online inherently invites broad consumption, reproduction, and creative reuse by the public. This is not merely custom, but a fundamental aspect of fair use doctrine as applied to the digital realm.

The four factors of fair use - purpose of use, nature of the copyrighted work, amount used, and effect on the market - overwhelmingly favor allowing free use of openly published web content. The transformative nature of most reuses, the public availability of the original works, the necessity of using entire works in many cases, and the lack of a traditional market for such content all support this interpretation.

This longstanding practice has been the catalyst for unprecedented innovation and information dissemination. It represents a tacit social contract between content creators and users, establishing a de facto "freeware" model for open web content. Any attempt to retroactively impose strict copyright limitations would not only stifle innovation but also contradict decades of established legal precedent and digital norms.

-As a side note, I’m not certain that training necessarily involves “copying.”

—-Lastly, if anyone really thinks the Robert’s court is going to knee-cap AI, you’re soft in the head.