June 29th, 2024

Microsoft CEO of AI Your online content is 'freeware' fodder for training models

Mustafa Suleyman, CEO of Microsoft AI, faced legal action for using online content as "freeware" to train neural networks. The debate raises concerns about copyright, AI training, and intellectual property rights.

Read original article

Microsoft CEO of AI Your online content is 'freeware' fodder for training models

The CEO of Microsoft AI, Mustafa Suleyman, stated that machine-learning companies can use online content as "freeware" to train neural networks. This led to legal action from the Center for Investigative Reporting against OpenAI and Microsoft for using content without permission. Several lawsuits have been filed against these companies for alleged content misappropriation. Suleyman mentioned a distinction between freely available online content and copyrighted material. The legal uncertainty around using copyrighted data to train AI models has raised concerns about the future of content creation and intellectual property rights. Experts suggest that policymakers need to address these issues to balance rights and responsibilities in the AI era. The ongoing debate highlights the evolving landscape of AI model training and the potential impact on content creators and the AI industry.

OpenAI and Anthropic are ignoring robots.txt

Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.

Record Labels Sue Two Startups for Training AI Models on Their Songs

Major record labels sue AI startups Suno AI and Uncharted Labs Inc. for using copyrighted music to train AI models. Lawsuits seek damages up to $150,000 per infringed work, reflecting music industry's protection of intellectual property.

Microsoft's AI boss Suleyman has a curious understanding of web copyright law

Microsoft's AI boss, Mustafa Suleyman, suggests open web content is free to copy, sparking copyright controversy. AI firms debate fair use of copyrighted material for training, highlighting legal complexities and intellectual property concerns.

Microsoft says that it's okay to steal web content it because it's 'freeware.'

Microsoft's CEO of AI, Mustafa Suleyman, believes web content is "freeware" for AI training unless specified otherwise. This stance has sparked legal disputes and debates over copyright infringement and fair use in AI content creation.

All web "content" is freeware

Microsoft's CEO of AI discusses open web content as freeware since the 90s, raising concerns about AI-generated content quality and sustainability. Generative AI vendors defend practices amid transparency and accountability issues. Experts warn of a potential tech industry bubble.

32 comments

By @ralferoo - 10 months

Copyright law is pretty clear that copyright automatically belongs to the creator from the moment they create something. People can choose to transfer those rights to another party, e.g. assigning the copyright to an employer or selling the rights, or by licensing rights to someone else to use the content. But it cannot be assumed that everything is "free" in the absence of a copyright notice, on the contrary one must assume that everything is copyrighted in the absence of a notice saying it's free.

The following rules are agreed upon by pretty much every country that has an interest in copyright: https://www.wipo.int/treaties/en/ip/berne/summary_berne.html

By @xkqras - 10 months

"That's the future Suleyman anticipates. "The economics of information are about to radically change because we can reduce the cost of production of knowledge to zero marginal cost," he said."

The cynicism is mind-blowing. Artificial Stupidities have created exactly zero knowledge so far, which is why companies like Microsoft now roll out people like Suleyman who openly admits that the DMCA is only for the rich with lawyers.

And steal and repackage "content" from altruistic creators.

This one might bite back Microsoft. If any case goes to the Supreme Court, I'm not sure that they'll be amused by this line of logic.

By @hunglee2 - 10 months

Three C's are missing from this - consultation, consent and compensation.

We post regularly online without any expectation of payment but then we never considered that the output could or would be used for commercial purpose. The value of our collective output is being captured by a very small elite. Not sure what we can do about this, other than support alternative eco-systems at least to ensure that intra-corporate competition might keep prices low

By @danybittel - 10 months

"I think that with respect to content that is already on the open web, the social contract of that content since the 1990s has been it is fair use"

And that social contract did not include AI. You put content on the internet because you had certain expectations on how it'll be seen or used.

By @piva00 - 10 months

It is the kind of attitude that will bring regulations; in the USA it will probably take a decade (as it usually does) for legislators catch up to it after some company skirting copyright laws has became big enough to be another economical powerhouse for the US's economy.

The EU is probably already acting on it behind closed doors, the usual vitriol against "innovation" peddled by the American-way of doing business went pretty high against the AI Act (which is definitely far from perfect but a step into some direction to regulate it), in the near future I can see more rulings or even new regulation to address the absurdity of AI companies consuming all this data for their own profits with no compensation to the creators of it.

My personal opinion is that leaving "innovation" be the only guidance to what is "good" without any morality imbued is stupid. A lot of us has seen the cycle by now, what was innovative before becomes entrenched, the entrenched companies become behemoths, and obviously start abusing their position of power when consumers have very little options to not be in the system they created. The downfall of tech from what I experienced in the early 2000s to what it has come to be in the 2020s is just sad, it's the new 80s finance yuppie bullshit, instead of coke-addicted greedy as fuck bros we have nerdy-blabbing-about-changing-the-world greedy fucks reaping the profits.

This will get ugly, and companies doing it will deserve the retribution if they get fucked.

By @ankit219 - 10 months

Current laws are insufficient to address this, but we will end up using them to adjudicate whether or not this is fair use. We will get there in due time, and the pathway will be messy.

I am ambivalent about this overall, but few things are clear. Someone getting sued does not automatically mean they are wrong and whoever is suing is right. We don't know the rules, and hence the lawsuit. I see it being used as an evidence of wrongdoing, and seems plainly wrong. Every thing that becomes big ends up being sued (including the artists with allegations that they copied someone's work). Tells us nothing.

(I think this part is clear) Reproducing content verbatim without permission, and for profit, is plain old plagiarism, whether it's done via AI or human. In some cases, with proper citations, it is allowed, but otherwise it's a no. For summarized content, with or without credit and citations, was always allowed, but never done at this scale, so this "social contract" might need to change.

By @Kye - 10 months

I guess I own and can do whatever I want with this Windows ISO then.

https://www.microsoft.com/en-us/software-download/windows11/

By @dspillett - 10 months

How about we test this theory by copying all the “freeware” content that MS put out. Heck, as it is freeware and not anything more restrictive, we can make derivative works too!

By @lambdaone - 10 months

Fascinating watching Microsoft morphing from intellectual property absolutism to this. "What's mine is mine, what's yours is also mine."

By @CM30 - 10 months

I'm sure all those media outlets and rights organisations are going to love this logic, and it's not going to backfire at all. They're already being sued by multiple news organisations, the author's guild of America, a few celebrities and likely many other organisations in the coming few months, and I'm sure at least some of the lawyers of those people are going to end up using this quote in court.

By @m463 - 10 months

Could Microsoft just ignore copyright on radio transmissions or broadcast tv too?

By @EnigmaFlare - 10 months

I think people are getting confused between using something and redistributing it. If you train an AI or build a search engine index, or your browser caches a webpage, you're just using it, and it was put on the web so that it can be used, without any restrictions on what for (except perhaps when the permission to even download it is granted conditionally on how you use it as Suleyman mentioned). Copyright restricts copying, not private use. Since it's on the web, there really is an implied permission to use it privately for whatever you want.

The copyright issue only comes up if you publish the output of your model. But if the AI is (somehow) clever enough to never reproduce the source material in any way that counts as copying for the purposes of copyright, then there's no copyright problem making it available to the public.

Some artists assume they have more rights than they really do and that other people aren't even allowed to mimic their style.

By @exe34 - 10 months

So I can train a model on all available Windows material, including from MS themselves, and sell it as a support bot?

By @dialup_sounds - 10 months

"information wants to be free" https://en.wikipedia.org/wiki/Information_wants_to_be_free?w...

By @croes - 10 months

Translation: If it benefits us it's free, if not you get sued.

By @jerpint - 10 months

I have my own small blog , and am more and more tempted to “poison” it with made up facts about me to see if/when it gets scraped by LLMs

By @surfingdino - 10 months

Oh, do fuck off. Copyright exists so that copyright owners have control over who uses their content and for what. Search brings visitors so that kind of use is beneficial to both parties. Training models that repurpose and recombine other copyright owners' IP and assigning copyright to the patchwork created using LLMs is not beneficial to the original copyright owner and is pure theft. Has Microsoft looked at what Adobe did a couple of weeks ago and thought "we haven't had shit thrown at us for a while, we'll have some of that, please?"

By @courseofaction - 10 months

Eat the rich.

By @bithead - 10 months

No wonder AI gets so much shit wrong. Asked ChatGPG a pydantic V2 question yesterday and it got it so wrong it would have hurt to watch if it were a person teaching a class. I hope AI doesn't start running operations and building planes...

By @dvhh - 10 months

I understand that the robots.txt is mostly taken as guidelines for courtesy. But if a large corporation can not be expected to respect it, there might be more proactive measures applied as countermeasures.

By @itronitron - 10 months

Seems like it would be easy to structure 'online content' so that it poisons the model with either nonsensical content or offensive content, like what happened with 'santorum'.

By @courseofaction - 10 months

See this in the context of corporate and governmental greed aligning into the impoverishment of the masses. This is part of a general movement towards wealth concentration, and war is the next step.

By @ChrisArchitect - 10 months

[dupe]

Some more discussion: https://news.ycombinator.com/item?id=40826588

By @operae - 10 months

"I think that with respect to content that is already on the open web, the social contract of that content since the 1990s has been it is fair use"

This argument cannot possibly hold in any court. This has not been the 'contract'. I cannot reproduce the content of a newspapers online outlet, I cannot reproduce the art of another artist on Instagram, I cannot reproduce someones Youtube video without permission. This same thing sparked the whole fair-use debate some years ago.

The exceptions to these rules have always existed in limbos of regulatory grey areas and are being discussed for decades now.

This guy is still living in the Napster-era apparently and the amount of gaslighting Microsoft, OpenAI, Google etc. perform right now to freeload on data is presumptuous.

By @bmacho - 10 months

What about windows source code? Is that free real estate too?

By @keiferski - 10 months

I think there's a 99% certainty that almost all serious media websites in the future are paywalled, especially ones that provide information useful to businesses. There is less and less benefit to making your work public.

By @127 - 10 months

"If you don't have enough money to sue us, your copyright doesn't exist."

By @ozim - 10 months

Yeah if I don’t lock my house and someone takes all the furniture it is not a theft……

What a joke of a person. I hope court will roll over them harshly and explain world doesn’t work like that and just because you can do something it doesn’t make it right.

By @courseofaction - 10 months

There are no rules. This is an everyone be damned land grab with every imperialistic instinct justifying itself with bullshit.

Read a history book.

By @rspoerri - 10 months

Soooo... Any downloadable version of Office and Windows are Freeware for my AI to train and spill out the same code again? Kinda cool i think. /s

Microsoft CEO of AI Your online content is 'freeware' fodder for training models

Related

OpenAI and Anthropic are ignoring robots.txt

Record Labels Sue Two Startups for Training AI Models on Their Songs

Microsoft's AI boss Suleyman has a curious understanding of web copyright law

Microsoft says that it's okay to steal web content it because it's 'freeware.'

All web "content" is freeware

Related

OpenAI and Anthropic are ignoring robots.txt

Record Labels Sue Two Startups for Training AI Models on Their Songs

Microsoft's AI boss Suleyman has a curious understanding of web copyright law

Microsoft says that it's okay to steal web content it because it's 'freeware.'

All web "content" is freeware