June 28th, 2024

Microsoft says that it's okay to steal web content it because it's 'freeware.'

Microsoft's CEO of AI, Mustafa Suleyman, believes web content is "freeware" for AI training unless specified otherwise. This stance has sparked legal disputes and debates over copyright infringement and fair use in AI content creation.

Read original article

Microsoft says that it's okay to steal web content it because it's 'freeware.'

Microsoft's CEO of AI, Mustafa Suleyman, stated that content shared on the web is considered "freeware" and can be used to create new content, particularly for training AI models. He mentioned that unless content producers explicitly state otherwise, all content on the open web is fair game for such purposes. This statement has sparked debates and legal challenges from publishers who disagree with this perspective. The controversy extends to the use of AI-generated content and the implications of training AI models on existing work. While some argue that it constitutes theft, others liken it to artists studying existing material. Microsoft and OpenAI have faced copyright infringement lawsuits, indicating the contentious nature of using web content for AI training without explicit permission. The issue raises questions about the boundaries of fair use, copyright law complexities, and the evolving landscape of AI technology in relation to content creation and ownership.

OpenAI and Anthropic are ignoring robots.txt

Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.

We need an evolved robots.txt and regulations to enforce it

In the era of AI, the robots.txt file faces limitations in guiding web crawlers. Proposals advocate for enhanced standards to regulate content indexing, caching, and language model training. Stricter enforcement, including penalties for violators like Perplexity AI, is urged to protect content creators and uphold ethical AI practices.

The Encyclopedia Project, or How to Know in the Age of AI

Artificial intelligence challenges information reliability online, blurring real and fake content. An anecdote underscores the necessity of trustworthy sources like encyclopedias. The piece advocates for critical thinking amid AI-driven misinformation.

My Memories Are Just Meta's Training Data Now

Meta's use of personal content from Facebook and Instagram for AI training raises privacy concerns. European response led to a temporary pause, reflecting the ongoing debate on tech companies utilizing personal data for AI development.

Not all 'open source' AI models are open: here's a ranking

Researchers found large language models claiming to be open source restrict access. Debate on AI model openness continues, with concerns over "open-washing" by tech giants. EU's AI Act may exempt open source models. Transparency and reproducibility are crucial for AI innovation.

8 comments

By @gerdesj - 10 months

Someone's reasons for sharing information are coloured by the situation at the time of sharing it, amongst many other factors.

Two years ago (say) no one predicted the meteoric rise of LLMs and their voracious appetite for data sets for training. These beasties are not simply search engines that are better direction pointers to your stuff (with a frisson of ads) but insist on being the final word and keep you out. To be blunt: It is stealing.

The implied contract for publishing on the web has changed again, just as it has several times in the past. The worst thing here is the use of the term "freeware". Describing original content, displayed for all to see as -ware is outrageous.

They might as well describe the content on Spotify and co as freeware ... bear with me: you could scrape wifi connections through your publicly available APs or even do some more broadband funky spectrum capture analysis and claim that is what an internet search engine does in its spare time and all is fine (lol).

LLMs and GenAI are quite interesting things but I do not think that they are the last word in ... AI. Anyway the latest cool thingie cannot be allowed to break whatever the current unspoken and somewhat undefined social contract is in place.

This bloke from MS seems to have forgotten that there really is a social contract of some sort and that if you say: "fuck you lot, omnomnom ... mmmm data ... ... laters (lol)" there might be some come back.

By @taspeotis - 10 months

It’s a pro-AI position but not really controversial?

My reading is he is saying content that is not under an explicit license for usage, that is made available publicly and freely, is fair game for training.

> In his remarks, Suleyman claimed that all content shared on the web is available to be used for AI training unless a content producer says otherwise specifically.

> "With respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That's been the understanding," said Suleyman.

> "There's a separate category where a website or a publisher or a news organization had explicitly said, 'do not scrape or crawl me for any other reason than indexing me so that other people can find that content.' That's a gray area and I think that's going to work its way through the courts."

By @candiddevmike - 10 months

It's ironic that Microsoft used copyright protection and IP law for years to secure a dominant market position, and now they don't need to play by the same rules because "something something AI".

By @dingosity - 10 months

Before we get too upset... can we verify this is MSFT's official position? I suspect this may be hyperbole. It could be Sulyman was constructing a hypothetical point that didn't survive translation into click-bait. That being said... MSFT has a history of chicanery. I'm off to try to find original sources. If anyone else has any, please provide a link.

FWIW... I found a few videos related to Endicott's story:

* This is a quick 5 minute video where Suleyman talks about how indeterminacy is good. So... you know... it's a good think that Co-Pilot can't tell you why it thinks it needs to dump 800 line of java code into your hello world program. At around 3:44, he confuses LLMs (with a surface understanding of syntax married with a markov chain on steroids) with people (who as best we can tell have a different understanding of the thing represented.) Corporate management confusing the the map with the territory? Who could have forseen such a thing: https://youtu.be/GsGFYoIx1YM

* This one seems to be the longer version, but I'm still looking for where Endicott's quote comes from, but around the 14minute mark is where the conversation turns towards "who owns the ip" used to train LLMs and the terms "Fair Use" and "Freeware" are used around the 14m50s mark: https://youtu.be/lPvqvt55l3A

[EDIT: So... yes... get out the pitch-forks... Microsoft is saying anything on the web is inherently freeware or subject to fair use even if you think you remember putting a copyright notice on it (or, as is mentioned in US copyright law, the creator automatically receives copyright protections upon creation of the work.)]

By @mmh0000 - 10 months

Of course it's okay.

I make an http _REQUEST_, the server voluntarily fulfills the request.

Why is it okay for a person to view your content, memorize it, and use it as a base for new content while it's not okay for an AI? at the end of the day it is the same thing.

By @nineteen999 - 10 months

The Windows source code was leaked onto the web many years ago wasn't it? Guess that makes it freeware too.

By @ginvok - 10 months

With this logic, so is pirated software right? It's free because it's on the internet.

By @userbinator - 10 months

This doesn't deprive the original owner, so they should use "share" or "pirate" instead.

Microsoft says that it's okay to steal web content it because it's 'freeware.'

Related

OpenAI and Anthropic are ignoring robots.txt

We need an evolved robots.txt and regulations to enforce it

The Encyclopedia Project, or How to Know in the Age of AI

My Memories Are Just Meta's Training Data Now

Not all 'open source' AI models are open: here's a ranking

Related

OpenAI and Anthropic are ignoring robots.txt

We need an evolved robots.txt and regulations to enforce it

The Encyclopedia Project, or How to Know in the Age of AI

My Memories Are Just Meta's Training Data Now

Not all 'open source' AI models are open: here's a ranking