October 26th, 2024

OSI readies controversial open-source AI definition

The Open Source Initiative will vote on the Open Source AI Definition on October 27, 2024, amid criticism that it may lower open-source standards by not requiring training data release.

Read original articleLink Icon
OSI readies controversial open-source AI definition

The Open Source Initiative (OSI) is set to vote on its Open Source AI Definition (OSAID) on October 27, 2024, with a public release planned for October 28. This definition aims to clarify what constitutes an open-source AI system, allowing for use, study, modification, and sharing. However, it has faced criticism from various members of the open-source community who argue that the OSAID may lower standards established by the original Open Source Definition (OSD). Concerns center around the treatment of training data, which the current draft does not require to be made available, only detailed information about it. Critics argue that this omission undermines the four freedoms associated with open-source software, as it limits the ability to fully understand and modify AI systems. The OSI has engaged in extensive discussions and consultations to develop the OSAID, but some experts believe that the complexities of AI systems necessitate a fresh approach rather than an adaptation of existing definitions. The OSI maintains that the OSAID reflects a balanced perspective, incorporating insights from a diverse range of stakeholders. The outcome of the vote and the subsequent impact of the OSAID on the open-source community and AI development remain uncertain.

- OSI will vote on the Open Source AI Definition (OSAID) on October 27, 2024.

- The OSAID has faced criticism for potentially lowering standards of open-source definitions.

- The draft does not require the release of training data, only detailed information about it.

- Critics argue this omission undermines the core principles of open source.

- The OSI claims the OSAID reflects a balanced approach based on stakeholder input.

Link Icon 24 comments
By @didibus - 6 months
> Maybe the supporter of the definition could demonstrate practically modifying a ML model without using the original training data, and show that it is just as easy as with the original data and it does not limit what you can do with it (e.g. demonstrate it can unlearn any parts of the original data as if they were not used).

I quite like that comment that was left on the article. I know some models you can tweak the weights, without the source data, but it does seem like you are more restricted without the actual dataset.

Personally, the data seems to be part of the source to me, in this case. I mean, the code is derived from the data itself, the weights are the artifact of training. If anything, they should provide the data, the training methodology, the model architecture, the code to train and infer, and the weights could be optional. I mean, the weights basically are equivalent to a built artifact, like the compiled software.

And that means commercially, people would pay for the cost of training. I might not have the resources to "compile" it myself, aka, run the training, so maybe I pay a subscription to a service that did.

By @samj - 6 months
The OSI apparently doesn't have the mandate from its members to even work on this, let alone approve it.

The community is starting to regroup at https://discuss.opensourcedefinition.org because the OSI's own forums are now heavily censored.

I encourage you to join the discussion about the future of Open Source, the first option being to keep everything as is.

By @blogmxc - 6 months
OSI sponsors include Meta, Microsoft, Salesforce and many others. It would seem unlikely that they'd demand the training data to be free and available.

Well, another org is getting directors' salaries while open source writers get nothing.

By @looneysquash - 6 months
The trained model is object code. Think of it as Java byte code.

You have some sort of engine that runs the model. That's like the JVM, and the JIT.

And you have the program that takes the training data and trains the model. That's your compiler, your javac, your Makefile and your make.

And you have the training data itself, that's your source code.

Each of the above pieces has its own source code. And the training set is also source code.

All those pieces have to be open to have a fully open system.

If only the training data is open, that's like having the source, but the compiler is proprietary.

If everything but the training set is open, well, that's like giving me gcc and calling it Microsoft Word.

By @AlienRobot - 6 months
If I remember correctly, Stallman's whole point about FLOSS was that consumers were beholden to developers who monopolized the means to produce binaries.

If I can't reproduce the model, I'm beholden to whoever trained it.

>"If you're explaining, you're losing."

That is an interesting point, but isn't this the same organization that makes "open source" vs. "source available" a topic? e.g. why Winamp wouldn't be open source?

I don't think you can even call a trained AI model "source available." To me the "source" is the training data. The model is as much of a binary as machine code. It doesn't even feel right to have it GPL licensed like code. I think it should get the same license you would give to a fractal art released to the public, e.g. CC.

By @wmf - 6 months
On one hand if you require people to provide data they just won't. People will never provide the data because it's full of smoking guns.

On the other hand if the data isn't open you should probably use the term open weights not open source. They're so close.

By @abecedarius - 6 months
The side note on hidden backdoors links to a paper that apparently goes beyond the usual ordinary point that reverse engineering is harder without source:

> We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer.

(I didn't read the paper. The ordinary version of this point is already compelling imo, given the current state of the art of reverse-engineering large models.)

By @JumpCrisscross - 6 months
> After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go

Huh, then this will be a useful definition.

The FSF position is untenable. Sure, it’s philosophically pure. But given a choice between a practical definition and a pedantically-correct but useless one, people will use the former. Irrespective of what some organisation claims.

> would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate

Not how language works.

By @swyx - 6 months
i like this style of article with extensive citing of original sources.

previously on: https://news.ycombinator.com/item?id=41791426

its really interesting to contrast this "outsider" definition of open ai with people with real money at stake https://news.ycombinator.com/item?id=41046773

By @andrewmcwatters - 6 months
I’m sure this will be controversial for some reason, but I think we should mostly reject the OSI’s definitions of “open” anything and leave that to the engineering public.

I don’t need a board to tell me what’s open.

And in the case of AI, if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open.

I don’t need people to tell me that.

OSI approved this and that has turned into a Ministry of Magic approved thinking situation that feels gross to me.

By @aithrowawaycomm - 6 months
What I find frustrating is that this isn't just about pedantry - you can't meaningfully audit an "open-source" model for security or reliability problems if you don't know what's in the training data. I believe that should be the "know it when I see it" test for open-source: has enough information been released for a competent programmer (or team) to understand the how the software actually works?

I understand the analogy to other types of critical data often not included in open-source distros (e.g Quake III's source is GPL but its resources like textures are not, as mentioned in the article). The distinction is in these cases the data does not clarify anything about the functioning of the engine, nor does its absence obscure anything. So by my earlier smell test it makes sense to say Quake III is open source.

But open-sourcing a transformer ANN without the training data tells us almost nothing about the internal functioning of the software. The exact same source code might be a medical diagnosis machine, or a simple translator. It does not pass my smell test to say this counts as "open source." It makes more sense to say that ANNs are data-as-code programming paradigms, glued together by a bit of Python. An analogy would be if id released its build scripts and announced Quake III was open-source, but claimed the .cpp and .h files were proprietary data. The batch scripts tell you a lot of useful info - maybe even that Q3 has a client-server architecture - but they don't tell you that the game is an FPS, let alone the tricks and foibles in its renderer.

By @Legend2440 - 6 months
Does "open-source" even make sense as a category for AI models? There isn't really a source code in the traditional sense.
By @lolinder - 6 months
> Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

This makes sense. What the OSI gets right here is that the artifact that is made open source is the weights. Making modifications to the weights is called fine tuning and does not require the original training data any more than making modifications to a piece of source code requires the brain of the original developer.

Tens of thousands of people have fine-tuned these models for as long as they've been around. Years ago I trained GPT-2 to produce text resembling Shakespeare. For that I needed Shakespeare, not GPT-2's training data.

The training data is properly part of the development process of the open source artifact, not part of the artifact itself. Some open source companies (GitLab) make their development process fully open. Most don't, but we don't call IntelliJ Community closed source on the grounds that they don't record their meetings and stream them for everyone to watch their planning process.

Edit: Downvotes are fine, but please at least deign to respond and engage. I realize that I'm expressing a controversial opinion here, but in all the times I've posted similar no one's yet given me a good reason why I'm wrong.

By @koolala - 6 months
"sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system"".

So a URL to the data? To download the data? Or what? Someone says "Just scrape the data from the web yourself." And a skilled person doesn't need a URL to the source data? No source? Source: The entire WWW?

By @mensetmanusman - 6 months
The 1000 lines of code is open source, the $100,000,000 in electricity costs to train is not.
By @pabs3 - 6 months
I prefer the Debian policy about this:

https://salsa.debian.org/deeplearning-team/ml-policy

By @rdsubhas - 6 months
There are already hundreds of OSI licenses for source code.

Just create a couple more for AI, one with training data, one without.

Holy grail thinking, finding "the one and only open" license instead of "an open" license, is in a sense anti-open.

By @metalman - 6 months
call it what it is a search engine,feeding back extracts from real human interaction,useing targeted advertising data to refine the responses

and since, what humans say is more horrible than good, the whole thing is a garbage mine

go talk to the crews ,who have been maintaining the consise oxford for the last number of centuries,or the French government and the department in charge of regulating the french language,remembering that the french, all but worship there language

there you will find,perhaps insight,or terror of the idea of creating a standard,consistant,concise,and useable,LLM

By @a-dub - 6 months
the term "open source" means that all of the materials that were used to create a distribution are available to inspect and modify.

anything else is closed source. it's as simple as that.

By @chrisfosterelli - 6 months
I imagine that Open AI (the company) must really not like this.
By @eadwu - 6 months
If only they kept their "Debian Free Software" name instead of hijacking another word ...