October 9th, 2024

The Open Source AI Definition RC1 Is Available for Comments

The Open Source Initiative released RC1 of the Open Source AI Definition for community feedback, emphasizing training data sharing, complete code transparency, and targeting a final version release on October 28, 2024.

Read original articleLink Icon
The Open Source AI Definition RC1 Is Available for Comments

The Open Source Initiative has released the Release Candidate 1 (RC1) of the Open Source AI Definition, inviting community feedback. This version incorporates insights from five town hall meetings and discussions across various countries. Key updates include a requirement for sharing all training data, a clarification that code must be complete for downstream users to understand the training process, and the allowance for copyleft-like terms for code, data, and parameters. The definition emphasizes that while Open Source does not guarantee reproducibility, it should not hinder it. The focus of the drafting process will now shift to bug fixes and enhancing accompanying documentation, with a target release date of October 28 for version 1.0. The initiative aims to gather more endorsements and address new concerns raised by the community.

- Open Source AI Definition RC1 is open for community comments.

- Key changes include mandatory sharing of training data and complete code for transparency.

- Copyleft-like terms for code and data are now permissible.

- The initiative does not aim for reproducibility but ensures it is not obstructed.

- The final version is expected to be released on October 28, 2024.

Link Icon 11 comments
By @swyx - 6 months
D.O.A without adoption from the major model labs (including the "opener" ones like AI2 and lets say Together/Eleuther). i dont like the open source old guard feeling like they have any say in defining things when they dont have skin in the game. (and yes, this is coming from a fan of their current work defending the "open source" term in traditional dev tools). a good way to ensure decline to irrelevance is to do a lot of busywork without ensuring a credible quorum of the major players at the table.

please dont let me discourage tho, i think this could be important work but if and only if this gets endorsement from >1 large model lab producing any interesting work

By @blackeyeblitzar - 6 months
A reinforcement of definitions is needed. Open weights is NOT open source. But there are people like Meta that are rampantly open washing their work. The point of open source is that you can recreate the product yourself, for example by compiling the source code. Clearly the equivalent for an LLM is being able to retrain the model to produce the weights. Yes I realize this is impractical without access to the hardware, but the transparency is still important, so we know how these models are designed, and how they may be influencing us through biases/censorship.

The only actually open source model I am aware of is AI2’s OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...), which includes training data, training code, evaluation code, fine tuning code, etc.

The license also matters. A burdened license that restricts what you can do with the software is not really open source.

I do have concerns about where OSI is going with all this. For example, why are they now saying that reproducibility is not a part of the definition? These two paragraphs below contradict each other - what does it mean to be able to “meaningfully fork” something and be able to make it more useful if you don’t have the ingredients to reproduce it in the first place?

> The aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.

> Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone.

By @wmf - 6 months
Various organizations are willing to release open weights but not open source weights according to this definition, so this is going to be a no-op. Open source already existed before the OSI codified it, but now they're trying to will open source AI into existence against tons of incentives not to.
By @pabs3 - 6 months
This doesn't look like a proper open source AI definition to me, I prefer what the Debian folks came up with.

https://salsa.debian.org/deeplearning-team/ml-policy

By @godelski - 6 months
I don't think this makes sense nor is consistent with itself, let alone its other definition[0]

  > The aim of Open Source is not and has never been to enable reproducible software.
  ...
  > Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. 
  ...
  > Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias.
For these things, it does mean what most people are asking for: training details.

So far companies are just releasing checkpoints and architecture. It is better than nothing and this is a great step (especially with how entrenched businesses are[1]). But if we really want to do things like fixing security issues or remove bias, you have to be able to understand the data that it was originally trained on AND the training procedures. Both of these introduce certain biases (via statistical definition, which is more general). These issues can't all be solved by tuning and the ability to tune is significantly influenced by these decisions.

The reason we care about reproducible builds is because it matters to things like security, where we know what we're looking at is the same thing that's in the actual program. It is fair to say that the "aim" isn't about reproducible software, but it is a direct consequence of the software being open source. Trust matters, but the saying is "trust but verify". Sure, you can also fix vulns and bugs in closed source software, hell, you can even edit or build on top of it. But we don't call these things open source (or source available) for a reason.

If we're going to be consistent in our definitions, we need to understand what these things are at at least a minimal level of abstraction. And frankly, as a ML researcher, I just don't see it.

That said, I'm generally fine with "source available" and like most people use it synonymous with "open source". But if you're going to go around telling everyone they're wrong about the OSS definition, at least be consistent and stick to your values.

[0] https://opensource.org/osd

[1] Businesses who's entire model depends on OSS (by OS's definition) and freely available research

By @tananaev - 6 months
The definition is good because currently many call their open model weights as open "source". But I suspect most companies will still call their models open source even when they're not.
By @glkanb - 6 months
Ok, decent first steps. Now approve a BSD license with an additional clause that prohibits use for "AI" training.

Just like a free grazing field would allow living animals, but not a combine harvester. The old rules of "for any purpose" no longer apply.

By @exac - 6 months
> The aim of Open Source is not and has never been to enable reproducible software.

Okay, well just because you have the domain name "opensource.org" doesn't give you the ability to speak for the community, and the community's understanding of the term.

opensource.org is irrelevant.