OSI readies controversial open-source AI definition
The Open Source Initiative will vote on the Open Source AI Definition on October 27, 2024, amid criticism that it may lower open-source standards by not requiring training data release.
Read original articleThe Open Source Initiative (OSI) is set to vote on its Open Source AI Definition (OSAID) on October 27, 2024, with a public release planned for October 28. This definition aims to clarify what constitutes an open-source AI system, allowing for use, study, modification, and sharing. However, it has faced criticism from various members of the open-source community who argue that the OSAID may lower standards established by the original Open Source Definition (OSD). Concerns center around the treatment of training data, which the current draft does not require to be made available, only detailed information about it. Critics argue that this omission undermines the four freedoms associated with open-source software, as it limits the ability to fully understand and modify AI systems. The OSI has engaged in extensive discussions and consultations to develop the OSAID, but some experts believe that the complexities of AI systems necessitate a fresh approach rather than an adaptation of existing definitions. The OSI maintains that the OSAID reflects a balanced perspective, incorporating insights from a diverse range of stakeholders. The outcome of the vote and the subsequent impact of the OSAID on the open-source community and AI development remain uncertain.
- OSI will vote on the Open Source AI Definition (OSAID) on October 27, 2024.
- The OSAID has faced criticism for potentially lowering standards of open-source definitions.
- The draft does not require the release of training data, only detailed information about it.
- Critics argue this omission undermines the core principles of open source.
- The OSI claims the OSAID reflects a balanced approach based on stakeholder input.
Related
Not all 'open source' AI models are open: here's a ranking
Researchers found large language models claiming to be open source restrict access. Debate on AI model openness continues, with concerns over "open-washing" by tech giants. EU's AI Act may exempt open source models. Transparency and reproducibility are crucial for AI innovation.
Open Source undefined, part 1: the alternative origin story
The blog post examines the origins and evolving definitions of "Open Source" software, discussing its historical context, the OSI's role, trademark issues, and ongoing debates about commercial redistribution.
Begun, the open source AI wars have.. This is going to be ugly. Really ugly.
The Open Source Initiative is finalizing a definition for open source AI, facing criticism for potentially allowing proprietary systems to claim open source status, with ongoing debates expected for years.
Policymakers Should Let Open Source Play a Role in the AI Revolution
The R Street Institute highlights the significance of open-source AI for innovation, noting a rise in investment from $900 million in 2022 to $2.9 billion in 2023, urging balanced regulations.
The Open Source AI Definition RC1 Is Available for Comments
The Open Source Initiative released RC1 of the Open Source AI Definition for community feedback, emphasizing training data sharing, complete code transparency, and targeting a final version release on October 28, 2024.
I quite like that comment that was left on the article. I know some models you can tweak the weights, without the source data, but it does seem like you are more restricted without the actual dataset.
Personally, the data seems to be part of the source to me, in this case. I mean, the code is derived from the data itself, the weights are the artifact of training. If anything, they should provide the data, the training methodology, the model architecture, the code to train and infer, and the weights could be optional. I mean, the weights basically are equivalent to a built artifact, like the compiled software.
And that means commercially, people would pay for the cost of training. I might not have the resources to "compile" it myself, aka, run the training, so maybe I pay a subscription to a service that did.
The community is starting to regroup at https://discuss.opensourcedefinition.org because the OSI's own forums are now heavily censored.
I encourage you to join the discussion about the future of Open Source, the first option being to keep everything as is.
Well, another org is getting directors' salaries while open source writers get nothing.
You have some sort of engine that runs the model. That's like the JVM, and the JIT.
And you have the program that takes the training data and trains the model. That's your compiler, your javac, your Makefile and your make.
And you have the training data itself, that's your source code.
Each of the above pieces has its own source code. And the training set is also source code.
All those pieces have to be open to have a fully open system.
If only the training data is open, that's like having the source, but the compiler is proprietary.
If everything but the training set is open, well, that's like giving me gcc and calling it Microsoft Word.
If I can't reproduce the model, I'm beholden to whoever trained it.
>"If you're explaining, you're losing."
That is an interesting point, but isn't this the same organization that makes "open source" vs. "source available" a topic? e.g. why Winamp wouldn't be open source?
I don't think you can even call a trained AI model "source available." To me the "source" is the training data. The model is as much of a binary as machine code. It doesn't even feel right to have it GPL licensed like code. I think it should get the same license you would give to a fractal art released to the public, e.g. CC.
On the other hand if the data isn't open you should probably use the term open weights not open source. They're so close.
> We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer.
(I didn't read the paper. The ordinary version of this point is already compelling imo, given the current state of the art of reverse-engineering large models.)
Huh, then this will be a useful definition.
The FSF position is untenable. Sure, it’s philosophically pure. But given a choice between a practical definition and a pedantically-correct but useless one, people will use the former. Irrespective of what some organisation claims.
> would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate
Not how language works.
previously on: https://news.ycombinator.com/item?id=41791426
its really interesting to contrast this "outsider" definition of open ai with people with real money at stake https://news.ycombinator.com/item?id=41046773
I don’t need a board to tell me what’s open.
And in the case of AI, if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open.
I don’t need people to tell me that.
OSI approved this and that has turned into a Ministry of Magic approved thinking situation that feels gross to me.
I understand the analogy to other types of critical data often not included in open-source distros (e.g Quake III's source is GPL but its resources like textures are not, as mentioned in the article). The distinction is in these cases the data does not clarify anything about the functioning of the engine, nor does its absence obscure anything. So by my earlier smell test it makes sense to say Quake III is open source.
But open-sourcing a transformer ANN without the training data tells us almost nothing about the internal functioning of the software. The exact same source code might be a medical diagnosis machine, or a simple translator. It does not pass my smell test to say this counts as "open source." It makes more sense to say that ANNs are data-as-code programming paradigms, glued together by a bit of Python. An analogy would be if id released its build scripts and announced Quake III was open-source, but claimed the .cpp and .h files were proprietary data. The batch scripts tell you a lot of useful info - maybe even that Q3 has a client-server architecture - but they don't tell you that the game is an FPS, let alone the tricks and foibles in its renderer.
This makes sense. What the OSI gets right here is that the artifact that is made open source is the weights. Making modifications to the weights is called fine tuning and does not require the original training data any more than making modifications to a piece of source code requires the brain of the original developer.
Tens of thousands of people have fine-tuned these models for as long as they've been around. Years ago I trained GPT-2 to produce text resembling Shakespeare. For that I needed Shakespeare, not GPT-2's training data.
The training data is properly part of the development process of the open source artifact, not part of the artifact itself. Some open source companies (GitLab) make their development process fully open. Most don't, but we don't call IntelliJ Community closed source on the grounds that they don't record their meetings and stream them for everyone to watch their planning process.
Edit: Downvotes are fine, but please at least deign to respond and engage. I realize that I'm expressing a controversial opinion here, but in all the times I've posted similar no one's yet given me a good reason why I'm wrong.
So a URL to the data? To download the data? Or what? Someone says "Just scrape the data from the web yourself." And a skilled person doesn't need a URL to the source data? No source? Source: The entire WWW?
Just create a couple more for AI, one with training data, one without.
Holy grail thinking, finding "the one and only open" license instead of "an open" license, is in a sense anti-open.
and since, what humans say is more horrible than good, the whole thing is a garbage mine
go talk to the crews ,who have been maintaining the consise oxford for the last number of centuries,or the French government and the department in charge of regulating the french language,remembering that the french, all but worship there language
there you will find,perhaps insight,or terror of the idea of creating a standard,consistant,concise,and useable,LLM
anything else is closed source. it's as simple as that.
Related
Not all 'open source' AI models are open: here's a ranking
Researchers found large language models claiming to be open source restrict access. Debate on AI model openness continues, with concerns over "open-washing" by tech giants. EU's AI Act may exempt open source models. Transparency and reproducibility are crucial for AI innovation.
Open Source undefined, part 1: the alternative origin story
The blog post examines the origins and evolving definitions of "Open Source" software, discussing its historical context, the OSI's role, trademark issues, and ongoing debates about commercial redistribution.
Begun, the open source AI wars have.. This is going to be ugly. Really ugly.
The Open Source Initiative is finalizing a definition for open source AI, facing criticism for potentially allowing proprietary systems to claim open source status, with ongoing debates expected for years.
Policymakers Should Let Open Source Play a Role in the AI Revolution
The R Street Institute highlights the significance of open-source AI for innovation, noting a rise in investment from $900 million in 2022 to $2.9 billion in 2023, urging balanced regulations.
The Open Source AI Definition RC1 Is Available for Comments
The Open Source Initiative released RC1 of the Open Source AI Definition for community feedback, emphasizing training data sharing, complete code transparency, and targeting a final version release on October 28, 2024.