OpenCoder: Open-Source LLM for Coding
OpenCoder is an open-source large language model for code generation, offering resources like model weights and training data. It aims to match proprietary models while promoting transparency and accessibility.
Read original articleOpenCoder is a newly introduced large language model (LLM) designed specifically for code generation and reasoning tasks. It aims to provide a high-quality, open-source alternative to proprietary models, addressing the limitations in access to reproducible data processing and transparent training protocols. The authors highlight the challenges faced in developing such models, including resource constraints and ethical considerations. OpenCoder not only matches the performance of leading models but also serves as an "open cookbook" for the research community. This initiative includes the release of model weights, inference code, reproducible training data, and detailed training protocols. The key components identified for building a top-tier code LLM include optimized heuristic rules for data cleaning, recall of relevant text corpora, and high-quality synthetic data used during training. By promoting openness, OpenCoder aims to enhance accessibility and foster reproducible advancements in the field of code AI.
- OpenCoder is an open-source large language model for code generation and reasoning.
- It provides comprehensive resources including model weights, training data, and protocols.
- The model aims to match the performance of proprietary models while promoting transparency.
- Key components for its development include data cleaning methods and high-quality synthetic data.
- The initiative seeks to broaden access to advanced code LLMs for the research community.
Related
Not all 'open source' AI models are open: here's a ranking
Researchers found large language models claiming to be open source restrict access. Debate on AI model openness continues, with concerns over "open-washing" by tech giants. EU's AI Act may exempt open source models. Transparency and reproducibility are crucial for AI innovation.
Meta Large Language Model Compiler
Large Language Models (LLMs) are utilized in software engineering but underused in code optimization. Meta introduces the Meta Large Language Model Compiler (LLM Compiler) for code optimization tasks. Trained on LLVM-IR and assembly code tokens, it aims to enhance compiler understanding and optimize code effectively.
Linux Foundation Backs Open Source LLM Initiative
The Linux Foundation supports an open-source initiative for large language models, aiming to democratize AI, enhance accessibility, encourage collaboration, and address risks associated with proprietary models like bias.
Hot Take: Low Code/No Code platforms die as LLMs get better
The advancement of large language models reduces the necessity for low-code and no-code platforms, complicating their training and pushing developers to focus on technologies with abundant online training data.
Meta under fire for 'polluting' open-source
Meta's labeling of its Llama AI models as "open-source" has drawn criticism for being misleading, as they do not fulfill full open-source criteria, prompting calls for greater transparency in AI development.
Regardless of the specific performance of this model versus another model, I think it’s good to keep in mind that everyone benefits from this kind of work
It's not because of the unknown, it will also replace me and remove the joy of building something on my own. But I bet in 20 30 years it will be like DJing back in the 90s and DJing now. DJing back then was manual work and art, required skill. DJing now is mostly effortless and could even be automated, with AI too. It's more of a performance show than mixing skill and art.
Creating something new will be a matter of just defining what you'd like your result to be (as already is the case very often) and refining the steps. Instead of writing code, you'll be writing or speaking with an AI, which will then generate code.
When I started coding at the age of 11, that was the dream. But I still can't find the motivation to deal with AI.
I'm 49 now, soon 50.
The reason is that those are actually two different models (Qwen2.5-Coder-7B-Base with 61.6, Qwen2.5-Coder-7B-Instruct with 88.4).
GPT-4 did this job perfectly. Qwen:72b did half of the job, completely missed the other one, and renamed 1 variable that had nothing to do with the question. Llama3.1:70b behaved very similar to Qwen, which is interesting.
OpenCoder:8b started reasonably well, then randomly replaced "Split('\n')" with "Split(n)" in unrelated code, and then went completely berserk, hallucinating non-existent StackOverflow pages and answers.
For posterity, I saved it here: https://pastebin.com/VRXYFpzr
My best guess is that you shouldn't train it on mostly code. Natural language conversations used to train other models let them "figure out" human-like reasoning. If your training set is mostly code, it can produce output that looks like code, but it will have little value to humans.
Edit: to be fair, llama3.2:3b also botched the code. But it did not hallucinate complete nonsense at least.
This is surprisingly high. Does the include imported libraries and packages? Since you are hashing at the file level, I am not fully convinced that this is due to people copying entire files over without modification.
Looking at the training data. For example a prompt / response like this:
Prompt:
I want to create a Python program that takes a sentence as input and sorts all the words in the sentence in descending order based on their length. How can I achieve this?
Response:
Certainly! You can create a Python program to sort the words in a sentence by their length in descending order. Here’s a step-by-step guide to achieve this: 1. *Split the sentence into words*: Use the `split()` method to break the sentence into a list of words. 2. *Sort the words by length*: Use the `sorted()` function with a custom key to sort the words by their length in descending order. 3. *Join the sorted words*: Use the `join()` method to combine the sorted words back into a single string. Here's a complete Python program that demonstrates this: ```python def sort_words_by_length(sentence): # Step 1: Split the sentence into words words = sentence.split() # Step 2: Sort the words by length in descending order sorted_words = sorted(words, key=len, reverse=True) # Step 3: Join the sorted words back into a single string sorted_sentence = ' '.join(sorted_words) return sorted_sentence # Example usage sentence = "I want to create a Python program" sorted_sentence = sort_words_by_length(sentence) print(sorted_sentence) ``` ### Explanation: - *`sentence.split()`*: This method splits the input sentence into a list of words based on whitespace. - *`sorted(words, key=len, reverse=True)`*: This sorts the list of words based on their length. The `key=len` argument specifies that the length of each word should be used for sorting, and `reverse=True` ensures the sorting is in descending order. - *`' '.join(sorted_words)`*: This joins the sorted list of words back into a single string, with each word separated by a space. ### Example Output: For the input sentence `"I want to create a Python program"`, the output will be: ``` create program Python want ``` This output shows the words sorted by length in descending order.
It strikes me that it would easier to train a NN (or another mechanism) if the training that was more structured and the actual question/answer was stripped from the plesentaries / additional explanations.
Also keep the training data to one language (say english / python).
Ie.:
Prompt:
sort the words in a sentence by their length in descending order
Response:
sorted(sentence.split(), key=len, reverse=True)
Alternative one could use snippets like above and the synthesize "realistic" prompt / responses.
Is this slob?
(1) is M-A-P or INFtech dot ai a well-known institutional affiliation?
Related
Not all 'open source' AI models are open: here's a ranking
Researchers found large language models claiming to be open source restrict access. Debate on AI model openness continues, with concerns over "open-washing" by tech giants. EU's AI Act may exempt open source models. Transparency and reproducibility are crucial for AI innovation.
Meta Large Language Model Compiler
Large Language Models (LLMs) are utilized in software engineering but underused in code optimization. Meta introduces the Meta Large Language Model Compiler (LLM Compiler) for code optimization tasks. Trained on LLVM-IR and assembly code tokens, it aims to enhance compiler understanding and optimize code effectively.
Linux Foundation Backs Open Source LLM Initiative
The Linux Foundation supports an open-source initiative for large language models, aiming to democratize AI, enhance accessibility, encourage collaboration, and address risks associated with proprietary models like bias.
Hot Take: Low Code/No Code platforms die as LLMs get better
The advancement of large language models reduces the necessity for low-code and no-code platforms, complicating their training and pushing developers to focus on technologies with abundant online training data.
Meta under fire for 'polluting' open-source
Meta's labeling of its Llama AI models as "open-source" has drawn criticism for being misleading, as they do not fulfill full open-source criteria, prompting calls for greater transparency in AI development.