August 2nd, 2024

Where Are Large Language Models for Code Generation on GitHub?

The study examines Large Language Models like ChatGPT and Copilot in GitHub projects, noting their limited use in smaller projects, short code snippets, and minimal modifications compared to human-written code.

Read original articleLink Icon
Where Are Large Language Models for Code Generation on GitHub?

The study titled "Where Are Large Language Models for Code Generation on GitHub?" investigates the use of Large Language Models (LLMs) like ChatGPT and Copilot in software development, particularly focusing on their code generation capabilities as reflected in GitHub projects. The research highlights that ChatGPT and Copilot are the most commonly used LLMs for code generation, while other models have minimal presence. It notes that projects utilizing these models tend to be smaller and less recognized, often led by individuals or small teams, yet they show signs of continuous development. The primary programming languages for generated code include Python, Java, and TypeScript, with a focus on data processing and transformation tasks. In contrast, C/C++ and JavaScript are used for algorithm implementation and user interface code. The generated code snippets are generally short and exhibit low complexity. The study also finds that LLM-generated code is present in a limited number of projects and undergoes fewer modifications compared to human-written code, with bug-related changes being particularly rare. Additionally, comments accompanying the generated code often lack detail, typically only indicating the code's source without providing context on prompts or modifications. The findings suggest implications for both researchers and practitioners in understanding the integration and effectiveness of LLMs in real-world coding scenarios.

Link Icon 4 comments
By @nickpsecurity - 6 months
This is about what I expected if my own experimentation extrapolated. The code quality was too inconsistent to replace human labor for many uses.

However, it was good for automating boilerplate and for one-off utilities that were tedious to write. In those two categories, many new uses are similar to previous uses in the training data. So, extrapolation is easier for such jobs.

The abstract suggests that’s the kind of use they’re doing. That’s also why they’re not usually fixing the bugs. We can often ignore or work around bugs in such use cases.

By @staplung - 6 months
From the paper:

"Specifically, we first conduct keyword searches, such as “generated by ChatGPT” and “generated by Copilot”, to locate GitHub code files that include such keywords, retaining only those files that contain GPT-generated code."

This seems like a pretty serious weakness to me; presumably there is a lot of code generated by LLMs that isn't annotated as such.