August 15th, 2024

Searching a Codebase in English

Greptile is developing an AI system to improve semantic search in codebases, finding that translating code to natural language and using tighter chunking enhances search accuracy and retrieval quality.

Read original article

Greptile is developing an AI system designed to enhance semantic search capabilities within codebases, allowing users to query code through an API. The challenge lies in effectively indexing and retrieving relevant code snippets, as traditional methods that work for natural language texts, like books, do not yield satisfactory results for code. The process involves breaking down the code into smaller units, generating semantic vector embeddings, and comparing these vectors to find matches. However, initial attempts showed that searching code directly often returned irrelevant results due to the semantic differences between code and natural language. For instance, a query about high-frequency trading fraud detection yielded poor matches when searching the code directly, while a natural language description of the code provided a significantly higher similarity score. The findings suggest that translating code into natural language before embedding and using tighter chunking—such as at the function level rather than the file level—improves search accuracy. This approach minimizes noise and enhances the retrieval quality, making semantic search on codebases more effective.

- Greptile is building an AI for semantic search in codebases.

- Traditional semantic search methods for text do not work well for code.

- Translating code to natural language improves search results.

- Tighter chunking (per-function) enhances retrieval quality.

- Noise in code significantly reduces semantic similarity and search effectiveness.

RAG for a Codebase with 10k Repos

The blog discusses challenges in implementing Retrieval Augmented Generation (RAG) for enterprise codebases, emphasizing scaling difficulties and contextual awareness. CodiumAI employs chunking, context maintenance, file type handling, enhanced embeddings, and advanced retrieval techniques to address these challenges, aiming to enhance developer productivity and code quality.

5 comments

By @tonyoconnell - 9 months

Summary "Semantic search on codebases works better if you first translate the code to natural language, before generating embedding vectors. It also works better if you chunk more “tightly” - on a per-function level rather than a per-file level. This is because noise negatively impacts retrieval quality in a huge way."

This makes a lot of sense. You should also embed information about how the code is related to other functions/code and where it is located in the codebase. One approach is to add really wonderful comments to the code so that when humans and machines read it they are brought on a step by step journey of how the code fulfills a goal. I tell the LLM to explain step by step to junior developers and and to inspire seniour engineers with glimpse of the profound beauty of the code and its architecture.

By @byearthithatius - 9 months

I think I found a mistake. In the article you write: "We then compare that against our database of vectors and find the one(s) that match the closest, i.e., have the lowest dot product and highest similarity."

We want to maximize the normalized dot product (or cosine similarity) to find semantically similar text chunks.

By @oshams - 9 months

Interesting direction. We also have a codebase chat (example here https://wiki.mutable.ai/ollama/ollama) that HN might find appealing. It uses a wiki as a living artifact owned by your team to power the chat, gives us increased context length and reasoning capabilities. We didn't really like the results we got with embeddings. Have been pretty thrilled with the results on Q&A, search, and even codegen (more on that soon).

By @deisteve - 9 months

is there a free version of greptile

By @Zambyte - 9 months

The page is unreadable on Firefox Focus

Searching a Codebase in English

Related

RAG for a Codebase with 10k Repos

Related

RAG for a Codebase with 10k Repos