Searching a Codebase in English
Greptile is developing an AI system to improve semantic search in codebases, finding that translating code to natural language and using tighter chunking enhances search accuracy and retrieval quality.
Read original articleGreptile is developing an AI system designed to enhance semantic search capabilities within codebases, allowing users to query code through an API. The challenge lies in effectively indexing and retrieving relevant code snippets, as traditional methods that work for natural language texts, like books, do not yield satisfactory results for code. The process involves breaking down the code into smaller units, generating semantic vector embeddings, and comparing these vectors to find matches. However, initial attempts showed that searching code directly often returned irrelevant results due to the semantic differences between code and natural language. For instance, a query about high-frequency trading fraud detection yielded poor matches when searching the code directly, while a natural language description of the code provided a significantly higher similarity score. The findings suggest that translating code into natural language before embedding and using tighter chunking—such as at the function level rather than the file level—improves search accuracy. This approach minimizes noise and enhances the retrieval quality, making semantic search on codebases more effective.
- Greptile is building an AI for semantic search in codebases.
- Traditional semantic search methods for text do not work well for code.
- Translating code to natural language improves search results.
- Tighter chunking (per-function) enhances retrieval quality.
- Noise in code significantly reduces semantic similarity and search effectiveness.
This makes a lot of sense. You should also embed information about how the code is related to other functions/code and where it is located in the codebase. One approach is to add really wonderful comments to the code so that when humans and machines read it they are brought on a step by step journey of how the code fulfills a goal. I tell the LLM to explain step by step to junior developers and and to inspire seniour engineers with glimpse of the profound beauty of the code and its architecture.
We want to maximize the normalized dot product (or cosine similarity) to find semantically similar text chunks.