August 28th, 2024

Show HN: Repo2vec – an open-source library for chatting with any codebase

repo2vec is a modular library that facilitates chat-based interaction with codebases, offering easy setup, customization, and a Retrieval-Augmented Generation approach for queries, with free hosting for selected repositories.

Read original articleLink Icon
CuriosityEnthusiasmConcern
Show HN: Repo2vec – an open-source library for chatting with any codebase

repo2vec is a modular library designed to facilitate interaction with public and private codebases through a chat interface. Its primary purpose is to help users understand and integrate with codebases without the need for extensive manual code review, functioning similarly to GitHub Copilot but focused on providing up-to-date information about specific repositories. The library features an easy setup process that requires running just two scripts, and it ensures that responses are documented with references to relevant code sections, enhancing the reliability of the AI's answers. Users can customize the library by swapping components in the pipeline for tailored improvements. The setup involves installing dependencies, exporting necessary environment variables, and running indexing and chat scripts, which index the codebase in a vector database and launch a Gradio app for user interaction. The indexing process includes cloning the repository, processing files, embedding chunks using OpenAI's API, and storing these embeddings in a vector store, typically Pinecone. The chat interface employs a Retrieval-Augmented Generation (RAG) approach to answer user queries based on the indexed code. Additionally, the developers offer free hosting for selected repositories, providing dedicated URLs for easy access. Contributions and feature requests are encouraged, and the code's modularity allows for integration with various embeddings, LLMs, and vector store providers.

- repo2vec enables chat-based interaction with codebases.

- The library is easy to set up and customize.

- It uses a Retrieval-Augmented Generation approach for answering queries.

- Free hosting is available for selected repositories.

- Contributions to the project are welcomed.

AI: What people are saying
The comments on the repo2vec article reflect user interest and inquiries about the library's capabilities and limitations.
  • Users express excitement about the potential of repo2vec and its chat-based interaction with codebases.
  • Several questions arise regarding the integration of documentation and support for multiple programming languages.
  • Concerns about privacy and the handling of private repositories are raised.
  • Users inquire about the choice of embeddings and the efficiency of handling large repositories.
  • There is interest in using local LLMs and the underlying technology powering the library.
Link Icon 13 comments
By @resters - 5 months
Very useful! I was just thinking this kind of thing should exist!

I would also like to be able to have the LLM know all of the documentation for any dependencies in the same way.

By @cool-RR - 5 months
I want to feed it not only the code but also a corpus of questions and answers, e.g. from the discussions page on GitHub. Is that possible?
By @peterldowns - 5 months
Very cool project, I'm definitely going to try this out. One question — why use the OpenAI embeddings API instead of BGE (BERT) or other embeddings model that can be efficiently run client-side? Was there a quality difference or did you just default to using OpenAI embeddings?
By @zaptrem - 5 months
We have LLMs with hundreds of thousands of tokens context windows and prompt caching that makes using them affordable. Why don’t we just stuff the whole code base in the context window?
By @erichi - 5 months
Is it somehow different from Cursor codebase indexing/chat? I’m using this setup to analyse repos currently.
By @adamtaylor_13 - 5 months
Sorry for the dumb question but can I use this on private repositories or is it sending my code to OpenAI?
By @kevshor - 5 months
This looks super cool! Is there currently a limit to how big a repo can be for this to work efficiently?
By @wiradikusuma - 5 months
Is this for a specific language? Does it support polygot (multiple languages in 1 project)?
By @interestingsoup - 5 months
Any plans on allowing the use of a local LLM like Ollama or LM Studio?
By @ccgongie - 5 months
Super easy to use! Thanks! What's powering this under the hood?
By @RicoElectrico - 5 months
I wonder if it will work on https://github.com/organicmaps/organicmaps

So far two similar solutions I tested crapped out on non-ASCII characters. Because Python's UTF-8 decoder is quite strict about it.

By @ranger_danger - 5 months
is there a docker image?