August 28th, 2024

Show HN: Repo2vec – an open-source library for chatting with any codebase

repo2vec is a modular library that facilitates chat-based interaction with codebases, offering easy setup, customization, and a Retrieval-Augmented Generation approach for queries, with free hosting for selected repositories.

Read original article

CuriosityEnthusiasmConcern

Show HN: Repo2vec – an open-source library for chatting with any codebase

repo2vec is a modular library designed to facilitate interaction with public and private codebases through a chat interface. Its primary purpose is to help users understand and integrate with codebases without the need for extensive manual code review, functioning similarly to GitHub Copilot but focused on providing up-to-date information about specific repositories. The library features an easy setup process that requires running just two scripts, and it ensures that responses are documented with references to relevant code sections, enhancing the reliability of the AI's answers. Users can customize the library by swapping components in the pipeline for tailored improvements. The setup involves installing dependencies, exporting necessary environment variables, and running indexing and chat scripts, which index the codebase in a vector database and launch a Gradio app for user interaction. The indexing process includes cloning the repository, processing files, embedding chunks using OpenAI's API, and storing these embeddings in a vector store, typically Pinecone. The chat interface employs a Retrieval-Augmented Generation (RAG) approach to answer user queries based on the indexed code. Additionally, the developers offer free hosting for selected repositories, providing dedicated URLs for easy access. Contributions and feature requests are encouraged, and the code's modularity allows for integration with various embeddings, LLMs, and vector store providers.

- repo2vec enables chat-based interaction with codebases.

- The library is easy to set up and customize.

- It uses a Retrieval-Augmented Generation approach for answering queries.

- Free hosting is available for selected repositories.

- Contributions to the project are welcomed.

Vercel AI SDK: RAG Guide

Retrieval-augmented generation (RAG) chatbots enhance Large Language Models (LLMs) by accessing external information for accurate responses. The process involves embedding queries, retrieving relevant material, and setting up projects with various tools.

AI: What people are saying

The comments on the repo2vec article reflect user interest and inquiries about the library's capabilities and limitations.

Users express excitement about the potential of repo2vec and its chat-based interaction with codebases.
Several questions arise regarding the integration of documentation and support for multiple programming languages.
Concerns about privacy and the handling of private repositories are raised.
Users inquire about the choice of embeddings and the efficiency of handling large repositories.
There is interest in using local LLMs and the underlying technology powering the library.

13 comments

By @resters - 8 months

Very useful! I was just thinking this kind of thing should exist!

I would also like to be able to have the LLM know all of the documentation for any dependencies in the same way.

By @cool-RR - 8 months

I want to feed it not only the code but also a corpus of questions and answers, e.g. from the discussions page on GitHub. Is that possible?

By @peterldowns - 8 months

Very cool project, I'm definitely going to try this out. One question — why use the OpenAI embeddings API instead of BGE (BERT) or other embeddings model that can be efficiently run client-side? Was there a quality difference or did you just default to using OpenAI embeddings?

By @zaptrem - 8 months

We have LLMs with hundreds of thousands of tokens context windows and prompt caching that makes using them affordable. Why don’t we just stuff the whole code base in the context window?

By @erichi - 8 months

Is it somehow different from Cursor codebase indexing/chat? I’m using this setup to analyse repos currently.

By @adamtaylor_13 - 8 months

Sorry for the dumb question but can I use this on private repositories or is it sending my code to OpenAI?

By @kevshor - 8 months

This looks super cool! Is there currently a limit to how big a repo can be for this to work efficiently?

By @wiradikusuma - 8 months

Is this for a specific language? Does it support polygot (multiple languages in 1 project)?

By @interestingsoup - 8 months

Any plans on allowing the use of a local LLM like Ollama or LM Studio?

By @ccgongie - 8 months

Super easy to use! Thanks! What's powering this under the hood?

By @RicoElectrico - 8 months

I wonder if it will work on https://github.com/organicmaps/organicmaps

So far two similar solutions I tested crapped out on non-ASCII characters. Because Python's UTF-8 decoder is quite strict about it.

By @ranger_danger - 8 months

is there a docker image?

Show HN: Repo2vec – an open-source library for chatting with any codebase

Related

Vercel AI SDK: RAG Guide

Related

Vercel AI SDK: RAG Guide