July 21st, 2024

txtai: Open-source vector search and RAG for minimalists

txtai is a versatile tool for semantic search, LLM orchestration, and language model workflows. It offers features like vector search with SQL, topic modeling, and multimodal indexing, supporting various tasks with language models. Built with Python and open-source under Apache 2.0 license.

Read original articleLink Icon
InterestQuestionsFeedback
txtai: Open-source vector search and RAG for minimalists

The txtai tool is an embeddings database designed for semantic search, LLM orchestration, and language model workflows. It combines vector indexes, graph networks, and relational databases to enable vector search with SQL, topic modeling, and retrieval augmented generation (RAG). This tool can function independently or as a valuable knowledge source for large language model (LLM) prompts. Key features include vector search with SQL, object storage, topic modeling, and multimodal indexing, as well as the ability to create embeddings for text, documents, audio, images, and video. txtai offers pipelines powered by language models for tasks like question-answering, labeling, transcription, translation, and summarization. Workflows can be created to join pipelines and aggregate business logic, with the flexibility to build using Python or YAML and API bindings available for JavaScript, Java, Rust, and Go. txtai is open-source under an Apache 2.0 license and is built with Python 3.8+, Hugging Face Transformers, Sentence Transformers, and FastAPI. Users interested in running hosted txtai applications can explore the txtai.cloud preview for an easy and secure experience.

AI: What people are saying
The comments on the article about txtai highlight various perspectives and experiences with the tool.
  • Users appreciate txtai's simplicity and minimalist approach compared to enterprise-backed frameworks.
  • Some users have successfully used txtai for specific applications like RAG and interactive help features, but there are concerns about dependencies like Java.
  • There are questions about how txtai compares to other tools like qdrant and llamaindex, and suggestions for improvements such as proper type annotations.
  • One user is interested in using txtai for personal projects involving copyrighted material and seeks guidance on fine-tuning models.
  • The author of txtai emphasizes its performance, innovation, and commitment to quality, especially with local models.
Link Icon 13 comments
By @dmezzetti - 3 months
Hello, author of txtai here. txtai was created back in 2020 starting with semantic search of medical literature. It has since grown into a framework for vector search, retrieval augmented generation (RAG) and large language model (LLM) orchestration/workflows.

The goal of txtai is to be simple, performant, innovative and easy-to-use. It had vector search before many current projects existed. Semantic Graphs were added in 2022 before the Generative AI wave of 2023/2024. GraphRAG is a hot topic but txtai had examples of using graphs to build search contexts back in 2022/2023.

There is a commitment to quality and performance, especially with local models. For example, it's vector embeddings component streams vectors to disk during indexing and uses mmaped arrays to enable indexing large datasets locally on a single node. txtai's BM25 component is built from the scratch to work efficiently in Python leading to 6x better memory utilization and faster search performance than the BM25 Python library most commonly used.

I often see others complain about AI/LLM/RAG frameworks, so I wanted to share this project as many don't know it exists.

Link to source (Apache 2.0): https://github.com/neuml/txtai

By @ipsi - 3 months
So here's something I've been wanting to do for a while, but have kinda been struggling to figure out _how_ to do it. txtai looks like it has all the tools necessary to do the job, I'm just not sure which tool(s), and how I'd use them.

Basically, I'd like to be able to take PDFs of, say, D&D books, extract that data (this step is, at least, something I can already do), and load it into an LLM to be able to ask questions like:

* What does the feat "Sentinel" do?

* Who is Elminster?

* Which God(s) do Elves worship in Faerûn?

* Where I can I find the spell "Crusader's Mantle"?

And so on. Given this data is all under copyright, I'd probably have to stick to using a local LLM to avoid problems. And, while I wouldn't expect it to have good answers to all (or possibly any!) of those questions, I'd nevertheless love to be able to give it a try.

I'm just not sure where to start - I think I'd want to fine-tune an existing model since this is all natural language content, but I get a bit lost after that. Do I need to pre-process the content to add extra information that I can't fetch relatively automatically. e.g., page numbers are simple to add in, but would I need to mark out things like chapter/section headings, or in-character vs out-of-character text? Do I need to add all the content in as a series of questions and answers, like "What information is on page 52 of the Player's Handbook? => <text of page>"?

By @pjot - 3 months
I’ve done something similar, but using duckDB as the backend/vector store. You can use embeddings from wherever. My demo uses OpenAI.

https://github.com/patricktrainer/duckdb-embedding-search

By @anotherpaulg - 3 months
I did some prototyping with txtai for the RAG used in aider’s interactive help feature [0]. This lets users ask aider questions about using aider, customizing settings, troubleshooting, using LLMs, etc.

I really liked the simplicity of txtai. But it seems to require Java as a dependency! Aider is an end user cli tool, and ultimately I couldn’t take on the support burden of asking my users to install Java.

[0] https://aider.chat/docs/troubleshooting/support.html

By @fastneutron - 3 months
I’ve been building a RAG mini app with txtai these past few weeks and it’s been pretty smooth. I’m between this and llamaindex as the backend for a larger app I want to build for a small-to-midsize customer.

With the (potentially) obvious bias towards your own framework, are there situations in which you would not recommend it for a particular application?

By @haolez - 3 months
"Interested in an easy and secure way to run hosted txtai applications? Then join the txtai.cloud preview to learn more."

I wish the author all the best and this seems to be a very sane and minimalist approach when compared to all the other enterprise-backed frameworks and libraries in this space. I might even become a customer!

However, has someone started an open source library that's fully driven by a community? I'm thinking of something like Airflow or Git. I'm not saying that the "purist" model is the best or enterprise-backed frameworks are evil. I'm just not seeing this type of project in this space.

By @sampling - 3 months
Has anyone had experience with qdrant (https://qdrant.tech/) as a vector store data and can speak to how txtai compares?
By @staticautomatic - 3 months
Looks pretty cool! Is this intended to be a simple alternative to, say, cobbling together something with LangChain and Chroma?
By @freeqaz - 3 months
This looks interesting. I've been wanting to build some tools to help feed text documents into Stable Diffusion and this looks like it could be helpful. Are there any other libs people are aware of that they'd recommend in this space?
By @v3ss0n - 3 months
Txtai get things done quick, but one problem is the code base is not properly typed (in contrast to a bit higher learning curve but more proper Haystack). Would be nice if this project is properly type annotated.
By @dmezzetti - 3 months
Link to source (Apache 2.0): https://github.com/neuml/txtai
By @antman - 3 months
What type of embeddings db does it use? Is it interchangeable?
By @janice1999 - 3 months
It's frustrating when developers of ML projects don't state even the most basic requirements. Do I need an Nvidia 4090 or a cluster of H100s to run this?