txtai: Open-source vector search and RAG for minimalists
txtai is a versatile tool for semantic search, LLM orchestration, and language model workflows. It offers features like vector search with SQL, topic modeling, and multimodal indexing, supporting various tasks with language models. Built with Python and open-source under Apache 2.0 license.
Read original articleThe txtai tool is an embeddings database designed for semantic search, LLM orchestration, and language model workflows. It combines vector indexes, graph networks, and relational databases to enable vector search with SQL, topic modeling, and retrieval augmented generation (RAG). This tool can function independently or as a valuable knowledge source for large language model (LLM) prompts. Key features include vector search with SQL, object storage, topic modeling, and multimodal indexing, as well as the ability to create embeddings for text, documents, audio, images, and video. txtai offers pipelines powered by language models for tasks like question-answering, labeling, transcription, translation, and summarization. Workflows can be created to join pipelines and aggregate business logic, with the flexibility to build using Python or YAML and API bindings available for JavaScript, Java, Rust, and Go. txtai is open-source under an Apache 2.0 license and is built with Python 3.8+, Hugging Face Transformers, Sentence Transformers, and FastAPI. Users interested in running hosted txtai applications can explore the txtai.cloud preview for an easy and secure experience.
Related
GraphRAG with Wikipedia
txtai is a versatile tool combining vector indexes, graph networks, and databases for semantic search and language workflows. It showcases using semantic graphs to enhance LLM generation, enabling comprehensive knowledge collection and history book creation.
Txtai – A Strong Alternative to ChromaDB and LangChain for Vector Search and RAG
Generative AI's rise in business and challenges with Large Language Models are discussed. Retrieval Augmented Generation (RAG) tackles data generation issues. LangChain, LlamaIndex, and txtai are compared for search capabilities and efficiency. Txtai stands out for streamlined tasks and text extraction, despite a narrower focus.
Show HN: txtai: open-source, production-focused vector search and RAG
The txtai tool is a versatile embeddings database for semantic search, LLM orchestration, and language model workflows. It supports vector search with SQL, RAG, topic modeling, and more. Users can create embeddings for various data types and utilize language models for diverse tasks. Txtai is open-source and supports multiple programming languages.
txtai 7.3 released: Adds new RAG Web Apps and streaming LLM/RAG support
The txtai 7.3.0 release introduces an open-source embeddings database for semantic search and language model workflows. It supports various data types, pipelines for tasks like summarization, and can be built with Python or YAML.
- Users appreciate txtai's simplicity and minimalist approach compared to enterprise-backed frameworks.
- Some users have successfully used txtai for specific applications like RAG and interactive help features, but there are concerns about dependencies like Java.
- There are questions about how txtai compares to other tools like qdrant and llamaindex, and suggestions for improvements such as proper type annotations.
- One user is interested in using txtai for personal projects involving copyrighted material and seeks guidance on fine-tuning models.
- The author of txtai emphasizes its performance, innovation, and commitment to quality, especially with local models.
The goal of txtai is to be simple, performant, innovative and easy-to-use. It had vector search before many current projects existed. Semantic Graphs were added in 2022 before the Generative AI wave of 2023/2024. GraphRAG is a hot topic but txtai had examples of using graphs to build search contexts back in 2022/2023.
There is a commitment to quality and performance, especially with local models. For example, it's vector embeddings component streams vectors to disk during indexing and uses mmaped arrays to enable indexing large datasets locally on a single node. txtai's BM25 component is built from the scratch to work efficiently in Python leading to 6x better memory utilization and faster search performance than the BM25 Python library most commonly used.
I often see others complain about AI/LLM/RAG frameworks, so I wanted to share this project as many don't know it exists.
Link to source (Apache 2.0): https://github.com/neuml/txtai
Basically, I'd like to be able to take PDFs of, say, D&D books, extract that data (this step is, at least, something I can already do), and load it into an LLM to be able to ask questions like:
* What does the feat "Sentinel" do?
* Who is Elminster?
* Which God(s) do Elves worship in Faerûn?
* Where I can I find the spell "Crusader's Mantle"?
And so on. Given this data is all under copyright, I'd probably have to stick to using a local LLM to avoid problems. And, while I wouldn't expect it to have good answers to all (or possibly any!) of those questions, I'd nevertheless love to be able to give it a try.
I'm just not sure where to start - I think I'd want to fine-tune an existing model since this is all natural language content, but I get a bit lost after that. Do I need to pre-process the content to add extra information that I can't fetch relatively automatically. e.g., page numbers are simple to add in, but would I need to mark out things like chapter/section headings, or in-character vs out-of-character text? Do I need to add all the content in as a series of questions and answers, like "What information is on page 52 of the Player's Handbook? => <text of page>"?
I really liked the simplicity of txtai. But it seems to require Java as a dependency! Aider is an end user cli tool, and ultimately I couldn’t take on the support burden of asking my users to install Java.
With the (potentially) obvious bias towards your own framework, are there situations in which you would not recommend it for a particular application?
I wish the author all the best and this seems to be a very sane and minimalist approach when compared to all the other enterprise-backed frameworks and libraries in this space. I might even become a customer!
However, has someone started an open source library that's fully driven by a community? I'm thinking of something like Airflow or Git. I'm not saying that the "purist" model is the best or enterprise-backed frameworks are evil. I'm just not seeing this type of project in this space.
Related
GraphRAG with Wikipedia
txtai is a versatile tool combining vector indexes, graph networks, and databases for semantic search and language workflows. It showcases using semantic graphs to enhance LLM generation, enabling comprehensive knowledge collection and history book creation.
Txtai – A Strong Alternative to ChromaDB and LangChain for Vector Search and RAG
Generative AI's rise in business and challenges with Large Language Models are discussed. Retrieval Augmented Generation (RAG) tackles data generation issues. LangChain, LlamaIndex, and txtai are compared for search capabilities and efficiency. Txtai stands out for streamlined tasks and text extraction, despite a narrower focus.
Show HN: txtai: open-source, production-focused vector search and RAG
The txtai tool is a versatile embeddings database for semantic search, LLM orchestration, and language model workflows. It supports vector search with SQL, RAG, topic modeling, and more. Users can create embeddings for various data types and utilize language models for diverse tasks. Txtai is open-source and supports multiple programming languages.
txtai 7.3 released: Adds new RAG Web Apps and streaming LLM/RAG support
The txtai 7.3.0 release introduces an open-source embeddings database for semantic search and language model workflows. It supports various data types, pipelines for tasks like summarization, and can be built with Python or YAML.