July 9th, 2024

Knowledge Graphs in RAG: Hype vs. Ragas Analysis

The analysis questions Microsoft's GraphRAG paper on knowledge graphs in RAG systems, revealing Neo4j's potential over FAISS in context retrieval. The study scrutinizes metrics and aims for precise quantification beyond vague claims.

Read original article

Knowledge Graphs in RAG: Hype vs. Ragas Analysis

The article discusses the analysis of Microsoft's GraphRAG paper on knowledge graphs in RAG systems. The author found questionable metrics in the paper and conducted their study using Neo4j. The study revealed that knowledge graphs may not significantly impact context retrieval in RAG systems. Neo4j with indexing showed a higher answer relevancy score compared to FAISS but with potential ROI constraints. The faithfulness score improved significantly with Neo4j's index. The paper's vague claims of substantial improvements over a baseline sparked the author's interest in quantifying with more precision. The analysis methodology involved splitting a PDF document, loading it into Neo4j, setting up OpenAI for retrieval and embeddings, and creating retrievers for testing. Ground truth data was created from the PDF for evaluation using RAGAS metrics. The evaluation focused on context relevancy, context recall, answer relevancy, and faithfulness metrics. The study aimed to provide a deeper understanding of knowledge graphs in RAG systems beyond the hype surrounding GraphRAG methods.

Surprise, your data warehouse can RAG

A blog post by Maciej Gryka explores "Retrieval-Augmented Generation" (RAG) to enhance AI systems. It discusses building RAG pipelines, using text embeddings for data retrieval, and optimizing data infrastructure for effective implementation.

Show HN: R2R V2 – A open source RAG engine with prod features

The R2R GitHub repository offers an open-source RAG answer engine for scalable systems, featuring multimodal support, hybrid search, and a RESTful API. It includes installation guides, a dashboard, and community support. Developers benefit from configurable functionalities and resources for integration. Full documentation is available on the repository for exploration and contribution.

GraphRAG (from Microsoft) is now open-source!

GraphRAG, a GitHub tool, enhances question-answering over private datasets with structured retrieval and response generation. It outperforms naive RAG methods, offering semantic analysis and diverse, comprehensive data summaries efficiently.

GraphRAG with Wikipedia

txtai is a versatile tool combining vector indexes, graph networks, and databases for semantic search and language workflows. It showcases using semantic graphs to enhance LLM generation, enabling comprehensive knowledge collection and history book creation.

Txtai – A Strong Alternative to ChromaDB and LangChain for Vector Search and RAG

Generative AI's rise in business and challenges with Large Language Models are discussed. Retrieval Augmented Generation (RAG) tackles data generation issues. LangChain, LlamaIndex, and txtai are compared for search capabilities and efficiency. Txtai stands out for streamlined tasks and text extraction, despite a narrower focus.

13 comments

By @davedx - 10 months

This seems highly relevant: https://arxiv.org/abs/2406.01506

> In this paper, we study the two foundational questions in this area. First, how are categorical concepts, such as {'mammal', 'bird', 'reptile', 'fish'}, represented? Second, how are hierarchical relations between concepts encoded? For example, how is the fact that 'dog' is a kind of 'mammal' encoded? We show how to extend the linear representation hypothesis to answer these questions. We find a remarkably simple structure: simple categorical concepts are represented as simplices, hierarchically related concepts are orthogonal in a sense we make precise, and (in consequence) complex concepts are represented as polytopes constructed from direct sums of simplices, reflecting the hierarchical structure.

Basically, LLM's already partially encode information as semantic graphs internally.

With this it is less surprising that augmenting them with external knowledge graphs has a lower ROI.

By @piizei - 10 months

Looks like the test-setup confuses knowledge graphs with graph databases. The code just creates a neo4j database from a document, not a knowledge graph (basically uses neo4j as vector database). A knowledge graph would be created by a LLM as a preprocessing step (and queried similary by an LLM). This is a different approach than was tested, an approach that trades preprocessing time and domain knowledge for accuracy. Reference: https://python.langchain.com/v0.1/docs/use_cases/graph/const...

By @visarga - 10 months

The Microsoft GraphRAG paper focuses on global sensemaking through hierarchical summarization, which is a fundamental aspect of their approach. The blog post analysis, however, doesn't address this core feature at all. Another issue is the corpus size, the paper focuses on sizes on the order of 1M tokens, while the reference text used in the blog post is probably shorter. On shorter text a simple LLM call could do summarization directly.

By @qeternity - 10 months

I don’t believe the author read the GraphRAG paper as there is nothing in this “deep dive” that implements anything remotely close.

By @dmezzetti - 10 months

There is no one size fits all formula. For simple RAG, a search query (vector, keyword, SQL, etc) works to build a context.

For more complex questions or research, a knowledge graph can be beneficial. I wrote an article[1] earlier this year that used graph path traversal to build a context.

The goal was to build a short narrative about English history from 500 - 1000 using Wikipedia articles. Vector similarity alone won't bring back good results. This article used a cypher graph path query that jumped multiple hops through concepts of interest. Those articles on that path were then brought in as the context.

[1] https://neuml.hashnode.dev/advanced-rag-with-graph-path-trav...

By @Tostino - 10 months

I really need to dig into the more recent advances in knowledge graphs + LLMs. I've been out of the game for ~10 months now, and am just starting to dig back into things and get my training pipeline working (darn bitrot...)

I had previously trained a llama2 13b model (https://huggingface.co/Tostino/Inkbot-13B-8k-0.2) on a whole bunch of knowledge graph tasks (in addition to a number of other tasks).

Here is an example of the training data for training it how to use knowledge graphs:

easy - https://gist.github.com/Tostino/76c55bdeb1f099fb2bfab00ce144...

medium - https://gist.github.com/Tostino/0460c18024697efc2ac34fe86ecd...

I also trained it on generating KGs from conversations, or articles you have provided. So from the LLM side, it's way more knowledgeable about the connections in the graph than GPT4 is by default.

Here are a couple examples of the trained model actually generating a knowledge graph:

1. https://gist.github.com/Tostino/c3541f3a01d420e771f66c62014e...

2. https://gist.github.com/Tostino/44bbc6a6321df5df23ba5b400a01...

I haven't done any work on integrating those into larger structures, combining the graphs generated from different documents, or using a graph database to augment my use case...all things I am eager to try out, and I am glad there is a bunch more to read on the topic available now.

Anyways, near term plans are to train a llama3 8b, and likely a phi-3 13b version of Inkbot on an improved version of my dataset. Glad to see others as excited as was on this topic!

By @itkovian_ - 10 months

Knowledge graphs where created to solve the problem of making natural,free flowing text machine processable. We now have a technology that completely understands natural free flowing text and can extract meaning. Why would going back to structure help when that structure can never be as rich as just text. I get it if the kb has new information, that's not what I'm saying.

By @jimmySixDOF - 10 months

This is a nice sandbox walkthrough of the author's objective which was to test MSFT claims in the paper -- but with all due respect the buzz of graphs is because they add whole third layer in a combined approach like Reciprocal Rank Fusion (RRF). You do a BM25 search then you do a vector based nearest neighbors search and now you can add a KG search then all combined with local and global reranking etc the expectation is this produces a better final outcome. These findings aside, it still makes sense that adding KG to a hybrid search pipeline is going to be useful.

By @DrStartup - 10 months

Knowledge / property graphs provide truths that can guide the retrieval. LLMs lack a truth function, ie causality. The KPG provides this as sorta a lace across the llm vector space. A KPG can either be used as a filter or a router of sorts. I expect we’ll see kpgs colocated with vector data of the llm and a tuned router layer uses it to guide retrieval and course correct the output. Kind of like MoE.

By @yetanotherjosh - 10 months

It seems to me that the "knowledge graph" generated in this article is incredibly naive and not comparable to the process in the MS paper, which requires multiple rounds of preprocessing the source content using LLMs to extract, summarize, find relationships at multiple levels and model them in the graph store. This just splats chunks and words into a vector graph and is barely defensible as a "knowledge graph".

Please tell me I'm missing something because this is egregious. How can you expect a graph approach to improve over naive rag if you don't actually build a knowledge graph that captures high quality, higher level entity relationships?

By @mark_l_watson - 10 months

That is an interesting writeup, but I had trouble understanding what they meant by what for me is a new term: “faithfulness.”

This is supposedly a measure of reducing hallucinations. Is it just me, or did other people here have difficulty understanding how faithfulness was evaluated?

EDIT: OK, faithfulness is calculated by human evaluation, and can be automatically calculated with ROUGE and BLEU.

By @lmeyerov - 10 months

I'm happy to see third-party comparisons, most of the marketing here indeed just assumes KGs are better with zero proof: marketers to be wary of. Unfortunately, I suspect a few key steps need to happen for this post to fairly reflect what the Microsoft NLP researchers called their alg, vs the broader family named by neo4j. Afaict, they're talking about a different graph.

* The kg index should be text documents hierarchically summarized based on an extracted named-entity-relation graph. The blog version seems to instead do (document, word), not the KG, and afaict, skips the hierarchical NER community summarization. The blog post is doing what neo4j calls a lexical graph, not the novel KG summary index of the MSR paper.

* The data volume should go up. Think a corpus like 100k+ tweets or 100+ documents. You start to see challenges like redundant tweets that clog retrieval/ranking, or many pieces of the puzzle spread over disparate chunks with indirect 'multi-hop' reasoning. Something like a debate can fit into one ChatGPT call, with no RAG. It's an interesting question how summarization preprocessing can still help small documents, but a more nuanced topic (and we have Thoughts on ;-))

* The tasks should reflect the challenges: multi-hop reasoning, wider summarization with fixed budget, etc. Retesting simple queries naive RAG already solves isn't the point. The paper focused on a couple types, which is also why they route to 2 diff retrieval modes. Subtle, part of the challenge in bigger data is how many resources we give the retriever & reasoner, and part of why graph rag is exciting IMO.

Afaict the blogpost essentially did a lexical graph with chunk/node embeddings, reran on a small document, and at that scale, asked simple q's... So close to a naive retrieval, and unsurprisingly, got parity. It's not too much more to improve so would encourage doing a bit more. Beyond the MSR paper, I would also experiment a bit more with retrieval strategies, eg, agentic layer on top, and include simple text search mixed in with reranking. And as validation before any of that, focus specifically on the queries expected to fail naive RAG and work in graph, and make sure those work.

Related: We are working on a variant of Graph RAG that solves some additional scale & quality challenges in our data (investigations: threat intel reports, real-time social & news, misinfo, ...), and may be open to an internship or contract role for the right person. One big focus area is ensuring AI quality & AI scale as our version is more GPU/AI-centric and used in serious situations by less technical users... A bit ironic given the article :) LMK if interested, see my profile. We'll need proof of capability for both engineering + AI challenges, and easier for us to teach the latter than the former.