Graph RAG on Noisy Data

By Various · NO BS AI ·Jan 22, 2025 · 5 min read

Graph RAG is a widely appreciated technology used in various production environments. Typically, it is applied to clean data, such as factual documents describing products, procedures, rules, and guidelines. But what happens when you need to build a Graph RAG using noisy data, where you can't always assume that every piece of information is accurate? Such a situation arises while building chatbots using previous customer support conversations. Often, companies just don't have all their data stored in clean documents. Even if documentations exist, the knowledge about real problems and solutions is kept in the heads (and conversations) of the customer support team. And these conversations tend to be really noisy.

By applying proprietary techniques we managed to make Graph RAG work on messy, real-life data which has nothing to do with examples from tutorials.

The setting

As usual, we'll focus on conversations between customer support and clients of a company that manufactures electronic devices. The aim of this knowledge base is to answer technical questions about usage of the devices, so we need solid technical knowledge base. The Graph RAG implementation will utilize the original library from Microsoft: https://microsoft.github.io/graphrag/.

Problems: Where vanilla Graph RAGs fail

Graph RAG assumes that every data point can be a valid fact. So, we get many, many nodes with client and order details, such as:
- Client IDs
- Client personal data (which somehow escaped anonymization)
- Order IDs

Sometimes these things are good to know, but we want our graph to contain knowledge about the devices and procedures so that we can diagnose user problems! We don't care about old order IDs from a few months ago, and storing personal data is dangerous.

Many situational or temporary facts are regarded as vital facts:
- A client reports possessing a black device of a certain brand. Thus, a BLACK_DEVICE_MODEL is created as a node - where in fact, the device color is just some client variation and not important for diagnosing problems.
- Many detected relationships refer to temporary situations, which can be unimportant or even risky to store in a graph - such as discounts. We witnessed many edges like this:
  <source_node>DEVICE,
  <target_node>DISCOUNT_15%,
  <relationship>"A discount of 15% is offered on DEVICE"
  Such relationships can be created e.g. from conversations with dissatisfied clients, where the discount is a way to prevent churning. Storing such information can be very dangerous, as it may lead to repeating some special actions

...

Read full article on NO BS AI →

This excerpt is provided for preview purposes. Full article content is available on the original publication.