Welcome

To get started:

  1. Visit: console.neo4j.io and create an account
  2. Create a free instance:
  3. Visit llm-graph-builder.neo4jlabs.com
  4. Connect to your instance: (Use the credentials file downloaded on instance creation)

Slides: dh25.resolve.works

Finding Connections

Transform your document collection into a graph visualisation

Who am I?

  • Freelance IT generalist
  • Over 15 years of building software
  • Work in journalism for the last 4 years

What we'll cover today

  • What GraphRAG is, and how it works
  • How to get your documents ready
  • Extracting a graph from your documents

So, what is GraphRAG?

A way of getting better answers from a language model by using a knowledge graph

Graph + RAG

What is a graph?

And, what is RAG?

So first things first...

What is a graph?

Nodes and edges.

Or: Entities and their relations.

owns

employs

employs

initiated

authorized

processed at

occurred in

Company A

Company B

Person C

Person D

Transaction E

Location F

Bank G

But what is RAG?

R(etrieval) A(ugmented) G(eneration)

Help the language model with relevant information

1. Find relevant information that can be used to answer your question

2. Provide the language model with the retrieved information alongside the question.

3. Get better answers.

Sounds familiar?

So, what is relevant?

What information do you give to the model?

Deep Research searches the web

But GraphRAG makes use of "semantic search"

Semantic search is a way of finding semantically similar information

It works by converting texts into numerical representations that can be compared, usually Vectors

These vectors are called embeddings

Because we're embedding text into multidimensional space

Don't worry, it's simpler than it sounds

A vector is a direction in multidimensional space

In other words, a list of numbers

This conversion is done with "embedding models" that are trained specifically for this task.

Let's try with all-MiniLM-L6-v2


from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

sentence = "The sun is shining"
embedding = model.encode(sentence)
            

"The sun is shining" becomes...

 5.42898139e-04  1.02551401e-01  8.36703405e-02  9.39253941e-02
 2.07145736e-02 -1.12772360e-02  9.93586630e-02 -4.44610193e-02
 3.98165248e-02 -2.13668533e-02  2.09684577e-02 -3.66757251e-02
 7.11042061e-03  4.15845886e-02  1.01847239e-01  7.34366402e-02
 1.43970475e-02  4.65401914e-03  9.70691442e-03 -4.16579135e-02
-2.30006799e-02 -2.27442160e-02 -2.73266062e-02 -1.22460043e-02
-8.25940259e-03  6.72213510e-02 -2.77701933e-02  2.63383351e-02
-6.12662174e-02 -1.37788551e-02  2.81139649e-02  5.81778679e-03
... etc ... ...

384 dimensions
            

Great. But now that we have our numerical representation, how do we compare it to others?

Embedding models are trained so that vectors with similar direction and magnitude are semantically similar

We can find similar vectors by calculating Cosine Similarity

Let's see what that looks like

384 dimensions is hard to visualize though...

So let's imagine we just have 2 dimensions

−1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Vector A (0.15, 0.72) Vector B (0.51, 0.43) θ = 38.1° Similarity = 0.787 cos(θ) = A·B / |A|·|B|

So to come back to "The sun is shining"...

Text Similarity
Bring an umbrella 0.320
The weather is great 0.418
It's daytime 0.622

Congratulations, you made it through the theory! 🎉

GraphRAG

How are these these technologies combined?

First the dataset is preprocessed...

Entities and relations are extracted from the text

Embedded for semantic search

Added to the graph

After which communities are calculated

Then, when querying the data...

Source Text Entity

We find similar entities through semantic search

Source Text Entity

Then we fetch related entities and communities

Source Text Entity

And the texts that reference them

Send all that together with the question to the model

---Role---
You are a helpful assistant responding to questions about data in the tables provided.

---Goal---
Generate a response that responds to the user's question, summarizing all information in the input data tables, and incorporating any relevant general knowledge.
If you don't know the answer, just say so. Do not make anything up.
Do not include information where the supporting evidence for it is not provided.

---Data tables---
{context_data}

And get better answers based on your data

But why is this cool?

Because we can visualize the graph

GraphRAG tools

  1. github.com/microsoft/graphrag
    github.com/noworneverev/graphrag-visualizer Widely adopted, popular
  2. github.com/gusye1234/nano-graphrag
    Small, easy to hack on
  3. llm-graph-builder.neo4jlabs.com
    github.com/neo4j/neo4j-graphrag-python Neo4j ecosystem

Parsing documents

Language model ❤️ markdown

  1. github.com/docling-project/docling
    IBM Research
  2. github.com/microsoft/markitdown
    Microsoft

To get started:

  1. Visit: console.neo4j.io and create an account
  2. Create a free instance:
  3. Visit llm-graph-builder.neo4jlabs.com
  4. Connect to your instance: (Use the credentials file downloaded on instance creation)

Some example documents
From Follow the Money's latest Sainsbury investigation, Thanks Lise!

LinkedIn:
Slides:

Thank you!