To get started:
Slides: dh25.resolve.works
Transform your document collection into a graph visualisation
A way of getting better answers from a language model by using a knowledge graph
What is a graph?
And, what is RAG?
So first things first...
Nodes and edges.
Or: Entities and their relations.
R(etrieval) A(ugmented) G(eneration)
Help the language model with relevant information
1. Find relevant information that can be used to answer your question
2. Provide the language model with the retrieved information alongside the question.
3. Get better answers.
Sounds familiar?
What information do you give to the model?
Deep Research searches the web
But GraphRAG makes use of "semantic search"
Semantic search is a way of finding semantically similar information
It works by converting texts into numerical representations that can be compared, usually Vectors
These vectors are called embeddings
Because we're embedding text into multidimensional space
Don't worry, it's simpler than it sounds
A vector is a direction in multidimensional space
In other words, a list of numbers
This conversion is done with "embedding models" that are trained specifically for this task.
Let's try with all-MiniLM-L6-v2
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
sentence = "The sun is shining"
embedding = model.encode(sentence)
"The sun is shining" becomes...
5.42898139e-04 1.02551401e-01 8.36703405e-02 9.39253941e-02 2.07145736e-02 -1.12772360e-02 9.93586630e-02 -4.44610193e-02 3.98165248e-02 -2.13668533e-02 2.09684577e-02 -3.66757251e-02 7.11042061e-03 4.15845886e-02 1.01847239e-01 7.34366402e-02 1.43970475e-02 4.65401914e-03 9.70691442e-03 -4.16579135e-02 -2.30006799e-02 -2.27442160e-02 -2.73266062e-02 -1.22460043e-02 -8.25940259e-03 6.72213510e-02 -2.77701933e-02 2.63383351e-02 -6.12662174e-02 -1.37788551e-02 2.81139649e-02 5.81778679e-03 ... etc ... ... 384 dimensions
Great. But now that we have our numerical representation, how do we compare it to others?
Embedding models are trained so that vectors with similar direction and magnitude are semantically similar
We can find similar vectors by calculating Cosine Similarity
Let's see what that looks like
384 dimensions is hard to visualize though...
So let's imagine we just have 2 dimensions
So to come back to "The sun is shining"...
Text | Similarity |
---|---|
Bring an umbrella | 0.320 |
The weather is great | 0.418 |
It's daytime | 0.622 |
Congratulations, you made it through the theory! 🎉
How are these these technologies combined?
First the dataset is preprocessed...
Entities and relations are extracted from the text
Embedded for semantic search
Added to the graph
After which communities are calculated
Then, when querying the data...
We find similar entities through semantic search
Then we fetch related entities and communities
And the texts that reference them
Send all that together with the question to the model
---Role---
You are a helpful assistant responding to questions about data in the tables provided.
---Goal---
Generate a response that responds to the user's question, summarizing all information in the input data tables, and incorporating any relevant general knowledge.
If you don't know the answer, just say so. Do not make anything up.
Do not include information where the supporting evidence for it is not provided.
---Data tables---
{context_data}
And get better answers based on your data
But why is this cool?
Because we can visualize the graph
Language model ❤️ markdown
To get started:
Some example documents
From Follow the Money's
latest Sainsbury investigation, Thanks Lise!
LinkedIn: |
Slides: |
Thank you!