The Running Example

One question. Seven phases of the KG life-cycle.

The same Marcus from the Open Banking Lab. The same Sarah. The same Meridian Bank. What changes is the phase of the KG you’re building or using.

Marcus Aldridge

Wealth client · age 54 · cust_m_7741

Advised by: Jane Wong
AUM: £18.7M
EU sov exposure: heavy
Tenure: 11 years

Jane Wong

Senior wealth advisor · adv_jw_032

Manages: 14 clients
AUM under advice: £210M
Tenure: 9 years

+ 8 other wealth clients

Some with similar profiles

Heloise Schmidt · €12M · DE sov
David Chen · £9M · EU sov mix
Anya Volkov · £15M · EU corp + sov
…and 5 more across 3 advisors

The agent’s question

"Find wealth clients whose advisors have managed at least one client with a similar EU sovereign exposure profile to Marcus — and rank them by predicted churn risk."

Progress through the lab

Overview · the KG life-cycle

Construction · RDF + R2RML + RML

OBDA · virtual graph

Querying · SPARQL · Cypher · GQL

GraphRAG · LLM walks the graph

Knowledge Vault · fact confidence

Embeddings · TransE / DistMult / RotatE

Graph ML · Node2Vec / GraphSAGE

Tab 00 · Overview

A knowledge graph is a life-cycle, not a database

Seven phases. You build the graph, you query it, you reason over it, you score its facts probabilistically, you embed it, you do ML on it. Twelve formal models cluster naturally across the seven phases.

In Plain English

Seven things you do with a knowledge graph, in order

People say "knowledge graph" as if it’s one thing. It isn’t. It’s a life-cycle with seven distinct phases, each with its own models, its own audience, and its own failure mode if you skip it.

Construction · how the graph gets made. The bank’s data already lives in SQL tables and JSON logs. You don’t throw that out. You write mappings: R2RML for the SQL side, RML for the JSON. The output is RDF — a pile of three-word sentences (subject-predicate-object) that describe everything.
Virtual access · keeping the data where it is. Materialising 100 million RDF triples in a triplestore is often a bad idea — latency, freshness, governance. OBDA (Ontology-Based Data Access) is the workaround: the agent asks in graph language, the system silently rewrites to SQL and runs it on the original sources.
Querying · three dialects of the same question. Once you have a graph (real or virtual), you need to ask it things. SPARQL is the W3C standard (think SQL for RDF). Cypher is what Neo4j popularised — visual, developer-friendly. ISO GQL (2024) is the new ISO standard that consolidates both.
Reasoning · the LLM walks the graph. An agent asks "who’s like Marcus?" — a question whose answer is a path through relationships, not a row in a table. GraphRAG is how an LLM uses the graph structure to ground its answer with citations instead of free-associating.
Probabilistic · not every fact is equally true. "Marcus is advised by Jane" might come from three different sources with three different confidences. Knowledge Vault (Google 2014) is how you keep all the evidence, score each fact, and let the agent know what to trust.
Embeddings · turn nodes into coordinates. TransE, DistMult, ComplEx, RotatE are different ways to turn every node and relationship into a vector. Once everything has coordinates, "similarity" becomes "distance." That’s how the agent finds "clients like Marcus."
Graph ML · learn from who you know. Node2Vec learns embeddings from random walks. GraphSAGE and GAT and R-GCN are graph neural networks — they predict things about a node from the structure around it. The bank uses this to predict churn risk, fraud, and lifetime value.

The seven phases · twelve models · what each contributes Cheat-sheet

A quick reference for what you’ll see across the seven phase-tabs.

Tab	Phase	Models	Everyday analogy	What it contributes
01	Construction	RDF · R2RML · RML	Three-word sentences	Turn relational/JSON data into a uniform triple graph
02	Virtual access	OBDA	Bilingual interpreter	Query the graph without materialising it
03	Querying	SPARQL · Cypher · GQL	Three dialects, same question	Express multi-hop questions over the graph
04	Reasoning	GraphRAG	Librarian who walks with you	Ground LLM answers in graph paths with citations
05	Probabilistic	Knowledge Vault	Not every fact equally true	Score and rank facts by source confidence
06	Embeddings	TransE · DistMult · ComplEx · RotatE	Coordinates on a map	Make similarity into distance for fast lookup
07	Graph ML	Node2Vec · GraphSAGE · GAT · R-GCN	Learn from who you know	Predict node properties from graph structure

How to read this lab. Each tab opens with its layman explainer at the top — five steps in plain English. Then a working demo on Marcus’s data. Then the formal technical detail. The same Marcus question — "find wealth clients whose advisors have managed at least one client similar to Marcus, ranked by churn risk" — flows through all seven tabs. By Tab 07 the agent has an answer with confidence intervals, citations, and a ranked list.

Tab 01 · Construction · RDF · R2RML · RML

A pile of three-word sentences that fits any data

The bank’s data already lives in SQL tables and JSON consent logs. You don’t throw that out. You write mappings that emit RDF triples — and you keep the source of truth where it is.

In Plain English

RDF is a pile of three-word sentences. Mappings turn your data into them.

Every fact in the bank — Marcus is a wealth client, Jane advises Marcus, Marcus holds £18.7M — can be written as a three-word sentence: subject, predicate, object. RDF is that format. R2RML and RML are how you generate those sentences from data you already have.

RDF is the lingua franca. Every fact gets reduced to (subject, predicate, object). Marcus → advisedBy → Jane. Marcus → hasAum → 18700000. Standardising on triples means anything that speaks RDF can read everything anyone else has written, forever. W3C 1999.
R2RML maps your SQL. Your customers table already exists. You write an R2RML mapping that says "each row of customers becomes a Customer node; the customer_id column becomes its URI; the advisor_id column becomes a wasAdvisedBy edge." No data movement — just a recipe. W3C 2012.
RML extends to anything. What about the JSON consent logs? The XML risk feeds? The CSV regulatory uploads? RML generalises R2RML beyond SQL — you can map any structured source into the same triple world. 2014, Ghent University.
Two modes for the same mapping. Run it as materialised — produce all the triples and load them into a triplestore. Or run it as virtual (next tab) — keep the data where it is, translate questions at query time. The mapping is the same; what changes is when you execute it.
The pay-off. Three years from now when the bank acquires another firm, all you need is a mapping from their tables to the same vocabulary. No data warehouse rebuild, no ETL project. The mapping IS the integration.

Generate RDF from Meridian’s SQL + JSON sources Interactive

Click ▶ to watch R2RML map the customer table and RML map the consent JSON. Each output is a triple in the unified graph.

Technical Detail

RDF, R2RML, and RML as W3C-grade mapping infrastructure

RDF (W3C 1999, RDF 1.1 in 2014). The Resource Description Framework models knowledge as a directed labelled multigraph of (subject, predicate, object) triples. Subjects and predicates are IRIs; objects are IRIs, literals, or blank nodes. RDF graphs compose by union — combining two graphs requires no schema reconciliation, only IRI alignment.

R2RML (W3C 2012). The Relational-to-RDF Mapping Language declares how a relational schema is exposed as an RDF graph. Each rr:TriplesMap binds a logical SQL query (or table) to subject/predicate/object templates. Subject IRIs are built from primary keys; predicates come from a vocabulary; objects are columns or computed values. Reference implementations: Ontop, Morph-RDB, db2triples.

RML (Ghent IDLab, 2014). The RDF Mapping Language extends R2RML to non-relational sources via the abstract rml:LogicalSource — supporting CSV (via ql:CSV), JSON (via ql:JSONPath), XML (via ql:XPath), and others. The same mapping vocabulary; the source iterator changes. Implementations: RMLMapper, SDM-RDFizer, Morph-KGC.

Materialised vs virtual execution. Materialisation produces a static RDF dump loadable into a triplestore (GraphDB, Stardog, Virtuoso, Blazegraph). Virtual execution rewrites SPARQL into source-native queries at runtime — the OBDA pattern of Tab 02. The W3C R2RML spec accommodates both via the same mapping language.

The 2024–25 successor: RML 2. A community-driven revision under the Knowledge Graph Construction W3C Community Group consolidates RML with FnO (Function Ontology) for transformations and StarML for RDF-star. Backwards compatible with RML.

Citations. Manola F. & Miller E. (eds.) (2004). RDF Primer. W3C Recommendation. Das S., Sundara S., Cyganiak R. (eds.) (2012). R2RML: RDB to RDF Mapping Language. W3C Recommendation. Dimou A. et al. (2014). RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. LDOW Workshop.

Tab 02 · OBDA · Ontology-Based Data Access

The bilingual interpreter who never leaves your side

Materialising a 100-million-triple graph is often the wrong choice — too much data, too much latency, governance nightmares. OBDA leaves the data where it is. The agent asks in graph language. The system silently translates to SQL.

In Plain English

A virtual graph the bank never has to maintain

Imagine you wanted a graph database, but you don’t want to copy 100 million rows out of your existing systems. OBDA is the alternative: keep the data where it is, pretend there’s a graph, and translate on the fly.

The agent thinks the graph is real. It writes SPARQL queries against a clean ontological view: "find wealth clients with EU sov exposure > £5M." The agent doesn’t know what tables exist or where they live.
The translator runs at query time. Behind the scenes, an OBDA engine (Ontop, Mastro, Stardog) reads the SPARQL, consults the R2RML mappings, and rewrites it into one or more SQL queries against the actual sources. The graph is never built. The query is.
Why this is often the right answer. Materialising the bank’s KG would require an ETL pipeline, a triplestore, and a freshness strategy. With OBDA, the data is always current — because it’s queried at the source. No staleness. No copy. No governance fight about where the "real" data lives.
The trade-off. Some SPARQL queries don’t rewrite efficiently to SQL — anything involving deep inference, or transitive closures over millions of edges. For those, you do materialise (Tab 01). The wisdom is choosing per query, not globally.
The ontology is the contract. Both the agent and the engine agree on what classes and properties mean. That contract is OWL — the work the Ontology Lab covered. OBDA is where that ontology becomes operational over the bank’s actual data.

SPARQL → SQL rewrite for Marcus’s wealth profile Interactive

Click ▶ to watch a SPARQL query rewrite through the R2RML mappings into a federated SQL plan. No triples materialised.

Technical Detail

OBDA: query rewriting under the DL-Lite family

Foundations. Ontology-Based Data Access (Calvanese, De Giacomo, Lembo, Lenzerini, Rosati — DL-Lite series, 2007 onward) is the formal framework in which a SPARQL query Q over an OWL 2 QL ontology O is mechanically rewritten into a SQL query Q’ over the underlying sources, using R2RML mappings.

The chain. Step 1 — ontological reformulation: expand Q using O’s subsumption axioms to capture all entailed answers (PerfectRef algorithm). Step 2 — unfolding via mappings: replace each ontology term with the SQL template from R2RML. Step 3 — SQL optimisation: standard cost-based query optimisation by the underlying DBMS.

OWL 2 QL is the sweet spot. QL was designed for this. It is FO-rewritable: any QL ontology’s reformulation under PerfectRef produces a first-order (i.e. SQL-expressible) query. Going beyond QL (to EL or RL) sacrifices this property and forces partial materialisation or limited reasoning.

Federated execution. When the data lives in multiple sources, OBDA engines (Ontop especially) use SQL federation — JOIN across PostgreSQL FDW, Presto/Trino, or vendor-specific federation. The reformulated query becomes a federated plan with sub-queries dispatched per source.

Limits in practice. Aggregations, negation-as-failure, transitive closures with high cardinality, and queries requiring richer-than-QL reasoning fall outside OBDA’s guarantees. Production systems combine OBDA for the bulk of queries with selective materialisation for the rest — the so-called "hybrid" pattern.

Citations. Calvanese D., De Giacomo G., Lembo D., Lenzerini M., Rosati R. (2007). Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family. JAR 39(3). Xiao G. et al. (2018). The Virtual Knowledge Graph System Ontop. ISWC. Poggi A. et al. (2008). Linking Data to Ontologies. J. Data Semantics X.

Tab 03 · Querying · SPARQL · Cypher · ISO GQL

Three dialects of the same question

You have a graph. Someone needs to ask it things. There are three query languages — each optimised for a different audience and historical lineage. The good news: they ask the same questions with surprisingly similar shapes.

In Plain English

Same question, three dialects — pick the one your team reads

"Find Marcus’s advisor’s other clients." That’s a graph question — two hops, simple. SPARQL, Cypher, and ISO GQL are three ways to ask it. The choice usually comes down to which language your team and your tools already know.

SPARQL — the academic and standards lineage. W3C standard since 2008, current 1.1 from 2013. Triple-pattern matching with SQL-style SELECT/WHERE/FILTER. Universal across RDF stores. The right answer when the graph is RDF and the team is comfortable with semantic-web tooling. ?marcus :advisedBy ?advisor . ?advisor :advises ?client .
Cypher — the developer-friendly path syntax. Created by Neo4j 2011, opened as openCypher. Reads like ASCII art: (marcus)-[:ADVISED_BY]->(advisor)-[:ADVISES]->(client). Engineers find it intuitive without much teaching. The dominant property-graph language for the last decade.
ISO GQL — the 2024 ISO standard that unifies them. ISO/IEC 39075:2024, published April 2024. The first ISO graph-query standard ever. Borrows Cypher’s pattern syntax, adds SQL-style composability, designed to be vendor-portable the way SQL is for relational. This is the language banks will standardise on through the late 2020s.
They’re more alike than you’d think. All three express multi-hop patterns, filters, optional branches, aggregations. The differences are syntactic taste, default semantics around path counting, and which engines speak them natively. The intellectual content of a query usually translates 1:1.
Pick by where the data lives. RDF triplestore? SPARQL. Neo4j or AuraDB? Cypher. New procurement in 2026+? GQL — because vendor lock-in is the actual long-term cost, and an ISO standard is the only durable answer to that.

"Find Marcus’s advisor’s other clients" · three languages Comparative

The same question expressed in all three. Same results: Heloise, David, Anya, and five others Jane advises.

Technical Detail

SPARQL 1.1, openCypher, and ISO GQL 2024 — capabilities compared

SPARQL 1.1 (W3C 2013). Pattern-matching over RDF triples with SELECT, ASK, CONSTRUCT, DESCRIBE query forms. Algebra grounded in relational algebra: BGP (basic graph pattern), JOIN, OPTIONAL, UNION, MINUS, FILTER, sub-queries, aggregation. Property paths (*, +, ?, /, |, ^) provide regex-style path expressions. Federation via SERVICE clauses. Solid theoretical foundation; verbose syntax.

openCypher (Neo4j 2011, opened 2015). Declarative pattern language: MATCH (a)-[r:KNOWS]->(b). Property graph model (nodes + edges with properties + labels). Strong path expressivity with variable-length matches [*1..3]. WITH-chained sub-queries enable functional composition. Powers Neo4j, Memgraph, AGE on Postgres.

ISO GQL (ISO/IEC 39075:2024, April 2024). The first ISO standard for graph queries, building on openCypher (Cypher syntax in the pattern fragment) plus SQL-style schema (CREATE GRAPH TYPE), composability (linear query composition via |>), and rigorous formal semantics (GPC theoretical foundation from Francis et al.). Implemented (or pending) in Neo4j, TigerGraph, Oracle, AnzoGraph, Memgraph.

What GQL adds. Standardised schema management (graph types as first-class), formal semantics over both bag and set models, restricted (acyclic) and shortest path semantics, normative procedures for path enumeration. The composability story — pipe-style chaining — is the deepest improvement over openCypher.

SPARQL vs GQL — different data models. SPARQL is for RDF (triples, IRIs, no native labels-on-edges); GQL is for property graphs (typed nodes/edges with property maps). RDF-star and SPARQL-star (in progress) close the gap for edge properties; many systems offer both.

Choosing in 2026. Existing RDF / OBDA stack → SPARQL. Existing Neo4j footprint → Cypher. New build, multi-vendor procurement, regulator-replayable contracts → GQL. The Bank’s long-term answer is GQL; the short-term answer is "speak whichever the engine in front of you speaks."

Citations. Harris S. & Seaborne A. (eds.) (2013). SPARQL 1.1 Query Language. W3C Recommendation. Francis N. et al. (2018). Cypher: An Evolving Query Language for Property Graphs. SIGMOD. ISO/IEC 39075:2024. Information technology — Database languages — GQL. Deutsch A. et al. (2022). Graph Pattern Matching in GQL and SQL/PGQ. SIGMOD.

Tab 04 · GraphRAG · Graph-Augmented Retrieval for LLMs

A librarian who walks the library with you

Vanilla RAG fetches text chunks an LLM tries to stitch together. GraphRAG walks the structured graph — finding facts by following relationships, returning paths with edge-level citations. Different shape of retrieval. Different quality of answer.

In Plain English

Retrieval that walks the graph instead of skimming the index

A standard RAG system hands the LLM a stack of relevant text snippets and hopes it stitches them together. GraphRAG hands the LLM a path through a graph — a sequence of facts with explicit edges between them. The LLM stops free-associating and starts citing.

The question. Agent is asked: "who’s like Marcus?" A vanilla RAG would search Marcus’s profile text against a vector index and hope the matches are relevant. A GraphRAG asks the graph: "start at Marcus, walk along advisedBy, then advises, return the other clients."
Multi-hop traversal. The answer to a real question often requires walking 2-3 edges, not one. Marcus → advisor → other clients → with similar exposure. A vector index treats this as one big bag of text. GraphRAG treats it as a directed walk.
Community detection ahead of time. Microsoft’s GraphRAG (2024) precomputes communities — clusters of densely connected nodes — and summarises each. When the agent asks a broad question, the system retrieves community summaries first, then drills down. Faster and more focused than walking every edge live.
Every fact comes with a citation. The LLM’s answer references graph edges, not opaque "chunk #47". "Heloise is similar to Marcus because Jane Wong advises both [edge: jane-advises-heloise]." That edge is auditable, replayable, attributable.
What this gives the bank. An agent that answers "who’s like Marcus?" with a ranked list, each with a citation chain back to source facts, with the reasoning visible. That’s the difference between a tool the bank can put in front of a wealth manager and a tool it can’t.

"Who’s like Marcus?" · GraphRAG walks the graph Interactive

Click ▶ to watch a 4-hop traversal — Marcus → advisor → other clients → similar exposure profiles — with citations at every edge.

Technical Detail

GraphRAG mechanics — communities, traversal, prompt grounding

The pipeline. (1) Query decomposition by an LLM: turn a natural-language ask into entity filters + relationship constraints + an information goal. (2) Community lookup: identify which precomputed community (Leiden / Louvain modularity-optimal clusters) the entities belong to; retrieve community-level LLM-generated summaries. (3) Multi-hop traversal: from seed entities, follow constrained edges with intersection-of-filters logic, collecting candidate-set + edge-provenance tuples. (4) Optional ontological relaxation: if a strict filter intersection returns ∅, consult OWL subsumption (Tab 02 of the Ontology Lab) and soft-match through subclass / sub-property axioms. (5) LLM synthesis with retrieval-evidence prompting: candidates and edges are formatted as structured context; the LLM produces an answer with explicit citations to graph node IDs and edge IDs.

Microsoft GraphRAG (2024). Open-source implementation from Microsoft Research; pioneered the community-summarisation pattern at scale. Designed for query-focused summarisation over corpora-derived KGs. Key insight: pre-computed community summaries are cheap to retrieve but rich in context, making them the right unit for global-scoped queries.

HippoRAG (2024). An alternative architecture that uses Personalized PageRank over the graph to score nodes by relevance to the query, returning a ranked set rather than walking from seeds. Strong on multi-hop QA benchmarks.

PathRAG (2024). Optimises retrieval by computing relevant paths (not nodes) and feeding the paths directly to the LLM. Closer to the structured-walk pattern; particularly useful when the answer requires explanation, not just lookup.

Composition with other phases. GraphRAG is downstream of Construction (the graph must exist), often consumes from a virtual graph via OBDA (Tab 02), and benefits enormously from embeddings (Tab 06) for similarity-aware traversal. Knowledge Vault (Tab 05) scores can be used to filter low-confidence facts before they reach the LLM’s context.

Citations. Edge D. et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft Research. Gutiérrez B.J. et al. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. NeurIPS. Chen B. et al. (2024). PathRAG: Pruning Graph-based Retrieval-Augmented Generation with Relational Paths.

Tab 05 · Knowledge Vault · Probabilistic KGs

Not every fact in the graph is equally true

"Marcus is advised by Jane" might come from CRM, from the call recording transcript, and from the regulatory filing. Three sources, three confidences. Knowledge Vault (Google 2014) is how you keep all the evidence and score each fact.

In Plain English

A graph where every edge has a confidence score

A regular knowledge graph treats every triple as definitely true. Reality is messier. Some facts come from authoritative systems; some are inferred from emails; some are extracted by LLMs from PDFs. Knowledge Vault attaches a confidence to each — and lets the agent decide what to trust.

Every fact gets a provenance record. Not just "Marcus advisedBy Jane." But "Marcus advisedBy Jane, extracted from CRM table customers.advisor_id, on 2026-04-15, confidence 0.99." And "Marcus advisedBy Jane, extracted from call transcript 2025-12-03, confidence 0.78." The graph remembers all witnesses.
Sources have a track record. The CRM has been right 99.4% of the time when audited. The LLM extracting from call transcripts has been right 84%. The vault keeps these source priors and uses them to weight new evidence.
Multiple weak signals add up. One LLM extraction saying "Marcus advisedBy Jane" is 0.78. Two independent extractions saying the same thing pushes the joint confidence higher. The vault does the Bayesian combination so the agent doesn’t have to.
The agent can threshold by confidence. For a regulatory report, set the threshold high — only facts with confidence ≥ 0.95 get included. For an exploratory "similar clients" search, set it lower — allow facts at 0.7 with the proviso the answer is flagged "low confidence."
Confidence flows through derivations. If "Marcus advisedBy Jane" has confidence 0.95, and "Jane advises Heloise" has 0.99, then "Marcus and Heloise share an advisor" has roughly 0.94. Every derived fact carries propagated confidence. This is what makes the agent’s answers defensibly calibrated.

Score the fact "Marcus advisedBy Jane" from three sources Interactive

Click ▶ to combine three independent extractions with different source-prior accuracies and see the fused confidence.

Technical Detail

Knowledge Vault — fact extraction at web scale with Bayesian fusion

The 2014 paper. Dong, Murphy, Gabrilovich, et al. (KDD 2014) introduced Knowledge Vault as a probabilistic KG containing 1.6 billion triples extracted at Google scale from four extractors (text, DOM, tables, human annotations), each scored by a logistic-regression classifier on extractor-specific features, then fused via probabilistic graphical model.

Fact scoring. For each (subject, predicate, object) candidate t, multiple extractors produce calibrated probabilities P_e(t | features_e). The vault fuses these into P(t | all evidence) using assumptions about extractor independence (or learned dependence). A prior derived from the existing knowledge graph (Freebase at the time) was added — facts already known are more likely to be true.

Pattern naturalised. While the original Knowledge Vault was not commercialised, the pattern dominates modern KG construction: NELL (CMU), Diffbot KG, Refinitiv PermID, ConceptNet 5.5, ClaimsKG. All maintain (fact, source, confidence, extraction-time) tuples and produce composite scores.

Modern variants. (i) Confidence-Aware KGE (CKRL, Xie et al. 2018) — embedding methods that incorporate fact confidence as a signal. (ii) UncertainKG embeddings (Chen et al. 2019, 2021) — learn vectors that encode both fact and confidence. (iii) Beta-Bernoulli calibration over extractor outputs to ensure published confidences match empirical accuracy.

Composition. In an open-banking KG: facts from CRM (high authoritative confidence) sit alongside facts from KYC transcripts (medium confidence, LLM-extracted) and from third-party data providers (variable). Knowledge Vault is the layer that keeps them all and lets downstream consumers (GraphRAG, Graph ML) choose thresholds appropriate to the use case.

Citations. Dong X. et al. (2014). Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion. KDD. Xie R., Liu Z., Lin F., Sun L. (2018). Does William Shakespeare REALLY Write Hamlet? Knowledge Representation Learning with Confidence. AAAI. Chen X., Chen M., Shi W., Sun Y., Zaniolo C. (2019). Embedding Uncertain Knowledge Graphs. AAAI.

Tab 06 · Embeddings · TransE · DistMult · ComplEx · RotatE

Turn every node into coordinates on a map

Once Marcus, Jane, and every other client and product are points in a vector space, "similarity" becomes "distance." The four classical embedding families differ in how they model relationships — translation, multiplication, complex rotation — but the goal is the same.

In Plain English

A map where similar things sit close together

Imagine a giant map. Every node in the bank’s KG — Marcus, Jane, ISA, Portfolio, GBP — gets placed on it. Train the map so that connected things end up close. Now "find clients similar to Marcus" is just "find the nearest neighbours of Marcus’s point." That’s a KG embedding.

TransE — relationships are translations. Bordes et al. 2013. Picture every relationship as an arrow on the map. If Marcus → advisedBy → Jane, then point’s position(Marcus) + vector(advisedBy) ≈ point’s position(Jane). Train this for every triple in the graph. Simple, fast, intuitive — but struggles with one-to-many relationships.
DistMult — relationships are scalings. Yang et al. 2015. Instead of adding, multiply element-wise. Score of a triple is the dot product weighted by the relationship vector. Handles symmetric relationships well; can’t natively express asymmetric ones (advisedBy ≠ advises).
ComplEx — go complex. Trouillon et al. 2016. Use complex-valued vectors instead of real ones. The asymmetric part of relationships now has a place to live (the imaginary component). Becomes a strict generalisation of DistMult, fixing its main weakness.
RotatE — relationships are rotations. Sun et al. 2019. Each relationship is a rotation in complex space, applied to the head entity to get the tail. Captures symmetric, antisymmetric, inverse, and composition patterns in one model. The most expressive of the four classical families.
Why the bank cares. Embeddings power link prediction (what facts are likely true but not recorded?), similarity search (clients like Marcus, products like the EU-sov fund), and recommendation (next-best-action). When the agent says "Marcus and Heloise are similar," it’s comparing their embedding vectors.

Embed the KG and rank similar clients to Marcus Interactive

Click ▶ to train a small TransE-style embedding on Marcus’s graph and rank the 8 other wealth clients by cosine similarity.

Technical Detail

Translation, bilinear, and complex-rotation embedding families

The KGE problem. Given a KG G = {(h, r, t)} of head-relation-tail triples, learn vectors e_h, e_t ∈ ℝ^d (or ℂ^d) for entities and w_r ∈ ℝ^d (or ℂ^d) for relations such that a scoring function f(h, r, t) is high for observed triples and low for negative samples. Trained typically with margin-based ranking loss or cross-entropy with sampled negatives.

TransE (Bordes et al. NeurIPS 2013). Scoring: f(h, r, t) = −‖e_h + w_r − e_t‖_{1 or 2}. Models relationships as translations in vector space. Limitation: cannot represent one-to-many, many-to-one, or many-to-many relations consistently. Variants — TransH, TransR, TransD — fix specific cases.

DistMult (Yang et al. ICLR 2015). Scoring: f(h, r, t) = ⟨e_h, w_r, e_t⟩ = Σ_i e_{h,i} · w_{r,i} · e_{t,i}. Element-wise bilinear product. Inherently symmetric — cannot distinguish (h, r, t) from (t, r, h). Limits real-world applicability but very fast.

ComplEx (Trouillon et al. ICML 2016). Embeddings in ℂ^d. Scoring: f(h, r, t) = Re(⟨e_h, w_r, conj(e_t)⟩). The conjugation breaks symmetry; asymmetric relations can be modelled. Strictly generalises DistMult (real subspace recovers DistMult).

RotatE (Sun et al. ICLR 2019). Embeddings in ℂ^d with constraint |w_{r,i}| = 1. Scoring: f(h, r, t) = −‖e_h ⊙ w_r − e_t‖, where ⊙ is element-wise multiplication. Each relation is a rotation. Provably captures: symmetry, antisymmetry, inversion, composition. State-of-the-art in the classical family.

Beyond the classical four. ConvE (2018) introduced convolutional decoders. TuckER (2019) generalises to tensor factorisation. QuatE (2019) uses quaternions. Graph neural networks (Tab 07) increasingly subsume KGE by jointly modelling structure and features.

Citations. Bordes A. et al. (2013). Translating Embeddings for Modeling Multi-relational Data. NeurIPS. Yang B. et al. (2015). Embedding Entities and Relations for Learning and Inference in Knowledge Bases. ICLR. Trouillon T. et al. (2016). Complex Embeddings for Simple Link Prediction. ICML. Sun Z. et al. (2019). RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space. ICLR.

Tab 07 · Graph ML · Node2Vec · GraphSAGE · GAT · R-GCN

Learn from who you know, not just who you are

The agent’s final task — rank wealth clients by churn risk — needs more than embeddings. It needs a model that learns from the structure around each node, plus their attributes. Graph neural networks were invented for this.

In Plain English

A model that predicts your behaviour from your neighbours’ behaviour

Suppose you want to predict whether Marcus will churn. His own attributes (age, AUM, tenure) help — but so does what his advisor’s other clients have done, and what their products have been doing. Graph ML is how you learn from both at once.

Node2Vec — random walks on the graph. Grover & Leskovec 2016. Start at a node, take a biased random walk. Treat the sequence of nodes you visit like a sentence; train word2vec on millions of these "sentences." Out pops an embedding for every node that reflects who they’re connected to. Fast, unsupervised, no node features needed.
GraphSAGE — sample-and-aggregate. Hamilton et al. 2017. For each node, look at its neighbours, aggregate their features, repeat. Now the node’s representation depends on its neighbourhood, not just itself. Crucially: works on graphs you’ve never seen before (inductive), so new clients can be scored without retraining.
GAT — pay attention to the right neighbours. Veličković et al. 2018. GraphSAGE averages all neighbours equally. GAT learns to weight them — which neighbour matters most for predicting THIS node’s churn? An attention mechanism per edge. This is what makes the model "explainable" — you can see which neighbours mattered most.
R-GCN — different relations matter differently. Schlichtkrull et al. 2018. The bank’s KG has many edge types: advisedBy, holdsProduct, livesIn, sameAdvisorAs. R-GCN has different weights per relation, so the model learns that "shares an advisor" matters differently from "lives in the same country" for predicting churn. Built for multi-relational graphs — exactly what a bank KG is.
The pay-off. A trained Graph ML model on the bank’s KG predicts churn risk for every client — using their attributes, their neighbours’ attributes, and the structure of their relationships. The final ranked answer to "find clients similar to Marcus, by predicted churn risk" lands here, with explainable edge-level attention.

Final answer — rank similar wealth clients by predicted churn Interactive · the whole question

Click ▶ to assemble everything: similarity from embeddings + GraphSAGE churn predictions + ranking. The complete answer to the agent’s question.

Technical Detail

Graph neural networks — message passing, sampling, attention

Node2Vec (KDD 2016). Biased random walks parameterised by (p, q) — p controls return probability (BFS-like local exploration), q controls in-out (DFS-like distant exploration). Walks are sequences fed to skip-gram with negative sampling to produce d-dimensional node embeddings. Predecessor DeepWalk (2014) is the unbiased-random-walk version.

GraphSAGE (NeurIPS 2017). Inductive node representation learning. For each node v at layer k: sample a fixed-size neighbourhood, aggregate (mean / LSTM / pooling) their layer-(k-1) representations, concatenate with v’s own representation, pass through weights and non-linearity. Inductive because the model generalises to unseen nodes. Critical for production: new customers don’t require full retraining.

GAT — Graph Attention Network (ICLR 2018). Replace uniform aggregation with attention weights: α_{ij} = softmax_j(a(W·h_i, W·h_j)), where a is a small feedforward over the concatenated transformed features. The node’s update is the attention-weighted sum of neighbours. Multi-head attention for stability. Provides per-edge interpretability — you can show which neighbour contributed most.

R-GCN — Relational GCN (ESWC 2018). Designed for multi-relational graphs (the natural form of an enterprise KG). For each relation type r, separate weight matrix W_r; the update aggregates over each relation separately then sums. Basis-function decomposition keeps parameter count tractable when there are many relations.

What banks use them for. Churn prediction (the demo). Fraud detection — graph features (rings of mutually transacting accounts) dominate single-node features. Anti-money-laundering — multi-hop transitive flow detection. Recommendations — next-best-product based on graph-similar customers. Risk modelling — counterparty exposure via beneficial-ownership graph.

Composition with the rest of the lab. Tab 01 builds the graph. Tab 02 makes it queryable virtually. Tab 03 queries it. Tab 04 lets an LLM walk it. Tab 05 scores facts. Tab 06 embeds it as static vectors. Tab 07 trains task-specific predictive models over the structure — and is where most of the ROI of an enterprise KG actually shows up.

Citations. Grover A. & Leskovec J. (2016). node2vec: Scalable Feature Learning for Networks. KDD. Hamilton W., Ying R., Leskovec J. (2017). Inductive Representation Learning on Large Graphs. NeurIPS (GraphSAGE). Veličković P. et al. (2018). Graph Attention Networks. ICLR. Schlichtkrull M. et al. (2018). Modeling Relational Data with Graph Convolutional Networks. ESWC.