The fragmentation problem, made concrete
Three source databases. Three customers spread across them. No single source of truth. Every formal model below addresses one slice of this problem.
Every tier-1 bank lives with this. Core banking owns accounts. Cards is its own legacy system, often acquired. Wealth Management runs on a third platform with a different customer model. The same human appears in all three — under slightly different identities. The job of the knowledge graph is to know they're the same person, to know what each system says about them, and to know which assertion came from where.
- Tab 01 — Fellegi-Sunter (1969). Probabilistic record linkage. Watch Sarah Chen's three records compute m-probabilities, u-probabilities, log-likelihood ratio. See the match decision emerge from the math.
- Tab 02 — RIGOR (2025). Retrieval-Augmented Ontology Generation. Watch an LLM-style iterative process build OWL axioms from the three schemas, table by table, with provenance tags.
- Tab 03 — Provenance Semirings (Green-Karvounarakis-Tannen, 2007). Marcus's transaction propagates through a join. See the polynomial annotation track every source tuple that contributed.
- Tab 04 — The Unified KG. Everything composed. Entities resolved. Ontology applied. Provenance attached. Interactive graph.
Fellegi-Sunter on Sarah Chen
Three records. Same human? The 1969 model says: compute m and u probabilities per field, sum the log-likelihood ratios, apply the threshold. Watch it happen.
For each comparable field, compute m (P[agree | match]) and u (P[agree | non-match]). Sum log2(m/u) across fields → match weight. Compare to upper and lower thresholds to classify as MATCH, POSSIBLE, or NON-MATCH.
Modern implementations: Splink applies Fellegi-Sunter at scale via expectation-maximisation (no labels needed). Ditto uses fine-tuned BERT for transformer-based matching. Both end at the same place — a confidence score and a canonical entity.
RIGOR — schemas to OWL, iteratively
The 2025 model: an LLM iterates table by table, retrieves from domain ontologies (FIBO, schema.org), and builds an OWL 2 DL ontology with provenance-tagged delta fragments. Watch it work.
For each table: retrieve schema + domain ontology + growing core ontology → prompt Gen-LLM → produce delta-ontology fragment → Judge-LLM validates → merge into core. Iterate following foreign-key constraints.
R2RML (W3C 2012) — declarative relational-to-RDF mapping language. You write the mapping. RML (Ghent 2014) — extends R2RML to CSV, JSON, XML. RIGOR (2025) — the LLM writes the mapping AND the ontology, iteratively. Same end-state (OWL ontology + RDF instances), three orders of magnitude less human effort.
Marcus Aldridge's transaction → BCBS report
Green-Karvounarakis-Tannen (PODS 2007) — track which source tuples contributed to every derived fact, as a polynomial over a semiring. Watch the algebra propagate.
Annotate each source tuple with a variable. Join (⊗) multiplies. Union (⊕) adds. The result is a polynomial that captures HOW each output tuple was derived — and lets you compute trust, probability, multiplicity by evaluating the same polynomial in different semirings.
The semiring polynomial is the algebraic layer. W3C PROV is the standard representation: three node types (Entity, Activity, Agent), five core relations (wasGeneratedBy, used, wasAssociatedWith, wasDerivedFrom, wasAttributedTo). Every modern lineage tool emits PROV-compatible events.
The Unified Knowledge Graph
Entities resolved (Fellegi-Sunter). Ontology applied (RIGOR). Provenance attached (Semirings + PROV). The graph below is the composition of every model above, on Meridian Bank's actual data.
Toggle the layers to see how each formal model contributes a different facet of the same knowledge graph. Click any node to see its provenance and attestations.
Three real people. Three formal models. One queryable graph. Every claim in the graph is attributable to a source row, defended by an ontology axiom, and traceable through PROV. That is the academic contract.