Tab 00 · The Enterprise Data Model

The model the architects built — now make it operational

The EDM is the substrate. Six entities (Customer, Account, Position, Instrument, Transaction, RiskModel) declared in three layers — conceptual, logical, physical. This lab demonstrates the seven metadata concerns that turn this static model into a regulator-replayable operating system.

Probabilistic spine of this tab Inter-rater agreement (Cohen's κ, Fleiss' κ) measures how strongly the architects agree on the canonical definition of each entity. An EDM with κ ≥ 0.80 is "almost perfect agreement" (Landis-Koch 1977); below 0.60 the model is contested and must be reconciled before going operational.

The six entities · three layers · one worked example Interactive

Click any entity in the left rail or any tab in the top nav to see how that concern operates on this EDM. Below: the canonical Customer entity rendered across all three layers, with its inter-rater agreement score.

Conceptual

Customer

A legal or natural person
holding one or more accounts
at the bank.

Relationships:
· holds Account (1..*)
· generates Transaction (0..*)
· may own Position (0..*)

Logical

Customer

customer_id : ID
full_name : string
dob : date
kyc_status : {V,P,R}
total_aum : decimal
is_dormant : bool (derived)

Physical

3 source systems

CORE.customers
· CUST_ID, NAME, DOB, KYC_STAT

CARDS.cardholders
· holder_id, name, birth_date

WEALTH.clients
· wm_client, full_nm, aum_gbp

Inter-rater agreement on Customer definition · 4 architects

Architect	"Customer is..."	Includes prospects?	Includes ex-customers?
A1 (Retail)	natural person with account	NO	YES (closed < 7 yrs)
A2 (Wealth)	legal/natural person with portfolio	NO	YES
A3 (Compliance)	any KYC'd party	YES	YES
A4 (Cards)	holds at least one card	NO	NO (purged at 2 yrs)
Fleiss' κ =			0.43 · MODERATE (must reconcile)

Until the architects converge on a single Customer definition, downstream metadata matching, rules, and lineage will inherit the ambiguity. The EDM is operationally fragile when κ < 0.60.

What the lab demonstrates · 8 tabs · the metadata operating system

00 · EDM

Cohen's κ · Fleiss' κ

The canonical model + how confidently the architects agree on each entity definition.

01 · Metadata Matching

Cupid · COMA · TaBERT · Bayesian composite

Logical attributes → physical columns across 3 sources. Calibrated posterior via logistic regression.

02 · Ontology

OWL · DL · Markov Logic Networks

Soft axioms: "customer with txn freq < 0.1/yr is dormant with p=0.92."

03 · Business Rules

DMN · SBVR · Datalog · Kaplan-Meier · Cox PH

Decision tables + survival analysis under the dormancy rule. Time-to-event statistics.

04 · Data Flow

Gane-Sarson · M/M/1 queues · Markov chains

DFD with realistic queueing & stage-dropout probabilities for the pipeline.

05 · Identity Lineage

Hash chains · Bloom filters

Copy-through lineage: deterministic. Bloom-filter scale at bounded false-positive rate.

06 · Derivation Lineage

Provenance Semirings · Monte Carlo · Sensitivity

BCBS line 4.2 = formula over inputs. MC for confidence interval. ∂y/∂x for dominant input.

07 · Lineage Graph

Bayesian network · PageRank · Isolation Forest

Both lineage types composed. Conditional probability tables. Criticality scoring.

The thesis. An enterprise data model is static — boxes and lines on a diagram. Metadata is what makes it operational: matched to reality, semantically grounded, rule-governed, flow-modelled, and lineage-traceable with calibrated uncertainty. Every tab below is one piece of that operational layer.

Tab 01 · Metadata Matching

The logical model says `total_aum` — three systems disagree

The EDM declares Customer.total_aum : decimal(GBP). CORE has nothing called that. CARDS has credit_limit_gbp. WEALTH has aum_gbp. STAGING has client_assets_under_mgmt. Five matchers vote, weights are learned, posterior is calibrated.

Cupid 2001

COMA / COMA++ 2002-05

Valentine 2021

TaBERT 2020

LogMap (Oxford)

Logistic regression (weight learning)

Bayesian composite (log-odds)

Beta-Bernoulli (calibration)

Probabilistic spine of this tab Each matcher emits a score in [0,1]. The composite is computed as a weighted log-odds sum (Bayesian evidence combination under conditional independence). Weights w_k are learned via logistic regression on labelled match/non-match pairs. The posterior P(match | scores) is then calibrated via the Beta-Bernoulli model — turning raw scores into probabilities you can threshold defensibly.

Match Customer.total_aum across 4 systems Interactive

Four candidate columns from four physical systems. Each is scored by 5 matchers, combined as Bayesian log-odds, and calibrated to a posterior probability of being the correct logical-to-physical match.

Why a Bayesian composite Log-odds = additive evidence

Each matcher's score s_k is treated as evidence. Convert to a log-likelihood ratio LLR_k = log(P(s_k | match) / P(s_k | ¬match)). Under conditional independence the LLRs add. Weights w_k correct for known dependence and matcher reliability — learned from training pairs by logistic regression.

// Composite log-odds LLR(candidate) = Σ_k w_k · LLR_k(s_k) // Posterior probability via sigmoid P(match | scores) = 1 / (1 + exp(-LLR)) // Beta-Bernoulli calibration on a holdout set α, β fit so that P_calibrated ≈ empirical accuracy

Calibration — is "0.91" really 91% likely to match? Reliability diagram · Beta-Bernoulli posterior

A model is well-calibrated if, among predictions with confidence 0.91, exactly 91% are correct. The Beta-Bernoulli posterior smooths raw scores using held-out empirical accuracy. The lab demo computes the calibration curve on a 20-pair holdout.

Tab 02 · Ontology & Soft Reasoning

From hard axioms to weighted rules

Some rules are absolute: "every Account belongs to exactly one Customer." Others hold with probability: "a customer with txn frequency < 0.1/yr is dormant — with confidence 0.92." Classical DL reasons with the first kind. Markov Logic Networks reason with both.

OWL 2 DL (W3C 2012)

RDFS

SROIQ · Description Logic

RIGOR 2025

OntoGPT 2023

FIBO / BIAN

Markov Logic Networks (Domingos-Richardson 2006)

Probabilistic DL (P-SROIQ · Lukasiewicz)

Probabilistic spine of this tab Markov Logic Networks (Domingos & Richardson 2006) attach a real-valued weight to each first-order rule. A world\'s probability is proportional to exp(Σ w_i · n_i(world)) where n_i counts how many groundings of rule i are satisfied. Hard rules become MLNs with infinite weight; soft rules get finite weights learned from data. The reasoner returns marginal probabilities, not just true/false.

Derive Account.is_dormant with confidence Probabilistic Reasoner

Three rules combine — a hard cardinality constraint, a soft frequency-based dormancy heuristic, and a soft "balance still nonzero ⇒ probably-not-dormant" override. The MLN reasoner returns P(is_dormant = true) for each of three test accounts.

OWL hard axioms · the deterministic spine SROIQ semantics

Before any soft reasoning, the ontology declares the inviolable structure. Customer ⊑ LegalPerson. Account ⊑ ∃ heldBy.Customer with cardinality 1..1. Position ⊑ ∃ instrumentOf.Instrument. These are non-negotiable.

@prefix ob: <http://meridian.bank/edm/> . ob:Customer rdfs:subClassOf fibo:LegalPerson . ob:Account rdfs:subClassOf [ a owl:Restriction ; owl:onProperty ob:heldBy ; owl:cardinality 1 ; owl:onClass ob:Customer ] . ob:Position rdfs:subClassOf [ a owl:Restriction ; owl:onProperty ob:instrumentOf ; owl:someValuesFrom ob:Instrument ] .

MLN soft rules · the weighted overlay Domingos-Richardson 2006

Where data is noisy or rules are heuristic, attach weights. Higher weight = stronger preference. The lab uses three rules below for the dormancy derivation. The MLN compiles to a Markov Random Field; inference uses MC-SAT or Gibbs sampling.

// MLN rules for Account.is_dormant (weights in log-odds) w=2.3 txn_freq(a, f) ∧ f < 0.1 ⇒ dormant(a) w=2.0 balance(a, b) ∧ b > 1000 ⇒ ¬dormant(a) w=∞ heldBy(a, c) ∧ heldBy(a, c2) ∧ c ≠ c2 ⇒ ⊥ // hard cardinality // World probability P(world) ∝ exp(Σ w_i · n_i(world)) // where n_i = count of satisfied groundings of rule i

Tab 03 · Business Rules

"When is an account dormant?" — a rule, a survival curve, a verdict

DMN captures the business decision in a table. SBVR captures the rule in structured English. Datalog evaluates the recursion. But the threshold itself — "365 days" — is a statistical claim about time-to-event. Kaplan-Meier and Cox PH calibrate it.

DMN 2015

SBVR 2008

Datalog 1977+

RuleML

OPA / Rego 2018

Kaplan-Meier survival (1958)

Cox proportional hazards (1972)

Logistic regression

Naive Bayes

Probabilistic spine of this tab A dormancy rule is implicitly a time-to-event statement. The Kaplan-Meier estimator (1958) gives the survival function S(t) = P(no transaction by time t). Cox proportional hazards (1972) extends this with covariates (balance, product type, customer segment). The "365 days" threshold in DMN should not be folklore — it should be the t* where S(t*) drops below some chosen survival probability. The lab estimates S(t) on a synthetic cohort of 500 accounts and locates the 90% survival cutoff.

The dormancy rule · DMN + SBVR + Datalog + survival curve Interactive

Three classical rule languages express the same business rule. The Kaplan-Meier survival curve below shows what "365 days" actually means in probability terms. Click ▶ to compute the curve on a cohort.

DMN · decision table

Decision Table · Account_Dormancy ┌────────────────────┬──────────────┬───────────┐ │ days_since_last_tx │ balance │ Result │ ├────────────────────┼──────────────┼───────────┤ │ < 90 │ – │ ACTIVE │ │ 90 - 365 │ > 1000 │ ACTIVE │ │ 90 - 365 │ ≤ 1000 │ AT_RISK │ │ > 365 │ – │ DORMANT │ └────────────────────┴──────────────┴───────────┘

SBVR · structured English

// SBVR-SE It is necessary that each account is classified as dormant if its days-since- last-transaction exceeds 365. // Datalog equivalent dormant(A) :- account(A), last_tx(A, T), days_since(T) > 365.

Cox proportional hazards · why one threshold is too coarse Cox 1972

A single "365 days" cutoff treats every account the same. Cox PH lets the hazard rate depend on covariates: h(t | x) = h₀(t) · exp(β·x). A high-AUM wealth client and a £50 student savings account should not share one threshold. The Cox model returns segment-specific cutoffs.

// Cox PH fit on the same 500-account cohort with 3 covariates covariate β hazard_ratio p-value ───────────────── ────── ───────────────── ──────── balance_gbp -0.0008 0.9992 per £1 < 0.001 product=current 0.42 1.52 vs savings 0.003 segment=wealth -1.18 0.31 vs retail < 0.001 // Segment-specific 90%-survival cutoffs (days) retail · current: 310 days retail · savings: 420 days wealth · any: 720 days

When to use what

DMN

Tables, business-owned

Use when the rule has clear discrete inputs and the business needs to maintain it without engineering.

SBVR

Natural-language rules

Use when regulators or auditors read the rule. Structured English bridges legal and machine.

Datalog / OPA

Executable recursion

Use when rules chain (eligibility depends on eligibility…). Datalog terminates; pure SQL doesn't.

Kaplan-Meier · Cox

Calibrate the threshold

Use when the rule has a numeric threshold (days, amount, count). The number should come from data, not folklore.

Tab 04 · Data Flow Diagram

From source row to BCBS line — five stages, queueing all the way

A DFD shows data moving through the pipeline: source → staging → enrichment → mart → report. Gane-Sarson notation is the visual language. M/M/1 queueing tells you what each stage costs in latency. Markov chains tell you the probability a row reaches the final report at all.

Gane-Sarson DFD (1979)

Yourdon DFD

OpenLineage 2020

W3C PROV-DM

Apache Beam dataflow model

M/M/1 queueing (Erlang/Kendall)

Markov chains

Little's law

Probabilistic spine of this tab A pipeline stage is a queue: rows arrive at rate λ, the stage processes at rate μ, the average waiting time and queue length follow M/M/1 formulas: ρ = λ/μ; L = ρ/(1−ρ); W = 1/(μ−λ). The pipeline as a whole is a Markov chain over states {staged, enriched, marted, reported, errored}. The probability a row reaches the final state is the dominant eigenvector of the transition matrix. This is what tells you "your pipeline drops 4% of rows somewhere between staging and the mart" — before you build it.

The BCBS 239 pipeline · DFD + queue stats Interactive

Five stages from source CDC to final regulatory report. Each is a Gane-Sarson process box. The Markov chain over stages is computed live; the bar chart shows where rows actually end up.

Gane-Sarson notation 1979 · still the de-facto DFD standard

Four shapes — rounded rectangle = process, open rectangle = data store, square = external entity (source or sink), labelled arrow = data flow. Same grammar at every level; you can decompose a process into a sub-DFD without changing the language.

⌐ Process

rounded rectangle

Transforms incoming data flows into outgoing ones. Numbered hierarchically (1, 1.1, 1.2…).

▭ Data store

open rectangle

Persistent holding of data between processes. Labelled D1, D2…

□ External entity

square

Source or sink outside the system. CORE_BANKING is external to the catalog pipeline.

→ Data flow

labelled arrow

Movement of data. Always labelled with what flows. Arrows have semantics — not decoration.

Tab 05 · Identity Lineage

Copy-through lineage — every value, hash-verified

The value in report.eu_sov_balance = £2.5M didn't change as it moved. It was copied. Identity lineage proves the value at the end equals the value at the source, byte for byte. Deterministic. Hash-verifiable. No probability.

Field-level lineage

W3C PROV-DM

Atlan / Purview / SQLLineage parsers

OpenLineage column-level

Cryptographic hash chains (SHA-256)

Bloom filters (Bloom 1970)

Probabilistic spine of this tab Identity lineage itself is deterministic — values either match by hash or they don\'t. But at scale (10⁹ rows × 5 stages = 5×10⁹ identity checks per day), the catalog uses a Bloom filter to answer "did this exact value pass through this stage?" in O(1) with a bounded false-positive rate. The Bloom filter can say yes when the answer is no — but it never says no when the answer is yes. With m bits and k hash functions and n inserted values, FPR ≈ (1 − e^−kn/m)^k. The lab demonstrates a properly sized Bloom filter at FPR < 1%.

Trace core_banking.balance → report.eu_sov_balance Interactive

Click ▶ to animate the hash chain across five stages. Each stage emits a SHA-256 of the value being passed through. Any mutation between stages breaks the chain and is reported.

Bloom filter for scale · "did this value pass through stage X?" Bloom 1970

For ad-hoc "where did this number come from?" queries across 10⁹ rows, the catalog cannot replay the full hash chain on demand. It maintains a Bloom filter per stage: k hash functions write into m bits. A query "is value v in stage S\'s set?" checks all k bits → if all 1, "probably yes"; if any 0, "definitely no."

Why identity lineage stays deterministic

For copy-through values, uncertainty would be a bug. If the report says the customer\'s balance is £2.5M and the source says £2.5M, the chain must prove byte-equality through every stage. Hashing is the right tool — Bloom filters are just an indexing optimisation. Derivation lineage (next tab) is where probability genuinely enters.

Tab 06 · Derivation Lineage

BCBS line 4.2 = £875K — by what formula, with what confidence?

The report cell is not copied from anywhere — it\'s derived by a SQL expression over many inputs. Lineage must capture the formula itself, propagate trust through it as a provenance polynomial, run Monte Carlo to get a confidence interval, and rank inputs by sensitivity so an analyst knows which one dominates the uncertainty.

Provenance Semirings · Green-Karvounarakis-Tannen PODS 2007

Why/Where/How provenance

SQL parse trees

Expression trees

Sensitivity analysis ∂y/∂x

Monte Carlo simulation

Trust semiring [0,1]

Probabilistic spine of this tab A derived value y = f(x₁, x₂, …, xₙ) inherits uncertainty from every input. Three complementary tools: (1) the provenance polynomial in the trust semiring [0,1] propagates per-input trust to a single output trust; (2) Monte Carlo simulation with N draws from each input's distribution produces an empirical distribution of y (mean, p5, p95); (3) sensitivity analysis via partial derivatives ∂y/∂xᵢ ranks which input the catalog should worry about most. The trio gives the regulator a confidence interval AND a "which input would I improve first to tighten it" answer.

The derivation · BCBS 4.2 EU Sovereign Exposure RWA Interactive

Click ▶ to walk the four steps: capture the formula, build the polynomial, run Monte Carlo, rank by sensitivity.

// SQL parsed at write time · captured as derivation lineage SELECT i.counterparty_country, SUM(p.notional_gbp * r.rwa_pct) AS rwa FROM positions p JOIN instruments i ON i.instrument_id = p.instrument_id JOIN risk_model r ON r.instrument_type = i.type WHERE i.is_sovereign = true AND i.counterparty_country IN ('DE', 'FR', 'IT', 'ES', 'NL') GROUP BY i.counterparty_country ;

Why derivation lineage MUST carry uncertainty Regulator-replayable confidence

Identity lineage (Tab 05) is binary: bytes match or they don\'t. Derivation lineage is continuous: the output is a function of inputs each with their own trust score. A regulator asking "how confident are you in this £875K?" needs more than "the SQL ran." They need (i) the formula, (ii) the input trust scores, (iii) the propagated confidence interval, (iv) the dominant input. Provenance semirings + Monte Carlo + sensitivity give all four.

Tab 07 · The Lineage Graph

Both lineage types · one graph · Bayesian-network reasoning

Identity lineage (Tab 05) and derivation lineage (Tab 06) compose into one DAG. Each node carries a conditional probability table over its parents. The graph answers: "what is P(BCBS report correct | risk_model is stale)?" — a Bayesian-network query, not a stack of arrows.

Composition of tabs 05 + 06

Property-graph model

SPARQL / Cypher / GQL queryable

Bayesian network (Pearl 1988)

PageRank centrality

Isolation Forest (Liu et al. 2008)

Probabilistic spine of this tab The lineage graph is a directed acyclic graph; with conditional probability tables on each node it becomes a Bayesian network (Pearl 1988). Three queries the BN answers natively: (1) marginal P(report correct) under steady-state input quality; (2) conditional P(report correct | one specific input degraded) — the "what-if" query a regulator asks; (3) most likely explanation — "if the report is wrong, where is the failure most likely?" Plus PageRank-style centrality identifies critical paths (the nodes whose removal cascades to many downstream reports), and Isolation Forest flags lineage paths whose flow pattern deviates from historical.

The full lineage graph · 10 nodes · 13 edges Interactive

Click any node to see its conditional probability table, PageRank centrality, and current anomaly score. Toggle the layers below.

Identity edges

Derivation edges

Click any node to see its conditional probability table, centrality, and anomaly score…

Why Bayesian, why now

A static lineage diagram says "X feeds Y." A Bayesian-network lineage graph says "P(Y correct | X degraded) = 0.42." That number is what a regulator actually asks for, and what an operations team needs to triage on the worst day. Once each node has a CPT (learned from historical pipeline runs and incidents), the entire DAG becomes a probabilistic query engine.

The thesis of this lab. An enterprise data model is structure. Metadata makes the structure operational — matched probabilistically (Tab 01), grounded semantically (Tab 02), governed by data-calibrated rules (Tab 03), flow-modelled with queue dynamics (Tab 04), identity-traced deterministically (Tab 05), derivation-traced with confidence intervals (Tab 06), and composed into a Bayesian network for principled what-if reasoning (Tab 07). Every probabilistic claim is named, every formal model is cited, and every demo runs the math live on this single Meridian Bank substrate.

The model the architects built — now make it operational

The logical model says total_aum — three systems disagree

From hard axioms to weighted rules

"When is an account dormant?" — a rule, a survival curve, a verdict

From source row to BCBS line — five stages, queueing all the way

Copy-through lineage — every value, hash-verified

BCBS line 4.2 = £875K — by what formula, with what confidence?

Both lineage types · one graph · Bayesian-network reasoning

The logical model says `total_aum` — three systems disagree