The Running Example

Will Marcus churn? Predict without training.

The same Marcus from the Knowledge Graph Lab. The same graph. Now ask seven different zero-shot or pre-trained models to predict his churn risk — and compare them honestly to XGBoost trained for 12 weeks.

Marcus Aldridge

Wealth client · age 54 · cust_m_7741

AUM: £18.7M
Tenure: 11 years
Last txn: 47 days ago
Balance trend: −18% (6mo)
Advisor: Jane Wong

The graph (from KG Lab)

Built once, queried forever

Nodes: 8.4M
Edges: 47M
Entity types: 12
Relations: 23

The honest baseline

What XGBoost would deliver

Build time: 12 weeks
Labels needed: 200K
Eng team: 3 FTEs
Accuracy: ~88%

The agent’s question

"Will Marcus churn in the next 90 days?" — answered by every model in this lab, side by side, today.

Progress through the lab

Overview · why zero-shot at all

Foundation Models · the doctor analogy

Kumo · relational FM at scale

GNNs · PyG · GraphSAGE · GAT

PGN · graph-native prediction

PyTorch Frame · the bridge

RelBench · honest compare

Tab 00 · Overview

Predict without training — when it works, and when it doesn’t

The headline claim of zero-shot ML on knowledge graphs: skip the 12-week training cycle and get predictions today. The honest reading: it works in 2026, with calibrated trade-offs against classical ML. This lab covers both sides.

In Plain English

The new claim: predict without training. The honest read: when it actually works.

For decades, doing ML in a bank meant a 12-16 week cycle per use case — features, labels, training, validation, deployment. Zero-shot ML says you can skip most of that, because a foundation model has already done the heavy learning across millions of other relational datasets. The question for a bank: when is that actually true?

The traditional path. Need to predict churn? Hire a team. Engineer 30 features. Label 200,000 customers. Train XGBoost. Validate. Deploy. Monitor. Retrain quarterly. One model, one use case, three months minimum. The bank lives this every time it wants a new prediction.
The zero-shot path. Use a foundation model — something pre-trained on millions of relational datasets across industries. Hand it your knowledge graph (which you already built, see the KG Lab). Ask "will Marcus churn?" Get an answer today, with reasoning, with confidence. No features engineered. No labels needed. No training.
Why this is possible at all. The foundation model has seen what churn looks like across a thousand other banks (and telcos, and retailers). It has learned that "declining balance + no recent deposits + single account + advisor turnover" tends to predict churn — regardless of which bank’s data the question is asked of. The pattern transfers. That’s the whole bet.
What you give up. Often 2-4% accuracy versus a fully-trained classical model. Sometimes more. The model doesn’t know your bank’s idiosyncratic patterns — only the universal patterns. For some use cases that gap is fine; for credit scoring it’s not. The honest framing is "calibrated trade-off," not "magic."
What this lab does. Six tabs, one per model family. Foundation Models · Kumo · GNNs · PGN · PyTorch Frame · RelBench. Each tab predicts Marcus’s churn risk on the same data, the same day. By Tab 06 you can compare all six side by side with XGBoost trained for 12 weeks — and decide where zero-shot wins, where it loses, and which use case is which.

The six models · what each contributes Cheat-sheet

A quick reference for what you’ll see across the six tabs.

Tab	Model	Year	Everyday analogy	What it does
01	TabPFN	2022-25	The doctor who read every chart on Earth	Tabular foundation model · in-context learning · no fine-tuning
02	Kumo · KumoRFM	2024	The consultant who has worked everywhere	Enterprise relational foundation model + PQL query language
03	PyG · GraphSAGE · GAT	2017-18	Learn from who you know	Graph neural networks — the open-source workhorses
04	PGN	research	Built for "what will this node do next?"	Predictive Graph Networks — task-specific GNN architectures
05	PyTorch Frame	2024	The bridge to graphs	Relational deep learning · tabular features + graph context
06	RelBench · vs XGBoost	2024	The honest scoreboard	Open benchmark + the head-to-head versus classical ML

How to read this lab. Each tab opens with the layman explainer (five steps, plain English). Then a working demo predicting Marcus’s churn risk on the same data. Then the technical detail with citations and honest limitations. Tab 06 puts every model side by side with XGBoost, and you decide which use case goes to which.

Tab 01 · The Foundation Model Idea · TabPFN · In-Context Learning

A doctor who has read every chart on Earth

Before any product or vendor: the conceptual breakthrough. A model can be pre-trained on millions of relational datasets, then make predictions on a brand-new dataset it has never seen — purely by analogy. TabPFN (2022-25) is the canonical example.

In Plain English

A doctor who has seen every medical chart on Earth

A new doctor at your hospital cannot diagnose well — they need months to learn your patients. A doctor who has read every medical chart on Earth doesn’t need that residency. They’ve seen your case a thousand times in other forms. That is what a foundation model is, applied to tabular and relational data.

Train once, on everything. A foundation model like TabPFN is pre-trained on millions of synthetic tabular datasets covering every shape of prediction problem imaginable. Different number of features, different distributions, different relationships, different outcomes. It learns the shape of how features predict outcomes, not any one specific outcome.
Now show it your data. You hand it a small dataset — say, 100 labelled customers with churn outcomes. You don’t fine-tune it. You don’t train it. You just show it the examples in the prompt, like you would to a human consultant. It calls this in-context learning.
Then ask about Marcus. "Here are 100 labelled customers and their churn outcomes. Here is Marcus. Will he churn?" The model pattern-matches Marcus against its prior of millions of similar predictions, conditioned on your 100 examples. Out comes a probability. No training. No fine-tuning. Same day.
Why this is not magic. The model is good at this because the pattern "declining balance + reduced activity = churn" shows up across every retail, telco, SaaS, and banking dataset in its pre-training corpus. Marcus is a new face but his shape is familiar. What it cannot do is learn your bank’s idiosyncratic patterns that aren’t in any other dataset. Those still need classical ML.
Where this came from. TabPFN (Hollmann-Müller-Eggensperger-Hutter, 2022 NeurIPS; v2 in 2025) was the first compelling demonstration that a transformer pre-trained on synthetic tabular tasks could match XGBoost on real tabular benchmarks — without per-task training. That paper is the conceptual ancestor of everything else in this lab.

Predict Marcus’s churn with TabPFN-style in-context learning Interactive

Click ▶ to show the model 5 labelled customers, then ask it about Marcus. No training. No fine-tuning. Just analogy.

Technical Detail

TabPFN and the Prior-Fitted Network paradigm

The architecture. TabPFN (Hollmann et al., 2022; v2 2025) is a transformer with a permutation-invariant attention pattern over feature columns and example rows. At training time, it is fed millions of synthetic datasets drawn from a hypothesis space defined by structural causal models. The training objective is approximate Bayesian inference: predict the held-out target given the context examples, marginalised over plausible data-generating processes.

In-context learning, not fine-tuning. At inference time, the model is given the labelled training rows as part of its input prompt, along with the unlabelled test row. It produces a posterior over the target. No gradients are computed; no parameters are updated. The "learning" is purely conditional inference within a single forward pass.

Theoretical foundation. Müller et al. (2022) showed PFNs perform approximate Bayesian inference for the prior they were trained on. The prior over synthetic SCMs is broad enough to cover most real-world tabular patterns, which is why the approach transfers. Where the prior is wrong, the model degrades gracefully or catastrophically depending on the mismatch.

Operational characteristics. Strong on small datasets (≤10K rows, ≤100 features) — frequently competitive with tuned XGBoost on standard benchmarks. Limited by context window: typical implementations support a few thousand context rows. TabPFN v2 (2025) extends to larger contexts and adds support for missing data, categorical features, and regression.

Limitations. Cannot capture bank-specific patterns absent from its synthetic prior. Inference is more expensive than a fitted XGBoost (transformer forward pass vs tree traversal). Not yet routinely used for high-volume real-time prediction at bank scale; better as a "first-look" model or for low-volume use cases.

Citations. Hollmann N., Müller S., Eggensperger K., Hutter F. (2022). TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. NeurIPS. Hollmann N. et al. (2025). Accurate predictions on small data with a tabular foundation model. Nature. Müller S. et al. (2022). Transformers Can Do Bayesian Inference. ICLR (PFNs).

Tab 02 · Kumo · KumoRFM · PQL

The consultant who has worked at a thousand banks

TabPFN is the conceptual ancestor; Kumo is the productionised commercial form. A relational foundation model that takes a knowledge graph as input and answers predictive queries written in a domain-specific language called PQL.

In Plain English

A consultant who walks in already knowing what churn looks like

Imagine a McKinsey consultant who has worked at a thousand banks over twenty years. When you describe your churn problem they don’t ask "what features should we use?" — they already know. They’ve seen this pattern. That is Kumo, applied to your knowledge graph.

Hand over the graph. Kumo accepts your knowledge graph — the same graph from the KG Lab. Nodes for customers, accounts, advisors, products. Edges for ownership, advisor-of, similar-to. It also accepts the raw tabular sources directly. You don’t engineer features. The model walks the graph.
Ask in PQL. Kumo has its own little query language for prediction tasks — PQL (Predictive Query Language). You write something like PREDICT churn FOR Customer IN NEXT 90 DAYS. PQL handles the temporal logic, the entity binding, the target definition. The model handles the rest.
It reasons over multi-hop paths. To predict Marcus’s churn, Kumo doesn’t just look at Marcus. It walks Marcus → advisor → other clients → who has churned recently. Marcus → similar-clients-by-AUM → outcomes. Marcus → product-holdings → product-level retention rates. Multi-hop reasoning is the difference between flat features and graph features.
The answer comes with a reasoning trail. Not just "42%". But "42% because: declining balance trend (28% contribution), advisor Jane Wong’s recent client departures (19%), no recent deposits (15%), no linked external accounts (12%), single-product holding (10%)." The reasoning is what regulators want, and what Kumo bakes in by design.
What this costs. Kumo is commercial — enterprise pricing, vendor relationship, integration work. The pitch is straightforward: you save 3 months per use case, you can spin up 10 use cases in a quarter instead of one, the bank-wide ROI is in the cycle-time saving. The honest counter: the open-source GNN stack (Tab 03) achieves 85-90% of this without the vendor.

Predict Marcus’s churn with a Kumo-style PQL query Interactive

Click ▶ to write a PQL query, see the multi-hop graph walk Kumo performs, and read the prediction with full reasoning trail.

Technical Detail

Kumo as a productionised relational foundation model

The system. Kumo (Kumo.AI, 2022 founding; KumoRFM commercial release 2024) is a foundation model architecture for relational data — explicitly designed to accept multi-table relational schemas (or knowledge graphs) and answer predictive queries without per-task fine-tuning. Founded by authors of PyTorch Geometric (Matthias Fey, Jure Leskovec).

Architecture. Combines graph-transformer encoders over the relational schema with temporal attention for time-series-shaped queries. Pre-trained on a large corpus of public relational datasets across e-commerce, telecom, healthcare, and finance — the analogue of GPT’s text corpus, but for relational structure. The published architecture is a member of the broader "relational deep learning" family (Fey et al. 2023).

PQL — Predictive Query Language. A SQL-flavoured DSL for declaring predictive tasks: PREDICT <target> FOR <entity> WHERE <filter> AT <time-cutoff>. PQL handles the canonical hard parts of ML productionisation: temporal leakage prevention (target after cutoff), entity scoping, target definition. The compilation target is a graph encoding plus an attention-based readout.

Zero-shot operation. The model is shipped pre-trained. To deploy on a new bank’s graph, you connect the data sources, declare the schema (Kumo can auto-infer much of this), and write PQL queries. No labelled training set is required for the first prediction, though labels improve accuracy as they accrue.

Reasoning trails. A first-class feature: each prediction is accompanied by feature-importance breakdowns at the graph level (which neighbours mattered, which relationship paths contributed). This explainability story is what makes the model viable for regulated industries; it is also the most direct point of comparison versus opaque GNN architectures.

Limitations. Vendor-managed and priced. Less transparent than open-source alternatives for what exactly the model has learned. Best-fit when the bank values cycle-time and managed-service economics over open-source control. Performance gap vs custom-trained classical ML on stable use cases is typically 2-5 percentage points.

Citations. Fey M., Hu W., Huang K., Lenssen J.E., Ranjan R., Robinson J., Ying R., You J., Leskovec J. (2023). Relational Deep Learning: Graph Representation Learning on Relational Databases. ICML Position Paper. Robinson J. et al. (2024). RelBench: A Benchmark for Deep Learning on Relational Databases. NeurIPS. Kumo.AI product documentation and PQL specification.

Tab 03 · Graph Neural Networks · PyG · GraphSAGE · GAT · R-GCN

Learn from who you know — the open-source workhorses

PyTorch Geometric is the open-source library that 80% of graph ML at scale runs on. The three workhorse architectures — GraphSAGE, GAT, R-GCN — were the foundation for everything in this lab, and they remain the practical default for bank-scale work.

In Plain English

A model that learns about you by also learning about who you know

A model that only looks at Marcus’s own attributes — age, AUM, tenure — gets some signal. A model that also looks at Marcus’s advisor’s other clients and the products Marcus holds and who else holds them gets dramatically more. Graph neural networks are how you build that "who you know" signal into a prediction.

The neighbourhood is the feature. For each customer, the GNN looks at their neighbours in the graph — the advisor, the products, the household members, the linked external accounts. It aggregates those neighbours’ features and combines them with the customer’s own. "Who you know" becomes part of "who you are."
GraphSAGE — the workhorse (Hamilton 2017). Sample a fixed-size neighbourhood, aggregate (mean, max, LSTM), repeat at each layer. Inductive — you can score new customers without retraining. This is what 80% of production bank graph ML actually runs.
GAT — attention per edge (Veličković 2018). Not all neighbours matter equally. GAT learns to weight them — which neighbour mattered most for predicting THIS customer? An attention coefficient per edge. Critical for explainability — you can show the auditor which connections drove the prediction.
R-GCN — different edge types weighted differently (Schlichtkrull 2018). The bank’s graph has many edge types: advisedBy, holdsProduct, householdMember, sameAdvisorAs, ownsExternalAccount. R-GCN learns separate weights per relation. "Shares an advisor" matters differently from "lives in same household" for predicting churn — R-GCN gets that.
Why open source matters here. PyTorch Geometric (Fey-Lenssen 2019) is the de facto library. Maintained by the same team that founded Kumo. Free, transparent, vendor-neutral. The trade-off: you provide the engineering team to train, deploy, monitor, retrain. Kumo (Tab 02) sells managing all of that.

Predict Marcus’s churn with a GraphSAGE-style aggregation Interactive

Click ▶ to watch GraphSAGE sample Marcus’s 2-hop neighbourhood and aggregate features layer by layer.

Technical Detail

PyTorch Geometric — the open-source graph deep learning stack

The library. PyTorch Geometric (Fey & Lenssen 2019) is the dominant open-source library for graph neural networks. Implements ~70 GNN layers, scalable mini-batching via NeighborLoader, heterogeneous graph support (HeteroData), temporal graphs, and integrations with PyTorch Lightning. Active development by NVIDIA, Stanford, and the Kumo team.

GraphSAGE (Hamilton, Ying, Leskovec 2017). Sample-and-aggregate inductive GNN. At each layer k, for each node v: sample a fixed-size set of neighbours N(v), aggregate (mean/max/LSTM), concatenate with v’s own previous representation, project through learned weights. The inductive property — generalises to unseen nodes — is what makes it deployable in production.

GAT (Veličković et al. 2018). Attention mechanism replaces uniform aggregation. α_{ij} = softmax_j(LeakyReLU(a^T [W h_i || W h_j])). Multi-head attention for stability (typically 4-8 heads). Provides per-edge interpretability — the attention coefficient α_{ij} is the model’s claim about how much neighbour j mattered for node i’s prediction.

R-GCN (Schlichtkrull et al. 2018). Generalises GCN to multi-relational graphs. Separate weight matrices W_r per relation type r; the update aggregates over each relation separately, then sums. Basis-decomposition regularisation keeps parameter counts tractable when there are many relations (the standard banking case).

Heterogeneous graphs in production. A bank’s graph has multiple node types (Customer, Account, Advisor, Product, Transaction) and edge types. PyG’s HeteroData class plus HGT (Heterogeneous Graph Transformer, Hu et al. 2020) is the modern default for production deployments.

Operational considerations. Training requires labelled data — typically 100K-1M labelled examples for stable churn / fraud / next-product models. Inference is fast (milliseconds per node). Standard MLOps applies: drift monitoring, periodic retraining, A/B testing. Cost-effective when the bank has an ML platform team; expensive in engineering time if it doesn’t.

Citations. Fey M., Lenssen J.E. (2019). Fast Graph Representation Learning with PyTorch Geometric. ICLR Workshop. Hamilton W., Ying R., Leskovec J. (2017). Inductive Representation Learning on Large Graphs. NeurIPS. Veličković P. et al. (2018). Graph Attention Networks. ICLR. Schlichtkrull M. et al. (2018). Modeling Relational Data with Graph Convolutional Networks. ESWC.

Tab 04 · PGN · Predictive Graph Networks

Purpose-built for "what will this node do next?"

GNNs from Tab 03 are general-purpose graph models. PGN is a family of architectures purpose-built for the specific task that banks care about most: predicting a node’s future behaviour from its current graph context.

In Plain English

A model whose only job is to predict what happens next

GraphSAGE and GAT are good at many graph tasks — node classification, edge prediction, graph classification. PGN trades generality for focus: it’s built specifically for "given everything we know about this node and its neighbourhood, what will it do in the next N days?" That focus shows up in better accuracy for that one task.

The task that banks care about most. Churn in 90 days. Default in 12 months. Next product purchase. Fraud likelihood in the next transaction. All of these are "predict the future behaviour of this node, given the graph." PGN architectures are designed around this specific shape.
Time-aware message passing. A vanilla GNN treats the graph as static. A PGN bakes in temporal structure — recent edges weighted more, message-passing windows that respect time. Marcus’s relationship to Jane Wong "last week" matters more than his relationship to a former advisor from 8 years ago.
Multi-task prediction heads. A bank doesn’t care about one prediction in isolation. It cares about churn AND next-product AND fraud-likelihood for the same customer. PGN architectures share an encoder across these heads, so the model learns one rich representation of Marcus that powers many predictions at once. Efficient inference. Coherent answers.
Where it sits in the stack. Below Kumo (Tab 02) — PGNs are a building block Kumo uses internally. Above GraphSAGE (Tab 03) — PGN is what you reach for when general GNNs aren’t quite specialised enough for predictive tasks. Most production banks combine: GraphSAGE for representation, PGN-style heads for the specific prediction targets.
The honest framing. "PGN" is more a design pattern than a single library. Different vendors and research groups use the term for different specific architectures. The common idea — temporal-aware, task-specialised, multi-head GNN architectures for predictive queries — is what matters. PyG implements all the necessary building blocks; the architecture is what you assemble.

Predict three things about Marcus simultaneously · multi-head PGN Interactive

Click ▶ to run a multi-head PGN that outputs churn, next-product, and fraud predictions from one shared graph encoding.

Technical Detail

PGN architectures · temporal GNN + multi-task prediction

The design pattern. Predictive Graph Networks combine three architectural choices: (i) a graph encoder (typically GraphSAGE, GAT, or HGT) that produces node embeddings, (ii) temporal-aware message passing that weights recent edges more than old ones, and (iii) multiple task-specific prediction heads that share the encoder. The pattern is more important than any single named architecture.

Temporal edge weighting. Each edge carries a timestamp. During message aggregation, the contribution of neighbour j to node i is scaled by a decay function δ(t_now − t_{ij}) — exponential decay being the simplest, learned decay being the most expressive. TGN (Rossi et al. 2020) and TGAT (Xu et al. 2020) are the canonical reference architectures.

Multi-task heads. Given a shared node embedding z_v ∈ ℝ^d, separate task-specific heads h_k: ℝ^d → ℝ^{o_k} compute each target. Training combines losses: L = Σ_k λ_k L_k(h_k(z_v), y_k). The λ_k weights balance tasks; uncertainty weighting (Kendall et al. 2018) is the standard automated approach. The shared encoder learns a representation that’s useful for all tasks, which improves data efficiency.

Coherence guarantees. Because predictions share an encoder, they are mutually consistent in a way that independent per-task models are not. If the encoder thinks Marcus looks like a churn risk, his next-product prediction will also reflect that — no "we predict churn but also predict mortgage upsell" incoherence.

Operational characteristics. Higher upfront engineering investment than single-task models. Lower amortised cost when running 3+ predictions per customer. Faster inference per prediction (encoder runs once). Better cold-start behaviour on new customers (the shared encoder transfers across tasks).

What this is not. "PGN" is not a single library or product. It’s a design pattern implemented many ways — pyg-team examples, internal bank architectures, components of Kumo’s commercial offering. The term in the article is a useful umbrella; treat it as a category, not a specific tool.

Citations. Rossi E. et al. (2020). Temporal Graph Networks for Deep Learning on Dynamic Graphs. ICML Workshop. Xu D. et al. (2020). Inductive Representation Learning on Temporal Graphs. ICLR. Kendall A., Gal Y., Cipolla R. (2018). Multi-Task Learning Using Uncertainty to Weigh Losses. CVPR.

Tab 05 · PyTorch Frame · Relational Deep Learning

The bridge between tabular ML and graph context

Most bank ML teams have years of investment in tabular ML — XGBoost, LightGBM, careful feature engineering. PyTorch Frame is the open-source library that brings graph signal into that world without throwing away tabular fluency.

In Plain English

A bridge that lets your tabular team use graphs without leaving tabular

The bank already has a strong tabular ML practice — pandas, XGBoost, well-understood pipelines. PyTorch Frame is the bridge that adds graph signal to that practice without forcing a complete rewrite. Your team keeps their fluency; the model gets graph context.

Tabular-first interface. PyTorch Frame works with the data shape your team already uses — multi-column tables with mixed types (numeric, categorical, text, timestamps). The API feels like pandas + a deep learning model, not like a graph library. The on-ramp is gentle.
Graph features come from the graph. Underneath, the library reaches into your knowledge graph (the one from the KG Lab) and extracts features that capture neighbourhood context: "customer’s advisor’s average AUM", "average churn rate of customers with same product mix". These get added as extra columns to your tabular dataset. Graph signal becomes tabular features.
Pre-trained foundation models on top. PyTorch Frame ships with foundation-model-style modules — ResNet, FT-Transformer, TabTransformer. Some are pre-trained on large relational corpora and can be used zero-shot. Some need fine-tuning but converge much faster than training from scratch. Same library, both modes.
Why this is the practical default for many banks. Banks don’t replace XGBoost overnight. PyTorch Frame lets the team add graph context to existing XGBoost pipelines (as extra features) AND experiment with neural tabular models. It’s the bridge approach, not the rip-and-replace approach. Lower risk, faster adoption.
The honest framing. PyTorch Frame is younger than PyG and less battle-tested. The pre-trained models are improving rapidly but don’t yet match TabPFN on small data or full custom GNNs on large data. It’s the right answer when your team values incremental adoption over peak performance. The 2026 trajectory is favourable; the 2026 maturity is "growing."

Build graph-derived features for Marcus · feed XGBoost + neural tabular Interactive

Click ▶ to extract 7 graph features from Marcus’s neighbourhood and predict churn with both XGBoost and FT-Transformer side by side.

Technical Detail

PyTorch Frame · relational deep learning for tabular practitioners

The library. PyTorch Frame (pyg-team, 2024) is an open-source library for deep learning on multi-table relational data. Companion project to PyTorch Geometric, designed by the same team to lower the on-ramp for engineers who think tabular-first. Implements stype-aware column handling (numerical, categorical, text-embedded, timestamp, multicategorical) and standard neural tabular architectures.

Architectures shipped. ResNet (Gorishniy et al. 2021 baseline), FT-Transformer (Gorishniy et al. 2021), TabTransformer (Huang et al. 2020), Trompt (Chen et al. 2023). Each can be combined with PyG-extracted features from a graph backbone — the integration is first-class.

Graph-augmented tabular features. The library’s practical sweet spot. Given a customer table and a graph of relations, extract neighbourhood-aggregated features as additional columns: mean AUM of advisor’s clients, variance of household income, count of products held by similar customers. These columns plug into XGBoost, LightGBM, CatBoost, or PyTorch Frame neural models interchangeably.

Foundation model components. PyTorch Frame integrates with HuggingFace tabular models and ships its own pre-trained checkpoints for some tasks. The pre-training corpus is smaller than TabPFN’s but more bank-relevant for the financial-services-trained checkpoints.

Operational adoption pattern. Phase 1 — augment existing XGBoost pipelines with graph features (no model change, lift comes from better features). Phase 2 — try neural tabular models (FT-Transformer) on the augmented dataset (often a 1-2 point lift). Phase 3 — graduate to native GNNs when the team has the engineering bandwidth. The library supports all three phases without rewriting data pipelines.

Limitations and honest framing. PyTorch Frame is younger than PyG; the documentation and community are smaller. Pre-trained checkpoint quality lags behind TabPFN on small tabular benchmarks and behind native GNNs on large graph benchmarks. It is not the peak-performance answer; it is the practical-adoption answer.

Citations. PyG team (2024). PyTorch Frame: A Modular Deep Learning Framework for Multi-Table Relational Data. Gorishniy Y. et al. (2021). Revisiting Deep Learning Models for Tabular Data. NeurIPS. Huang X. et al. (2020). TabTransformer: Tabular Data Modeling Using Contextual Embeddings.

Tab 06 · RelBench · Honest Compare vs XGBoost

The honest scoreboard — measure before you migrate

The previous five tabs each made a claim. This tab puts them on the same scoreboard with XGBoost trained the traditional way — accuracy, latency, cost, time-to-deploy. The right answer is rarely "always zero-shot" or "always classical." It depends on the use case.

In Plain English

Compare on the metrics that actually matter to the bank

Accuracy is one metric. The bank also cares about time-to-deploy, cost per prediction, explainability, maintainability, and regulatory acceptance. RelBench is the open benchmark that compares zero-shot relational models with classical ML on real-world relational data — and a fair comparison surfaces the honest trade-offs.

What RelBench is. An open benchmark from the same team that built PyG and Kumo (Stanford / pyg-team, 2024). It provides multiple real-world relational datasets — e-commerce, F1 racing, Stack Exchange, healthcare — with standardised predictive tasks. It’s the "shared scoreboard" the field needed.
The published numbers. On RelBench tasks, the typical pattern: classical ML (XGBoost with carefully engineered features) wins by 2-5 percentage points on accuracy for stable, well-labelled tasks. Zero-shot relational FMs (Kumo, TabPFN, GNN-based) close the gap to within 2 points on most tasks, and win on rare-event or temporal-pattern tasks. Neither is uniformly better.
What the bank should actually compare. Five metrics on every candidate use case: (a) accuracy / AUC, (b) time-to-first-prediction in weeks, (c) annual cost of ownership, (d) explainability for regulators, (e) coverage of rare events / cold-start customers. The right model differs by row of that table.
Where zero-shot wins. New product launch (no historical data). Rare-event detection (AML, fraud novelty). Multi-use-case agendas where engineering velocity matters more than peak accuracy. Cold-start customers. Pilot or proof-of-concept work that needs to land in weeks, not quarters. The cycle-time saving is the real ROI.
Where classical ML still wins. Credit scoring (regulatory acceptance and stability matter most). High-volume mature predictions (millions per second, well-labelled). Use cases where 2-3 percentage points of AUC translates to material P&L. The bank’s portfolio uses both, and the meta-decision — which to apply where — is itself a strategic capability.

All six models · Marcus’s churn · side by side with XGBoost The whole question

Click ▶ to run all six zero-shot/pre-trained models on the same Marcus, compare to XGBoost trained for 12 weeks, and see the honest scoreboard.

Technical Detail

RelBench · the open benchmark for relational deep learning

The benchmark. RelBench (Robinson, Ranjan, Hu, Yeh, Shirzad, Hilger, Fey, Leskovec; NeurIPS 2024) is an open benchmark for evaluating relational deep learning on real-world multi-table datasets with temporal structure. Datasets cover e-commerce (rel-amazon), Q&A (rel-stack), sports (rel-f1), healthcare (rel-trial), and others. Each dataset ships with standardised predictive tasks, train/val/test splits respecting temporal cutoffs, and reference baselines.

What it measures. Standard predictive metrics (AUC, accuracy, MAE) plus latency and resource consumption. Crucially, it provides a fair comparison surface: same task, same data split, same evaluation protocol across classical ML (LightGBM, XGBoost) and relational deep learning (GraphSAGE, Hetero GNNs, foundation models).

Published results pattern. Across the RelBench tasks: tuned LightGBM/XGBoost is competitive everywhere and leads on roughly half the tasks. GNN-based RDL leads on the other half, particularly tasks with rich relational structure and temporal dependencies. Foundation-model zero-shot approaches close the gap to within a few points on most tasks without per-task training. No method dominates universally.

Why the benchmark matters strategically. Before RelBench, banks evaluating zero-shot approaches had to rely on vendor benchmarks (with predictable biases) or build their own evaluation infrastructure (costly and slow). RelBench provides an open, citable baseline that lets the bank evaluate "would this approach work on our data?" by anchoring on published comparisons.

The honest five-metric table for the bank. For each candidate use case, score: accuracy, time-to-deploy, cost-of-ownership, explainability, rare-event coverage. The right model often differs by metric. A mature 2026 bank’s ML portfolio uses classical ML for credit/fraud (regulatory, stable, high-volume) and zero-shot for new products / cold-start / rapid experimentation (where cycle-time dominates). The portfolio mix — not the per-task winner — is the strategic decision.

What this lab’s comparison demo shows. An illustrative side-by-side on Marcus’s churn: TabPFN, Kumo, GraphSAGE, GAT, PGN, PyTorch Frame, and XGBoost (12-week baseline). The numbers are representative of typical patterns in published benchmarks, not from a specific RelBench run.

Citations. Robinson J., Ranjan R., Hu W., Yeh K., Shirzad S., Hilger A., Fey M., Leskovec J. (2024). RelBench: A Benchmark for Deep Learning on Relational Databases. NeurIPS. Fey M., Hu W., Huang K., Lenssen J.E., Ranjan R., Robinson J., Ying R., You J., Leskovec J. (2024). Position: Relational Deep Learning — Graph Representation Learning on Relational Tables. ICML.