Predict without training — when it works, and when it doesn’t
The headline claim of zero-shot ML on knowledge graphs: skip the 12-week training cycle and get predictions today. The honest reading: it works in 2026, with calibrated trade-offs against classical ML. This lab covers both sides.
- The traditional path. Need to predict churn? Hire a team. Engineer 30 features. Label 200,000 customers. Train XGBoost. Validate. Deploy. Monitor. Retrain quarterly. One model, one use case, three months minimum. The bank lives this every time it wants a new prediction.
- The zero-shot path. Use a foundation model — something pre-trained on millions of relational datasets across industries. Hand it your knowledge graph (which you already built, see the KG Lab). Ask "will Marcus churn?" Get an answer today, with reasoning, with confidence. No features engineered. No labels needed. No training.
- Why this is possible at all. The foundation model has seen what churn looks like across a thousand other banks (and telcos, and retailers). It has learned that "declining balance + no recent deposits + single account + advisor turnover" tends to predict churn — regardless of which bank’s data the question is asked of. The pattern transfers. That’s the whole bet.
- What you give up. Often 2-4% accuracy versus a fully-trained classical model. Sometimes more. The model doesn’t know your bank’s idiosyncratic patterns — only the universal patterns. For some use cases that gap is fine; for credit scoring it’s not. The honest framing is "calibrated trade-off," not "magic."
- What this lab does. Six tabs, one per model family. Foundation Models · Kumo · GNNs · PGN · PyTorch Frame · RelBench. Each tab predicts Marcus’s churn risk on the same data, the same day. By Tab 06 you can compare all six side by side with XGBoost trained for 12 weeks — and decide where zero-shot wins, where it loses, and which use case is which.
| Tab | Model | Year | Everyday analogy | What it does |
|---|---|---|---|---|
| 01 | TabPFN | 2022-25 | The doctor who read every chart on Earth | Tabular foundation model · in-context learning · no fine-tuning |
| 02 | Kumo · KumoRFM | 2024 | The consultant who has worked everywhere | Enterprise relational foundation model + PQL query language |
| 03 | PyG · GraphSAGE · GAT | 2017-18 | Learn from who you know | Graph neural networks — the open-source workhorses |
| 04 | PGN | research | Built for "what will this node do next?" | Predictive Graph Networks — task-specific GNN architectures |
| 05 | PyTorch Frame | 2024 | The bridge to graphs | Relational deep learning · tabular features + graph context |
| 06 | RelBench · vs XGBoost | 2024 | The honest scoreboard | Open benchmark + the head-to-head versus classical ML |
A doctor who has read every chart on Earth
Before any product or vendor: the conceptual breakthrough. A model can be pre-trained on millions of relational datasets, then make predictions on a brand-new dataset it has never seen — purely by analogy. TabPFN (2022-25) is the canonical example.
- Train once, on everything. A foundation model like TabPFN is pre-trained on millions of synthetic tabular datasets covering every shape of prediction problem imaginable. Different number of features, different distributions, different relationships, different outcomes. It learns the shape of how features predict outcomes, not any one specific outcome.
-
Now show it your data. You hand it a small dataset — say, 100 labelled customers with churn outcomes. You don’t fine-tune it. You don’t train it. You just show it the examples in the prompt, like you would to a human consultant. It calls this
in-context learning. - Then ask about Marcus. "Here are 100 labelled customers and their churn outcomes. Here is Marcus. Will he churn?" The model pattern-matches Marcus against its prior of millions of similar predictions, conditioned on your 100 examples. Out comes a probability. No training. No fine-tuning. Same day.
- Why this is not magic. The model is good at this because the pattern "declining balance + reduced activity = churn" shows up across every retail, telco, SaaS, and banking dataset in its pre-training corpus. Marcus is a new face but his shape is familiar. What it cannot do is learn your bank’s idiosyncratic patterns that aren’t in any other dataset. Those still need classical ML.
- Where this came from. TabPFN (Hollmann-Müller-Eggensperger-Hutter, 2022 NeurIPS; v2 in 2025) was the first compelling demonstration that a transformer pre-trained on synthetic tabular tasks could match XGBoost on real tabular benchmarks — without per-task training. That paper is the conceptual ancestor of everything else in this lab.
The architecture. TabPFN (Hollmann et al., 2022; v2 2025) is a transformer with a permutation-invariant attention pattern over feature columns and example rows. At training time, it is fed millions of synthetic datasets drawn from a hypothesis space defined by structural causal models. The training objective is approximate Bayesian inference: predict the held-out target given the context examples, marginalised over plausible data-generating processes.
In-context learning, not fine-tuning. At inference time, the model is given the labelled training rows as part of its input prompt, along with the unlabelled test row. It produces a posterior over the target. No gradients are computed; no parameters are updated. The "learning" is purely conditional inference within a single forward pass.
Theoretical foundation. Müller et al. (2022) showed PFNs perform approximate Bayesian inference for the prior they were trained on. The prior over synthetic SCMs is broad enough to cover most real-world tabular patterns, which is why the approach transfers. Where the prior is wrong, the model degrades gracefully or catastrophically depending on the mismatch.
Operational characteristics. Strong on small datasets (≤10K rows, ≤100 features) — frequently competitive with tuned XGBoost on standard benchmarks. Limited by context window: typical implementations support a few thousand context rows. TabPFN v2 (2025) extends to larger contexts and adds support for missing data, categorical features, and regression.
Limitations. Cannot capture bank-specific patterns absent from its synthetic prior. Inference is more expensive than a fitted XGBoost (transformer forward pass vs tree traversal). Not yet routinely used for high-volume real-time prediction at bank scale; better as a "first-look" model or for low-volume use cases.
The consultant who has worked at a thousand banks
TabPFN is the conceptual ancestor; Kumo is the productionised commercial form. A relational foundation model that takes a knowledge graph as input and answers predictive queries written in a domain-specific language called PQL.
- Hand over the graph. Kumo accepts your knowledge graph — the same graph from the KG Lab. Nodes for customers, accounts, advisors, products. Edges for ownership, advisor-of, similar-to. It also accepts the raw tabular sources directly. You don’t engineer features. The model walks the graph.
-
Ask in PQL. Kumo has its own little query language for prediction tasks —
PQL (Predictive Query Language). You write something likePREDICT churn FOR Customer IN NEXT 90 DAYS. PQL handles the temporal logic, the entity binding, the target definition. The model handles the rest. - It reasons over multi-hop paths. To predict Marcus’s churn, Kumo doesn’t just look at Marcus. It walks Marcus → advisor → other clients → who has churned recently. Marcus → similar-clients-by-AUM → outcomes. Marcus → product-holdings → product-level retention rates. Multi-hop reasoning is the difference between flat features and graph features.
- The answer comes with a reasoning trail. Not just "42%". But "42% because: declining balance trend (28% contribution), advisor Jane Wong’s recent client departures (19%), no recent deposits (15%), no linked external accounts (12%), single-product holding (10%)." The reasoning is what regulators want, and what Kumo bakes in by design.
- What this costs. Kumo is commercial — enterprise pricing, vendor relationship, integration work. The pitch is straightforward: you save 3 months per use case, you can spin up 10 use cases in a quarter instead of one, the bank-wide ROI is in the cycle-time saving. The honest counter: the open-source GNN stack (Tab 03) achieves 85-90% of this without the vendor.
The system. Kumo (Kumo.AI, 2022 founding; KumoRFM commercial release 2024) is a foundation model architecture for relational data — explicitly designed to accept multi-table relational schemas (or knowledge graphs) and answer predictive queries without per-task fine-tuning. Founded by authors of PyTorch Geometric (Matthias Fey, Jure Leskovec).
Architecture. Combines graph-transformer encoders over the relational schema with temporal attention for time-series-shaped queries. Pre-trained on a large corpus of public relational datasets across e-commerce, telecom, healthcare, and finance — the analogue of GPT’s text corpus, but for relational structure. The published architecture is a member of the broader "relational deep learning" family (Fey et al. 2023).
PQL — Predictive Query Language. A SQL-flavoured DSL for declaring predictive tasks: PREDICT <target> FOR <entity> WHERE <filter> AT <time-cutoff>. PQL handles the canonical hard parts of ML productionisation: temporal leakage prevention (target after cutoff), entity scoping, target definition. The compilation target is a graph encoding plus an attention-based readout.
Zero-shot operation. The model is shipped pre-trained. To deploy on a new bank’s graph, you connect the data sources, declare the schema (Kumo can auto-infer much of this), and write PQL queries. No labelled training set is required for the first prediction, though labels improve accuracy as they accrue.
Reasoning trails. A first-class feature: each prediction is accompanied by feature-importance breakdowns at the graph level (which neighbours mattered, which relationship paths contributed). This explainability story is what makes the model viable for regulated industries; it is also the most direct point of comparison versus opaque GNN architectures.
Limitations. Vendor-managed and priced. Less transparent than open-source alternatives for what exactly the model has learned. Best-fit when the bank values cycle-time and managed-service economics over open-source control. Performance gap vs custom-trained classical ML on stable use cases is typically 2-5 percentage points.
Learn from who you know — the open-source workhorses
PyTorch Geometric is the open-source library that 80% of graph ML at scale runs on. The three workhorse architectures — GraphSAGE, GAT, R-GCN — were the foundation for everything in this lab, and they remain the practical default for bank-scale work.
- The neighbourhood is the feature. For each customer, the GNN looks at their neighbours in the graph — the advisor, the products, the household members, the linked external accounts. It aggregates those neighbours’ features and combines them with the customer’s own. "Who you know" becomes part of "who you are."
- GraphSAGE — the workhorse (Hamilton 2017). Sample a fixed-size neighbourhood, aggregate (mean, max, LSTM), repeat at each layer. Inductive — you can score new customers without retraining. This is what 80% of production bank graph ML actually runs.
- GAT — attention per edge (Veličković 2018). Not all neighbours matter equally. GAT learns to weight them — which neighbour mattered most for predicting THIS customer? An attention coefficient per edge. Critical for explainability — you can show the auditor which connections drove the prediction.
- R-GCN — different edge types weighted differently (Schlichtkrull 2018). The bank’s graph has many edge types: advisedBy, holdsProduct, householdMember, sameAdvisorAs, ownsExternalAccount. R-GCN learns separate weights per relation. "Shares an advisor" matters differently from "lives in same household" for predicting churn — R-GCN gets that.
- Why open source matters here. PyTorch Geometric (Fey-Lenssen 2019) is the de facto library. Maintained by the same team that founded Kumo. Free, transparent, vendor-neutral. The trade-off: you provide the engineering team to train, deploy, monitor, retrain. Kumo (Tab 02) sells managing all of that.
The library. PyTorch Geometric (Fey & Lenssen 2019) is the dominant open-source library for graph neural networks. Implements ~70 GNN layers, scalable mini-batching via NeighborLoader, heterogeneous graph support (HeteroData), temporal graphs, and integrations with PyTorch Lightning. Active development by NVIDIA, Stanford, and the Kumo team.
GraphSAGE (Hamilton, Ying, Leskovec 2017). Sample-and-aggregate inductive GNN. At each layer k, for each node v: sample a fixed-size set of neighbours N(v), aggregate (mean/max/LSTM), concatenate with v’s own previous representation, project through learned weights. The inductive property — generalises to unseen nodes — is what makes it deployable in production.
GAT (Veličković et al. 2018). Attention mechanism replaces uniform aggregation. α_{ij} = softmax_j(LeakyReLU(a^T [W h_i || W h_j])). Multi-head attention for stability (typically 4-8 heads). Provides per-edge interpretability — the attention coefficient α_{ij} is the model’s claim about how much neighbour j mattered for node i’s prediction.
R-GCN (Schlichtkrull et al. 2018). Generalises GCN to multi-relational graphs. Separate weight matrices W_r per relation type r; the update aggregates over each relation separately, then sums. Basis-decomposition regularisation keeps parameter counts tractable when there are many relations (the standard banking case).
Heterogeneous graphs in production. A bank’s graph has multiple node types (Customer, Account, Advisor, Product, Transaction) and edge types. PyG’s HeteroData class plus HGT (Heterogeneous Graph Transformer, Hu et al. 2020) is the modern default for production deployments.
Operational considerations. Training requires labelled data — typically 100K-1M labelled examples for stable churn / fraud / next-product models. Inference is fast (milliseconds per node). Standard MLOps applies: drift monitoring, periodic retraining, A/B testing. Cost-effective when the bank has an ML platform team; expensive in engineering time if it doesn’t.
Purpose-built for "what will this node do next?"
GNNs from Tab 03 are general-purpose graph models. PGN is a family of architectures purpose-built for the specific task that banks care about most: predicting a node’s future behaviour from its current graph context.
- The task that banks care about most. Churn in 90 days. Default in 12 months. Next product purchase. Fraud likelihood in the next transaction. All of these are "predict the future behaviour of this node, given the graph." PGN architectures are designed around this specific shape.
- Time-aware message passing. A vanilla GNN treats the graph as static. A PGN bakes in temporal structure — recent edges weighted more, message-passing windows that respect time. Marcus’s relationship to Jane Wong "last week" matters more than his relationship to a former advisor from 8 years ago.
- Multi-task prediction heads. A bank doesn’t care about one prediction in isolation. It cares about churn AND next-product AND fraud-likelihood for the same customer. PGN architectures share an encoder across these heads, so the model learns one rich representation of Marcus that powers many predictions at once. Efficient inference. Coherent answers.
- Where it sits in the stack. Below Kumo (Tab 02) — PGNs are a building block Kumo uses internally. Above GraphSAGE (Tab 03) — PGN is what you reach for when general GNNs aren’t quite specialised enough for predictive tasks. Most production banks combine: GraphSAGE for representation, PGN-style heads for the specific prediction targets.
- The honest framing. "PGN" is more a design pattern than a single library. Different vendors and research groups use the term for different specific architectures. The common idea — temporal-aware, task-specialised, multi-head GNN architectures for predictive queries — is what matters. PyG implements all the necessary building blocks; the architecture is what you assemble.
The design pattern. Predictive Graph Networks combine three architectural choices: (i) a graph encoder (typically GraphSAGE, GAT, or HGT) that produces node embeddings, (ii) temporal-aware message passing that weights recent edges more than old ones, and (iii) multiple task-specific prediction heads that share the encoder. The pattern is more important than any single named architecture.
Temporal edge weighting. Each edge carries a timestamp. During message aggregation, the contribution of neighbour j to node i is scaled by a decay function δ(t_now − t_{ij}) — exponential decay being the simplest, learned decay being the most expressive. TGN (Rossi et al. 2020) and TGAT (Xu et al. 2020) are the canonical reference architectures.
Multi-task heads. Given a shared node embedding z_v ∈ ℝ^d, separate task-specific heads h_k: ℝ^d → ℝ^{o_k} compute each target. Training combines losses: L = Σ_k λ_k L_k(h_k(z_v), y_k). The λ_k weights balance tasks; uncertainty weighting (Kendall et al. 2018) is the standard automated approach. The shared encoder learns a representation that’s useful for all tasks, which improves data efficiency.
Coherence guarantees. Because predictions share an encoder, they are mutually consistent in a way that independent per-task models are not. If the encoder thinks Marcus looks like a churn risk, his next-product prediction will also reflect that — no "we predict churn but also predict mortgage upsell" incoherence.
Operational characteristics. Higher upfront engineering investment than single-task models. Lower amortised cost when running 3+ predictions per customer. Faster inference per prediction (encoder runs once). Better cold-start behaviour on new customers (the shared encoder transfers across tasks).
What this is not. "PGN" is not a single library or product. It’s a design pattern implemented many ways — pyg-team examples, internal bank architectures, components of Kumo’s commercial offering. The term in the article is a useful umbrella; treat it as a category, not a specific tool.
The bridge between tabular ML and graph context
Most bank ML teams have years of investment in tabular ML — XGBoost, LightGBM, careful feature engineering. PyTorch Frame is the open-source library that brings graph signal into that world without throwing away tabular fluency.
- Tabular-first interface. PyTorch Frame works with the data shape your team already uses — multi-column tables with mixed types (numeric, categorical, text, timestamps). The API feels like pandas + a deep learning model, not like a graph library. The on-ramp is gentle.
- Graph features come from the graph. Underneath, the library reaches into your knowledge graph (the one from the KG Lab) and extracts features that capture neighbourhood context: "customer’s advisor’s average AUM", "average churn rate of customers with same product mix". These get added as extra columns to your tabular dataset. Graph signal becomes tabular features.
- Pre-trained foundation models on top. PyTorch Frame ships with foundation-model-style modules — ResNet, FT-Transformer, TabTransformer. Some are pre-trained on large relational corpora and can be used zero-shot. Some need fine-tuning but converge much faster than training from scratch. Same library, both modes.
- Why this is the practical default for many banks. Banks don’t replace XGBoost overnight. PyTorch Frame lets the team add graph context to existing XGBoost pipelines (as extra features) AND experiment with neural tabular models. It’s the bridge approach, not the rip-and-replace approach. Lower risk, faster adoption.
- The honest framing. PyTorch Frame is younger than PyG and less battle-tested. The pre-trained models are improving rapidly but don’t yet match TabPFN on small data or full custom GNNs on large data. It’s the right answer when your team values incremental adoption over peak performance. The 2026 trajectory is favourable; the 2026 maturity is "growing."
The library. PyTorch Frame (pyg-team, 2024) is an open-source library for deep learning on multi-table relational data. Companion project to PyTorch Geometric, designed by the same team to lower the on-ramp for engineers who think tabular-first. Implements stype-aware column handling (numerical, categorical, text-embedded, timestamp, multicategorical) and standard neural tabular architectures.
Architectures shipped. ResNet (Gorishniy et al. 2021 baseline), FT-Transformer (Gorishniy et al. 2021), TabTransformer (Huang et al. 2020), Trompt (Chen et al. 2023). Each can be combined with PyG-extracted features from a graph backbone — the integration is first-class.
Graph-augmented tabular features. The library’s practical sweet spot. Given a customer table and a graph of relations, extract neighbourhood-aggregated features as additional columns: mean AUM of advisor’s clients, variance of household income, count of products held by similar customers. These columns plug into XGBoost, LightGBM, CatBoost, or PyTorch Frame neural models interchangeably.
Foundation model components. PyTorch Frame integrates with HuggingFace tabular models and ships its own pre-trained checkpoints for some tasks. The pre-training corpus is smaller than TabPFN’s but more bank-relevant for the financial-services-trained checkpoints.
Operational adoption pattern. Phase 1 — augment existing XGBoost pipelines with graph features (no model change, lift comes from better features). Phase 2 — try neural tabular models (FT-Transformer) on the augmented dataset (often a 1-2 point lift). Phase 3 — graduate to native GNNs when the team has the engineering bandwidth. The library supports all three phases without rewriting data pipelines.
Limitations and honest framing. PyTorch Frame is younger than PyG; the documentation and community are smaller. Pre-trained checkpoint quality lags behind TabPFN on small tabular benchmarks and behind native GNNs on large graph benchmarks. It is not the peak-performance answer; it is the practical-adoption answer.
The honest scoreboard — measure before you migrate
The previous five tabs each made a claim. This tab puts them on the same scoreboard with XGBoost trained the traditional way — accuracy, latency, cost, time-to-deploy. The right answer is rarely "always zero-shot" or "always classical." It depends on the use case.
- What RelBench is. An open benchmark from the same team that built PyG and Kumo (Stanford / pyg-team, 2024). It provides multiple real-world relational datasets — e-commerce, F1 racing, Stack Exchange, healthcare — with standardised predictive tasks. It’s the "shared scoreboard" the field needed.
- The published numbers. On RelBench tasks, the typical pattern: classical ML (XGBoost with carefully engineered features) wins by 2-5 percentage points on accuracy for stable, well-labelled tasks. Zero-shot relational FMs (Kumo, TabPFN, GNN-based) close the gap to within 2 points on most tasks, and win on rare-event or temporal-pattern tasks. Neither is uniformly better.
- What the bank should actually compare. Five metrics on every candidate use case: (a) accuracy / AUC, (b) time-to-first-prediction in weeks, (c) annual cost of ownership, (d) explainability for regulators, (e) coverage of rare events / cold-start customers. The right model differs by row of that table.
- Where zero-shot wins. New product launch (no historical data). Rare-event detection (AML, fraud novelty). Multi-use-case agendas where engineering velocity matters more than peak accuracy. Cold-start customers. Pilot or proof-of-concept work that needs to land in weeks, not quarters. The cycle-time saving is the real ROI.
- Where classical ML still wins. Credit scoring (regulatory acceptance and stability matter most). High-volume mature predictions (millions per second, well-labelled). Use cases where 2-3 percentage points of AUC translates to material P&L. The bank’s portfolio uses both, and the meta-decision — which to apply where — is itself a strategic capability.
The benchmark. RelBench (Robinson, Ranjan, Hu, Yeh, Shirzad, Hilger, Fey, Leskovec; NeurIPS 2024) is an open benchmark for evaluating relational deep learning on real-world multi-table datasets with temporal structure. Datasets cover e-commerce (rel-amazon), Q&A (rel-stack), sports (rel-f1), healthcare (rel-trial), and others. Each dataset ships with standardised predictive tasks, train/val/test splits respecting temporal cutoffs, and reference baselines.
What it measures. Standard predictive metrics (AUC, accuracy, MAE) plus latency and resource consumption. Crucially, it provides a fair comparison surface: same task, same data split, same evaluation protocol across classical ML (LightGBM, XGBoost) and relational deep learning (GraphSAGE, Hetero GNNs, foundation models).
Published results pattern. Across the RelBench tasks: tuned LightGBM/XGBoost is competitive everywhere and leads on roughly half the tasks. GNN-based RDL leads on the other half, particularly tasks with rich relational structure and temporal dependencies. Foundation-model zero-shot approaches close the gap to within a few points on most tasks without per-task training. No method dominates universally.
Why the benchmark matters strategically. Before RelBench, banks evaluating zero-shot approaches had to rely on vendor benchmarks (with predictable biases) or build their own evaluation infrastructure (costly and slow). RelBench provides an open, citable baseline that lets the bank evaluate "would this approach work on our data?" by anchoring on published comparisons.
The honest five-metric table for the bank. For each candidate use case, score: accuracy, time-to-deploy, cost-of-ownership, explainability, rare-event coverage. The right model often differs by metric. A mature 2026 bank’s ML portfolio uses classical ML for credit/fraud (regulatory, stable, high-volume) and zero-shot for new products / cold-start / rapid experimentation (where cycle-time dominates). The portfolio mix — not the per-task winner — is the strategic decision.
What this lab’s comparison demo shows. An illustrative side-by-side on Marcus’s churn: TabPFN, Kumo, GraphSAGE, GAT, PGN, PyTorch Frame, and XGBoost (12-week baseline). The numbers are representative of typical patterns in published benchmarks, not from a specific RelBench run.