Why ontology, in plain English
A database knows that Sarah has a row called "ISA" worth £42,000. It does not know what an ISA is. An ontology fixes that — and unlocks an agent's ability to reason instead of just look up.
- Without ontology, software sees fields. Sarah’s ISA is a row with the text "ISA" and the number 42000. The software does not know that an ISA is a kind of savings account, that savings accounts have withdrawal penalties, or that withdrawal penalties make money less easy to spend.
- An ontology adds the rules of the world. "ISA is a kind of savings account." "Savings accounts have penalties on early withdrawal." "Money with penalties on withdrawal is not freely spendable." Now the software has the missing connections.
- Now the agent can answer real questions. Asked "can Sarah cover £5,000 this week without penalty?" the agent does not just sum her accounts. It checks which accounts are actually spendable without penalty. The current account (£8,000) is. The ISA (£42,000) is not. The portfolio (£420,000) takes 3-5 days to liquidate. The answer is yes — £8,000 in the current account covers it.
- This lab walks the six models that make this possible. RDFS gives the basic vocabulary. OWL adds rules and constraints. Description Logics powers the reasoning engine. RIGOR generates the ontology from the bank’s existing schemas. The Layer Cake organises the build process. OntoGPT extracts new concepts from regulatory text.
- The whole point. An ontology is what lets an agent be helpful rather than literal. Without one, it can only show you the columns. With one, it can answer the question you actually asked.
| Tab | Model | Year | The everyday analogy | What it adds to the agent |
|---|---|---|---|---|
| 01 | RDFS | 2004 | The grammar lesson | Vocabulary: things, kinds, relations |
| 02 | OWL / OWL 2 | 2004 / 2012 | The contract with clauses | Rules: required, forbidden, mutually exclusive |
| 03 | Description Logics | 1980s+ | The detective | Inference: derives new facts from the rules |
| 04 | RIGOR | 2025 | The apprentice librarian | Automation: builds the ontology from existing data |
| 05 | Layer Cake | 2005 | The staircase | Method: order the build from terms up to axioms |
| 06 | OntoGPT | 2023 | The policy reader | Extraction: pulls concepts from regulatory text |
The grammar lesson
Before software can reason about Sarah’s ISA, it needs the basic vocabulary: things, what kind of thing they are, and how they relate. RDFS is the minimum a machine needs to start understanding.
- Everything becomes a sentence with three pieces. Subject, predicate, object. "Sarah holds acct_isa_001." "acct_isa_001 is a kind of ISA." "ISA is a kind of SavingsAccount." Three facts, three sentences, three rows in the knowledge base.
-
The "is a kind of" relationship is the magic. RDFS gives you one word for it:
rdfs:subClassOf. Now you can say ISA is a subclass of SavingsAccount, and SavingsAccount is a subclass of Account. Software follows the chain automatically — if something is true of an Account, it’s true of an ISA too. - Properties get rules too. "Holds" connects a Customer to an Account — never the other way round. RDFS lets you say so: "the property 'holds' goes from Customer (domain) to Account (range)." Anything you accidentally wire up wrong gets flagged.
-
Labels and comments make it human-readable. Every class and property gets a human-friendly name and a description. So the technical term
fibo:LegalPersoncan have the label "Legal Person" and a description "An entity recognised in law as having rights and obligations." Software still uses the technical name; humans read the friendly one. - Why this is the floor. RDFS gives the agent a way to follow a chain: Sarah → holds → acct_isa_001 → which is an → ISA → which is a → SavingsAccount → which is an → Account. Walking that chain is how the agent finds out that Sarah’s ISA, however the database labels it, is fundamentally an Account. Everything richer (OWL, DL) builds on this floor.
RDFS (Resource Description Framework Schema) is a W3C Recommendation from 2004 that extends RDF with vocabulary for describing classes and properties. Built on RDF’s subject-predicate-object triple model, RDFS adds the core schema vocabulary needed to declare type hierarchies and property semantics.
Core vocabulary. rdfs:Class declares a class. rdfs:subClassOf declares a subclass relation. rdfs:subPropertyOf declares property subsumption. rdfs:domain and rdfs:range declare the type signature of a property. rdfs:label and rdfs:comment attach human-readable annotations. rdf:type (from RDF itself) asserts membership.
Entailment. RDFS entailment is decidable in polynomial time. A reasoner can derive: if ?x rdf:type ISA and ISA rdfs:subClassOf SavingsAccount, then ?x rdf:type SavingsAccount. This is the transitive closure of subclass and subproperty relationships — strictly weaker than OWL DL inference.
Limitations. RDFS cannot express cardinality constraints, disjointness, equivalence, property restrictions, inverse properties, or transitive properties. For those, OWL (Tab 02) is required. RDFS is the minimum semantic layer; richer semantics layer on top without breaking it.
The contract with clauses
RDFS gives vocabulary. OWL gives rules. Required clauses, forbidden combinations, mutual exclusions, equivalence — the things a contract spells out so there’s no ambiguity.
-
Required clauses (cardinality). "Every Account must have exactly one Owner." Not zero, not two — exactly one. OWL writes this as
owl:cardinality 1. Any data that violates it is flagged before it goes into production. -
Mutually exclusive clauses (disjointness). "An account is either a SavingsAccount or a CurrentAccount, but not both." OWL writes this as
owl:disjointWith. The contract makes that fork in the road formal. -
Equivalence clauses. "A WealthClient is exactly a Customer who holds a Portfolio worth more than £100,000." OWL lets you define a class by its properties using
owl:equivalentClass. Anyone in the data who matches the definition is automatically classified as a WealthClient — no human curation needed. -
Property restrictions. "A SavingsAccount has at least one WithdrawalPenalty." OWL writes this as
∃ hasPenalty.WithdrawalPenalty(read: some value of hasPenalty must be a WithdrawalPenalty). This is the clause that makes Sarah’s ISA recognisably different from her current account. - The pay-off. Once these clauses exist, software can answer richer questions. "Is Sarah’s ISA freely spendable?" No — the OWL contract says it has a withdrawal penalty, and accounts with penalties are by definition not freely spendable. The contract did the work; the agent didn’t need to be told.
OWL (W3C Recommendation 2004, OWL 2 in 2012) is built on RDF/RDFS and corresponds to fragments of first-order logic with controlled expressiveness, chosen so that reasoning remains decidable. OWL 2 DL — the most expressive sub-language commonly used — corresponds to the Description Logic SROIQ(D) (Horrocks-Kutz-Sattler 2006).
Class constructors. Intersection (owl:intersectionOf, ⊓), union (owl:unionOf, ⊔), complement (owl:complementOf, ¬), existential restriction (owl:someValuesFrom, ∃R.C), universal restriction (owl:allValuesFrom, ∀R.C), cardinality restrictions (qualified and unqualified).
Axioms. owl:equivalentClass for definitional equivalence; owl:disjointWith for class disjointness; owl:sameAs / owl:differentFrom for individual identity; owl:inverseOf, owl:TransitiveProperty, owl:FunctionalProperty, owl:SymmetricProperty, owl:AsymmetricProperty for property characteristics.
The three OWL 2 profiles. EL (PTIME, optimised for very large class hierarchies — used in SNOMED CT); QL (FO-rewritable, optimised for query answering over data sources — the OBDA foundation); RL (rule-based, scales linearly with data — implementable in standard rule engines). Full OWL 2 DL is N2EXPTIME-complete in the worst case but typically efficient on practical ontologies (≤ 10⁴ axioms).
The detective
The ontology has the rules and the facts. Description Logics is the engine that combines them, like a detective combining clues, to deduce new facts nobody wrote down — including the answer to the agent’s question.
- Start with the rules of the world (the "TBox"). "An ISA is a kind of SavingsAccount." "Every SavingsAccount has a withdrawal penalty." "Anything with a withdrawal penalty is, by definition, not a LiquidAsset." "A Customer who owns a LiquidAsset is LiquidityReady."
- Add the actual data (the "ABox"). "Sarah owns acct_isa_001." "acct_isa_001 is an ISA." "Sarah also owns acct_curr_001." "acct_curr_001 is a CurrentAccount, and CurrentAccounts are LiquidAssets."
- The detective starts chaining. Step 1: acct_isa_001 is an ISA, so it’s a SavingsAccount. Step 2: SavingsAccounts have withdrawal penalties, so acct_isa_001 has one. Step 3: things with withdrawal penalties are not LiquidAssets, so acct_isa_001 is not a LiquidAsset. Nobody wrote step 3 in the database. The detective deduced it.
- The other path. Sarah owns acct_curr_001, which is a LiquidAsset. The rule says a Customer who owns a LiquidAsset is LiquidityReady. Therefore: Sarah is LiquidityReady. Via the current account, not the ISA.
- The agent’s question, finally answered. Can Sarah cover £5,000 this week without penalty? Yes — through acct_curr_001 (£8,000, fully liquid). The ISA would technically cover it but with a 5% penalty (£250). The reasoner gave the agent both the answer and the right reason.
Description Logics (DLs) are decidable fragments of first-order logic developed since the 1980s to support knowledge representation with tractable reasoning. The DL underlying OWL 2 DL is SROIQ — comprising role hierarchies (R), nominals (O), inverse roles (I), and qualified number restrictions (Q), among others.
TBox / ABox separation. The TBox (terminological box) contains schema-level axioms — class inclusions, equivalences, and role characteristics. The ABox (assertional box) contains data-level facts — class memberships and role assertions for individuals. Together they form a knowledge base K = ⟨T, A⟩.
Inference services. A DL reasoner provides: consistency checking (is K satisfiable?), subsumption (does C ⊑ D follow from T?), classification (compute the full subsumption hierarchy), instance checking (is a:C entailed by K?), realisation (compute the most specific class for each individual).
Algorithms. Modern reasoners (HermiT, ELK, Pellet, FaCT++) implement tableau-based decision procedures for SROIQ — which is N2EXPTIME-complete in the worst case but typically efficient on practical ontologies. ELK is optimised for OWL 2 EL and runs in polynomial time on biomedical-scale ontologies.
Open-world semantics. Unlike databases, DLs adopt the Open-World Assumption (OWA): the absence of a fact does not imply its negation. Combined with the Unique Name Assumption being optional, this gives DL inference its distinctive power — and makes integrity constraints (very different in flavour from SQL constraints) a topic in their own right.
The apprentice librarian
A bank already has hundreds of database tables. RIGOR is what reads them — and the existing finance reference books — to generate the ontology automatically, with a senior librarian checking every page.
- Start with nothing. The ontology begins empty. No classes, no properties, no axioms. The apprentice has a stack of database tables to process.
-
Pick a table. Look at
customers. Read its columns: customer_id, name, dob, tax_residency, kyc_status, segment. Each column is a clue to a domain concept. - Consult the reference books. The apprentice has access to FIBO (the financial industry ontology) and BIAN (the banking architecture network reference). They look up "customer" and find fibo:LegalPerson, "kyc_status" and find the regulatory categories.
- Write a draft (the Gen-LLM). The apprentice proposes: "Customer is a subclass of fibo:LegalPerson. Customer has a property hasKycStatus whose range is {VERIFIED, PENDING, REJECTED}. Customer has a property hasTaxResidency whose range is country-code."
- The senior librarian checks (the Judge-LLM). Is the draft consistent with what’s already in the ontology? Does it cover all the columns? Is it syntactically valid OWL? If yes, merge. If no, send back for revision.
-
Move to the next table and iterate. Process
accounts, thenpositions, thenconsents, following the foreign-key relationships so each table adds to what came before. After all tables are processed, you have a complete, validated OWL ontology. Months of architect work in minutes.
RIGOR (Retrieval-Augmented Iterative Generation of RDB Ontologies; 2025) combines schema introspection, retrieval over domain ontologies (FIBO, BIAN, ISO 20022), and a two-LLM architecture (Gen + Judge) to produce OWL 2 DL ontologies from relational database schemas with controlled quality.
Topological ordering. Tables are processed in foreign-key dependency order — referenced tables before referencing ones. This ensures that when accounts is processed, the Customer class it references already exists in the growing core ontology.
Retrieval context per table. For table T_i, the retrieval index returns: (a) T_i’s DDL with column types and constraints, (b) DDLs of FK-related tables, (c) embedding-matched fragments from FIBO / BIAN / ISO 20022, (d) the current core ontology O_{i−1}.
Gen-LLM output. A delta-ontology fragment ΔO_i in OWL 2 DL syntax — class declarations, equivalence axioms, property restrictions, cardinality constraints, datatype/object property declarations with domain and range.
Judge-LLM validation. Three checks: (i) syntactic well-formedness (parsed as valid OWL); (ii) logical consistency with O_{i−1} (no contradiction — optionally verified by a classical DL reasoner like HermiT or ELK in the loop); (iii) coverage — does ΔO_i describe all columns of T_i? Failed checks trigger revision requests up to k_max iterations.
Output guarantees. The final ontology O_n is OWL 2 DL compliant, retains provenance metadata for every axiom (source table, retrieval context, iteration index), and is itself queryable via SPARQL.
The staircase
Building an ontology isn’t one step — it’s seven, each layer built on the one below. The Layer Cake (Cimiano-Mädche 2005) is the canonical recipe: start with raw words, end with formal axioms.
- Layer 1 · Raw text and data. The starting point. Open banking regulatory documents, the bank’s database schemas, product descriptions, policy memos. Just text and tables. No structure yet.
- Layer 2 · Terms. Pull out every important word. "ISA", "savings account", "withdrawal", "penalty", "AUM", "consent", "TPP". Just the words people use — no organisation.
- Layer 3 · Synonyms. Group the words that mean the same thing. "ISA" and "Individual Savings Account." "AUM" and "assets under management." "Customer" and "client" and "account holder." Now the terminology stops fighting itself.
- Layer 4 · Concepts. Turn synonym groups into formal concepts. "Customer" becomes the concept Customer. "ISA" becomes the concept ISA. These are the building blocks; they don’t yet have parent or child relationships.
- Layer 5 · Concept hierarchies. Arrange the concepts as a family tree. ISA is a kind of SavingsAccount. SavingsAccount is a kind of Account. WealthClient is a kind of Customer. This is the spine of the ontology.
- Layer 6 · Relations. Wire concepts together. Customer holds Account. Account has Balance. Customer resides in Country. These are the named connections.
- Layer 7 · Axioms. Add the enforceable rules. "Every Account is held by exactly one Customer." "ISA accounts have a withdrawal penalty." "A WealthClient’s AUM exceeds £100,000." Now the ontology can reason, not just describe.
The Ontology Learning Layer Cake (Cimiano & Mädche, 2005) is the canonical staged model for systematic ontology construction. It separates the learning task into seven well-defined sub-tasks, each with established techniques and evaluable output.
The seven layers in detail. (L1) Corpus: text/data inputs. (L2) Terms: extracted via TF-IDF, C-value/NC-value, or domain-specific tokenisation. (L3) Synonyms: clustered via distributional similarity (Lin similarity, word embeddings) or curated lexicons (WordNet, FinancialTerms). (L4) Concepts: formed by mapping synonym sets to canonical labels with disambiguation. (L5) Concept hierarchies: built via lexico-syntactic patterns (Hearst patterns) or distributional inclusion. (L6) Relations: extracted via verb-frame parsing or LLM-based triple extraction. (L7) Axioms: induced from data or extracted from formal specification text.
Why the order matters. Each layer is a checkpoint with measurable quality. Skipping layers leads to ontologies whose lower-layer disagreements compound into upper-layer contradictions. The Layer Cake makes this failure mode visible at the layer where it occurs.
Modern reinterpretation. Tools like RIGOR (Tab 04) collapse the cake by using LLMs to jump from L1 directly to L7 in one iterative loop — but the conceptual sequence still describes what must internally happen. The Cake stays useful as the discipline of what to evaluate at each stage, even when the implementation is end-to-end.
The policy reader
Regulators publish guidance in prose. Product teams write specs in narrative. OntoGPT (2023) reads that prose and pulls out ontology-shaped knowledge — concepts, properties, restrictions — that the agent can use.
- Feed it a regulatory paragraph. A product memo, an FCA guidance note, a BCBS standard, a MAS notice. Anything in normal English (or German, or French) describing how things work.
- It picks out the concepts. "ISA" is a concept. "Notice period" is a concept. "Early-withdrawal penalty" is a concept. It distinguishes these from filler words like "must" and "have."
- It picks out the properties. "ISA has a notice period." "Notice period has duration 90 days." "ISA has an early-withdrawal penalty." "Penalty has amount 5%."
- It picks out the restrictions. The "must" word is important. It signals an axiom: every ISA must satisfy this rule. OntoGPT promotes this into an OWL restriction — cardinality, value range, required type.
- The output is ontology-shaped. The result isn’t a summary paragraph — it’s actual OWL axioms ready to merge into the bank’s ontology. The reasoner from Tab 03 can immediately use them. The agent now knows that Sarah’s ISA has a 90-day notice and a 5% penalty, because the regulator said so — and that reasoning trail is traceable.
OntoGPT (2023) is an LLM-based knowledge extraction system that pairs a target ontology schema (in LinkML or OWL) with a prompt-templated extraction loop, producing structured outputs that conform to the target schema. Unlike free-form extraction, OntoGPT’s output is constrained to known classes, properties, and value ranges.
Three-stage pipeline. (i) Schema specification: the user provides a target schema fragment naming the classes to extract and their properties. (ii) Prompted extraction: an LLM (typically GPT-4-class) is prompted with the text and the schema; it emits candidate instances. (iii) Schema validation: outputs are checked against the schema; non-conforming instances are rejected or revised via a second LLM pass.
Provenance. Every extracted instance retains a pointer to the source span — start offset, end offset, source document URI. This makes the extraction auditable: "where did this axiom come from?" has a literal answer.
Domain track record. Originally developed for biomedical concept extraction (rare disease ontology Mondo, gene ontology terms). Adapts cleanly to other regulated domains — finance, legal, clinical guidelines — where the source text is structured prose with stable terminology.
Composition with RIGOR. RIGOR (Tab 04) generates the schema-derived ontology from databases. OntoGPT extracts the text-derived axioms from policy documents. Combined, the bank gets a unified ontology covering both the data and the rules that govern the data — both regulator-traceable.