The Importance of Ontology to an IR System¶
2026.03 by Alfred Lu
This notebook is dual-licensed:
- Text and visual content: Creative Commons Attribution 4.0 International (CC BY 4.0)
- Source code: Apache License 2.0
You are free to share and adapt this material for any purpose, even commercially, provided you give appropriate credit to me.
%load_ext autoreload
%autoreload 2
import logging
import warnings
from transformers import logging as transformers_logging
logging.getLogger("transformers").setLevel(logging.ERROR)
transformers_logging.set_verbosity_error()
warnings.filterwarnings('ignore')
logging.getLogger("BAAI").setLevel(logging.ERROR)
import KGRAG
import tqdm as notebook_tqdm
from sentence_transformers import SentenceTransformer
from rdflib import Graph, Namespace, RDF, RDFS, OWL, BNode, URIRef, Literal
import os
from pizzar_utils import *
from collections import defaultdict
g = Graph()
g.parse("pizza.rdf", format="xml")
PIZZA = Namespace("http://www.co-ode.org/ontologies/pizza/pizza.owl#")
concrete_triples = get_all_concrete_triples(g)
adj = defaultdict(list)
all_predicates = set()
all_node = set()
p_needs = ['hasSpiciness', 'hasTopping', 'belongs to', 'hasBase']
all_named_pizza = set()
# filter out BNode
final_triples = []
for s, p, o in concrete_triples:
# s & o must be URIRef or transfered from Literal,not BNode
if isinstance(s, BNode) or isinstance(o, BNode):
continue
s = get_uri_name(s)
p = get_uri_name(p)
o = get_uri_name(o)
if (o == 'NamedPizza') and (p == 'belongs to'):
all_named_pizza.add(s)
if p in p_needs:
adj[s].append((o, p))
all_node.add(s)
all_node.add(o)
all_predicates.add(p)
final_triples.append((s, p, o))
print(f"Has total {len(final_triples)} triples(no Blank Node)")
print(f"Has total {len(all_node)} nodes, {len(all_predicates)} rels")
all_rels = str(", ".join(list(all_predicates)))
print(f"All rels: {all_rels}")
print(f"\nAll triples: ")
for i, (s, p, o) in enumerate(final_triples):
print(f"{i+1}. ({s}, {p}, {o})")
# distribution of rel
from collections import Counter
pred_counts = Counter([get_uri_name(p) for _, p, _ in final_triples])
print(f"\nTop 3 rels: ")
for pred, count in pred_counts.most_common(3):
print(f" {pred}: {count}")
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
def call_llm(prompt: str) -> str:
"""调用LLM生成问题,使用稍高的temperature增加多样性"""
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)
completion = client.chat.completions.create(
messages=[
{"role": "system", "content":"You are a pizza expert."},
{'role': 'user','content': prompt},
],
model="qwen-max"
)
response = completion.choices[0].message.content
return response
Experiment 1. Use LLM to answer directly¶
q_text = f"""
Base on your knowledge, recommend ALL kind of pizza from following pizza list,
which is medium spiciness and has vegetable on it.
Only output the name of pizza, and a short description of this pizza, followed by a
brief reason you select it.
"""
prompt = f"""
{q_text}
## pizza list
{', '.join(list(all_named_pizza))}
"""
print(prompt)
print('-'*10)
print(call_llm(prompt))
Observation from experiment 1
❌ Looks like FruttiDiMare is missing, let's see why?
missing_pizza = 'FruttiDiMare'
prompt = f"""
I am looing for a pizza which is medium spiciness and has vegetable on it.
Base on your knowledge, explain why you not recommend {missing_pizza}.
Only output a short description of {missing_pizza}, followed by a
brief reason you not recommend this one.
"""
print(call_llm(prompt))
Takeaway from experiment 1
💡Seems LLM don't know FruttiDiMare can have GarlicTopping and TomatoTopping. Probably we need introduce RAG.
✅ One approach to building a RAG system is to express all triples in subject-predicate-object sentence structure, treating each sentence as a chunk for RAG construction. However, this method fails to capture the direct relationships between triples.
✅ An alternative intuitive approach is to represent sentences describing the same entity using a set of subject-predicate-object patterns, combining them into a single chunk for RAG construction. This method essentially samples a subgraph centered around a specific entity, uses the obtained subgraph as the value, and employs the embedding vector of the natural text generated from this subgraph as the key, to build an information retrieval system.
We use the 2nd method in following demonstration.
Experiment 2. Introduce RAG¶
bge_model = '/Users/weijialu/Documents/BAAI_bge-large-zh-v1.5/'
doc_dict={}
for a_t in final_triples:
if (a_t[0] not in doc_dict) and ('Topping' not in a_t[0]):
doc_dict[a_t[0]] = a_t[0] + ' ' + a_t[1] + ' ' + a_t[2]
if (a_t[1] == 'hasTopping') and ('Topping' in a_t[2]):
for b_t in final_triples:
if (b_t[0] == a_t[2]):
doc_dict[a_t[0]] += ', ' + b_t[0] + ' ' + b_t[1] + ' ' + b_t[2]
else:
if ('Topping' not in a_t[0]):
doc_dict[a_t[0]] += ', ' + a_t[0] + ' ' + a_t[1] + ' ' + a_t[2]
if (a_t[1] == 'hasTopping') and ('Topping' in a_t[2]):
for b_t in final_triples:
if (b_t[0] == a_t[2]):
doc_dict[a_t[0]] += ', ' + b_t[0] + ' ' + b_t[1] + ' ' + b_t[2]
docs = []
for k in doc_dict.keys():
docs.extend([(k + x).strip(', ') for x in doc_dict[k].split(k) if len(x.strip()) > 0])
print(f'We have {len(docs)} documents...')
print('\n\n'.join(docs[:20]))
q_text = f"""
Base on your knowledge, recommend ALL kind of pizza from following pizza list,
which is medium spiciness and has vegetable on it.
Only output the name of pizza, and a short description of this pizza, followed by a
brief reason you select it.
"""
naiveRAG = KGRAG.SimpleRAGSystem(\
embedding_model=bge_model,
llm_model='qwen-max',
use_cloud_llm=True,
db_name = 'pizza_naive_rag'
)
naiveRAG.build_vector_store(docs)
naiveRAG.get_context(q_text, top_n=10);
Takeaway
❌ Seems like bge model fails in catching clean search conditions.
💡 Let's try revise question based on ontology
Experiment 3. Improve RAG by revise question based on ontology¶
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
onto_struct = {}
onto_struct["entities"] = [\
'NamedPizza - a pizza with specific name',
'PizzaTopping - the topping of pizza',
'Topping - the topping of food',
'hotness - the hotness of topping, could be mild, medium, hot',
'Food - the food',
'PizzaBase - the base of pizza']
onto_struct["object_property"] = ['is a kind of','hasTopping','hasSpiciness','hasBase']
onto_struct["graph"] = [\
('NamedPizza','hasTopping','PizzaTopping'),
('PizzaTopping', 'is a kind of', 'Topping'),
('Topping','hasSpiciness','hotness'),
('NamedPizza','hasBase','PizzaBase'),
('Mild','is a kind of','hotness'),
('Medium','is a kind of','hotness'),
('Hot','is a kind of','hotness'),
('NamedPizza','is a kind of', 'Food')
]
Firstly, I need show you the way to generate query graph based on this onto_struct. demo on same q_text
q_text = f"""
Base on your knowledge, recommend ALL kind of pizza from following pizza list,
which is medium spiciness and has vegetable on it.
Only output the name of pizza, and a short description of this pizza, followed by a
brief reason you select it.
"""
print(f'q_text:\n{q_text}')
kgBuilder = KGRAG.KnowledgeGraphBuilder(\
embedding_model=bge_model,
llm_model='qwen-max',
use_cloud_llm=True
)
target, kg = kgBuilder.extract_entities_relations(q_text, onto_struct)
print('search target:'+target)
print(f'search condition:\n{kg}')
Let's try again use same bge model to investigate a revised new question.
naiveRAGRlt2 = naiveRAG.get_context(q_text, onto_struct, top_n=10);
Takeaway
❤️ Better than before
❌ but looks like bge still can not simultaneously conside all search conditions.
💡 A Potential solution is GRetriever... But tototototot heavy
💡 Let's try a simple method in this case, to double check based on ontology.
Experiment 4. Further improvement based on reranking¶
Firstly we need to translate each sample paragrapgh into graph . as following example:
q_text = f"""
FruttiDiMare is a pizza typically topped with a variety of seafood such as
shrimp, mussels, and calamari, often in a tomato or white wine sauce.
"""
print(f'q_text:\n{q_text}')
kgBuilder = KGRAG.KnowledgeGraphBuilder(\
embedding_model=bge_model,
llm_model='qwen-max',
use_cloud_llm=True
)
target, kg = kgBuilder.extract_entities_relations(q_text, onto_struct)
print('topic:'+target)
print(f'condition:\n{kg}')
tree_data = KGRAG.analyze_graph(kg)
print(tree_data['root'])
print(tree_data['leaf_candidates'])
q_text = f"""
Base on your knowledge, recommend ALL kind of pizza from following pizza list,
which is medium spiciness and has vegetable on it.
Only output the name of pizza, and a short description of this pizza, followed by a
brief reason you select it.
"""
target, kg = kgBuilder.extract_entities_relations(q_text, onto_struct)
tree_data_q = KGRAG.analyze_graph(kg)
print(tree_data_q['leaf_candidates'])
Then let's go reranking over all candidates from naive RAG based on leaf
all_retrieved_doc_txt = [adoc['text'] for adoc in naiveRAGRlt2['metadatas'][0]]
tree_data_ks = []
leaf_all_d = []
for adoc in all_retrieved_doc_txt:
target, kg = kgBuilder.extract_entities_relations(adoc, onto_struct)
tree_data_k = KGRAG.analyze_graph(kg)
tree_data_ks.append(tree_data_k)
print(kg)
print('Leaf : {:}'.format(', '.join(sorted(tree_data_k['leaf_candidates']))))
leaf_all_d.append(tree_data_k['leaf_candidates'])
scores = [kgBuilder.cal_rel_score(tree_data_q['leaf_candidates'], x) for x in leaf_all_d]
sorted_text = [t for t, _ in sorted(zip(all_retrieved_doc_txt, scores),
key=lambda pair: pair[1],
reverse=True)]
print([f'{x:.2f}' for x in scores])
print('--before--\n\n'+'\n'.join(all_retrieved_doc_txt))
print('\n\n')
print('--after--\n\n'+'\n'.join(sorted_text))
Takeaway
🎉 Now the result is correct.
💡 The simple way introduced only looks into the leafs. In this case, leafs represent the attribute of pizza we need.
Sometime, we need to look into the whole path from root to each leaf. So here's our ultimate rerank solution..
Feel free to reach out if you'd like to discuss the details further.
scores = naiveRAG.rerank(tree_data_q, tree_data_ks);
sorted_text = [t for t, _ in sorted(zip(all_retrieved_doc_txt, scores),
key=lambda pair: pair[1],
reverse=True)]
print('--before--\n\n'+'\n'.join(all_retrieved_doc_txt))
print('\n\n')
print('--after--\n\n'+'\n'.join(sorted_text))
To Wrap up¶
This example shows the power of ontology.
For demonstration purposes, I picked a really simple example (famous Pizza dataset) where each document can be represented as a tree—with the root being a type of Pizza and the leaf nodes being its key attributes. So once documents are successfully converted into triples and adjacency matrices using the ontology, we can directly rerank them by computing similarity matrices between the leaf nodes of documents and those of the query.
Please noted, I use QWen-Max, the Flag-ship LLM of Ali, and World-class Tier-1 LLM.
Beyond the approach shown here: text-similarity-based RAG + ontology-based question revise & reranking there are several other ways to tackle this problem:
Structure the data as tables, using Pizza attributes as column headers, then use ReAct + code generation + Python interpreter to extract all results (see this link). The upside is that the number of recalled items isn't limited by the TopK parameter. The downside is that ReAct can burn through quite a few tokens.
Design a rerank/retrival model using multi-layer GAT networks to replace the simple method described in experiment 4. Namely, to use GAT in context engineering. The advantage here is that GAT can learn to pay attention to semantic structures in both the query and recalled documents—not just leaf nodes—through carefully designed contrastive learning, giving it some generalization across different tasks. The major drawback is that the network needs to be trained.
Combine multi-layer GAT networks with foundation model SFT for an end-to-end solution. Compared to option 2, the drawback is that it requires significantly more compute.
We already have an implementation for option 2, and PyG officially supports option 3. Feel free to reach out if you'd like to discuss the details further.