Thi Hoang Thi Pham

Hello, I'm Thi Hoang Thi Pham

PhD Student in Semantic Web & Knowledge Graphs Nantes, France Animal Lover Table Tennis Player Badminton Player

About Me

I'm Thi, a PhD student passionate about semantic web technologies and knowledge graphs. My journey from mathematics to computer science has been driven by curiosity and a deep fascination with how data connections can unlock new possibilities in research and real-world applications. And, I believe that consistent daily effort creates the most meaningful breakthroughs.

Research & Vision

I specialize in SPARQL query optimization and public SPARQL endpoints, with a vision to make semantic web technologies more accessible to everyone. My goal is to bridge the gap between complex research and practical applications that can benefit society.

Life Balance

When I'm not working with knowledge graphs, you'll find me playing table tennis or cuddling with cats and dogs. There's something peaceful about both the precision of a perfect serve and the simple joy of animal companionship.

Research Focus

Main areas of expertise and research interests

Knowledge Graphs

RDF data construction, querying, and applications

SPARQL Optimization

Query processing techniques for semantic web applications

Public SPARQL Endpoints

Performance analysis for online semantic web platforms

Information Retrieval

Impact analysis of collections on retrieval system performance

Publications

Recent contributions to the field

2025

Passage: Ensuring Completeness and Responsiveness of Public SPARQL Endpoints with SPARQL Continuation Queries

THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

Proceedings of the ACM on Web Conference 2025, 47-58

Abstract: Being able to query online public knowledge graphs such as Wikidata or DBpedia is extremely valuable. However, these queries can be interrupted due to the fair use policies enforced by SPARQL endpoint providers, leading to incomplete results. While these policies help maintain responsiveness for public SPARQL endpoints, they compromise the completeness of query results, limiting the feasibility of various downstream tasks. Ideally, we shouldn't have to choose between completeness and responsiveness. To address this issue, we introduce the concept of SPARQL continuation queries. When a SPARQL endpoint interrupts a query, it returns partial results along with a SPARQL continuation query to retrieve the remaining results. If the continuation query is also interrupted, the process repeats, generating further continuation queries until the complete results are obtained. In our experimention, we show that our continuation server Passage ensures completeness and responsiveness with performances in execution time similar to BlazeGraph.

2025

LLM4Schema.org: Generating Schema.org Markups with Large Language Models

MH Dang, THT Pham, P Molli, H Skaf-Molli, A Gaignard

Semantic Web Journal

Abstract: The integration of Schema.org markup into web pages has resulted in billions of RDF triples, yet around 75% of web pages still lack this critical markup. Large Language Models (LLMs) present a promising solution by automatically generating the missing Schema.org markup. Despite this potential, there is currently no benchmark to evaluate the markup quality produced by LLMs. This paper introduces LLM4Schema.org, an innovative approach for assessing the performance of LLMs in generating Schema.org markup. Unlike traditional methods, LLM4Schema.org does not require a predefined ground truth. Instead, it compares the quality of LLM-generated markup against human-generated markup. Our findings reveal that 40–50% of the markup produced by GPT-3.5 and GPT-4 is invalid, non-factual, or non-compliant with the Schema.org ontology. However, specialized LLM-powered agents can effectively identify and eliminate these errors. After applying such filtering for both human and LLM-generated markup, GPT-4 shows notable improvements in quality and outperforms humans. LLM4Schema.org highlights both the potential and challenges of leveraging LLMs for semantic annotations, emphasizing the critical role of careful curation and validation in achieving reliable results.

2025

Continuation Queries: Embracing Timeouts on Public SPARQL Endpoints

THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

ISWC 2025 Companion Volume, Nara, Japan

Abstract: Public SPARQL endpoints like Wikidata enforce strict timeout limits that prevent complete query results. Rather than treating timeouts as failures, our approach leverages continuation queries to obtain partial results. When a SPARQL endpoint reaches its time quota, it returns partial results along with a new SPARQL query designed to retrieve the missing results. This can be repeated iteratively to recover complete answers. The demonstration uses a passage instance loaded with 13 billion triples from Wikidata 2025.

2025

Fraw: Sampling-Based Approximate Query Processing for Federations of SPARQL Endpoints

E Boisteau-Desdevises, THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

ISWC 2025 Companion Volume, Nara, Japan

Abstract: Fraw is a SPARQL federation engine that enables users to query multiple SPARQL endpoints as if all RDF data were available through a single virtual endpoint. The system employs sampling-based approximate query processing using random walks, beneficial when timely responses matter and approximate results are acceptable. We demonstrate the engine's utility through an interactive SPARQL query autocompletion feature, which supplies user suggestions during query composition despite federated querying complexity.

2024

CRAWD: Sampling-Based Estimation of Count-Distinct SPARQL Queries

THT Pham, P Molli, B Nédelec, H Skaf-Molli, J Aimonier-Davat

International Semantic Web Conference, 98-115 (Presented at ISWC2024, Baltimore)

Abstract: Count-distinct SPARQL queries compute the number of unique values in the results of a query executed on a Knowledge Graph. However, counting the exact number of distinct values is often computationally demanding and time-consuming. As a result, these queries often fail on public SPARQL endpoints due to fair use policies. In this paper, we propose CRAWD, a new sampling-based approach designed to approximate count-distinct SPARQL queries. CRAWD significantly improves sampling efficiency and allows feasible execution of count-distinct SPARQL queries on public SPARQL endpoints, considerably improving existing methods.

2024

Online sampling of summaries from public SPARQL endpoints

THT Pham, H Skaf-Molli, P Molli, B Nédelec

Companion Proceedings of the ACM Web Conference 2024, 617-620

Abstract: This paper investigates whether online sampling can generate summaries useful in cutting-edge SPARQL federation engines. Our experimental studies indicate that sampling allows the creation and maintenance of summaries by exploring less than 20% of datasets. The approach enables the collection of statistics while respecting fair usage policies, which is important for public SPARQL endpoints that have resource limitations.

2024

Impact des collections sur les performances des Systèmes de Recherche d'Information

THT Pham, P Galuščáková, P Mulhem, G González Sáez, L Goeuriot

CORIA 2024 (COnférence en Recherche d'Information et Applications)

Résumé: Cet article est une étude préliminaire sur les évolutions des corpus et leur impact sur les performances des systèmes de recherche d'information. Nous proposons une approche pour créer des corpus intermédiaires entre deux existants, puis de mesurer leurs différences suivant plusieurs caractéristiques. Nous étudions ensuite les corrélations entre les différences entre les caractéristiques et les évaluations d'un certain nombre de systèmes de recherche d'information, et nous montrons que les représentations des requêtes sont des indicateurs de différences entre collections bien corrélé aux performances de plusieurs variants de systèmes de recherches d'informations.

2022

Understanding the Energy Consumption of HPC Scale Artificial Intelligence

D Carastan-Santos, THT Pham

Latin American High Performance Computing Conference (CARLA 2022), 131-144

Abstract: This paper contributes towards better understanding the energy consumption trade-offs of HPC scale Artificial Intelligence (AI), and more specifically Deep Learning (DL) algorithms. For this task we developed benchmark-tracker, a benchmark tool to evaluate the speed and energy consumption of DL algorithms in HPC environments. We exploited hardware counters and Python libraries to collect energy information through software, which enabled us to instrument a known AI benchmark tool, and to evaluate the energy consumption of numerous DL algorithms and models. Through an experimental campaign, we show a case example of the potential of benchmark-tracker to measure the computing speed and the energy consumption for training and inference DL algorithms, and also the potential of Benchmark-Tracker to help better understanding the energy behavior of DL algorithms in HPC platforms. This work is a step forward to better understand the energy consumption of Deep Learning in HPC, and it also contributes with a new tool to help HPC DL developers to better balance the HPC infrastructure in terms of speed and energy consumption.

Education

Academic journey from Mathematics to Semantic Web

2023 - Present

PhD in Computer Science In Progress

LS2N, Nantes Université, France

Semantic Web, Decentralized Linked Data, Knowledge Graphs, SPARQL

Supervisors: Prof Pascal MOLLI, Prof Hala Skaf-MOLLI 2+ years
2021 - 2023

Master in Informatics (MoSIG)

Université Grenoble Alpes – ENSIMAG, Grenoble INP, France

Advanced Algorithms, Machine Learning, Information Retrieval, Natural Language Processing, Statistical Learning

Specialized in Data Mining & Machine Learning 2 years
2019 - 2020

Master of Science in Applied Mathematics

France-Vietnam Master Program (PUF) - Université Orléans & VNUHCM

Applied Mathematics with focus on statistical methods and computational approaches

Thesis Grade: 15/20 (Top 5) 1 year
2015 - 2019

Bachelor of Science in Applied Mathematics

University of Saigon, Vietnam

Probability Theory, Statistics, Functional Analysis, Mathematical Modeling, High-Performance Computing

GPA: 3.2/4 (Top 2 - Honorable Student) 4 years

Professional Certificates

Continuous learning and skill development

IBM

IBM RAG and Agentic AI Professional Certificate

Coursera - IBM Skills Network

8-course specialization covering advanced generative AI applications using RAG, agentic, and multimodal AI technologies. Includes building autonomous agents with LangChain, LangGraph, CrewAI, and AutoGen for complex reasoning workflows.

RAG Applications Vector Databases LangChain LangGraph Agentic AI Multimodal AI
ID: ZBIJZ606QIJO Sep 9, 2025 8 Courses
IBM

IBM AI Developer Professional Certificate

Coursera - IBM Skills Network

10-course comprehensive program covering software engineering, AI fundamentals, and generative AI development. Built web applications using Python, Flask, and generative AI-powered models for chatbots and innovative solutions.

Python Development Flask Generative AI Prompt Engineering Web Development AI Applications
ID: VOJE5CBYF3SD Sep 23, 2025 10 Courses