Thi Hoang Thi Pham

Profile

PhD candidate specializing in Semantic Web technologies with a unique mathematical foundation and hands-on development experience. Research focus on completeness and responsiveness of online SPARQL endpoints, bridging theoretical research with practical implementation. Proven track record of 8 peer-reviewed publications and contributing to research projects developing tools like CRAWD and PASSAGE for public SPARQL endpoints.

Research & Development Experience

PhD Researcher - Semantic Web & Decentralized Linked Data

LS2N Lab, Nantes University Sep 2023 - Present

Supervisors: Prof Pascal Molli, Prof Hala Skaf-Molli

PASSAGE: Novel SPARQL continuation query concept ensuring completeness for public endpoints
CRAWD: Sampling-based estimator for count-distinct SPARQL queries
Published 4 papers at top-tier venues (ACM Web Conference, ISWC, Semantic Web Journal)
Collaborated on production-ready tools used by semantic web community worldwide

Research Intern - Information Retrieval Systems

LIG Lab, Grenoble Feb - Jul 2023

Supervisors: Prof Philippe Mulhem, Prof Lorraine Goeuriot, Dr. Petra Galuscakova

Applied NLP techniques to analyze IR system performance across different document collections
Developed predictive models using document features (length, complexity, query structure)

ML Research Intern - Energy-Efficient AI

LIG Lab, Grenoble Apr - Jul 2022

Supervisors: Prof Denis Trystram, Dr. Danilo Carastan-Santos

Built benchmark tracker for evaluating energy consumption of HPC-scale AI algorithms
Instrumented Python libraries to collect energy metrics through hardware counters
Published results at CARLA 2022 (Latin America High-Performance Computing Conference)

Key Publications

PASSAGE: Ensuring Completeness and Responsiveness of Public SPARQL Endpoints

THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

ACM Web Conference 2025 Recent

Abstract: Being able to query online public knowledge graphs such as Wikidata or DBpedia is extremely valuable. However, these queries can be interrupted due to the fair use policies enforced by SPARQL endpoint providers, leading to incomplete results. We introduce the concept of SPARQL continuation queries to ensure completeness and responsiveness with performances similar to BlazeGraph.

CRAWD: Sampling-Based Estimation of Count-Distinct SPARQL Queries

THT Pham, P Molli, B Nédelec, H Skaf-Molli, J Aimonier-Davat

International Semantic Web Conference 2024 (Presented at ISWC2024, Baltimore)

Abstract: Count-distinct SPARQL queries compute the number of unique values in the results of a query executed on a Knowledge Graph. However, counting the exact number of distinct values is often computationally demanding and time-consuming. We propose CRAWD, a new sampling-based approach designed to approximate count-distinct SPARQL queries, significantly improving sampling efficiency for public SPARQL endpoints.

Continuation Queries: Embracing Timeouts on Public SPARQL Endpoints

THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

ISWC 2025 Companion Volume, Nara, Japan

Abstract: Rather than treating timeouts as failures on public SPARQL endpoints, our approach leverages continuation queries to obtain partial results. When a SPARQL endpoint reaches its time quota, it returns partial results along with a new SPARQL query designed to retrieve the missing results, repeated iteratively to recover complete answers.

Fraw: Sampling-Based Approximate Query Processing for Federations of SPARQL Endpoints

E Boisteau-Desdevises, THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

ISWC 2025 Companion Volume, Nara, Japan

Abstract: Fraw is a SPARQL federation engine enabling users to query multiple SPARQL endpoints as if all RDF data were available through a single virtual endpoint. The system employs sampling-based approximate query processing using random walks, demonstrated through an interactive SPARQL query autocompletion use case.

LLM4Schema.org: Generating Schema.org Markups with Large Language Models

MH Dang, THT Pham, P Molli, H Skaf-Molli, A Gaignard

Semantic Web Journal 2025 1 citation

Abstract: The integration of Schema.org markup into web pages has resulted in billions of RDF triples, yet around 75% of web pages still lack this critical markup. This paper introduces LLM4Schema.org, an innovative approach for assessing the performance of LLMs in generating Schema.org markup. Our findings reveal that 40–50% of the markup produced by GPT-3.5 and GPT-4 is invalid, but specialized LLM-powered agents can effectively identify and eliminate these errors.

Online Sampling of Summaries from Public SPARQL Endpoints

THT Pham, H Skaf-Molli, P Molli, B Nédelec

Companion Proceedings of the ACM Web Conference 2024, 617-620

Abstract: This paper investigates whether online sampling can generate summaries useful in cutting-edge SPARQL federation engines. Our experimental studies indicate that sampling allows the creation and maintenance of summaries by exploring less than 20% of datasets, while respecting fair usage policies for public SPARQL endpoints.

Impact des collections sur les performances des Systèmes de Recherche d'Information

THT Pham, P Galuščáková, P Mulhem, G González Sáez, L Goeuriot

CORIA 2024 (COnférence en Recherche d'Information et Applications)

Résumé: Cet article est une étude préliminaire sur les évolutions des corpus et leur impact sur les performances des systèmes de recherche d'information. Nous proposons une approche pour créer des corpus intermédiaires entre deux existants, puis étudions les corrélations entre les différences et les évaluations d'un certain nombre de systèmes de recherche d'information.

Understanding Energy Consumption of HPC Scale Artificial Intelligence

D Carastan-Santos, THT Pham

CARLA 2022 3 citations

Abstract: This paper contributes towards better understanding the energy consumption trade-offs of HPC scale Artificial Intelligence (AI), and more specifically Deep Learning (DL) algorithms. We developed benchmark-tracker, a benchmark tool to evaluate the speed and energy consumption of DL algorithms in HPC environments, contributing with a new tool to help HPC DL developers better balance infrastructure in terms of speed and energy consumption.

Certificates

IBM AI Developer Professional Certificate

IBM Skills Network - Coursera

September 2025

Certificate ID: VOJE5CBYF3SD | 10 Courses Completed

IBM RAG and Agentic AI Professional Certificate

IBM Skills Network - Coursera

September 2025

Certificate ID: ZBIJZ606QIJO | 8 Courses Completed

Education

PhD in Computer Science

LS2N, Nantes University

2023 - 2026 (Expected)

Master in Informatics (MoSIG)

Université Grenoble Alpes - ENSIMAG

2021 - 2023

Master of Science in Applied Mathematics

Université Orléans & VNUHCM (France-Vietnam Program)

2019 - 2020

Thesis: 15/20 (Top 5)

Bachelor of Science in Applied Mathematics

University of Saigon

2017 - 2019

GPA: 3.2/4 (Top 2 - Honorable Student)