Thi Hoang Thi Pham - Semantic Web Researcher

2025

Passage: Ensuring Completeness and Responsiveness of Public SPARQL Endpoints with SPARQL Continuation Queries

THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

Proceedings of the ACM on Web Conference 2025, 47-58

Abstract: Being able to query online public knowledge graphs such as Wikidata or DBpedia is extremely valuable. However, these queries can be interrupted due to the fair use policies enforced by SPARQL endpoint providers, leading to incomplete results. While these policies help maintain responsiveness for public SPARQL endpoints, they compromise the completeness of query results, limiting the feasibility of various downstream tasks. Ideally, we shouldn't have to choose between completeness and responsiveness. To address this issue, we introduce the concept of SPARQL continuation queries. When a SPARQL endpoint interrupts a query, it returns partial results along with a SPARQL continuation query to retrieve the remaining results. If the continuation query is also interrupted, the process repeats, generating further continuation queries until the complete results are obtained. In our experimention, we show that our continuation server Passage ensures completeness and responsiveness with performances in execution time similar to BlazeGraph.

ACM DL Demo

2025

LLM4Schema.org: Generating Schema.org Markups with Large Language Models

MH Dang, THT Pham, P Molli, H Skaf-Molli, A Gaignard

Semantic Web Journal

Abstract: The integration of Schema.org markup into web pages has resulted in billions of RDF triples, yet around 75% of web pages still lack this critical markup. Large Language Models (LLMs) present a promising solution by automatically generating the missing Schema.org markup. Despite this potential, there is currently no benchmark to evaluate the markup quality produced by LLMs. This paper introduces LLM4Schema.org, an innovative approach for assessing the performance of LLMs in generating Schema.org markup. Unlike traditional methods, LLM4Schema.org does not require a predefined ground truth. Instead, it compares the quality of LLM-generated markup against human-generated markup. Our findings reveal that 40–50% of the markup produced by GPT-3.5 and GPT-4 is invalid, non-factual, or non-compliant with the Schema.org ontology. However, specialized LLM-powered agents can effectively identify and eliminate these errors. After applying such filtering for both human and LLM-generated markup, GPT-4 shows notable improvements in quality and outperforms humans. LLM4Schema.org highlights both the potential and challenges of leveraging LLMs for semantic annotations, emphasizing the critical role of careful curation and validation in achieving reliable results.

HAL

2025

Continuation Queries: Embracing Timeouts on Public SPARQL Endpoints

THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

ISWC 2025 Companion Volume, Nara, Japan

Abstract: Public SPARQL endpoints like Wikidata enforce strict timeout limits that prevent complete query results. Rather than treating timeouts as failures, our approach leverages continuation queries to obtain partial results. When a SPARQL endpoint reaches its time quota, it returns partial results along with a new SPARQL query designed to retrieve the missing results. This can be repeated iteratively to recover complete answers. The demonstration uses a passage instance loaded with 13 billion triples from Wikidata 2025.

HAL

2025

Fraw: Sampling-Based Approximate Query Processing for Federations of SPARQL Endpoints

E Boisteau-Desdevises, THT Pham, G Montoya, B Nédelec, H Skaf-Molli, P Molli

ISWC 2025 Companion Volume, Nara, Japan

Abstract: Fraw is a SPARQL federation engine that enables users to query multiple SPARQL endpoints as if all RDF data were available through a single virtual endpoint. The system employs sampling-based approximate query processing using random walks, beneficial when timely responses matter and approximate results are acceptable. We demonstrate the engine's utility through an interactive SPARQL query autocompletion feature, which supplies user suggestions during query composition despite federated querying complexity.

HAL

2024

CRAWD: Sampling-Based Estimation of Count-Distinct SPARQL Queries

THT Pham, P Molli, B Nédelec, H Skaf-Molli, J Aimonier-Davat

International Semantic Web Conference, 98-115 (Presented at ISWC2024, Baltimore)

Abstract: Count-distinct SPARQL queries compute the number of unique values in the results of a query executed on a Knowledge Graph. However, counting the exact number of distinct values is often computationally demanding and time-consuming. As a result, these queries often fail on public SPARQL endpoints due to fair use policies. In this paper, we propose CRAWD, a new sampling-based approach designed to approximate count-distinct SPARQL queries. CRAWD significantly improves sampling efficiency and allows feasible execution of count-distinct SPARQL queries on public SPARQL endpoints, considerably improving existing methods.

HAL

2024

Online sampling of summaries from public SPARQL endpoints

THT Pham, H Skaf-Molli, P Molli, B Nédelec

Companion Proceedings of the ACM Web Conference 2024, 617-620

Abstract: This paper investigates whether online sampling can generate summaries useful in cutting-edge SPARQL federation engines. Our experimental studies indicate that sampling allows the creation and maintenance of summaries by exploring less than 20% of datasets. The approach enables the collection of statistics while respecting fair usage policies, which is important for public SPARQL endpoints that have resource limitations.

ACM DL

2024

Impact des collections sur les performances des Systèmes de Recherche d'Information

THT Pham, P Galuščáková, P Mulhem, G González Sáez, L Goeuriot

CORIA 2024 (COnférence en Recherche d'Information et Applications)

Résumé: Cet article est une étude préliminaire sur les évolutions des corpus et leur impact sur les performances des systèmes de recherche d'information. Nous proposons une approche pour créer des corpus intermédiaires entre deux existants, puis de mesurer leurs différences suivant plusieurs caractéristiques. Nous étudions ensuite les corrélations entre les différences entre les caractéristiques et les évaluations d'un certain nombre de systèmes de recherche d'information, et nous montrons que les représentations des requêtes sont des indicateurs de différences entre collections bien corrélé aux performances de plusieurs variants de systèmes de recherches d'informations.

HAL

2022

Understanding the Energy Consumption of HPC Scale Artificial Intelligence

D Carastan-Santos, THT Pham

Latin American High Performance Computing Conference (CARLA 2022), 131-144

Abstract: This paper contributes towards better understanding the energy consumption trade-offs of HPC scale Artificial Intelligence (AI), and more specifically Deep Learning (DL) algorithms. For this task we developed benchmark-tracker, a benchmark tool to evaluate the speed and energy consumption of DL algorithms in HPC environments. We exploited hardware counters and Python libraries to collect energy information through software, which enabled us to instrument a known AI benchmark tool, and to evaluate the energy consumption of numerous DL algorithms and models. Through an experimental campaign, we show a case example of the potential of benchmark-tracker to measure the computing speed and the energy consumption for training and inference DL algorithms, and also the potential of Benchmark-Tracker to help better understanding the energy behavior of DL algorithms in HPC platforms. This work is a step forward to better understand the energy consumption of Deep Learning in HPC, and it also contributes with a new tool to help HPC DL developers to better balance the HPC infrastructure in terms of speed and energy consumption.

Springer

Hello, I'm Thi Hoang Thi Pham

About Me

Research & Vision

Life Balance

Research Focus

Knowledge Graphs

SPARQL Optimization

Public SPARQL Endpoints

Information Retrieval

Publications

Passage: Ensuring Completeness and Responsiveness of Public SPARQL Endpoints with SPARQL Continuation Queries

LLM4Schema.org: Generating Schema.org Markups with Large Language Models

Continuation Queries: Embracing Timeouts on Public SPARQL Endpoints

Fraw: Sampling-Based Approximate Query Processing for Federations of SPARQL Endpoints

CRAWD: Sampling-Based Estimation of Count-Distinct SPARQL Queries

Online sampling of summaries from public SPARQL endpoints

Impact des collections sur les performances des Systèmes de Recherche d'Information

Understanding the Energy Consumption of HPC Scale Artificial Intelligence

Education

PhD in Computer Science In Progress

LS2N, Nantes Université, France

Master in Informatics (MoSIG)

Université Grenoble Alpes – ENSIMAG, Grenoble INP, France

Master of Science in Applied Mathematics

France-Vietnam Master Program (PUF) - Université Orléans & VNUHCM

Bachelor of Science in Applied Mathematics

University of Saigon, Vietnam

Professional Certificates

IBM RAG and Agentic AI Professional Certificate

IBM AI Developer Professional Certificate