Call Number | SK-2456 (Softcopy SK-1938) |
Collection Type | Skripsi |
Title | Development of a Metadata-Related Benchmark Based on Large Language Models(LLMS) for the Retrieval- Augmented Generation (RAG) Module of Pneuma: a Data Discovery System |
Author | Muhammad Imam Luthfi Balaka; |
Publisher | Depok: Fasilkom UI, 2025 |
Subject | Retrieval-Augmented Generation (RAG) |
Location | FASILKOM-UI; |
Nomor Panggil | ID Koleksi | Status |
---|---|---|
SK-2456 (Softcopy SK-1938) | TERSEDIA |
Name : Muhammad Imam Luthfi Balaka Study Program : Computer Science Title : Development of a Metadata-Related Benchmark Based on Large Language Models (LLMs) for the Retrieval-Augmented Generation (RAG) Module of Pneuma: A Data Discovery System Counselor : Adila Alfa Krisnadhi, S.Kom., M.Sc., Ph.D. Pneuma is a system under development aimed at addressing the limitations of current data discovery systems, specifically for structured data (datasets). This system utilizes the Retrieval-Augmented Generation (RAG) mechanism for storing and retrieving information. Pneuma stores two types of information: metadata and dataset summary. Metadata describes data such as its creator, while dataset summaries summarize column names and row values of datasets. An RAG system has many components that can be chosen and tuned. To select and tune these components, we aim to develop two benchmarks, each corresponding to the type of information it consists of (either metadata or dataset summary). The metadata-related benchmark, handled in this thesis, is a set of triples T,Q,A, where T represents a dataset, Q represents a question designed to elicit metadata from T, and A represents the answer to Q given T. We take T and Q from existing resources. We then use two Large Language Models (LLMs), LLM Generator and LLM Evaluator, to generate answers and evaluate them, respectively. For the LLM Evaluator, we design a prompt and select a model through an experiment. For the LLM Generator, we compare several scenarios, each comprising a combination of model and prompt (either standard or role-play prompt). We then keep the best-performing combination of model and prompt. Following this, we compare different decoding strategies—nucleus sampling and contrastive search—to observe their effects on the evaluations and linguistic qualities of the generated answers. Linguistic quality is measured by GRUEN scores. As a result, we select Llama-3-8b-Instruct as the LLM Evaluator because it has high agreeability with gpt-4-1106-preview. Then, the best combination of LLM and prompt is OpenHermes-2.5-Mistral-7B and role-play prompt. Finally, for the different decoding strategies, greedy search produced the most number of good answers, but it lacks in terms of linguistic quality compared to the other two decoding strategies. Based on these experiments, we keep the benchmark generated through a combination of OpenHermes-2.5-Mistral-7B, role-play prompt, and greedy search for selection and tuning of components of the RAG module of Pneuma.