
SecretLLM - Cultural QA System
Optimizing Llama-3-8B for cultural reasoning for a 2 month project at TU Dresden.
Overview
Developed for the "Behind the Secrets of Large Language Models" module at TU Dresden, this project addresses the "cultural gap" in standard LLMs. I engineered a question-answering system using Meta-Llama-3-8B that improves accuracy on cultural tasks. While Supervised Fine-Tuning (SFT) helped with format alignment, the implementation of a Dynamic Retrieval-Augmented Generation (RAG) system yielded the most significant results, increasing Short Answer Question (SAQ) accuracy by 0.16 over the baseline. Full project report can be found in the github repository.
Tech Stack
The Challenge
Standard Large Language Models often exhibit Western-centric biases and lack the nuance required for specific cultural queries. Additionally, the 8B parameter model struggled with strict output formatting (JSON) and zero-shot reasoning for complex Short Answer Questions (SAQ), often leading to "instruction drift".
The Solution
I moved from complex JSON prompts to simplified natural language instructions and implemented a "Dynamic Few-Shot" RAG framework. By converting Multiple Choice data into Short Answer pairs, I created an augmented knowledge base. For every incoming query, the system retrieves and injects the top-k = 3 semantically similar examples into the prompt, grounding the model in relevant cultural context.
System Architecture
The solution utilizes the Llama-3-8B model with 4-bit quantization (NF4) for efficiency. It employs `all-MiniLM-L6-v2` for semantic embedding and retrieval. The pipeline includes a data augmentation stage where training data is stripped of options to create direct QA pairs. Inference uses Greedy Search to ensure deterministic and concise outputs required by the evaluation script.
Key Features
Dynamic RAG
Retrieval system that injects semantically relevant "in-context" examples for each specific query, boosting SAQ accuracy by 16%.
Data Augmentation
Automated pipeline to transform MCQ datasets into SAQ pairs, effectively doubling the training resources for the retrieval corpus.
Efficient Fine-Tuning
Used LoRA (Low-Rank Adaptation) and quantization to fine-tune the 8B model on limited hardware, optimizing for task alignment.
Ablation Studies
Evaluated external internet search (DuckDuckGo), discovering that "clean" internal data outperforms noisy web results for this specific domain.
Project Gallery



Development Timeline
Setup & Training
Setting up the environment with Transformers/PEFT and executing initial Fine-Tuning runs.
RAG Implementation
Developing the Dynamic Few-Shot logic and constructing the augmented knowledge base.
Evaluation
Running ablation studies on decoding strategies and external search integration.
Report Submission
Finalizing the project report and analysis of the 0.16 accuracy gain.