Skip to main content

SecretLLM - Cultural QA System

Optimizing Llama-3-8B for cultural reasoning for a 2 month project at TU Dresden.

  • Python
  • Transformers
  • Sentence Transformers
  • PEFT / LoRA
  • Llama 3 8B
  • Hugging Face
  • WandB
  • BitsAndBytes
+0.16
Accuracy Gain
8 Billion
Parameter Size
4-bit
Quantization

The Challenge

Standard Large Language Models often exhibit Western-centric biases and lack the nuance required for specific cultural queries. Additionally, the 8B parameter model struggled with strict output formatting (JSON) and zero-shot reasoning for complex Short Answer Questions (SAQ), often leading to "instruction drift".

The Approach

I stripped away complex JSON prompts in favor of natural language instructions and built a "Dynamic Few-Shot" RAG framework. By transforming multiple-choice datasets into direct QA pairs, the system dynamically retrieves and injects the top 3 most relevant examples into every prompt.

The Impact

The dual approach of LoRA fine-tuning and Dynamic RAG crushed the baseline zero-shot performance. The system outputted logically sound, culturally accurate answers required by the strict academic evaluator, resulting in a definitive 16% accuracy gain.

System Architecture

The solution utilizes the Llama-3-8B model with 4-bit quantization (NF4) for efficiency. It employs `all-MiniLM-L6-v2` for semantic embedding and retrieval. The pipeline includes a data augmentation stage where training data is stripped of options to create direct QA pairs. Inference uses Greedy Search to ensure deterministic and concise outputs required by the evaluation script.

Key Engineering Features

Dynamic RAG

Retrieval system that injects semantically relevant "in-context" examples for each specific query, boosting SAQ accuracy by 16%.

Data Augmentation

Automated pipeline to transform MCQ datasets into SAQ pairs, effectively doubling the training resources for the retrieval corpus.

Efficient Fine-Tuning

Used LoRA (Low-Rank Adaptation) and quantization to fine-tune the 8B model on limited hardware, optimizing for task alignment.

Ablation Studies

Evaluated external internet search (DuckDuckGo), discovering that "clean" internal data outperforms noisy web results for this specific domain.

Development Lifecycle

Setup & Training

Setting up the environment with Transformers/PEFT and executing initial Fine-Tuning runs.

RAG Implementation

Developing the Dynamic Few-Shot logic and constructing the augmented knowledge base.

Evaluation

Running ablation studies on decoding strategies and external search integration.

Report Submission

Finalizing the project report and analysis of the 0.16 accuracy gain.