Skip to main content

SecretLLM: Cultural QA System

Parameter-efficient fine-tuning and dynamic RAG alignment for cultural reasoning on quantized Llama-3-8B.

  • Python
  • Transformers
  • Sentence Transformers
  • PEFT and LoRA
  • Llama 3 8B
  • Hugging Face
  • WandB
  • BitsAndBytes
+16%
Accuracy Gain
8 Billion
Parameter Size
4-bit
Quantization

The Challenge

Large language models struggle with geographic and cultural nuance due to representation gaps in their pre-training corpora. Additionally, smaller open-weights models (such as those with 8 billion parameters) demonstrate instruction drift and unstable formatting compliance when presented with complex zero-shot formatting instructions.

The Approach

Replaced complex formatting constraints with natural instruction layouts and built a dynamic few-shot retrieval framework. By structuring the reference corpus into clean query-response pairs, the pipeline retrieves and inserts the top three most semantically relevant context exemplars dynamically during inference.

The Impact

The pipeline combining parameter-efficient fine-tuning (PEFT using LoRA) and dynamic RAG significantly outperformed baseline models, satisfying strict verification metrics and securing a 16% absolute accuracy improvement on short-answer evaluation suites.

System Architecture

The solution utilizes the Llama-3-8B model with 4-bit quantization for efficiency. It employs the MiniLM-L6-v2 model for semantic embedding and retrieval. The pipeline includes a data augmentation stage where training data is stripped of options to create direct query-response pairs. Inference uses Greedy Search to ensure deterministic and concise outputs required by the evaluation script.

Key Engineering Features

Dynamic Retrieval

Retrieval system that injects semantically relevant in-context examples for each specific query, boosting short-answer accuracy by 16%.

Dataset Restructuring

Automated pipeline to transform multiple-choice datasets into direct query-response pairs, effectively doubling the training resources for the retrieval corpus.

Quantized Tuning

Used Low-Rank Adaptation and quantization to fine-tune the 8-billion parameter model on limited hardware, optimizing for task alignment.

Ablation Testing

Evaluated external search integration, discovering that clean internal data outperforms noisy web results for this specific domain.

Development Lifecycle

PEFT Configuration

Configured model environment with parameter-efficient adapters and executed base runs.

Dynamic RAG Integration

Implemented the few-shot search structures and compiled the retrieval database.

Ablation Studies

Conducted ablation studies testing decoding parameters and search API dependencies.

Analysis and Delivery

Finalized analysis report documenting the 16% benchmarking improvement.