ENGLISH — Personal Technical Blog: Building Mufti GPT’s RAG-backed Assistant
I built this system because I wanted an assistant that felt like a trusted teacher—one that answers precisely, cites responsibly, and points the user to canonical sources when necessary. The core engineering problem I solved was how to blend large, high-quality textual sources (Qur’an, Sunnah, classical Tafsir, curated duaa collections, and an extensive digital hadith corpus) with transformer models in a way that preserves provenance, minimizes hallucinations, and delivers sub-second relevance while keeping privacy and safety rules in force.
What we built (high level)
- A retrieval-augmented generation (RAG) pipeline that treats canonical Islamic texts and curated secondary sources as first-class knowledge. The retrieval layer returns tight, source-tagged passages and metadata; the generator composes answers over those retrieved chunks while obeying explicit system constraints and safety rules.
- A lightweight orchestration layer that (1) ingests, normalizes, and chunks textual sources; (2) embeds them into a vector index; and (3) performs fast semantic retrieval with relevance-ranking and provenance attached to every hit.
- A controlled generation stage built around our own ensemble/fine-tuned mix of models plus a strict system prompt. The generator is guided to: cite scripture and hadith precisely, avoid disallowed topics, escalate to a qualified scholar for heavy legal rulings, and provide concise actionable steps when appropriate.
Data, ingestion, and normalization
I spent a lot of time on the data layer because the model is only as truthful as its sources. In practice that meant:
- Canonical sources: the Qur’an (verified translations and original Arabic), Sahih collections, and widely accepted hadith corpora. I normalized metadata so every hadith includes chain/collection identifiers when available.
- Secondary sources: trusted Tafsir extracts, contemporary scholar notes, curated duaa collections, and project-specific annotations for context (e.g., region-specific fiqh notes).
- Preprocessing: normalization of Arabic orthography, consistent tokenization, stripping of OCR artifacts, deduplication, and aligning Arabic with trusted translations and transliterations. Each chunk is stored with full provenance fields: source id, chapter/ayah/hadith id, speaker/chain info, and ingestion timestamp.
Embeddings and the vector index
I use dense semantic embeddings to map questions and passages into the same vector space. The index stores vectors together with the provenance metadata. Important engineering choices:
- Chunking strategy: we use overlapping, meaning-preserving chunks sized to optimize retrieval relevance while preserving citation granularity.
- Hybrid ranking: lexical signals (term matches) are combined with semantic similarity and an engineered proximity score that favors smaller, more focused chunks.
- Freshness and TTL: recent corrections or manual overrides can be prioritized via a small delta-index; we avoid blind re-indexing and prefer append/patch flows.
Retrieval and provenance
A core design principle was that every retrieved item must carry a provenance header. The runtime pipeline returns a ranked list of candidate passages with these fields: source, sub-source (e.g., chapter/verse/hadith id), extract text, and a compact justification string explaining why it was returned. This justification is used by the next stage to format citations and to tell the generator what to trust.
Prompting, system constraints, and alternation rules
I wrote a carefully-engineered system instruction (stored server-side) that tells the generator how to behave: always open the first reply with a respectful salaam and Bismillah, thereafter begin responses with Bismillah unless the user greets again; strictly refuse prohibited categories with a single standardized warning token; alternate every other reply with a single follow-up question or next-action recommendation; prioritize the user’s configured madhab but accept explicit temporary overrides in the session context. The prompt also communicates the retrieved passages explicitly and requires the generator to embed inline citations where a claim maps to a retrieved passage.
Model ensemble & fine-tuning
We don’t rely on any single off-the-shelf model. Instead, we aggregate outputs from a small ensemble of tuned components and a fine-tuned head that specializes in jurisprudence-style reasoning and citation generation. The ensemble yields a primary draft which is then re-scored against retrieval provenance and constrained by safety heuristics. If a generated claim cannot be linked to retrieved evidence, the renderer either downgrades the certainty language or refuses the claim and suggests consulting a scholar.
Safety, prohibitions, and warning flow
I implemented an explicit policy pathway: disallowed queries produce a single, exact warning phrase and increment a local warning counter. After a configurable number of warnings, the UI disables chat and asks the user to contact support to restore access. This flow is enforced at both the retrieval and generation layers, ensuring that disallowed prompts never leak into the generation stage.
Speed, caching, and latency engineering
My goal was sub-second retrieval and low-second full-turn latency. To get there I used:
- A small in-memory LRU cache for embeddings of common queries and for recently retrieved passage sets.
- Asynchronous retrieval with optimistic streaming of partial citations so the UI can show a “thinking” state and then render the full answer with citations.
- A compact binary representation for vector tiles that speeds transfer between the retrieval node and the generator.
Evaluation, metrics, and continuous quality
I established a battery of automated tests and human-in-the-loop checks:
- Precision@k over held-out QA pairs constructed from authoritative sources.
- Citation accuracy: does the generated claim correctly reference the passage used to justify it?
- Safety regression tests to ensure disallowed categories continue to be blocked.
- Human evaluation panels for scholarly accuracy and helpfulness; the system learns from curated feedback and manual corrections via incremental re-indexing and targeted fine-tuning.
UX, provenance display, and editor features
Because provenance is central, the chat UI shows inline citations and allows users to tap a citation to see the original source snippet, exact verse/hadith id, and the matched text. I also added practical UI improvements: Markdown rendering for rich answers, Enter-to-send with Shift+Enter for newline, and a journey-context injection that personalizes answers based on recent user habit logs (prayer counts, quran minutes, dhikr counts, etc.). The assistant will suggest a follow-up action every other reply to encourage habit formation and learning.
Privacy and deployment
User data and journey logs are stored locally and only the minimal context required for a session is sent to the model. I minimized telemetry, encrypted persisted secrets, and made the whole inference stack configurable to run on-prem or in a trusted cloud environment.
What this means for users
Practically, users get answers that:
- Prioritize scripture and sahih hadith with clear citations.
- Avoid specious or invented claims by defaulting to conservative language or refusing where evidence is insufficient.
- Are fast, and are tailored to the user’s stated madhab and personal journey.
Final thoughts (raw and honest)
This is a hard problem: blending ancient canonical sources with modern statistical models means guarding hard at the data layer and forcing the model to justify itself with direct evidence. The biggest engineering wins were not in the model weights but in the plumbing: high-quality ingestion, strict provenance, hybrid retrieval, and a conservative generation policy. We fine-tuned our own mix of models and wrapped them in rules and re-scoring so the assistant behaves like a careful guide rather than an overconfident oracle. The system still needs continuous curation and human supervision—but today it’s a much safer, faster, and more source-aware companion than any single unsupervised model could be.
Blog
© 2025 DeenDash. All rights reserved.

