How do we build Enterprise KnowledgeHub-BluePrint
- Anand Nerurkar
- Dec 2
- 10 min read
What the Knowledge Hub is and goals
High-level architecture (components)
Ingestion pipeline (sources → chunking → embeddings)
Indexing & storage (vector DB + metadata store + KG)
Retrieval / RAG integration & runtime APIs
Governance, QA, and audit controls
Security, privacy & compliance (BFSI focus)
Operations, monitoring & SLOs
Example schemas / payloads and prompt template hints
Interview-ready one-liners and FAQs
1 — What is the Knowledge Hub (Goals)
A Knowledge Hub is the single enterprise-managed, versioned, auditable corpus of authorized documents, policies, SOPs, contract templates, past case notes and curated internal knowledge that powers RAG and other downstream copilots.
Primary goals:
Authoritative: single source of truth for policies and clauses
Updatable: fast updates without model retrain
Auditable: traceable evidence per LLM answer (chunk → doc → version)
Searchable: semantic and exact-match search (hybrid)
Composable: supports RAG, KG, analytics and manual review UIs
2 — High-level Architecture (component map)
Sources (DMS, DBs, Email, SharePoint, Policies, PDFs, Jira, Slack)
↓ Ingest Connectors (ETL / Change Capture)
↓
Preprocessing & Normalizer (OCR, Text Extraction, Language Detect)
↓
Chunker + Metadata Extractor ——> Blob Storage (chunked documents)
↓
Embedder / Vectorizer ——> Vector DB (vector index)
↓
Metadata Store (Postgres/Cosmos) <—— Blacklist/Whitelist/Policy tags
↓
Knowledge Graph Extractor (optional) ——> Graph DB (Neo4j/Cosmos Graph)
↓
RAG Orchestrator & APIs (retriever, reranker, prompt-builder)
↓
LLM Endpoints (private/on-prem) & Post-Validator
↓
Audit Store (immutable): prompts, responses, chunk IDs, evidence URIs
Components:
Connectors: ingest from DMS, SharePoint, Jira, Confluence, regulatory feeds, knowledge authors.
Preprocessor: OCR (Form Recognizer), text cleaning, normalization, language detection.
Chunker: semantic chunking, overlap, chunk metadata (source, clause_id, doc_version).
Embedding service: generate embeddings with controlled model.
Vector DB: Milvus, Pinecone, PGVector, Azure Cognitive Search vector, etc.
Metadata DB: relational DB for provenance + policies.
KG extractor: optional NER + relation extraction for graph queries.
RAG Orchestrator: hybrid retriever + reranker + prompt builder.
Audit & Evidence store: blob storage + immutable logs.
Management UI: upload, review, tag, version, legal sign-off.
3 — Ingestion Pipeline (step-by-step)
Source identification & access: catalog sources (policy repo, legal DMS, RBI circulars, SOPs, contracts, emails, past manual review notes). Define owner for each source.
Connector & change-capture:
Full initial sync, then CDC (file-change webhook, SFTP, API polling).
Keep source_id, source_path, and version metadata.
Preprocessing:
Convert to plain text / structured JSON.
OCR for images/PDF (Form Recognizer); extract tables (bank statements).
Language detection & normalization (punctuation, whitespace).
Canonicalization & metadata extraction:
Extract doc_type (policy, contract, clause), effective_date, jurisdiction, product, author, policy_id, doc_version, tags.
Map to taxonomy (e.g., Lending → CreditPolicy → Collateral).
Chunking:
Semantic chunking (200–800 tokens), with small overlaps (20–40%) for context.
Tag chunks with metadata: chunk_id, doc_id, source_type, clause_id, start_offset, end_offset, authority_score (manually or heuristically assigned).
Embedding:
Use an enterprise embedding model (private endpoint).
Generate embeddings for each chunk and persist to Vector DB with metadata pointers.
Indexing:
Populate vector DB; update hybrid inverted-index for exact keyword lookups.
Update metadata DB with chunk -> doc mapping and provenance.
Validation & QA:
Samples checked by SME or auto-validate: key fields present, no PII leakage in public embeddings, minimum chunk quality.
Publish / Version:
When source doc changes, create new doc_version, re-chunk, re-embed; keep prior versions for audit.
4 — Indexing & Storage: Vector DB + Metadata + KG
Vector DB (primary retrieval):
Store vector, chunk_id, doc_id, score, authority_score, timestamp
Support ANN search (HNSW / IVF) with filters (metadata) for hybrid retrieval
Metadata Store (control plane):
Schema: doc_id, doc_type, policy_id, version, effective_date, owner, jurisdiction, tags, retention_policy, authority_score, legal_signoff
Use Postgres / Cosmos DB
Knowledge Graph (optional, for relationships):
Entities: Person, Organization, Account, Clause, Regulation
Relationships: director_of, affiliated_with, cited_by, amends
Use Graph DB for AML/fraud multi-hop reasoning
Blob Store / Evidence Store:
Store original docs, chunk text, extracted fields, prompt/response logs
Use immutable storage with WORM/append-only capability (Azure Blob with immutability policy)
5 — Retrieval / RAG runtime & APIs
Key runtime services:
Retriever API: accepts (query, loan_context_filters), returns top-K chunks + metadata
Support vector similarity + boolean filters (jurisdiction, policy_id, date)
Reranker service: cross-encoder to re-score top candidates by authority and semantic fit
Prompt builder: constructs structured prompt (system + facts + top chunks + question)
LLM gateway: private endpoint orchestration (small model draft / large model final)
Post-validator: confirm every factual claim references chunk(s); if not, mark INSUFFICIENT_EVIDENCE
Audit writer: write prompt, chunk_ids, reranker scores, response, model version to audit store
APIs:
POST /knowledge/retrieve → {chunks[]}
POST /copilot/query → triggers retrieval + LLM + returns {response, citations[]}
GET /knowledge/doc/{doc_id} → original doc metadata + versions
POST /knowledge/upload → admin upload (with review workflow)
6 — Governance, QA & Audit Controls
Governance pillars:
Ownership & approvals: every doc must have an owner; only owners can approve versions.
Versioning: immutable versions; keep previous versions for audit.
Authority scoring: assign higher weight to legal/approved docs.
Validation rules: automated checks for chunk integrity, length, language, PII leakage.
Human review workflow: SMEs review newly ingested high-impact docs before publish.
Prompt & response logging: immutable logs for every Copilot invocation (prompt, chunks returned, reranker scores, LLM output).
Explainability mapping: map policy clauses to decision rules and provide direct citation in Copilot output.
Retention & deletion policies: retention per doc_type and legal/regulatory requirements.
7 — Security, Privacy & Compliance (BFSI specifics)
Data residency & isolation: store data in-region; use private endpoints for LLMs; use VNet, Private Link.
PII protections:
Avoid embedding raw PII when sending to external models.
Use redaction/pseudonymization before embedding if using third-party embeddings.
For essential PII, use on-prem models or enterprise private endpoints.
Access control:
RBAC for ingestion, approval, deletion.
Audit trails for every admin action.
Encryption:
Data-at-rest encryption (SSE)
TLS for data-in-transit
Immutable audit store: WORM for prompts/responses.
Key management: Azure Key Vault / KMIP for keys.
Regulatory audit pack: ability to export case-level evidence: prompt + chunks + doc versions + decision trace.
8 — Operations, Monitoring & SLOs
Operational metrics:
Ingestion latency: time from source change → available in vector index
Index freshness: % docs within SLA (e.g., 99% < 5 minutes for urgent regulatory updates)
Retrieval latency (p95): target < 100–300 ms, depending on SLA
Copilot end-to-end latency (p95): target < 5s (draft) / < 20s (deep reasoning)
Query relevance metrics: human rated precision@k, MRR
Hallucination rate: % responses flagged as INSUFFICIENT_EVIDENCE or failing post-verifier
Usage & cost metrics: LLM calls, token usage, vector DB ops
Operational alerts: embedding failures, index build failures, storage capacity thresholds
SLOs:
Retrieval p95 < 200ms
Index freshness for critical regulatory docs < 60s (or immediate for manual push)
Copilot availability 99.9% for internal users
9 — Example Schemas & Prompt Template Hints
A. Document Metadata (metadata store example)
{
"doc_id": "POL-CR-2025-07",
"title": "Credit Policy - Personal Loans",
"version": "v7",
"effective_date": "2025-07-01",
"jurisdiction": "IN",
"owner": "head-risk",
"doc_type": "policy",
"authority_score": 0.98,
"legal_signoff": true,
"tags": ["credit", "personal_loan", "policy"],
"source_uri": "dms://policies/credit/personal/v7.pdf"
}
B. Chunk record (vector DB metadata)
{
"chunk_id": "POL-CR-2025-07_ch_14",
"doc_id": "POL-CR-2025-07",
"text": "...",
"start_offset": 2400,
"end_offset": 2800,
"language": "en",
"authority_score": 0.98,
"tags": ["PD_threshold"]
}
C. Prompt Template (Underwriter Copilot)
SYSTEM: You are an Underwriter Assistant for ACME Bank. Use only the evidence chunks provided. Do not hallucinate.
FACTS:
- Loan Ref: {loan_ref}
- KYC: {kyc_summary}
- Credit Score: {credit_score}, Model: {credit_model_version}
- Income Stability: {income_score}
EVIDENCE CHUNKS: [CHUNK_1, CHUNK_2, ...] (each with doc_id and uri)
QUESTION: Explain why this loan needs manual review and recommend next steps with policy references.
RESPONSE FORMAT: JSON {
"summary":"3-line",
"conflicts": [...],
"missing_documents":[...],
"recommendation": {"action":"", "policy_refs":[...]}
}
10 — Testing, QA & Validation
Unit tests for chunker & embedder (tokenization/overlap consistency)
Integration tests for retriever + reranker (recall tests on held-out queries)
Human-in-the-loop QA: SME judgments on top-K results for key queries
Canary rollout for new embeddings/embedder models and LLM prompt changes
Red-team to probe hallucination by adversarial queries
Synthetic tests: inject test docs and ensure retrieval returns correct chunk with top rank
11 — Cost & Scaling Considerations
Vector DB storage & ANN compute: scales with chunk count — chunk size policy important
Embedding cost: batch compute embeddings at ingestion time; cache embeddings
LLM cost: use two-tier model strategy — small model draft + large model for high-criticality
Optimize by precomputing retrieval for common queries and caching results per loan_id
12 — Ownership, People & Process
Data Owner per source (legal, compliance)
Knowledge Engineers: chunking, taxonomy, prompt templates
SME Reviewers: sign-off documents & chunk authority
LLMOps team: model orchestration, prompt versioning
Platform SRE: ensure index availability & performance
AI Governance Board: approve policies for content usage & retention
13 — Interview-Ready One-Liners
“The Knowledge Hub is the enterprise-controlled library that RAG consults — it enables auditable, versioned, and authoritative GenAI outputs without retraining models.”
“We separate content (Knowledge Hub) from execution (LLM), allowing us to update policies instantly and keep regeneration of answers compliant and auditable.”
“All Copilot answers are evidence-first: each factual claim must reference a chunk_id → doc_id → version that is stored immutably.”
Sample
====
I’ll produce a concrete, interview-ready example using Ram’s loan (loan-888888). You’ll get:
A realistic /agreement-context/{loanId} JSON (what Agreement Service returns after agreement generation)
Example RAG retrieved chunks (policy & clause chunks returned from Knowledge Hub)
The structured prompt that the underwriter Copilot will send to the LLM (system + context + few-shot style)
A sample LLM response (JSON) that the Copilot would return — fully evidence-backed with chunk citations and safe phrasing
1) GET /agreement-context/loan-888888 (sample response)
{
"loan_ref": "loan-888888",
"borrower": {
"name": "Ram Kumar",
"masked_mobile": "+91-XXXX-XXXX-1234",
"customer_id": "CUST-44721"
},
"sanctioned_terms": {
"amount": 500000,
"tenure_months": 36,
"interest_rate_annual_percent": 12.75,
"interest_type": "floating",
"pricing_rule_id": "PR-PL-2025-09",
"processing_fee": 2500
},
"emi_schedule": {
"emi": 16789,
"first_emi_date": "2026-01-05",
"schedule_uri": "blob://agreements/loan-888888/emi_schedule.json"
},
"payment_terms": {
"late_fee_fixed": 500,
"late_fee_percent_monthly": 2.0,
"grace_period_days": 7
},
"prepayment_and_foreclosure": {
"foreclosure_allowed": true,
"foreclosure_fee": "1% of outstanding principal if prepaid within 12 months, else 0.5%",
"prepayment_process_uri": "blob://agreements/loan-888888/prepayment_policy.pdf"
},
"legal_clauses": [
{"clause_id": "LC-4.2", "title": "Interest Rate Reset", "text_uri": "blob://agreements/loan-888888/clauses/LC-4.2.txt"},
{"clause_id": "LC-7.1", "title": "Late Payment & Penalty", "text_uri": "blob://agreements/loan-888888/clauses/LC-7.1.txt"},
{"clause_id": "LC-9.3", "title": "Foreclosure & Prepayment", "text_uri": "blob://agreements/loan-888888/clauses/LC-9.3.txt"}
],
"product_info": {
"product_code": "PL-STD-24",
"product_name": "Personal Loan - Standard",
"disclosure_uri": "blob://agreements/loan-888888/disclosures.pdf"
},
"agreement_uri": "blob://agreements/loan-888888/agreement_v1.pdf",
"generated_at": "2025-11-28T10:06:00Z",
"agreement_version": "v1",
"evidence_uris": [
"blob://agreements/loan-888888/agreement_v1.pdf",
"blob://agreements/loan-888888/emi_schedule.json"
]
}
2) RAG retrieved chunks (sample top-K from Knowledge Hub)
These are what the retriever returned for the Copilot query “Explain borrower loan agreement and foreclosure cost”.
[
{
"chunk_id": "POL-RBI-202402_ch_12",
"doc_id": "POL-RBI-202402",
"title": "RBI Fair Practice Code - Prepayment & Foreclosure",
"snippet": "Banks shall allow prepayment of loans: foreclosure charges shall not exceed 2% for prepayment within one year of sanction; thereafter, foreclosure fees shall be reduced as per product schedule.",
"uri": "dms://policies/rbi/fair_practice/2024/ch12.pdf",
"authority_score": 0.99
},
{
"chunk_id": "POL-PL-2025-09_ch_03",
"doc_id": "POL-PL-2025-09",
"title": "Internal Product Pricing Rules - Personal Loan Standard",
"snippet": "If borrower prepays within first 12 months, foreclosure fee = 1% of outstanding; between 12 and 24 months = 0.75%; after 24 months = 0.5%. Floating rate reset once every 6 months based on MCLR + spread.",
"uri": "dms://policies/internal/pricing/PL-STD-2025_v9.pdf",
"authority_score": 0.95
},
{
"chunk_id": "DISC-INFO-EMI_ch_02",
"doc_id": "DISC-INFO-EMI",
"title": "Customer Disclosure - EMI Calculation",
"snippet": "EMI calculated using standard reducing balance method. The EMI shown is indicative and subject to final rounding and taxes.",
"uri": "dms://disclosures/emi_calc.pdf",
"authority_score": 0.9
},
{
"chunk_id": "LEGAL-LC-7.1_ch_01",
"doc_id": "LEGAL-CLAUSES-v1",
"title": "Clause LC-7.1 Late Payment",
"snippet": "Late fee: fixed INR 500 plus interest penalty of 2% monthly on overdue amount following a 7 day grace period.",
"uri": "dms://legal/clauses/LC-7.1.txt",
"authority_score": 0.98
}
]
3) Prompt sent to LLM (structured, evidence-first)
System prompt (guardrails):
SYSTEM: You are the Bank's Borrower Agreement Copilot. Use ONLY the factual context and evidence chunks provided. Do NOT hallucinate. Every factual claim must include a citation to the chunk_id (e.g., [POL-RBI-202402_ch_12]). If evidence is insufficient, respond with "INSUFFICIENT_EVIDENCE" and request human review. Provide a short summary in simple language (2-3 lines), list key obligations and costs, explain foreclosure/prepayment cost examples, and provide 1–2 actionable next steps for the borrower. Output MUST be JSON as specified.
User/Instruction prompt (context + retrieved chunks):
INSTRUCTION:
LoanRef: loan-888888
Borrower: Ram Kumar
AgreementContext: (see attached agreement-context JSON)
EMI: INR 16,789 monthly; first EMI 2026-01-05. (agreement_context.emi_schedule)
Payment terms: late_fee_fixed INR 500; late_fee_percent_monthly 2%; grace_period 7 days.
Prepayment policy from agreement: "1% if prepaid within 12 months, else 0.5% after 12 months" (agreement_context.prepayment_and_foreclosure)
RETRIEVED EVIDENCE CHUNKS:
1) POL-RBI-202402_ch_12 - "Banks shall allow prepayment... foreclosure charges shall not exceed 2%..."
2) POL-PL-2025-09_ch_03 - "If borrower prepays within first 12 months, foreclosure fee = 1%... reset every 6 months..."
3) DISC-INFO-EMI_ch_02 - "EMI calculated using reducing balance..."
4) LEGAL-LC-7.1_ch_01 - "Late fee: INR 500 plus 2% monthly..."
QUESTION:
1) Provide a short, simple summary of this agreement for Ram.
2) Explain the foreclosure/prepayment cost in numeric examples for 1 year and at 18 months assuming no principal paid other than scheduled EMIs (use outstanding principal approx from schedule URI if needed; otherwise estimate to nearest rupee).
3) List the borrower's key obligations and penalties.
4) Provide 2 actionable next steps for Ram (clear, simple).
5) Include evidence citations for each factual claim in the answer.
OUTPUT FORMAT (JSON):
{
"summary": "<2-3 lines>",
"foreclosure_examples": [
{"scenario":"", "estimates": [{"month":12, "outstanding":<int>, "foreclosure_fee":<int>}, ... ], "citations":[]}
],
"key_obligations": [{"item":"", "citation":"chunk_id"}],
"next_steps":["",""],
"confidence": "<0.0-1.0>"
}
4) Sample LLM Response (Copilot output — JSON with citations)
This is the kind of audited, evidence-backed output you must present in interviews.
{
"loan_ref": "loan-888888",
"summary": "You have a ₹5,00,000 personal loan for 36 months at a floating rate (12.75% p.a.). Your EMI is ₹16,789 and the first EMI is due on 5 Jan 2026. [DISC-INFO-EMI_ch_02] [POL-PL-2025-09_ch_03]",
"foreclosure_examples": [
{
"scenario": "Foreclose after 12 months (approx.)",
"estimates": [
{
"month": 12,
"approx_outstanding_principal": 410000,
"foreclosure_fee_percent": 1.0,
"foreclosure_fee_amount": 4100
}
],
"explanation": "Estimated outstanding principal after 12 months approx ₹410,000 using reducing balance schedule; foreclosure fee = 1% as per product rule. [POL-PL-2025-09_ch_03] [DISC-INFO-EMI_ch_02]",
"citations": ["POL-PL-2025-09_ch_03", "DISC-INFO-EMI_ch_02"]
},
{
"scenario": "Foreclose at 18 months",
"estimates": [
{
"month": 18,
"approx_outstanding_principal": 310000,
"foreclosure_fee_percent": 0.75,
"foreclosure_fee_amount": 2325
}
],
"explanation": "Product rules indicate a graded foreclosure fee after 12 months (0.75% for 12–24 months). Fee estimated on outstanding principal. [POL-PL-2025-09_ch_03]",
"citations": ["POL-PL-2025-09_ch_03"]
}
],
"key_obligations": [
{"item": "Pay EMI of ₹16,789 on or before due date; 7-day grace period applies.", "citation": "DISC-INFO-EMI_ch_02"},
{"item": "If EMI is late, a fixed late fee of ₹500 plus an interest penalty of 2% per month applies.", "citation": "LEGAL-LC-7.1_ch_01"},
{"item": "Foreclosure is allowed; foreclosure fee applies depending on timing (1% within 12 months; 0.75% between 12–24 months; 0.5% after 24 months).", "citation": "POL-PL-2025-09_ch_03"},
{"item": "Interest is floating and rate resets every 6 months per product pricing rules.", "citation": "POL-PL-2025-09_ch_03"}
],
"next_steps": [
"If you consider prepaying within the first year, request a prepayment quote from the bank (include the exact outstanding principal). The bank will compute the exact outstanding and foreclosure fee. [POL-PL-2025-09_ch_03]",
"If you miss an EMI, pay within the 7-day grace period to avoid the fixed penalty; otherwise expect an additional 2% monthly on overdue. Contact support only if you face hardship to discuss restructuring. [LEGAL-LC-7.1_ch_01] "
],
"confidence": 0.86,
"evidence": [
{"chunk_id": "POL-PL-2025-09_ch_03", "uri": "dms://policies/internal/pricing/PL-STD-2025_v9.pdf"},
{"chunk_id": "DISC-INFO-EMI_ch_02", "uri": "dms://disclosures/emi_calc.pdf"},
{"chunk_id": "LEGAL-LC-7.1_ch_01", "uri": "dms://legal/clauses/LC-7.1.txt"},
{"chunk_id": "POL-RBI-202402_ch_12", "uri": "dms://policies/rbi/fair_practice/2024/ch12.pdf"}
],
"llm_metadata": {
"model": "enterprise-llm-v2",
"temperature": 0.0,
"prompt_id": "prompt-ug-2025-11-28-001",
"generated_at": "2025-11-28T10:08:15Z"
}
}
Notes about the numeric estimates: The Copilot used the EMI schedule pointer (emi_schedule_uri) as a reference. If precise outstanding principal is required, the Copilot returns a prepayment quote step (next step 1) which triggers the backend Agreement Service to compute exact outstanding and fee. This is the correct pattern: Copilot provides bounded estimates + actionable next step to get precise computation.
:
“After agreement generation our /agreement-context/{loanId} returns the structured agreement with EMI schedule, foreclosure policy and clause pointers. The Underwriter/Borrower Copilot first retrieves the relevant policy and clause chunks from the Knowledge Hub (RAG). We build a structured prompt that includes the factual agreement context and the retrieved chunks, and call a private LLM with strict evidence-first guardrails. The LLM returns a JSON response that is fully citation-backed. For any precise monetary computations (outstanding principal), Copilot recommends requesting an exact prepayment quote from the Agreement Service — we never let the LLM invent precise ledger calculations. All prompts, retrieved chunks, and responses are written to the immutable evidence store for audit.”
Emphasize:
Evidence-first design (citations)
Bounded LLM behavior (estimates vs exact via service)
Auditability (prompt + chunks + response stored)
Safety (temperature = 0, post-validator)
.png)

Comments