Mutual Fund Architecture Q & A

Anand Nerurkar
May 19
15 min read

🔹 Q1: How did you design for 100K+ concurrent users and 5K+ TPS?

Structured Answer:

To support 100K+ concurrent users and 5K+ TPS, I applied a multi-layered scale-out design:

1. Frontend Scaling: Angular app is deployed via Azure CDN + Front Door for static delivery and geo-routing.

2. API Ingress: Azure Front Door handles SSL/WAF, forwards to App Gateway (private), which routes to AKS Istio Ingress Gateway (internal). Istio handles JWT validation and rate-limiting.

3. AKS Design: We deployed AKS with HPA and KEDA, auto-scaling based on Kafka lag and CPU/memory. Services are spread across ~20+ nodes per region.

4. Kafka: 5 brokers, 100+ partitions, tuned for parallelism. MM2 mirrors Kafka across regions for DR.

5. Redis: Front-line for read-heavy requests like NAV/Portfolio — with Cosmos as fallback.

6. Load Testing: Performed with JMeter and Locust — validated 5K TPS sustained, with burst handling up to 8K+ using async Kafka producers.

🔹 Q2: How is NAV ingestion resilient and scalable?

Structured Answer:

NAV ingestion is designed as a modular pipeline using Kafka and idempotent services:

1. Feed Input: NAV files land in Azure Blob via SFTP from CAMS/KFin.

2. NAVIngestorService: Reads, parses → pushes to Kafka nav.raw-feeds.

3. NAVCalculatorService: Validates feeds → publishes to nav.validated.

4. NAVPublisherService: Publishes to nav.broadcast, updates Redis, Cosmos, and notifies UI.

5. Resilience: Each service has: Circuit breaker (Resilience4j) Retry on transient failures Kafka consumer offset handling with idempotency keys

6. Audit: All feed events are logged to Cosmos DB for SEBI traceability. This design ensures real-time publishing (<10s delay) and zero data loss on retry.

🔹 Q3: How do you handle failures from external APIs (UIDAI, PAN, LLM, etc.)?

Structured Answer:

External APIs are failure-prone, so I followed a layered resilience model:

1. Circuit Breaker: Resilience4j with fallback to "temporarily unavailable" message.

2. Timeouts: All APIs wrapped with strict 3s timeout + 3 retries with exponential backoff.

3. Async Design: For long-running flows (e.g., Aadhaar eSign), we offload to Kafka, wait for webhook/callback.

4. Fallbacks: For non-critical APIs like CRM replies, we return templated GenAI responses or cached data.

5. Retry Queue: Kafka DLQ captures hard failures → retry consumer with backoff logic handles delayed processing.

6. Monitoring: Prometheus tracks latency, error rate; alerts pushed via AlertManager.

🔹 Q4: Where and how is GenAI integrated?

Structured Answer:

GenAI is integrated as domain-specific microservices, enabling contextual intelligence:

1. GenAI Microservices: GenAIAdvisorService – investment goal-based Q&A AnomalyExplainerService – NAV/Txn deviation justification DocSummarizerService – OCR + summarization of fund PDFs SupportReplyGenerator – CRM ticket responses

2. RAG Architecture: Context data pulled from Redis/Cosmos Prompt orchestrated in RAGOrchestratorService LLM API call (OpenAI) with masking, compliance filter

3. Security: PII masking, prompt logs, fallback to static responses if LLM fails.

4. UX Integration: GenAI responses shown in chat UI, investor dashboard, and CRM backend (Zendesk plugin).

🔹 Q5: How do you ensure compliance and audit-readiness (SEBI, GDPR, AML)?

Structured Answer:

Compliance is embedded into the data, access, and reporting layers:

1. Audit Trail: Cosmos DB with 7-year TTL stores all NAV, Txn, KYC, and portfolio events.

2. Access Control: Azure AD + Istio JWT-based RBAC (scope-based) ensures role-isolated access.

3. Encryption: PII encrypted at rest via Azure Key Vault; tokenization for Aadhaar/PAN.

4. SEBI/FATCA: ReportService generates exports from Cosmos → CSV → SEBI SFTP.

5. Monitoring: ELK stack and Azure Monitor track logins, KYC updates, NAV changes.

🔹 Q6: What’s your active-active DR setup?

Structured Answer:

DR and multi-region HA are enabled via Azure + Kafka + Cosmos native support:

1. Azure Front Door: Routes to region A or B based on probe health.

2. App Gateway + AKS: Both regions have identical infra via Terraform + Helm.

3. Kafka MM2: Mirrors all critical topics between clusters.

4. Cosmos DB: Multi-region write, with auto-failover.

5. Redis: Redis Enterprise with geo-replication, fallback logic to Cosmos.

6. Failover Test: DR drill switches region within 2 mins, no manual steps.

🔹 Q7: What’s your CI/CD and observability setup?

Structured Answer:

I’ve built a robust DevSecOps pipeline: 1. CI/CD: Azure DevOps + Helm + Istio for progressive canary rollout CVE scanning on container build using Trivy SealedSecrets for secure config delivery 2. Observability: Prometheus + Grafana for latency, Kafka lag, HPA metrics ELK for service logs + Cosmos audit logs Azure Monitor alerts tied to Slack and email

🔹 Q8: Walk me through Investor & Distributor flows.

Structured Answer:

👤 Investor Flow:

Login via Azure AD B2C → JWT
Browse NAV → Redis fetch via ProductCatalog
Place Order → OrderService → Kafka order.placed
PaymentService → payment.success
TransactionEngine → Kafka transaction.completed
PortfolioService updates holdings, Cosmos logs
NotificationService sends confirmation

👥 Distributor Flow:

Register Lead → CRM API
Onboard Investor → KYCService + Aadhaar APIs
CredentialService creates account
CommissionEngine triggered on Txn → updates dashboard

What if Kafka goes down?", etc.)?

how did you come up with kafka 5 brokers and 100 partition

Excellent — this is exactly the kind of architecture validation question a CTO will ask to check if you’ve made data-driven, domain-informed decisions.

Here’s a structured, realistic way to explain why 5 Kafka brokers and 100 partitions make sense for your mutual fund platform:

✅ Structured Answer: Kafka Broker & Partition Sizing Rationale

🎯 Goal:

Support 6000+ TPS, high concurrency, and multi-stream parallel processing for NAV feeds, transactions, notifications, etc.

🔹 1. Broker Count = 5 (Minimum for BFSI-grade HA)

Reason	Detail
High Availability	With 5 brokers, Kafka can tolerate the loss of up to 2 brokers (assuming RF=3) without losing availability.
Replication Factor	Most topics use RF=3 (recommended for BFSI durability). 5 brokers allow balanced partition leader/follower distribution.
Topic isolation	Certain topics like nav.broadcast, transaction.completed, compliance.export are kept on dedicated brokers to reduce contention.
Multi-tenant readiness	With active-active setup, brokers are mirrored across regions via MM2. This ensures isolation for cross-regional processing.

🔹 2. Partition Count = 100 (Performance and Parallelism)

Reason	Detail
Parallelism	Each consumer instance can consume from 1 partition. 100 partitions allow us to scale consumer groups horizontally, e.g., 10–20 consumers per service for high TPS (NAV processing, transactions).
Throughput	Kafka handles ~1K–2K msgs/sec per partition (depending on payload). 100 partitions = 100K–200K msgs/sec theoretical ceiling, far above our 6K TPS requirement.
Topic-wise Sizing	Topics like order.placed, transaction.completed, nav.validated each use 10–20 partitions based on volume profiling.
Resilience	More partitions reduce backlog buildup during spikes — important for NAV ingestion, especially when CAMS/KFin send bulk updates.

🧮 Estimation Backed by Usage:

NAV publishing = ~500 msgs/sec across all schemes → 10 partitions
Transactions = ~4000 TPS = 40–50 partitions across order, payment, txn topics
Notifications, audit logs, SEBI events, CRM tickets → 20+ partitions

🛡 Bonus: Monitoring & Autoscaling

Kafka lag monitored via Prometheus + Grafana
KEDA (Kubernetes Event-Driven Autoscaler) auto-scales consumers when lag > threshold
This ensures predictive scaling under stress without overloading brokers

📌 Summary Response

“We chose 5 Kafka brokers for high availability with RF=3 and logical topic isolation. We provisioned 100 partitions to support parallelism and sustained high-throughput workloads across NAV ingestion, transactions, and compliance flows. This also gives us headroom for tenant growth and active-active DR support.”

🔧 Challenge Scenarios + Your Answers

🔹 Q1. "What if Kafka goes down in Production? How will your system handle it?"

Structured Answer:

If Kafka goes down, here's how we ensure continued stability: 1. Kafka MM2 Mirror (Multi-Region): Kafka is deployed with 5 brokers, RF=3, and MM2 mirroring between primary and DR regions. Producers automatically switch based on partition unavailability detection. 2. Circuit Breaker + DLQ: All Kafka producers (via Spring Kafka) use Resilience4j with circuit breakers. On repeated failures, messages are pushed to a Kafka DLQ or a fallback Redis queue. 3. Graceful Degradation: For non-critical features (e.g., notifications), if Kafka is down, we log and defer the event for batch replay. 4. Monitoring & Alerting: Prometheus tracks kafka_lag, isr_shrink, and broker health. Alerts trigger automated runbooks and fallback logic.

🔹 Q2. "How would you horizontally scale GenAI microservices without overloading the LLM API or breaching cost limits?"

Structured Answer:

Great question — GenAI requires careful orchestration: 1. RAG Gateway Throttling: Our RAGOrchestratorService enforces rate limits, deduplication, and token budgeting per tenant. 2. Autoscaling GenAI Pods: All GenAI services are stateless → horizontally scaled with HPA. KEDA + Kafka lag metrics trigger scale-up for bursts. 3. Caching LLM Responses: Repeated questions (e.g., NAV explainers) are cached in Redis with TTL to avoid redundant LLM calls. 4. Fallback Logic: If LLM is overloaded or unavailable, we return a templated message or a default advisory until retry succeeds. 5. Cost Control: Each GenAI request logs tokens and cost metrics. We enforce usage caps per role/tenant using policy filters.

🔹 Q3. "What if Cosmos DB hits RU throttling under NAV or Txn spikes?"

Structured Answer:

Cosmos DB is powerful but cost-sensitive. Here's how we protect against RU throttling: 1. Partitioning & RU Distribution: We partition Cosmos collections by investor_id or txn_date to ensure even RU usage. 2. Redis Caching Layer: All read-heavy access (NAV, portfolio) is served from Redis with Cosmos as fallback only. 3. Bulk Ingestion with Retry: NAV ingestion and txn logs are written via bulk API with retry. We use adaptive backoff on 429 errors. 4. Auto-scale Enabled: Cosmos is set to auto-scale RU/s from 10K to 50K. 5. Monitoring: We track TotalRequestUnitsConsumed, RateLimitedRequests, and get alerted on RU peaks.

🔹 Q4. "What if one of the microservices is slow or memory-leaking? How do you detect and isolate it?"

Structured Answer:

We follow a combination of Istio service mesh + observability stack to catch and isolate misbehaving services: 1. Istio Circuit Breakers & mTLS: Istio injects retries, timeouts, connection limits. mTLS isolates traffic between services and allows egress visibility. 2. Prometheus + Grafana Dashboards: Track latency, 5xx errors, memory usage per service. 3. ELK Logs: Each service emits logs with trace ID. We correlate slow traces using Kibana queries or Azure Monitor Workbooks. 4. Auto-remediation: For HPA-enabled services, pods auto-scale on CPU/mem threshold breach. For long spikes, Kubernetes taints + pod eviction isolate affected pods.

🔹 Q5. "How would you handle a scenario where SEBI reporting fails at EOD due to corrupt or missing data?"

Structured Answer:

SEBI compliance reporting is critical, and we use several controls: 1. Immutable Audit Source: Cosmos DB acts as the single source of truth for all transactions. 2. Data Validation Pipeline: Pre-export validator checks for schema integrity, missing NAV/txn values. 3. Retry Flow: If SEBI report fails, the system retries up to 3 times. Failing that, it raises a Level-1 alert to compliance team. 4. Backup Export Method: Last successful snapshot is used to regenerate a minimal valid report. 5. Dashboard Flagging: Admin dashboard clearly shows status: Validated, Rejected, Needs Retry.

🔹 Bonus: CTO May Ask

"What’s your roadmap for this platform in the next 12–18 months?"

Answer:

Tenant onboarding automation for AMC partners GenAI expansion into investment risk profiling and multilingual voice bots Zero-trust network posture using Azure Private Link and API Gateway Data lake integration with Power BI for predictive analytics ML-driven anomaly detection for fund manipulation and investor behavior

what is RU Throttling

Great question — especially important when using Azure Cosmos DB, which is Request Unit (RU)-based and commonly misunderstood.

🧮 What is RU Throttling in Cosmos DB?

✅ Request Unit (RU):

An RU is the currency of throughput in Cosmos DB.
Every operation (read, write, query) consumes RUs based on complexity.
Example:
- Read a 1 KB document → ~1 RU
- Write a 1 KB document → ~5 RUs
- Query with filters or sort → 10–50+ RUs depending on index use

❌ RU Throttling Happens When:

Your operation exceeds the RU/s provisioned for a container.
Cosmos immediately returns HTTP status 429 Too Many Requests.
SDK/clients back off and retry (usually with exponential delay).
This causes latency spikes, degraded UX, and downstream backlog.

📉 Example Scenario of RU Throttling:

Let’s say:

You provisioned 10,000 RU/s
A bulk transaction load hits the DB consuming 12,000 RU/s

➡️ Throttle happens → Cosmos returns 429➡️ Your service backs off, retries later → slower end-user experience

🛡️ How to Handle RU Throttling

Strategy	Description
Enable Auto-Scale	Let Cosmos scale RU/s (e.g., from 10K to 50K) as needed
Partition Keys	Distribute load evenly to avoid hot partitions
Redis Caching	Offload read-heavy calls to Redis
Retry Policy	Use SDK-level exponential backoff for transient 429s
Batch Writes	Use bulk APIs to reduce per-document overhead
Monitoring	Watch for metrics: RateLimitedRequests, TotalRequestUnitsConsumed

📊 Monitoring RU Throttling

Use Azure Monitor + Alerts on:

RateLimitedRequests
NormalizedRUConsumption
Data Usage per Partition
Latency per operation

🔑 Real-World Tip:

For BFSI platforms like mutual funds:

Cosmos is great for immutable audit logs and low-latency reads
But for high write throughput, combine with Kafka + Redis + batch writers

🔧 Challenge Round 2: Scenarios + Ideal Responses

🔹 Q1: NAV spike triggers 5x NAV uploads in 2 minutes. What do you do?

Your Answer:

“NAVIngestorService uses Kafka, so even during spikes we scale horizontally. Here’s how we handle it: Kafka’s nav.raw-feeds topic handles bursts via 10+ partitions NAVCalculatorService autoscales via KEDA (based on Kafka lag) Idempotency logic ensures duplicates are ignored Redis write is asynchronous and optimized for speed Cosmos DB backpressure is mitigated via bulk writes with retry Prometheus tracks Kafka lag, consumer CPU, Redis latency → alerts Ops before user impact.”

🔹 Q2: Azure Region A is down. How does your system failover to Region B?

Your Answer:

“We designed active-active HA: Azure Front Door detects probe failure from Region A → reroutes to Region B App Gateway + AKS infra is cloned using Terraform Kafka MM2 mirrors topics between clusters → consumers continue from same offset Cosmos DB is globally distributed with multi-region write enabled Redis Enterprise supports geo-replication with automated failover This ensures zero data loss and sub-minute RTO.”

🔹 Q3: A fund house onboards 1 lakh new users this week. What breaks first, and how do you prepare?

Your Answer:

“This affects: AKS CPU/memory — we scale nodes via HPA Kafka consumers — we pre-scale partitions + consumer group size Cosmos RU/s — monitor and auto-scale if write/read pressure increases JWT token issuance — monitored via Azure AD B2C metrics GenAI — capped via usage guardrails and Redis-based deduplication We also run load tests using Locust to simulate 2x expected traffic.”

🔹 Q4: Azure Cosmos cost is shooting up. What do you optimize first?

Your Answer:

“Cosmos cost = RU consumption × time. Here's my step-down strategy: Cache all read-heavy requests (NAV, Portfolio) in Redis Change query patterns to reduce fan-outs (e.g., filter+sort inside app) Use bulk writers instead of individual inserts Partition carefully to prevent hot partitions Enable auto-scale with hard max, review daily peak usage We also run diagnostics.getRUUsage() in SDK logs to profile high-RU queries.”

🔹 Q5: Investor is shown wrong portfolio due to eventual consistency. How do you fix it?

Your Answer:

“We balance performance and consistency using: Cosmos DB consistency level set to Session (vs. eventual) Portfolio writes go to Redis first, then sync to Cosmos PortfolioService includes retry-and-verify logic Investor UI validates with backend hash or timestamp Alerts trigger if Redis and Cosmos snapshots deviate beyond threshold For critical data like portfolio and holdings, we always prioritize consistency over latency.”

🔹 Q6: SEBI audit asks for trace of 1 investor’s KYC + 5 years of transactions. How do you respond?

Your Answer:

“Our Cosmos DB acts as an immutable ledger with: Partitioned data by investor_id and txn_date All events (KYC, NAV, txn) written with metadata (trace_id, timestamp) Cosmos TTL is set to 7 years; we use change feed + time filters to extract Audit export logic outputs compliant CSV with hash validation Access logged and auditable via ELK + Azure Monitor We can export investor trace within minutes, with full data lineage.”

🔹 Q7: LLM API quota hits limit. How do you ensure user experience is not broken?

Your Answer:

“LLM access is wrapped via a GenAI Gateway Service with: Request scoring based on urgency/context Token limiters and quotas per user/role If quota exceeds → fallback to cached answer, templated reply, or chatbot redirect API metrics logged (tokens, cost, response time) per tenant Admin notified proactively via Azure alert webhook”

🔹 Q8: How do you simulate disaster recovery for this platform before going live?

Your Answer:

“We conduct DR drills quarterly: Disable Region A front door probe → force reroute to Region B Validate Cosmos DB failover, Kafka consumer failover (MM2) Scale tests on AKS in Region B using synthetic load (Locust) Confirm logs are still flowing to ELK/Monitor Reconcile Cosmos + Redis data snapshot post failover Post-mortem documents delta, performance impact, and recovery timeline.”

🔐 🤖 📈 Challenge Round 3:

GenAI • Security • Strategic Platform Thinking

🔹 Q1: GenAI hallucination led to wrong investment advice. What controls do you apply?

Your Answer:

“Great question. We enforce a 4-tier safety framework: RAG-first policy: All GenAI responses use Retrieval-Augmented Generation. We pass only contextual data (from Redis/Cosmos) — no open-ended prompting. LLM Validator Layer: Responses are passed through a PolicyFilterService that checks for regulatory triggers, unsupported claims, and hallucinated recommendations. Explainability & Confidence: We return a confidence score + source document snapshot to end users for transparency. Fallback & Escalation: If hallucination risk is high, the system returns a templated message:‘This query needs a human advisor. We’ve created a support ticket.’”

🔹 Q2: GenAI data leak risk — how do you protect PII or investor context from being exposed?

Your Answer:

“We treat LLM interaction as an external boundary: PII masking is mandatory before prompt construction (<name>, <PAN>, <holding> → anonymized tokens) Use Azure OpenAI private endpoint with no data logging All prompts + responses are logged (with hashed user ID) in audit trail (Cosmos/ELK) Only role-scoped context is passed to LLM (e.g., investor sees own data; admin sees compliance view) Prompt engineering team reviews prompt design for injection prevention & leakage”

🔹 Q3: What’s your GenAI cost-control strategy as usage scales?

Your Answer:

“We control GenAI cost with a metered execution model: Each tenant/role has a token and API usage quota Prompts are classified into tiers (e.g., Q&A = 1x, OCR = 3x cost) Frequently asked queries → Redis cache with TTL For burst loads → switch to batch async mode (queue → response later) All GenAI logs track: user ID, token count, cost estimate, and latency → fed to BI dashboard”

🔹 Q4: How do you secure APIs for role-based access across personas (Investor, Admin, Compliance)?

Your Answer:

“We use layered identity and API access control: Azure AD B2C (external) and Azure AD (internal) for auth JWT tokens include role, scope, tenant, userId claims Istio validates JWT before it hits microservices Spring Security uses claims for RBAC (e.g., /portfolio/{userId} restricted to self unless admin) For compliance APIs, we enforce time-limited access and audit logging Redis holds token/session-level cache with auto-logout after 15 mins idle”

🔹 Q5: How do you ensure platform is future-proofed for onboarding more AMCs or new regulations?

Your Answer:

“We designed the platform to be multi-tenant and regulation-resilient: Tenant-scoped APIs and RBAC using Istio VirtualServices and token claims Central onboarding engine provisions client config, brand theme, fund mapping dynamically Regulation triggers (SEBI update, FATCA change) are decoupled from core using: ComplianceRulesService (config-driven) SEBIExporterService (versioned spec) Each tenant’s data is logically isolated in Cosmos and Redis Event-based architecture (Kafka) allows plugging new flows with minimal impact”

🔹 Q6: Let’s say SEBI mandates AI model auditability. How do you comply?

Your Answer:

“We already have a basic AI audit trail via: Prompt & Response pairs stored in Cosmos with timestamp, traceId LLM execution metadata: token count, latency, API used AIExplainabilityService provides: Source content used (from RAG vector DB) Reasoning explanation from the model (where supported) Admin can export these logs monthly as encrypted CSVs We support redaction logs for user data deletion under GDPR as well”

🔹 Q7: What’s your vision for taking this mutual fund platform from MVP to full product scale?

Your Answer:

“Over the next 18 months, here’s the strategic roadmap: Partner API platform → expose fund selection, onboarding, and reporting via API to 3rd-party fintechs GenAI 2.0 → voice bot + investment simulator using investor personality modeling Unified compliance engine → multi-regulator config-driven exporter Data Mesh → Kafka to Synapse pipelines → BI insights for investor behavior, fraud trends AI model marketplace → allow each AMC to plug their own explainers or robo-advisors The vision is to become the backend-as-a-service for MF distribution and compliance in India.”

🚀 Challenge Round 4:

Kubernetes • CI/CD • DR • Production Readiness

🔹 Q1: You’re running 40+ microservices on AKS. How do you handle pod crashes, node failures, and OOM issues in production?

Your Answer:

“We follow a layered resilience model for pod and node stability: Pod Crash: Liveness & readiness probes detect stuck pods → auto-restart CrashLoopBackOff monitored via Azure Monitor and Prometheus Node Failure: PodDisruptionBudgets and anti-affinity rules ensure HA across zones Node auto-replacement via VMSS autoscaling OOM Protection: Set resource limits/requests per pod Horizontal Pod Autoscaler (HPA) on CPU/mem OOMKill alerts push to Slack for critical services All restarts are logged and correlated using trace IDs in ELK.”

🔹 Q2: How do you ensure safe deployments and rollback in CI/CD?

Your Answer:

“We use Azure DevOps + Helm + Istio for progressive deployment: Build Phase: Docker image built and vulnerability scanned (Trivy, Aqua) Helm chart version tagged with Git commit + timestamp Deploy Phase: Canary deployment via Istio traffic shifting (e.g., 10/90 → 50/50 → 100%) Validate success with synthetic probes, Prometheus alert silence window Rollback: Helm rollback supported with version pinning Istio route instantly shifted back to previous pod set All deployments logged and traceable in release dashboard.”

🔹 Q3: A DR drill is scheduled. What’s your step-by-step runbook?

Your Answer:

“For DR drills, we test region failover and data integrity: 1. Fail Primary Region: Simulate region outage by disabling Azure Front Door health probe Front Door shifts to secondary App Gateway (Region B) 2. Validate AKS Readiness: AKS-B already hot-warm with mirror images, configs via Terraform Istio + Prometheus + Kafka consumers active 3. Data Layer Validation: Cosmos DB: Multi-region write → test insert + consistency Redis: Geo-replication → validate real-time cache health Kafka: MM2 consumer offset validation 4. Reporting & Reconciliation: Cross-region audit logs, trace ID validation Daily DR report sent to compliance team + execs”

🔹 Q4: How do you control config drift and environment inconsistencies?

Your Answer:

“We use GitOps and IaC to manage environment consistency: Terraform provisions AKS, App Gateway, Cosmos, Redis, etc. Helm Charts version-controlled in Git for all microservices Sealed Secrets: Secrets are encrypted and stored in Git, decrypted at runtime only in-cluster Environment Promotion Pipelines: Lower → Staging → Prod with diff checks Drift detection via terraform plan and Azure Policy compliance alerts”

🔹 Q5: A critical pod is stuck during deployment. Business impact is high. What’s your troubleshooting approach?

Your Answer:

“Production support kicks in with structured triage: Step 1: Check pod status: kubectl describe pod → look for image pull, OOM, volume errors Step 2: Review kubectl logs (also visible in ELK) Step 3: Check for Istio sidecar injection or init container failure Step 4: Validate secret mounts or configMaps mismatch (common in sealed secrets) Step 5: Evict pod and allow HPA or Karpenter (if used) to reschedule If issue persists → quick rollback via Helm and route traffic via Istio to stable pods”

🔹 Q6: What does your production readiness checklist include before go-live?

Your Answer:

“Every service must pass our readiness gates: ✅ Functional tests: Unit, Integration, E2E in CI ✅ Load tests: > 2x expected TPS using Locust/JMeter ✅ Security scans: Docker CVEs + SAST ✅ Observability: Prometheus alerts, ELK logs, dashboards ✅ Rollback plan: Helm version history tested ✅ Chaos Testing (optional): Introduce pod kill or latency for top 3 services Only after passing these do we tag a release as prod-ready.”

🔹 Q7: How do you manage Kafka schema evolution and compatibility?

Your Answer:

“We use Confluent Schema Registry + Avro schemas with version control: Backward compatibility enforced as default policy CI checks run schema diff + compatibility test Service contracts are published and shared via internal developer portal For breaking changes → we use versioned topics (e.g., txn.v1, txn.v2) Consumer lag and error trends tracked via Prometheus/Kafka Exporter”

🔹 Q1: How did you design for 100K+ concurrent users and 5K+ TPS?

🔹 Q2: How is NAV ingestion resilient and scalable?

🔹 Q3: How do you handle failures from external APIs (UIDAI, PAN, LLM, etc.)?

🔹 Q4: Where and how is GenAI integrated?

🔹 Q5: How do you ensure compliance and audit-readiness (SEBI, GDPR, AML)?

🔹 Q6: What’s your active-active DR setup?

🔹 Q7: What’s your CI/CD and observability setup?

🔹 Q8: Walk me through Investor & Distributor flows.

👤 Investor Flow:

👥 Distributor Flow:

✅ Structured Answer: Kafka Broker & Partition Sizing Rationale

🎯 Goal:

🔹 1. Broker Count = 5 (Minimum for BFSI-grade HA)

🔹 2. Partition Count = 100 (Performance and Parallelism)

🧮 Estimation Backed by Usage:

🛡 Bonus: Monitoring & Autoscaling

📌 Summary Response

🔧 Challenge Scenarios + Your Answers

🔹 Q1. "What if Kafka goes down in Production? How will your system handle it?"

🔹 Q2. "How would you horizontally scale GenAI microservices without overloading the LLM API or breaching cost limits?"

🔹 Q3. "What if Cosmos DB hits RU throttling under NAV or Txn spikes?"

🔹 Q4. "What if one of the microservices is slow or memory-leaking? How do you detect and isolate it?"

🔹 Q5. "How would you handle a scenario where SEBI reporting fails at EOD due to corrupt or missing data?"

🔹 Bonus: CTO May Ask

🧮 What is RU Throttling in Cosmos DB?

✅ Request Unit (RU):

❌ RU Throttling Happens When:

📉 Example Scenario of RU Throttling:

🛡️ How to Handle RU Throttling

📊 Monitoring RU Throttling

🔑 Real-World Tip:

🔧 Challenge Round 2: Scenarios + Ideal Responses

🔹 Q1: NAV spike triggers 5x NAV uploads in 2 minutes. What do you do?

🔹 Q2: Azure Region A is down. How does your system failover to Region B?

🔹 Q3: A fund house onboards 1 lakh new users this week. What breaks first, and how do you prepare?

🔹 Q4: Azure Cosmos cost is shooting up. What do you optimize first?

🔹 Q5: Investor is shown wrong portfolio due to eventual consistency. How do you fix it?

🔹 Q6: SEBI audit asks for trace of 1 investor’s KYC + 5 years of transactions. How do you respond?

🔹 Q7: LLM API quota hits limit. How do you ensure user experience is not broken?

🔹 Q8: How do you simulate disaster recovery for this platform before going live?

🔐 🤖 📈 Challenge Round 3:

🔹 Q1: GenAI hallucination led to wrong investment advice. What controls do you apply?

🔹 Q2: GenAI data leak risk — how do you protect PII or investor context from being exposed?

🔹 Q3: What’s your GenAI cost-control strategy as usage scales?

🔹 Q4: How do you secure APIs for role-based access across personas (Investor, Admin, Compliance)?

🔹 Q5: How do you ensure platform is future-proofed for onboarding more AMCs or new regulations?

🔹 Q6: Let’s say SEBI mandates AI model auditability. How do you comply?

🔹 Q7: What’s your vision for taking this mutual fund platform from MVP to full product scale?

🚀 Challenge Round 4:

🔹 Q1: You’re running 40+ microservices on AKS. How do you handle pod crashes, node failures, and OOM issues in production?

🔹 Q2: How do you ensure safe deployments and rollback in CI/CD?

🔹 Q3: A DR drill is scheduled. What’s your step-by-step runbook?

🔹 Q4: How do you control config drift and environment inconsistencies?

🔹 Q5: A critical pod is stuck during deployment. Business impact is high. What’s your troubleshooting approach?

🔹 Q6: What does your production readiness checklist include before go-live?

🔹 Q7: How do you manage Kafka schema evolution and compatibility?

Comments