Mutual Fund - Interview Preparation

Anand Nerurkar
May 17
9 min read

“I’ve architected a modern, real-time, investor-facing mutual fund platform capable of handling 100k+ concurrent users and 5k+ TPS. It uses a Spring Boot microservices backbone with Azure AKS, Kafka for event processing, Istio for service mesh, and integrates with external systems like UIDAI, SEBI, Digio, and payment gateways. This architecture is built for elasticity, DR, real-time observability, and compliance with SEBI mandates. It’s production-ready at enterprise scale with clearly defined user flows, automation, and role-based isolation.

Mutual Fund - Interview Preparation – Structured Architecture Q&A

🔹 1. How do you scale your architecture for 150k users and 5,000+ TPS?

Answer: We use Azure Kubernetes Service (AKS) with clearly defined thresholds and sizing metrics:

Horizontal Pod Autoscaler (HPA): Each stateless service is horizontally scaled based on CPU, memory, and Kafka lag. For instance, TransactionEngine handles ~500 TPS per replica; with 5,000 TPS target, we scale to ~20 replicas with 75% CPU utilization target.
Cluster Autoscaler (CA): AKS node pools are scaled between 50–120 nodes per region depending on load. Each node hosts 10–15 pods, leading to ~1,000–1,800 pods per cluster.
Kafka: Topics are partitioned based on investor ID and fund type. High-volume topics like order.placed or nav.broadcast are configured with 50–100 partitions and 3–5 consumer replicas.
Redis: Premium Azure Cache with 3 shards and active-active geo-redundancy. Supports >150K ops/sec.
Database Layer: Azure SQL provisioned with 4 read replicas per region; Cosmos DB scaled to 20K RU/s per partitioned container with geo-replication enabled.
Traffic Forecasting: Prometheus + Azure Monitor help project usage patterns, allowing proactive pod pre-scaling during high-volume windows (e.g., market open, NAV updates).
Async Event Processing: Kafka’s decoupled event streams ensure ingestion remains non-blocking during spikes, enabling downstream services to scale independently.

🔹 2. What happens if Redis is down during a transaction?

Answer:

Fallback to Cosmos DB: Services fallback to persistent stores when Redis fails.
Circuit Breakers: Resilience4j ensures retries don't overload systems.
Redis HA: Redis is deployed in geo-redundant Premium tier with failover nodes.
Replay Mechanism: NAVPublisher writes every update to Kafka, so Redis can be rehydrated post-recovery.

🔹 3. How do you ensure tenant isolation (e.g., Investor A ≠ Investor B)?

Answer:

Authentication: Azure AD B2C tokens include tenant claims.
Tenant-Aware APIs: All services enforce tenant-scoped access.
Database Isolation: Tenant ID part of primary keys; queries filtered. Logical partitioning using composite keys and Cosmos DB’s tenant-specific containers
Istio RBAC: Restricts service-to-service traffic using labels.
Data Masking & Encryption: Sensitive data, PAN, Aadhaar encrypted using Key Vault; masking done via interceptor layer.

🔹 4. What happens if NAV feed is delayed or corrupted?

Answer:

NAVIngestorService: Performs schema validation, timestamp check, checksum, and threshold difference validations to detect anomalies before ingesting.
Quarantine Logic: Corrupt feeds are sent to nav.error Kafka topic.Corrupted records are logged and published to a nav.error Kafka topic, which is monitored by the alerting system.
Redis TTL: Ensures NAV older than 15 minutes is not used.
Fallback Logic: In case of corruption or delay, the UI reads the last known valid NAV from Redis, which is cached with a TTL of 20 minutes. This ensures investors don’t receive stale or invalid NAVs beyond the SLA window.
If the fallback NAV is older than 20 minutes or unavailable, the UI gracefully displays a banner stating: "NAV temporarily unavailable. Please try again later."
Alerting:
- Prometheus exposes nav_freshness_seconds; alert if >900s.Alerts are triggered automatically via Prometheus (nav_age_seconds) if the NAV data age exceeds 900 seconds.
- An alert notification is also sent to the admin dashboard for manual override, reprocessing, or alternate data source injection.
The system ensures data integrity by allowing downstream services like TransactionEngine and PortfolioService to skip NAV-based operations until a valid NAV is restored, thus preventing erroneous unit allocations.

🔹 5. How do you prevent Kafka replay or duplicate orders?

Answer:

UUID-based orderId: Generated on order creation, stored with processed status.
Idempotency Layer: Transaction Engine checks for duplicates before applying.TransactionEngine checks if orderId exists with status PROCESSED; skips if already handled.
Kafka Keying: Message keys ensure ordered delivery per fund/investor.Kafka consumers use keys to ensure order; services follow at-least-once + idempotent handler design
At-Least-Once with Deduplication: Even if Kafka replays, logic ensures idempotent handling.Idempotency token saved in transaction metadata.

🔹 6. What if UIDAI/KYC provider is down?

Answer:

Deferred KYC Flow: Investor record created with status = KYC_PENDING.
Queue-Based Retry: KYC requests pushed to kyc.pending Kafka topic.kyc.pending topic queues request for retry.
Graceful Degradation: Investor cannot transact until KYC confirmed.
Alerts: Circuit breaker opens after 3 failed attempts; alert raised.Ops notified if provider downtime exceeds SLA threshold.
Admin can manually override for testing use cases (audit logged)

🔹 7. How do you ensure SEBI compliance and auditability?

Answer:

AuditTrailService subscribes to all Kafka topics (order, payment, login, KYC) and captures immutable event logs.Each event carries metadata including user ID, timestamp, IP address, and action type.
Immutability: Logs are written to Cosmos DB with soft-deletion disabled and retention policy set to 7 years, adhering to SEBI guidelines.
Write-once access is enforced at the database level using append-only containers to ensure immutability.
Automated Reports: SEBI compliance reports generated on schedule via dedicated microservice.A scheduled job generates SEBI-compliant JSON and CSV exports daily/weekly/monthly via secure API endpoints.?
An Audit Dashboard enables real-time traceability and log search by investor, fund, or transaction.
RBAC Controls: Access to audit data limited to compliance officers,ensures that only compliance officers and authorized auditors can view or export reports.
Encryption: All PII is encrypted at rest and masked in output views.

🔹 8. What’s your disaster recovery (DR) strategy?

Answer:

Active-Active AKS: Deployed in South and West India using Azure Front Door.
Kafka MirrorMaker2: Replicates topics across regions for low-latency failover.
Cosmos DB Geo-Replication: Multi-region write-enabled.\n- Redis Rehydration: Redis cache repopulated from Cosmos DB or Kafka replay.
Redis rehydrated on failover using latest from Cosmos DB.
Azure Front Door handles region rerouting.
Monthly DR drill includes full app + infra failover validation.
DR Drills: Periodic simulation of region outage and traffic redirection.

🔹 9. How do you safely deploy changes across 40+ microservices?

Answer:

Azure DevOps CI/CD: Pipelines using Helm charts and versioned packages.
Istio Canary Deployment: Traffic split for new version testing.
Blue-Green Deployments: For stateful/critical services.
Feature Flags: Used via LaunchDarkly to turn features on gradually.
Rollback Hooks: Rollback auto-triggered on failure or performance drop.
Azure DevOps CI/CD with Helm for templated deploys.
Istio Canary rollout (10% traffic → 25% → 100%) over 15 mins.
Services declare dependencies via ConfigMap and shared service contracts (OpenAPI).
During build, each microservice's pipeline validates against its downstream and upstream contracts to avoid breaking changes.
Deployment gates include health checks of dependent services using readiness probes and synthetic test endpoints.
Spring Boot Actuator is leveraged to verify downstream availability post-deploy.
Rollback is auto-triggered on readiness failure or response time degradation beyond SLO thresholds.
We use a robust deployment framework built on Azure DevOps + AKS + Istio, supported by automated contract checks, progressive rollout, and security validation. Here’s the detailed strategy:
🔷 CI/CD with Dependency Validation
- Microservices define their upstream/downstream API dependencies in ConfigMaps and OpenAPI contracts.
- CI pipeline uses tools like Swagger Diff or custom contract validators to check compatibility before deploy.
- If any incompatible changes are detected, deployment is blocked automatically.
- Semantic versioning is enforced (e.g., v1.2.3) and pinned in Helm charts to avoid accidental version mismatches.
🔷 Istio-Based Canary Deployments
- Canary strategy is managed by Istio VirtualService:
  - Traffic split: 10% → 25% → 50% → 100%
  - Prometheus metrics used to detect errors and response time violations
- If SLOs are violated (e.g., latency > 500ms, 5xx > 1%), rollout is halted and reverted.
🔷 Blue/Green Deployment Strategy
- For critical or stateful services (e.g., OrderService, PaymentGatewayService):
  - Blue (old) and Green (new) versions are deployed in parallel
  - Manual or automated switch after green passes all tests
  - Zero-downtime ensured by keeping both environments live during transition
- Rollback simply involves routing back to Blue via Istio routing policies.
🔷 Health Checks and Readiness Probes
- Spring Boot Actuator endpoints (/health, /info) tied to Kubernetes readiness/liveness probes
- Synthetic test scripts run post-deploy to validate downstream integrations (e.g., KYC, NAV, CRM)
- If health fails, rollback is triggered automatically using Helm history
🔷 DevSecOps Gate Checks
- Static code scans using SonarQube and Checkmarx in CI phase
- Open-source dependency scanning via OWASP Dependency-Check and GitHub Dependabot
- Container image scanning (e.g., Aqua, Trivy, Microsoft Defender for Containers)
- Security Gate in pipeline: fails deployment if any HIGH/Critical CVEs found in image or dependencies
- All secrets managed via Azure Key Vault, and injected at runtime using sealed secrets
🔷 Observability During Deployment
- Dashboards (Grafana + Prometheus) monitor:
  - Error rates
  - Response latency
  - Pod restarts and HPA status
- Logs streamed to ELK stack for real-time trace
- Alerting integrated with Microsoft Teams and PagerDuty for rollback alerts
🔷 Rollback Strategy
- If validation fails at any phase:
  - Rollback is done automatically via helm rollback to the previous known good release
  - Istio redirects all traffic to the last stable deployment
  - Postmortem and RCA workflows triggered automatically in Confluence/Jira

🔹 10. If NAV updates are increased to every 5 mins, how do you adapt?

Answer:

Scheduler Update: NAVIngestor cron adjusted to run every 5 minutes.
Kafka Throughput: Ensured by increasing partitions for nav.raw-feeds.
HPA Scaling: NAVCalculator and NAVPublisher auto-scale based on Kafka lag and CPU.
Redis TTL: Adjusted to expire data more frequently.
Event-Driven Design: No change needed to consuming services like Portfolio or Transaction.

⚡ Bonus: Lightning Questions (Structured Responses)

✅ Prevent Double Order

Implemented with UUID-based orderId
TransactionEngine checks if already processed before applying any state change

✅ Circuit Breaker Placement

All external APIs (UIDAI, Razorpay, CRM) wrapped in Resilience4j with fallbacks and timeout strategy
Circuit open status triggers alerts via Prometheus

✅ Kafka Load Testing

Use Apache JMeter Kafka Plugin or kafka-producer-perf-test.sh
Monitor consumer lag, broker CPU/mem via Prometheus/Grafana

✅ NAV Monitoring

nav_age_seconds metric from NAVPublisherService to Prometheus
Alert raised if NAV data is older than 900 seconds (15 min SLA)

✅ Investor/Admin Isolation

Separate AKS Node Pools for investor vs admin workloads
Istio VirtualServices and authorization policies segment traffic
Azure AD RBAC ensures access boundaries at both UI and API layer

how transaction engine build unique orderid and deduplication logic

ChatGPT said:

Great question — designing idempotent, deduplicated order handling is critical in BFSI systems where even one duplicate transaction can lead to fund misallocation, regulatory violations, or customer disputes.

Here's how a TransactionEngine typically generates a unique orderId and enforces deduplication.

✅ TransactionEngine – Unique Order ID & Deduplication Logic

🔹 1. How orderId Is Generated

The OrderService (not TransactionEngine) is the source of truth for order creation.
At the time of order placement:
- A UUID (Version 4) is generated server-side using Java (e.g., UUID.randomUUID()).
- Optionally, include metadata in the ID (e.g., prefix like ORD-20240517-xyz) for traceability.

Example: ORD-20240517-efc1a287-d9e2-4c1d-bc3f-1e9eaec401cd

This ID is:
- Stored in the orders table with initial status = NEW
- Sent in the Kafka order.placed event

🔹 2. Deduplication Logic in TransactionEngine

TransactionEngine listens to Kafka topic payment.success or order.placed, depending on the system.

✅ On Receiving the Event:

java

CopyEdit

@Transactional public void processTransaction(TransactionEvent event) { String orderId = event.getOrderId(); Optional<Transaction> txn = txnRepo.findByOrderId(orderId); if (txn.isPresent() && txn.get().getStatus().equals("PROCESSED")) { log.info("Skipping duplicate transaction for orderId: {}", orderId); return; } // Process the transaction (NAV lookup, unit calc, portfolio update) Transaction newTxn = new Transaction(); newTxn.setOrderId(orderId); newTxn.setStatus("PROCESSED"); txnRepo.save(newTxn); kafkaTemplate.send("transaction.completed", event); }

🔹 3. Storage Design for Deduplication

Field	Description
orderId	Primary key or unique key in transaction table
status	ENUM: NEW / PROCESSING / PROCESSED / FAILED
createdAt	Timestamp of initial process attempt
attempts	Optional retry count

If Kafka replays the same event, it’s ignored gracefully.
Can also track first seen timestamp and number of replays, useful for debugging or fraud checks.

🔹 4. Why This Works in Kafka's At-Least-Once World

Kafka does not guarantee single delivery, so:
- At-least-once → consumer must be idempotent
Your transaction logic must never cause side effects (debit/credit) more than once.

🔒 BONUS: Resilience Enhancements

Use a dedicated DB index on orderId to ensure fast deduplication check.
Log duplicates with Kafka transaction.duplicate topic (optional).
Alert if replay rates spike — might indicate Kafka reprocessing, payment gateway retry storm, or downstream flakiness.

✅ Benefits:

Prevents duplicate fund allocation
Ensures data consistency
Makes your system resilient to Kafka replays, retries, and failures

Absolutely — these are advanced observability and safety best practices in BFSI-grade systems to ensure transaction accuracy, detect anomalies, and enable proactive debugging.

Here’s how you implement and leverage them:

✅ Logging Duplicate Transactions via transaction.duplicate Kafka Topic

🎯 Why?

To trace idempotent drops
To enable fraud detection or anomaly analysis
To build metrics on order replay frequency

🔧 Implementation in TransactionEngine:

java

CopyEdit

if (txnRepo.existsByOrderId(orderId)) { log.info("Duplicate order detected: {}", orderId); // Publish to Kafka for observability kafkaTemplate.send("transaction.duplicate", new DuplicateEvent(orderId, event.getInvestorId(), Instant.now())); return; }

🔹 Kafka Topic: transaction.duplicate

Field	Description
orderId	ID of the duplicate transaction
investorId	Who triggered the request
sourceSystem	(Optional) Origin e.g. Web, UPI
timestamp	Event occurrence time

🚨 Spike Detection: Replay Rate Alerting

🎯 Why?

Repeated replay events could indicate:
- Kafka consumer lag or crash/restart
- Payment gateway retrying failed callbacks
- Internal service instability

📈 How to Detect:

Expose Metric:Use Micrometer/Prometheus counter:
java
CopyEdit
Counter.builder("transaction.duplicates") .tag("source", sourceSystem) .register(meterRegistry) .increment();
Alerting Rule in Prometheus:
yaml
CopyEdit
- alert: HighTransactionReplayRate expr: rate(transaction_duplicates_total[1m]) > 5 for: 5m labels: severity: warning annotations: summary: "High replay rate detected" description: "More than 5 duplicate transactions per minute"
Integrate Alerts with:
- PagerDuty
- Microsoft Teams / Slack
- Grafana dashboards

🛡️ Benefits of This Approach

Aspect	Benefit
Risk Management	Detects and isolates root cause before it snowballs
Compliance	Proves system dropped duplicate safely
Debugging Support	Full trace of replayed transaction context
Platform Reliability	Helps fine-tune Kafka consumers and upstream dependencies

Mutual Fund - Interview Preparation – Structured Architecture Q&A

🔹 1. How do you scale your architecture for 150k users and 5,000+ TPS?

🔹 2. What happens if Redis is down during a transaction?

🔹 3. How do you ensure tenant isolation (e.g., Investor A ≠ Investor B)?

🔹 4. What happens if NAV feed is delayed or corrupted?

🔹 5. How do you prevent Kafka replay or duplicate orders?

🔹 6. What if UIDAI/KYC provider is down?

🔹 7. How do you ensure SEBI compliance and auditability?

🔹 8. What’s your disaster recovery (DR) strategy?

🔹 9. How do you safely deploy changes across 40+ microservices?

🔷 CI/CD with Dependency Validation

🔷 Istio-Based Canary Deployments

🔷 Blue/Green Deployment Strategy

🔷 Health Checks and Readiness Probes

🔷 DevSecOps Gate Checks

🔷 Observability During Deployment

🔷 Rollback Strategy

🔹 10. If NAV updates are increased to every 5 mins, how do you adapt?

⚡ Bonus: Lightning Questions (Structured Responses)

✅ Prevent Double Order

✅ Circuit Breaker Placement

✅ Kafka Load Testing

✅ NAV Monitoring

✅ Investor/Admin Isolation

ChatGPT said:

✅ TransactionEngine – Unique Order ID & Deduplication Logic

🔹 1. How orderId Is Generated

🔹 2. Deduplication Logic in TransactionEngine

✅ On Receiving the Event:

🔹 3. Storage Design for Deduplication

🔹 4. Why This Works in Kafka's At-Least-Once World

🔒 BONUS: Resilience Enhancements

✅ Benefits:

✅ Logging Duplicate Transactions via transaction.duplicate Kafka Topic

🎯 Why?

🔧 Implementation in TransactionEngine:

🔹 Kafka Topic: transaction.duplicate

🚨 Spike Detection: Replay Rate Alerting

🎯 Why?

📈 How to Detect:

🛡️ Benefits of This Approach

Comments