top of page

Mutual Fund - Interview Preparation

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • May 17
  • 9 min read

“I’ve architected a modern, real-time, investor-facing mutual fund platform capable of handling 100k+ concurrent users and 5k+ TPS. It uses a Spring Boot microservices backbone with Azure AKS, Kafka for event processing, Istio for service mesh, and integrates with external systems like UIDAI, SEBI, Digio, and payment gateways. This architecture is built for elasticity, DR, real-time observability, and compliance with SEBI mandates. It’s production-ready at enterprise scale with clearly defined user flows, automation, and role-based isolation.


Mutual Fund - Interview Preparation – Structured Architecture Q&A

🔹 1. How do you scale your architecture for 150k users and 5,000+ TPS?

Answer: We use Azure Kubernetes Service (AKS) with clearly defined thresholds and sizing metrics:

  • Horizontal Pod Autoscaler (HPA): Each stateless service is horizontally scaled based on CPU, memory, and Kafka lag. For instance, TransactionEngine handles ~500 TPS per replica; with 5,000 TPS target, we scale to ~20 replicas with 75% CPU utilization target.

  • Cluster Autoscaler (CA): AKS node pools are scaled between 50–120 nodes per region depending on load. Each node hosts 10–15 pods, leading to ~1,000–1,800 pods per cluster.

  • Kafka: Topics are partitioned based on investor ID and fund type. High-volume topics like order.placed or nav.broadcast are configured with 50–100 partitions and 3–5 consumer replicas.

  • Redis: Premium Azure Cache with 3 shards and active-active geo-redundancy. Supports >150K ops/sec.

  • Database Layer: Azure SQL provisioned with 4 read replicas per region; Cosmos DB scaled to 20K RU/s per partitioned container with geo-replication enabled.

  • Traffic Forecasting: Prometheus + Azure Monitor help project usage patterns, allowing proactive pod pre-scaling during high-volume windows (e.g., market open, NAV updates).

  • Async Event Processing: Kafka’s decoupled event streams ensure ingestion remains non-blocking during spikes, enabling downstream services to scale independently.


🔹 2. What happens if Redis is down during a transaction?

Answer:

  • Fallback to Cosmos DB: Services fallback to persistent stores when Redis fails.

  • Circuit Breakers: Resilience4j ensures retries don't overload systems.

  • Redis HA: Redis is deployed in geo-redundant Premium tier with failover nodes.

  • Replay Mechanism: NAVPublisher writes every update to Kafka, so Redis can be rehydrated post-recovery.

🔹 3. How do you ensure tenant isolation (e.g., Investor A ≠ Investor B)?

Answer:

  • Authentication: Azure AD B2C tokens include tenant claims.

  • Tenant-Aware APIs: All services enforce tenant-scoped access.

  • Database Isolation: Tenant ID part of primary keys; queries filtered. Logical partitioning using composite keys and Cosmos DB’s tenant-specific containers

  • Istio RBAC: Restricts service-to-service traffic using labels.

  • Data Masking & Encryption: Sensitive data, PAN, Aadhaar encrypted using Key Vault; masking done via interceptor layer.


🔹 4. What happens if NAV feed is delayed or corrupted?

Answer:

  • NAVIngestorService: Performs schema validation, timestamp check, checksum, and threshold difference validations to detect anomalies before ingesting.

  • Quarantine Logic: Corrupt feeds are sent to nav.error Kafka topic.Corrupted records are logged and published to a nav.error Kafka topic, which is monitored by the alerting system.

  • Redis TTL: Ensures NAV older than 15 minutes is not used.

  • Fallback Logic: In case of corruption or delay, the UI reads the last known valid NAV from Redis, which is cached with a TTL of 20 minutes. This ensures investors don’t receive stale or invalid NAVs beyond the SLA window.

  • If the fallback NAV is older than 20 minutes or unavailable, the UI gracefully displays a banner stating: "NAV temporarily unavailable. Please try again later."

  • Alerting:

    • Prometheus exposes nav_freshness_seconds; alert if >900s.Alerts are triggered automatically via Prometheus (nav_age_seconds) if the NAV data age exceeds 900 seconds.

    • An alert notification is also sent to the admin dashboard for manual override, reprocessing, or alternate data source injection.

  • The system ensures data integrity by allowing downstream services like TransactionEngine and PortfolioService to skip NAV-based operations until a valid NAV is restored, thus preventing erroneous unit allocations.


🔹 5. How do you prevent Kafka replay or duplicate orders?

Answer:

  • UUID-based orderId: Generated on order creation, stored with processed status.

  • Idempotency Layer: Transaction Engine checks for duplicates before applying.TransactionEngine checks if orderId exists with status PROCESSED; skips if already handled.

  • Kafka Keying: Message keys ensure ordered delivery per fund/investor.Kafka consumers use keys to ensure order; services follow at-least-once + idempotent handler design

  • At-Least-Once with Deduplication: Even if Kafka replays, logic ensures idempotent handling.Idempotency token saved in transaction metadata.


🔹 6. What if UIDAI/KYC provider is down?

Answer:

  • Deferred KYC Flow: Investor record created with status = KYC_PENDING.

  • Queue-Based Retry: KYC requests pushed to kyc.pending Kafka topic.kyc.pending topic queues request for retry.

  • Graceful Degradation: Investor cannot transact until KYC confirmed.

  • Alerts: Circuit breaker opens after 3 failed attempts; alert raised.Ops notified if provider downtime exceeds SLA threshold.

  • Admin can manually override for testing use cases (audit logged)


🔹 7. How do you ensure SEBI compliance and auditability?

Answer:

  • AuditTrailService subscribes to all Kafka topics (order, payment, login, KYC) and captures immutable event logs.Each event carries metadata including user ID, timestamp, IP address, and action type.

  • Immutability: Logs are written to Cosmos DB with soft-deletion disabled and retention policy set to 7 years, adhering to SEBI guidelines.

  • Write-once access is enforced at the database level using append-only containers to ensure immutability.

  • Automated Reports: SEBI compliance reports generated on schedule via dedicated microservice.A scheduled job generates SEBI-compliant JSON and CSV exports daily/weekly/monthly via secure API endpoints.?

  • An Audit Dashboard enables real-time traceability and log search by investor, fund, or transaction.

  • RBAC Controls: Access to audit data limited to compliance officers,ensures that only compliance officers and authorized auditors can view or export reports.

  • Encryption: All PII is encrypted at rest and masked in output views.


🔹 8. What’s your disaster recovery (DR) strategy?

Answer:

  • Active-Active AKS: Deployed in South and West India using Azure Front Door.

  • Kafka MirrorMaker2: Replicates topics across regions for low-latency failover.

  • Cosmos DB Geo-Replication: Multi-region write-enabled.\n- Redis Rehydration: Redis cache repopulated from Cosmos DB or Kafka replay.

  • Redis rehydrated on failover using latest from Cosmos DB.

  • Azure Front Door handles region rerouting.

  • Monthly DR drill includes full app + infra failover validation.

  • DR Drills: Periodic simulation of region outage and traffic redirection.

🔹 9. How do you safely deploy changes across 40+ microservices?

Answer:

  • Azure DevOps CI/CD: Pipelines using Helm charts and versioned packages.

  • Istio Canary Deployment: Traffic split for new version testing.

  • Blue-Green Deployments: For stateful/critical services.

  • Feature Flags: Used via LaunchDarkly to turn features on gradually.

  • Rollback Hooks: Rollback auto-triggered on failure or performance drop.

  • Azure DevOps CI/CD with Helm for templated deploys.

  • Istio Canary rollout (10% traffic → 25% → 100%) over 15 mins.

  • Services declare dependencies via ConfigMap and shared service contracts (OpenAPI).

  • During build, each microservice's pipeline validates against its downstream and upstream contracts to avoid breaking changes.

  • Deployment gates include health checks of dependent services using readiness probes and synthetic test endpoints.

  • Spring Boot Actuator is leveraged to verify downstream availability post-deploy.

  • Rollback is auto-triggered on readiness failure or response time degradation beyond SLO thresholds.

  • We use a robust deployment framework built on Azure DevOps + AKS + Istio, supported by automated contract checks, progressive rollout, and security validation. Here’s the detailed strategy:

    🔷 CI/CD with Dependency Validation

    • Microservices define their upstream/downstream API dependencies in ConfigMaps and OpenAPI contracts.

    • CI pipeline uses tools like Swagger Diff or custom contract validators to check compatibility before deploy.

    • If any incompatible changes are detected, deployment is blocked automatically.

    • Semantic versioning is enforced (e.g., v1.2.3) and pinned in Helm charts to avoid accidental version mismatches.

    🔷 Istio-Based Canary Deployments

    • Canary strategy is managed by Istio VirtualService:

      • Traffic split: 10% → 25% → 50% → 100%

      • Prometheus metrics used to detect errors and response time violations

    • If SLOs are violated (e.g., latency > 500ms, 5xx > 1%), rollout is halted and reverted.

    🔷 Blue/Green Deployment Strategy

    • For critical or stateful services (e.g., OrderService, PaymentGatewayService):

      • Blue (old) and Green (new) versions are deployed in parallel

      • Manual or automated switch after green passes all tests

      • Zero-downtime ensured by keeping both environments live during transition

    • Rollback simply involves routing back to Blue via Istio routing policies.

    🔷 Health Checks and Readiness Probes

    • Spring Boot Actuator endpoints (/health, /info) tied to Kubernetes readiness/liveness probes

    • Synthetic test scripts run post-deploy to validate downstream integrations (e.g., KYC, NAV, CRM)

    • If health fails, rollback is triggered automatically using Helm history

    🔷 DevSecOps Gate Checks

    • Static code scans using SonarQube and Checkmarx in CI phase

    • Open-source dependency scanning via OWASP Dependency-Check and GitHub Dependabot

    • Container image scanning (e.g., Aqua, Trivy, Microsoft Defender for Containers)

    • Security Gate in pipeline: fails deployment if any HIGH/Critical CVEs found in image or dependencies

    • All secrets managed via Azure Key Vault, and injected at runtime using sealed secrets

    🔷 Observability During Deployment

    • Dashboards (Grafana + Prometheus) monitor:

      • Error rates

      • Response latency

      • Pod restarts and HPA status

    • Logs streamed to ELK stack for real-time trace

    • Alerting integrated with Microsoft Teams and PagerDuty for rollback alerts

    🔷 Rollback Strategy

    • If validation fails at any phase:

      • Rollback is done automatically via helm rollback to the previous known good release

      • Istio redirects all traffic to the last stable deployment

      • Postmortem and RCA workflows triggered automatically in Confluence/Jira


🔹 10. If NAV updates are increased to every 5 mins, how do you adapt?

Answer:

  • Scheduler Update: NAVIngestor cron adjusted to run every 5 minutes.

  • Kafka Throughput: Ensured by increasing partitions for nav.raw-feeds.

  • HPA Scaling: NAVCalculator and NAVPublisher auto-scale based on Kafka lag and CPU.

  • Redis TTL: Adjusted to expire data more frequently.

  • Event-Driven Design: No change needed to consuming services like Portfolio or Transaction.

⚡ Bonus: Lightning Questions (Structured Responses)

Prevent Double Order

  • Implemented with UUID-based orderId

  • TransactionEngine checks if already processed before applying any state change

Circuit Breaker Placement

  • All external APIs (UIDAI, Razorpay, CRM) wrapped in Resilience4j with fallbacks and timeout strategy

  • Circuit open status triggers alerts via Prometheus

Kafka Load Testing

NAV Monitoring

  • nav_age_seconds metric from NAVPublisherService to Prometheus

  • Alert raised if NAV data is older than 900 seconds (15 min SLA)

Investor/Admin Isolation

  • Separate AKS Node Pools for investor vs admin workloads

  • Istio VirtualServices and authorization policies segment traffic

  • Azure AD RBAC ensures access boundaries at both UI and API layer


how transaction engine build unique orderid and deduplication logic

ChatGPT said:

Great question — designing idempotent, deduplicated order handling is critical in BFSI systems where even one duplicate transaction can lead to fund misallocation, regulatory violations, or customer disputes.

Here's how a TransactionEngine typically generates a unique orderId and enforces deduplication.

✅ TransactionEngine – Unique Order ID & Deduplication Logic

🔹 1. How orderId Is Generated

  • The OrderService (not TransactionEngine) is the source of truth for order creation.

  • At the time of order placement:

    • A UUID (Version 4) is generated server-side using Java (e.g., UUID.randomUUID()).

    • Optionally, include metadata in the ID (e.g., prefix like ORD-20240517-xyz) for traceability.

Example: ORD-20240517-efc1a287-d9e2-4c1d-bc3f-1e9eaec401cd
  • This ID is:

    • Stored in the orders table with initial status = NEW

    • Sent in the Kafka order.placed event

🔹 2. Deduplication Logic in TransactionEngine

TransactionEngine listens to Kafka topic payment.success or order.placed, depending on the system.

✅ On Receiving the Event:

java

CopyEdit

@Transactional public void processTransaction(TransactionEvent event) { String orderId = event.getOrderId(); Optional<Transaction> txn = txnRepo.findByOrderId(orderId); if (txn.isPresent() && txn.get().getStatus().equals("PROCESSED")) { log.info("Skipping duplicate transaction for orderId: {}", orderId); return; } // Process the transaction (NAV lookup, unit calc, portfolio update) Transaction newTxn = new Transaction(); newTxn.setOrderId(orderId); newTxn.setStatus("PROCESSED"); txnRepo.save(newTxn); kafkaTemplate.send("transaction.completed", event); }

🔹 3. Storage Design for Deduplication

Field

Description

orderId

Primary key or unique key in transaction table

status

ENUM: NEW / PROCESSING / PROCESSED / FAILED

createdAt

Timestamp of initial process attempt

attempts

Optional retry count

  • If Kafka replays the same event, it’s ignored gracefully.

  • Can also track first seen timestamp and number of replays, useful for debugging or fraud checks.

🔹 4. Why This Works in Kafka's At-Least-Once World

  • Kafka does not guarantee single delivery, so:

    • At-least-once → consumer must be idempotent

  • Your transaction logic must never cause side effects (debit/credit) more than once.

🔒 BONUS: Resilience Enhancements

  • Use a dedicated DB index on orderId to ensure fast deduplication check.

  • Log duplicates with Kafka transaction.duplicate topic (optional).

  • Alert if replay rates spike — might indicate Kafka reprocessing, payment gateway retry storm, or downstream flakiness.


✅ Benefits:

  • Prevents duplicate fund allocation

  • Ensures data consistency

  • Makes your system resilient to Kafka replays, retries, and failures


Absolutely — these are advanced observability and safety best practices in BFSI-grade systems to ensure transaction accuracy, detect anomalies, and enable proactive debugging.

Here’s how you implement and leverage them:

✅ Logging Duplicate Transactions via transaction.duplicate Kafka Topic

🎯 Why?

  • To trace idempotent drops

  • To enable fraud detection or anomaly analysis

  • To build metrics on order replay frequency

🔧 Implementation in TransactionEngine:

java

CopyEdit

if (txnRepo.existsByOrderId(orderId)) { log.info("Duplicate order detected: {}", orderId); // Publish to Kafka for observability kafkaTemplate.send("transaction.duplicate", new DuplicateEvent(orderId, event.getInvestorId(), Instant.now())); return; }

🔹 Kafka Topic: transaction.duplicate

Field

Description

orderId

ID of the duplicate transaction

investorId

Who triggered the request

sourceSystem

(Optional) Origin e.g. Web, UPI

timestamp

Event occurrence time

🚨 Spike Detection: Replay Rate Alerting

🎯 Why?

  • Repeated replay events could indicate:

    • Kafka consumer lag or crash/restart

    • Payment gateway retrying failed callbacks

    • Internal service instability

📈 How to Detect:

  1. Expose Metric:Use Micrometer/Prometheus counter:

    java

    CopyEdit

    Counter.builder("transaction.duplicates") .tag("source", sourceSystem) .register(meterRegistry) .increment();

  2. Alerting Rule in Prometheus:

    yaml

    CopyEdit

    - alert: HighTransactionReplayRate expr: rate(transaction_duplicates_total[1m]) > 5 for: 5m labels: severity: warning annotations: summary: "High replay rate detected" description: "More than 5 duplicate transactions per minute"

  3. Integrate Alerts with:

    • PagerDuty

    • Microsoft Teams / Slack

    • Grafana dashboards

🛡️ Benefits of This Approach

Aspect

Benefit

Risk Management

Detects and isolates root cause before it snowballs

Compliance

Proves system dropped duplicate safely

Debugging Support

Full trace of replayed transaction context

Platform Reliability

Helps fine-tune Kafka consumers and upstream dependencies


 
 
 

Recent Posts

See All
Ops Efficiency 30 % improvement

how did you achieve 30 % operational efficiency Achieving 30% operational efficiency in a BFSI-grade, microservices-based personal...

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page