Data Modernization
- Anand Nerurkar
- Mar 14
- 25 min read
Updated: Mar 15
What is Data Modernization?
Data modernization is the process of transforming legacy data platforms, architectures, and data management practices into scalable, cloud-enabled, real-time, and analytics-driven data ecosystems that support digital business, AI, and advanced analytics.
Simple interview line:
“Data modernization is about transforming legacy data platforms into scalable, real-time, cloud-enabled data ecosystems that can support digital applications, advanced analytics, and AI-driven decision making.”
Why Organizations Do Data Modernization
Legacy data systems usually have problems like:
data silos across departments
batch-based reporting (slow insights)
limited scalability
difficulty supporting AI/ML
high infrastructure cost
Data modernization enables:
real-time insights
data-driven decision making
AI/ML capabilities
scalable cloud platforms
Core Components of Data Modernization
You can explain it in 5 layers.
1️⃣ Data Platform Modernization
Move from legacy databases / data warehouses to modern platforms.
Examples:
traditional RDBMS / on-prem warehouse
→ cloud data lake / lakehouse
Technologies:
Azure Data Lake
Snowflake
Databricks
Goal:
scalable storage and compute separation
2️⃣ Data Integration Modernization
Legacy approach:
batch ETL jobs
Modern approach:
real-time data pipelines
streaming data ingestion
Technologies:
Kafka
Event streaming
CDC pipelines
Goal:
real-time data availability
3️⃣ Data Governance & Security
Modern data platforms must support:
data catalog
lineage tracking
data quality monitoring
data masking / tokenization
Especially important in BFSI.
Example:
PII protection
regulatory compliance
4️⃣ Analytics & AI Enablement
Modern platforms support:
self-service analytics
ML models
AI-driven insights
Examples:
fraud detection
personalized banking offers
risk scoring
5️⃣ Data Democratization
Data should be accessible to:
business teams
analysts
data scientists
But with proper access controls.
Goal:
data-driven organization.
“For example, in a banking modernization program, legacy reporting systems that rely on overnight batch processing can be modernized by building a cloud-based data lakehouse platform. Transaction data from core banking and digital channels is ingested in real time using streaming pipelines, governed through data catalog and security controls, and then exposed to analytics and AI platforms for fraud detection, customer analytics, and risk monitoring.”
“Data modernization is the process of transforming legacy data platforms into scalable, cloud-enabled data ecosystems that support real-time analytics and AI-driven decision making. It typically involves modernizing data platforms, enabling real-time data integration through streaming pipelines, strengthening data governance and security, and building analytics and AI capabilities on top of the data platform. The goal is to move from siloed batch-based reporting systems to a unified data platform that enables faster insights, better customer experience, and data-driven business decisions.”
Modern Data Platform Architecture for Banking
Think of the architecture in 5 layers. This is a very real enterprise approach.
1️⃣ Data Source Layer
These are operational systems generating data.
Examples in a bank:
Core Banking System
Payment Systems (UPI / Cards)
Internet Banking
Mobile Banking
CRM systems
ATM network
Fraud monitoring systems
Goal:
Collect structured and unstructured data from multiple systems.
Interview line:
“The first layer includes operational systems such as core banking, payments, digital banking channels, and customer platforms that generate transaction and customer data.”
2️⃣ Data Ingestion Layer
This layer moves data into the modern platform.
Two common approaches:
Batch ingestion
ETL jobs
daily or hourly loads
Real-time ingestion
streaming pipelines
event-driven architecture
Technologies often used:
Kafka / Event Hub
CDC pipelines
ETL tools
Example:
Transaction events from digital banking → streamed to data platform.
Interview line:
“Data is ingested using both batch pipelines and real-time streaming mechanisms depending on the use case.”
3️⃣ Data Storage Layer (Lakehouse)
Modern platforms use data lakes or lakehouse architecture.
Typical setup:
Raw data layer
Processed data layer
Curated data layer
Example storage platforms:
cloud data lake
object storage
lakehouse engines
Benefits:
scalable storage
cost optimization
supports analytics + AI
“Data is stored in a scalable lakehouse architecture with raw, processed, and curated layers to support analytics and machine learning workloads.”
4️⃣ Data Processing & Governance Layer
This layer ensures data is reliable and secure.
Capabilities include:
data transformation
data quality checks
metadata management
lineage tracking
access control
In BFSI this is critical for:
regulatory compliance
data privacy
Example:
PII fields masked before analytics access.
“The platform also enforces strong governance including data catalog, lineage tracking, quality controls, and role-based access policies.”
5️⃣ Analytics & AI Layer
Finally, the data is used for business insights.
Use cases:
fraud detection
customer 360 analytics
credit risk scoring
marketing personalization
Users include:
business analysts
data scientists
AI models
“The curated data layer powers analytics dashboards, machine learning models, and AI-driven decision platforms.”
Simple End-to-End Flow
You can summarize like this:
Core Banking → Event Streaming → Data Lakehouse → Data Processing & Governance → Analytics / AI
“In a banking modernization program, data modernization typically involves building a cloud-based lakehouse architecture. Transaction and customer data from systems such as core banking, payments, and digital channels are ingested through batch pipelines and real-time streaming platforms. The data is stored in a scalable data lakehouse with raw, processed, and curated layers. On top of this we implement governance capabilities such as data catalog, lineage tracking, and security controls to ensure compliance. Finally, the curated data is consumed by analytics and AI platforms for use cases like fraud detection, customer analytics, and risk monitoring.”
Yes, structured data can absolutely go into a data lake. A data lake is not limited to unstructured data.
But the key is how it is organized and processed.
1️⃣ Data Lake Accepts All Types of Data
A modern data lake stores three types of data:
Structured data
RDBMS tables
transactional data
CSV / Parquet tables
Semi-structured data
JSON
XML
logs
events
Unstructured data
documents
images
audio
So if your source is RDBMS or NoSQL, the data can still be ingested into the data lake.
Example:
Core Banking DB → CDC pipeline → Data Lake
2️⃣ Structured Data in Data Lake (How It Works)
Structured data from databases is usually stored in columnar formats such as:
Parquet
ORC
Delta tables
These formats enable:
fast analytics queries
compression
schema evolution
So the lake is not just a dumping storage.
It becomes analytics-ready storage.
3️⃣ Why Companies Push RDBMS Data to Data Lake
Banks often push structured data into the lake because:
1️⃣ Unified data platform
Instead of many silo databases.
2️⃣ Advanced analytics
AI and ML need large datasets.
3️⃣ Historical data storage
Data lake stores years of data cheaply.
4️⃣ Real-time pipelines
Streaming pipelines continuously land data into the lake.
4️⃣ Where RDBMS Still Exists
Even in modern architecture:
Purpose | Technology |
Operational transactions | RDBMS |
High scale operational apps | NoSQL |
Analytics / AI / reporting | Data lake / lakehouse |
So data lake does not replace operational databases.
It supports analytics workloads.
5️⃣
“A modern data lake stores structured, semi-structured, and unstructured data. Structured data from RDBMS or NoSQL systems is typically ingested through CDC or batch pipelines and stored in columnar formats such as Parquet or Delta tables. This enables large-scale analytics, AI workloads, and long-term historical storage while operational databases continue to support transactional workloads.”
“If data lake can store structured data, why do we still need a data warehouse?”
The key difference is purpose and optimization.
1️⃣ Data Lake – Raw & Flexible Storage
A data lake is designed for:
storing large volumes of raw data
handling structured, semi-structured, and unstructured data
supporting data science and AI workloads
Characteristics:
schema-on-read
cheap storage
highly scalable
Example:
Bank stores:
transaction logs
customer data
mobile app events
ATM logs
documents
“Data lakes are optimized for storing large volumes of raw data in its native format.”
2️⃣ Data Warehouse – Curated & Optimized for BI
A data warehouse is designed for:
structured reporting
business dashboards
regulatory reporting
Characteristics:
schema-on-write
highly optimized SQL queries
curated business datasets
Example:
Finance dashboardsCustomer analytics reportsRegulatory MIS reports
“Data warehouses provide curated, structured datasets optimized for BI and reporting workloads.”
3️⃣ How Modern Architecture Uses Both
Most enterprises follow this pattern:
Operational Systems → Data Lake → Data Warehouse → BI Tools
Flow example:
Core Banking → Data Lake → Curated Data Warehouse → Power BI dashboards
Why?
Data lake:
ingest everything
Data warehouse:
provide clean, trusted datasets for business users
4️⃣ Modern Approach – Lakehouse
Today many organizations combine both.
Lakehouse architecture provides:
data lake scalability
data warehouse performance
Technologies:
Delta Lake
Databricks
Snowflake
“Modern platforms often adopt lakehouse architecture which combines the scalability of data lakes with the performance and governance capabilities of data warehouses.”
“While data lakes can store structured data, they are primarily optimized for storing large volumes of raw data across multiple formats. Data warehouses, on the other hand, provide curated and structured datasets optimized for BI reporting and SQL analytics. In modern architectures, organizations often use both together where raw data lands in the data lake and curated business datasets are served through a data warehouse or lakehouse platform.”
“How would you design a data platform for real-time fraud detection in a bank?”
They want to see if you understand event-driven architecture, streaming, AI, and low-latency systems.
A simple way to explain is using 5 layers.
1️⃣ Transaction Source Layer
Fraud detection starts with transaction events.
Sources include:
Core banking transactions
Card payment systems
UPI / digital payments
Mobile banking transactions
ATM transactions
Every transaction generates an event.
Example:
Customer initiates card payment → transaction event generated.
“Fraud detection platforms start with transaction events generated by payment systems, digital banking channels, and card processing systems.”
2️⃣ Real-Time Streaming Layer
These events must be processed immediately.
Use an event streaming platform.
Examples:
Kafka
Event streaming platforms
messaging queues
Purpose:
ingest high-volume transactions
support low latency processing
Example flow:
Payment system → Event Stream → Fraud engine
“Transactions are published to a real-time event streaming platform which enables high-throughput and low-latency processing.”
3️⃣ Real-Time Processing Layer
This layer performs fraud detection logic.
Processing includes:
rule-based detection
anomaly detection
machine learning scoring
Examples:
Rules:
unusual transaction location
abnormal spending pattern
ML models:
behavioral fraud detection
Goal:
fraud decision in milliseconds
“The streaming data is processed through real-time analytics engines that apply both rule-based detection and machine learning models.”
4️⃣ Data Platform Layer
All transaction events are stored for:
historical analysis
model training
investigation
Stored in:
data lake / lakehouse
Purpose:
build better fraud detection models
support analytics
5️⃣ Response Layer
If fraud risk is detected:
Possible actions:
block transaction
trigger OTP verification
send alert to customer
notify fraud investigation team
Goal:
prevent fraud before money leaves the system.
“Based on fraud scoring, the system can automatically block transactions or trigger step-up authentication mechanisms.”
End-to-End Architecture Flow
Simple explanation:
Transaction → Event Streaming → Real-Time Fraud Engine → Decision → Alert / Block
Parallel flow:
Transaction → Data Lake → AI Model Training
“In a modern banking architecture, real-time fraud detection is built on an event-driven platform. Transaction events from payment systems, digital banking channels, and ATM networks are published to a real-time streaming platform such as Kafka. These events are processed by real-time analytics engines that apply rule-based checks and machine learning models to detect suspicious behavior. The system generates a fraud risk score within milliseconds and can trigger actions such as blocking the transaction or initiating step-up authentication. All transaction data is also stored in a data lake or lakehouse platform for historical analysis and continuous improvement of fraud detection models.”
“How do you design a platform to support 10 million digital banking users with high availability?”
The key point is: feature store sits between curated data and ML models.
Let’s walk through the realistic data → ML pipeline used in modern data platforms.
1️⃣ Data Ingestion → Data Lake (Raw Layer)
First, data from systems enters the lake.
Sources:
Core banking
Card transactions
Digital banking events
CRM data
This lands in the Raw Layer.
Characteristics:
exact copy of source data
minimal transformation
used for audit and traceability
Flow:
Core Banking → CDC / Streaming → Raw Data Lake
2️⃣ Data Processing → Curated Layer
Next, data is cleaned and transformed.
Activities:
data cleansing
schema standardization
enrichment
joining multiple sources
Example:
Transaction data + customer data + device data.
Result:
Curated datasets ready for analytics.
Flow:
Raw → ETL / Spark processing → Curated layer
3️⃣ Analytics / Feature Engineering Layer
From curated data, features are derived.
Features are variables used by ML models.
Example fraud features:
transaction amount deviation
number of transactions in last 5 minutes
location change frequency
device risk score
These features are stored in the Feature Store.
Flow:
Curated Data → Feature Engineering → Feature Store
4️⃣ Feature Store
This is a central repository for ML features.
It ensures:
consistent features for training and inference
reusable features across models
feature versioning and governance
Example features:
average transaction amount
transaction velocity
customer risk profile
“Feature stores provide a governed repository of machine learning features derived from curated datasets.”
5️⃣ ML Model Training
Now the ML pipeline uses features from the feature store.
Training process:
Feature Store → Training Dataset → ML Model Training
Example:
Fraud detection modelCredit risk model
Output:
Trained model.
6️⃣ Real-Time Inference
For real-time fraud detection:
Transaction event arrives → features retrieved → model scoring.
Flow:
Transaction Event→ Feature Lookup→ ML Model→ Fraud Score
Full Enterprise Flow
Operational Systems→ Data Lake Raw→ Curated Data Layer→ Feature Engineering→ Feature Store→ ML Model Training→ Real-Time Model Inference
“In a modern data platform, operational data first lands in the raw data lake and is then transformed into curated datasets through processing pipelines. From the curated layer we perform feature engineering to derive machine learning features such as transaction velocity or behavioral patterns. These features are stored in a centralized feature store, which ensures consistency between training and inference. Machine learning models are trained using these features, and during real-time transactions the system retrieves relevant features to generate fraud risk scores.”
The more correct architecture is:
Raw → Curated → Feature Engineering → Feature Store → ML Training
Analytics dashboards may also use curated data, but feature store is specifically for ML pipelines.
Why Feature Store Has Two Parts
Machine learning systems have two different needs:
1️⃣ Model training (large historical data)2️⃣ Real-time inference (low latency scoring)
Because these needs are different, the feature store is split into:
Offline Feature Store
Online Feature Store
1️⃣ Offline Feature Store (Model Training)
Used for training ML models.
Characteristics:
stores large historical datasets
optimized for batch processing
supports data science experimentation
Where it typically resides:
Data Lake
Lakehouse
Data Warehouse
Example:
Historical transaction data for last 2 years used to train fraud model.
Flow:
Curated Data → Feature Engineering → Offline Feature Store
Example features stored:
average transaction value (30 days)
number of transactions per hour
device usage patterns
“Offline feature stores support model training by providing large historical datasets derived from curated data.”
2️⃣ Online Feature Store (Real-Time Inference)
Used during live transactions.
Characteristics:
low latency access
optimized for milliseconds response
contains latest feature values
Where it is stored:
NoSQL databases
in-memory stores
low-latency key-value stores
Example:
When a customer makes a payment:
System retrieves features such as:
transaction velocity
last login location
device fingerprint
This must happen in milliseconds.
Flow:
Transaction Event → Fetch Features → ML Model → Fraud Score
“Online feature stores provide low-latency feature access for real-time model inference.”
Example Fraud Detection Flow
Transaction occurs:
1️⃣ Transaction event arrives2️⃣ System retrieves features from Online Feature Store3️⃣ ML model calculates fraud probability4️⃣ Transaction allowed or blocked
Meanwhile:
Historical data continues feeding Offline Feature Store for model retraining.
Simple Architecture View
Data Lake → Feature Engineering→ Offline Feature Store → Model Training
Real-time transactions→ Online Feature Store → Model Inference
“Feature stores typically have two components: offline and online stores. The offline feature store is used for model training and contains large historical datasets derived from curated data in the lakehouse. Data scientists use this data to train and experiment with machine learning models. The online feature store is optimized for low-latency access and is used during real-time inference. When a transaction occurs, the system retrieves relevant features from the online store and feeds them into the ML model to generate predictions such as fraud risk scores.”
Enterprise Data Modernization Architecture (Banking Example)
Data modernization means moving from siloed legacy databases to a unified data platform that supports analytics, AI/ML, and real-time decision systems like fraud detection.
The goal is to enable:
real-time decision making
advanced analytics and AI
scalable data processing
enterprise governance and compliance
1️⃣ Data Ingestion Layer
First step in modernization is collecting data from multiple sources.
Typical banking data sources:
Core banking system
Card transactions
ATM network
Mobile banking apps
CRM systems
payment gateways
Data is ingested through:
Batch ingestion
ETL pipelines
CDC from databases
Streaming ingestion
Kafka
Azure Event Hub
Flow:
Operational Systems→ Data Ingestion Platform
Streaming ingestion is critical for real-time analytics like fraud detection.
2️⃣ Data Lake Architecture
All raw data is stored in a central data lake.
Example platforms:
Azure Data Lake
Amazon S3
GCP Cloud Storage
Data is organized into three layers.
Raw Layer
Original data as received.
Examples:
transaction logs
clickstream data
payment events
Purpose:
preserve original data for audit.
Curated Layer
Data is cleaned and standardized.
Typical processing:
schema validation
data quality checks
deduplication
Example tools:
Spark
Databricks
Azure Data Factory
This layer creates trusted datasets for analytics.
Analytics Layer
This layer prepares aggregated datasets for business insights.
Examples:
customer behaviour datasets
transaction summaries
fraud detection datasets
These datasets support:
BI dashboards
reporting
machine learning
3️⃣ Feature Engineering & Feature Store
For ML systems, raw data must be converted into features.
Examples of fraud features:
average transaction value
transactions in last 5 minutes
device fingerprint
location anomaly
Feature pipelines compute these features and store them in a Feature Store.
Feature stores maintain two versions:
Offline Feature Store
Used for model training.
Stores historical feature data.
Example technologies:
Databricks Feature Store
Feast
Online Feature Store
Used for real-time inference.
Stored in low-latency systems like:
Redis
Cassandra
This ensures fraud models can retrieve features within milliseconds.
4️⃣ ML Model Training Pipeline
Historical data from the offline feature store is used for training ML models.
Steps:
Data lake→ feature engineering→ offline feature store→ ML training
Typical models used in fraud detection:
Gradient boosting (XGBoost)
Random forest
neural networks
The trained model is then stored in a Model Registry.
Model registry manages:
versioning
approvals
governance
5️⃣ Real-Time Fraud Detection (Synchronous)
When a customer performs a transaction, the system performs real-time fraud scoring.
Flow:
Transaction Request→ Fraud Detection API→ Feature retrieval (online feature store)→ ML model inference→ rule engine validation→ risk decision
Possible outcomes:
approve transaction
request OTP / MFA
block transaction
This process must complete within 30–50 milliseconds.
6️⃣ Asynchronous Fraud Analytics Pipeline
In parallel, transaction events are sent to a streaming platform.
Example:
Transaction Event→ Kafka / Event Hub→ Fraud Analytics Engine
This pipeline performs deeper analysis such as:
behavioural anomaly detection
fraud network detection
merchant fraud patterns
If fraud is detected later:
accounts may be frozen
transaction reversal attempted
fraud investigation triggered
These fraud cases are also fed back into the data lake for retraining models.
7️⃣ Continuous Model Improvement
Fraud detection systems constantly improve through feedback loops.
Process:
Fraud incident detected→ labeled data stored in data lake→ feature pipelines updated→ model retrained
This allows models to adapt to new fraud patterns.
8️⃣ Governance and Compliance
Data modernization platforms must include strong governance.
Capabilities include:
data catalog
lineage tracking
access control
data masking
regulatory compliance
Tools often used:
Azure Purview
Collibra
Apache Atlas
This ensures secure and compliant data usage.
9️⃣ Final Architecture Overview
The modern enterprise data platform supports:
Operational Systems→ Data Ingestion (Batch + Streaming)→ Data Lake (Raw → Curated → Analytics)→ Feature Engineering→ Feature Store (Offline + Online)→ ML Model Training→ Model Registry→ Real-Time Fraud Detection API→ Asynchronous Fraud Analytics
This architecture enables AI-driven banking platforms with real-time decision making.
“Data modernization in banking involves building a unified data platform where operational data is ingested through batch and streaming pipelines into a multi-layer data lake. Curated datasets are used for analytics and ML feature engineering, while feature stores provide consistent features for both model training and real-time inference. This enables systems such as real-time fraud detection where ML models evaluate transactions synchronously, supported by asynchronous analytics pipelines that continuously improve fraud detection capabilities.”
“How do you prevent training–serving skew in ML systems?”
What is Training–Serving Skew?
Training–serving skew happens when:
The data used to train the model is different from the data used during real-time prediction.
Because of this difference, the model behaves incorrectly in production.
Example in banking fraud detection:
During training
Feature calculated as:
average transaction amount in last 30 daysBut during real-time inference
System calculates:
average transaction amount in last 7 daysNow the model receives different feature distributions, leading to inaccurate predictions.
That mismatch is training–serving skew.
“Training–serving skew occurs when the feature computation logic or data distribution used during model training differs from what is used during real-time inference.”
Why It Happens
Common reasons:
1️⃣ Different data pipelines2️⃣ Different feature calculation logic3️⃣ Missing real-time data4️⃣ Delayed feature updates
Example:
Training pipeline built in Spark, but production inference calculates features in application code.
This causes inconsistency.
How Enterprises Prevent It
There are three main approaches.
1️⃣ Use Feature Store
Feature store ensures same features are used for both training and inference.
Instead of recalculating features separately:
Training and inference both read from the same feature definitions.
Interview line:
“Feature stores help eliminate training-serving skew by ensuring consistent feature definitions across training and inference pipelines.”
2️⃣ Unified Feature Engineering Pipelines
Feature computation logic should be defined once.
Example:
Feature defined once in pipeline → reused everywhere.
Avoid:
Different teams writing separate feature logic.
3️⃣ Continuous Monitoring
Production models must be monitored for:
data drift
feature drift
prediction anomalies
Monitoring tools detect if production data distribution changes.
Example Fraud Detection Flow
Training pipeline:
Data Lake → Feature Engineering → Offline Feature Store → Model Training
Inference pipeline:
Transaction Event → Online Feature Store → ML Model → Fraud Score
Both pipelines use same feature definitions.
“Training–serving skew occurs when the feature data used during model training differs from what is used during real-time inference, which can lead to inaccurate predictions in production. To prevent this, modern ML platforms use feature stores where feature definitions are centralized and shared across both training and inference pipelines. This ensures the same features and transformations are used consistently. Additionally, organizations implement unified feature engineering pipelines and monitor models in production to detect data drift or feature drift.”
Correct ML Feature Store Flow
1️⃣ Model Training (Offline Pipeline)
Training uses historical data.
Flow:
Data Lake → Feature Engineering → Offline Feature Store → Model Training
Characteristics:
large datasets
batch processing
used by data scientists
Example:
Train fraud model using 2 years of historical transaction features.
2️⃣ Feature Synchronization
After features are generated offline, latest feature values are pushed to the Online Feature Store.
This process is sometimes called:
feature materialization
feature serving pipeline
Purpose:
Make latest features available for real-time scoring.
3️⃣ Real-Time Inference (Online Pipeline)
When a transaction happens:
Transaction Event→ Retrieve Features from Online Feature Store→ Call ML Model→ Generate Fraud Score
Why?
Because online store provides millisecond latency.
Offline store (data lake / warehouse) is too slow for real-time systems.
Why We Cannot Use Offline Store for Inference
Offline feature stores usually live in:
data lakes
warehouses
lakehouses
These systems are optimized for:
batch queries
analytics
Latency may be seconds or minutes, which is unacceptable for fraud detection.
Real-time fraud detection needs 10–50 ms response time.
That is why we use:
low latency key-value stores
in-memory databases
NoSQL
Correct Architecture Summary
Training pipeline:
Data Lake→ Feature Engineering→ Offline Feature Store→ Model Training
Inference pipeline:
Transaction Event→ Online Feature Store→ ML Model→ Fraud Score
Important Principle
Both feature stores contain same feature definitions, but they serve different purposes.
Store | Purpose |
Offline Feature Store | Training |
Online Feature Store | Real-time inference |
“Model training typically uses the offline feature store which contains large historical datasets. However, during real-time inference the system retrieves features from the online feature store because it provides low-latency access. The online store is synchronized with the offline store to ensure consistency between training and serving.”
“If both stores have the same features, why do we need two?”
The answer is simply:
offline = batch analytics
online = millisecond serving
Correct Feature Store Flow
1️⃣ Historical Data → Feature Engineering
First, we derive features from historical datasets.
Example fraud features:
avg transaction amount (30 days)
number of transactions in last 10 minutes
device risk score
location deviation score
Flow:
Operational Data → Data Lake → Feature Engineering Pipeline
This pipeline produces feature datasets.
2️⃣ Features Stored in Offline Feature Store
These engineered features are stored in the Offline Feature Store.
Purpose:
used by data scientists
supports model training
large historical datasets
Example:
2 years of customer transaction features.
Flow:
Feature Engineering → Offline Feature Store
3️⃣ Feature Materialization to Online Store
Now latest feature values are pushed (materialized) to the Online Feature Store.
This step ensures the same features used during training are available during inference.
Flow:
Offline Feature Store→ Feature Materialization→ Online Feature Store
The online store only keeps latest feature values, not huge history.
4️⃣ Model Training
Training pipeline uses:
Offline Feature Store → ML Training Pipeline → Model
Example:
Fraud detection model trained on historical feature datasets.
5️⃣ Real-Time Inference
When a transaction happens:
Transaction Event→ Fetch features from Online Feature Store→ Call ML Model→ Generate Fraud Score
This works because the online store provides millisecond latency.
Simplified Architecture Flow
Historical Data→ Feature Engineering→ Offline Feature Store→ Model Training
Offline Feature Store→ Feature Materialization→ Online Feature Store
Transaction Event→ Online Feature Store→ ML Model→ Fraud Score
The Key Principle
Offline store and online store share the same feature definitions.
But they serve different needs:
Feature Store | Purpose |
Offline Feature Store | Training and experimentation |
Online Feature Store | Low-latency inference |
“Features are typically derived from historical datasets through feature engineering pipelines and stored in the offline feature store for model training. The latest feature values are then materialized to the online feature store, which is optimized for low-latency access during real-time inference.”
AI platform architecture used in enterprises (Data Lake → Feature Store → Model Registry → CI/CD → Model Serving).
this is a very important enterprise AI architecture. Large organizations implement AI platforms in a structured pipeline so models can be built, governed, and deployed reliably.
A typical enterprise AI platform architecture looks like this:
Data Sources ↓Data Lake / Data Warehouse ↓Feature Store ↓Model Development & Training ↓Model Registry ↓CI/CD for ML ↓Model Serving / Inference ↓Monitoring & FeedbackLet’s walk through each layer clearly, so you can explain it confidently in interviews.
1️⃣ Data Sources
AI models start with enterprise data sources.
Typical sources in banking include:
Core banking transactions
Payment systems
Customer profiles
Credit bureau data
Digital banking activity
External data (fraud networks, geo data).
Example:
UPI transactionsATM withdrawalsMobile banking eventsLoan applicationsThese generate large volumes of structured and streaming data.
2️⃣ Data Lake / Data Warehouse
All raw data is collected in a central data platform.
Purpose:
store large volumes of data
enable analytics
provide data for ML training.
Typical technologies:
Azure Data Lake
AWS S3 Data Lake
GCP BigQuery
Snowflake.
Example pipeline:
Core Banking ↓Data Ingestion (Kafka / ETL) ↓Enterprise Data LakeData is cleaned, transformed, and governed here.
3️⃣ Feature Store
This is a very important ML component.
A feature is a variable used by ML models.
Example features for fraud detection:
transaction_amounttransactions_last_24_hoursdevice_change_flaglocation_distanceThe Feature Store:
stores reusable ML features
ensures consistency between training and inference
avoids recomputing features.
Example tools:
Feast
Tecton
Databricks Feature Store.
Example:
Feature Store |customer_avg_spendtransaction_frequencycredit_utilization_ratioThis allows multiple models to reuse the same features.
4️⃣ Model Development & Training
Data scientists use ML frameworks to train models.
Typical tools:
Python
TensorFlow
PyTorch
Scikit-learn
Spark ML.
Example fraud detection model:
Input: transaction featuresAlgorithm: Gradient BoostingOutput: fraud_probabilityTraining usually runs on:
GPU clusters
cloud ML platforms.
5️⃣ Model Registry
Once trained, models must be versioned and governed.
A Model Registry stores:
model versions
training data reference
performance metrics
approval status.
Example:
Fraud_Model_v1Fraud_Model_v2Fraud_Model_v3Registry ensures:
traceability
auditability
controlled deployment.
Typical tools:
MLflow Model Registry
SageMaker Model Registry.
6️⃣ CI/CD for Machine Learning
Enterprises implement MLOps pipelines similar to software DevOps.
Purpose:
automate model testing
automate deployment
ensure reliability.
Example pipeline:
Model Training ↓Model Testing ↓Model Approval ↓Deployment PipelineTools used:
Jenkins
GitHub Actions
Azure ML pipelines.
This enables continuous model improvement.
7️⃣ Model Serving (Inference)
After deployment, models are exposed through APIs.
Example:
Fraud Detection APIPOST /predictFraudInput:
transaction detailsOutput:
fraud_probability = 0.92Deployment options:
real-time API inference
batch prediction
streaming inference.
Example:
Digital Payment ↓Fraud API ↓Decision EngineThis allows real-time AI decisions.
8️⃣ Monitoring & Feedback Loop
AI models must be monitored after deployment.
Important metrics:
prediction accuracy
model drift
data drift.
Example:
If customer behavior changes, the model may become inaccurate.
Monitoring triggers model retraining.
Pipeline:
Model Monitoring ↓Performance Alert ↓Retrain ModelThis keeps AI models reliable over time.
Example Banking Use Cases on This Platform
Using this architecture, enterprises implement:
Fraud Detection
Real-time ML model analyzing transactions.
Credit Risk Scoring
Predict loan default probability.
Personalized Offers
AI recommends products.
Customer Churn Prediction
Predict customers likely to leave.
Enterprise AI platforms typically follow a structured architecture where data from enterprise systems is ingested into a data lake, transformed into reusable features in a feature store, and used by data scientists to train machine learning models. These models are versioned in a model registry, deployed through CI/CD pipelines, and exposed through APIs for real-time or batch inference, with continuous monitoring to ensure model performance and governance.
✅ This answer signals you understand:
AI architecture
data engineering
MLOps
enterprise AI governance
— which is very valuable in digital transformation discussions.
Let’s extend the AI platform architecture to include GenAI (Large Language Models) because many enterprises are now adding GenAI capabilities on top of their existing AI platforms.
A modern Enterprise GenAI Architecture looks like this:
Enterprise Data Sources ↓Data Lake / Data Platform ↓Data Governance & Security ↓Embedding Pipeline ↓Vector Database ↓LLM Gateway / Prompt Layer ↓RAG (Retrieval Augmented Generation) ↓Application APIs ↓Monitoring & GuardrailsNow let’s walk through this step by step, the way you can explain in interviews.
1️⃣ Enterprise Data Sources
GenAI systems need enterprise knowledge.
Typical sources in banks:
policy documents
customer communication history
loan agreements
knowledge base articles
support tickets
product documentation
Example:
Loan policy documentsCredit card rulesCustomer service FAQsFraud investigation reportsThis data usually resides in:
SharePoint
Document management systems
databases
data lakes.
2️⃣ Data Lake / Data Platform
All enterprise data is stored in a central data platform.
Purpose:
unify enterprise data
enable analytics
feed AI/GenAI systems.
Typical platforms:
Azure Data Lake
AWS S3
GCP BigQuery
Snowflake.
3️⃣ Data Governance & Security
Before GenAI uses enterprise data, governance ensures:
sensitive data protection
regulatory compliance
role-based access.
Example controls:
data classification
masking of PII
access control policies.
This is critical in BFSI environments.
4️⃣ Embedding Pipeline
LLMs cannot directly search enterprise documents.
Documents must be converted into vector embeddings.
Process:
Document ↓Text Chunking ↓Embedding Model ↓Vector RepresentationExample:
A document paragraph becomes a numerical vector.
Tools used:
OpenAI embeddings
Azure OpenAI embeddings
HuggingFace models.
5️⃣ Vector Database
These embeddings are stored in a vector database.
Purpose:
enable semantic search
retrieve relevant documents quickly.
Examples:
Pinecone
Weaviate
FAISS
Azure AI Search.
Example query:
User question:"Loan eligibility for salaried customer?"Vector DB retrieves relevant loan policy documents.
6️⃣ LLM Gateway / Prompt Layer
This layer manages interaction with LLM models.
Responsibilities:
prompt management
request routing
model selection
rate limiting.
Example models:
GPT models
Llama models
enterprise fine-tuned models.
Example prompt:
Answer the customer query using the following bank policy documents.7️⃣ RAG (Retrieval Augmented Generation)
This is the most common enterprise GenAI pattern.
RAG combines:
vector search
LLM generation.
Flow:
User Question ↓Vector Search retrieves relevant documents ↓Documents + Prompt sent to LLM ↓LLM generates contextual answerThis ensures:
answers are based on enterprise knowledge
hallucination risk is reduced.
Example use cases:
customer support bots
employee knowledge assistants
compliance advisors.
8️⃣ Application APIs
The GenAI capability is exposed through enterprise applications.
Examples:
mobile banking chatbot
call center assistant
banker productivity copilots.
Example API:
POST /askAIInput:
Customer questionOutput:
AI generated response9️⃣ Monitoring & Guardrails
Enterprises must monitor GenAI systems carefully.
Important controls:
hallucination monitoring
toxicity filtering
response validation
usage monitoring.
Example guardrails:
PII detectionprompt injection protectioncontent filteringReal Banking GenAI Use Cases
Customer Support Assistant
AI answers banking questions instantly.
Relationship Manager Copilot
AI suggests investment products.
Fraud Investigation Assistant
AI summarizes suspicious transactions.
Document Processing
AI extracts information from loan documents.
Enterprises extend their AI platforms with GenAI capabilities by building an architecture that includes enterprise data platforms, embedding pipelines, vector databases, and LLM gateways. Using a retrieval-augmented generation approach, relevant enterprise data is retrieved and provided to large language models to generate contextual responses, while governance and monitoring ensure security and compliance.
✅ This answer shows:
AI + GenAI architecture understanding
enterprise data governance awareness
modern AI platform thinking
“What is the difference between RAG and Fine-Tuning?”
Both approaches help adapt Large Language Models (LLMs) to enterprise knowledge, but they work very differently.
1️⃣ Retrieval Augmented Generation (RAG)
RAG means the model retrieves relevant enterprise data at runtime and uses it to generate answers.
Architecture
User Question ↓Vector Search (find relevant documents) ↓Documents + Prompt ↓LLM ↓Generated AnswerExample
Customer asks:
"What is the eligibility for a home loan?"
System process:
Query goes to vector database
Relevant loan policy documents are retrieved
Documents are sent to the LLM
LLM generates answer based on those documents.
Key Characteristics
Model is not retrained
Uses external knowledge sources
Easy to update knowledge by adding new documents
Very popular for enterprise knowledge assistants
Banking Use Cases
customer support chatbot
employee knowledge assistant
policy lookup systems
compliance advisory tools.
2️⃣ Fine-Tuning
Fine-tuning means training the LLM further using domain-specific datasets so it learns new patterns.
Architecture
Training Dataset ↓Fine-Tuning Process ↓Updated Model ↓InferenceExample dataset:
Customer queries + correct responsesLoan approval examplesFraud case analysisAfter training, the model internalizes the knowledge.
Key Characteristics
Requires training process
Changes model behavior permanently
More expensive and complex
Harder to update frequently.
Banking Use Cases
fraud detection language models
customer conversation assistants
document classification models.
3️⃣ Key Differences
Aspect | RAG | Fine-Tuning |
Model training | No | Yes |
Knowledge source | External documents | Embedded in model |
Updates | Easy (update documents) | Requires retraining |
Cost | Lower | Higher |
Best for | Knowledge retrieval | Behavior customization |
4️⃣ What Enterprises Usually Do
Most enterprises combine both approaches.
Typical pattern:
Base LLM ↓Fine-tuned for enterprise tone ↓RAG used for enterprise knowledgeThis gives:
accurate responses
domain understanding
access to updated data.
5️⃣
Retrieval Augmented Generation retrieves relevant enterprise data at runtime and provides it to the LLM to generate accurate responses, while fine-tuning modifies the model itself by training it on domain-specific datasets. In most enterprise implementations, RAG is preferred for knowledge access because it allows frequent updates without retraining the model.
6️⃣
RAG separates knowledge from the model, making enterprise AI systems more scalable, maintainable, and compliant with governance requirements.
✅ This shows interviewers that you understand:
modern GenAI architecture
enterprise AI governance
practical implementation patterns
which is very valuable for enterprise architecture roles.
AI Copilot Architecture (used for banker assistants and developer copilots).
Let’s look at AI Copilot Architecture, which many enterprises (especially banks) are implementing now for employee productivity and customer service.
Examples include:
Banker assistant
Customer service copilot
Developer copilot
Fraud investigation assistant
These copilots help employees query enterprise data using natural language.
Enterprise AI Copilot Architecture
A typical architecture looks like this:
User (Employee / Banker / Developer) ↓Enterprise Application (Web / Mobile / CRM) ↓Copilot Service Layer ↓Prompt Orchestration Layer ↓RAG Pipeline ↓ ↓Vector Database Enterprise APIs ↓ ↓Enterprise Data Platform ↓Large Language Model ↓Response + ActionNow let’s walk through the important components.
1️⃣ User Interface Layer
Employees interact with the copilot through:
CRM systems
internal banking portals
developer IDE tools
mobile apps.
Example query:
"Show me the risk profile of this customer"or
"Summarize this loan application"2️⃣ Copilot Service Layer
This layer manages:
conversation context
authentication
session management
integration with enterprise systems.
It ensures the AI works securely within enterprise workflows.
3️⃣ Prompt Orchestration Layer
This is a very important layer.
It builds the prompt dynamically by combining:
user question
relevant data
system instructions.
Example prompt:
You are a banking assistant.Answer using the loan policy documents.Do not reveal confidential information.This layer ensures controlled AI responses.
4️⃣ RAG Pipeline (Enterprise Knowledge Retrieval)
Copilots usually rely on RAG architecture.
Flow:
User Query ↓Vector Search ↓Relevant Enterprise Documents ↓LLM ↓Context-Aware ResponseThis ensures answers are based on enterprise knowledge, not just model training.
5️⃣ Vector Database
Stores embeddings of enterprise documents.
Examples:
product manuals
policy documents
internal knowledge bases
fraud investigation reports.
Popular technologies:
Pinecone
Azure AI Search
Weaviate
FAISS.
6️⃣ Enterprise Data & API Integration
Copilots often connect to live enterprise systems.
Examples:
core banking APIs
CRM systems
transaction databases
risk scoring systems.
Example query:
"Show last 10 transactions for this account"The copilot can call backend APIs to fetch real-time data.
7️⃣ Large Language Model
The LLM performs:
natural language understanding
summarization
reasoning
response generation.
Enterprises typically use:
GPT models
Llama models
enterprise fine-tuned models.
8️⃣ Guardrails & Security
Very important in banking environments.
Security controls include:
PII protection
access control
prompt injection protection
content filtering.
Example rule:
Customer data visible only to authorized banker roles9️⃣ Monitoring & Feedback
Enterprises monitor:
hallucinations
response quality
model usage
compliance violations.
Feedback is used to improve prompts and models.
Real Banking Copilot Examples
Banker Copilot
Helps relationship managers:
understand customer profiles
suggest financial products
summarize transactions.
Customer Service Copilot
Helps agents:
answer customer queries faster
retrieve policy information
resolve issues quickly.
Fraud Investigation Copilot
Helps fraud teams:
analyze suspicious transactions
summarize investigation reports.
Enterprise AI copilots are typically built using a retrieval-augmented architecture where user queries are processed through a prompt orchestration layer, relevant enterprise data is retrieved from vector databases or enterprise APIs, and large language models generate contextual responses. Security guardrails, governance controls, and monitoring ensure the system operates safely in regulated environments.
✅ This answer signals that you understand:
GenAI enterprise architecture
RAG-based systems
secure AI implementation
real business use cases
“How GenAI fits into Digital Transformation Architecture.”

GenAI + Enterprise Cloud + Data Modernization Architecture
1️⃣ Business Layer
Use Cases / Outcomes
Real-time fraud detection
Personalized financial advice
Customer support automation (chatbots)
KPIs: Fraud loss %, STP %, customer satisfaction, cost savings
2️⃣ Data Layer
Sources: Core banking (on-prem), CRM, transactions, external market data
Processing: Raw → Curated → Analytics → Feature Store
Offline Feature Store: Used for model training
Online Feature Store: Used for real-time inference
Governance: Data masking, PII compliance, audit trails
3️⃣ AI/ML Layer
Model Training Pipelines
Offline batch training
Continuous retraining with new patterns
Inference Pipelines
Real-time scoring via online feature store
Synchronous (critical decisions) + Asynchronous (analytics)
Fallback Controls: Rule-based risk mitigation for unknown patterns
4️⃣ Platform Layer
Hybrid Cloud Architecture
Primary cloud: Azure
DR / secondary: GCP
On-prem integration for regulated core systems
Services: API Gateway, Microservices, Streaming (Kafka), Load Balancer
Monitoring: Latency, throughput, model drift, system health
5️⃣ Governance Layer
Architecture Governance: EA office, domain architects, delivery councils
Model Governance: Version control, bias/explainability checks, regulatory compliance
Operational Governance: CI/CD, automated deployment pipelines, rollback strategy
Innovation Enablement: Sandbox environments, CoE for AI & Cloud
6️⃣ Roadmap & Scaling
Phase 1: Pilot high-value, low-risk use cases (fraud, chatbot)
Phase 2: Scale to credit risk, wealth advisory, analytics
Phase 3: Reusable frameworks, accelerators, and enterprise-wide CoE
Outcome: Scalable, compliant, business-driven, AI-enabled enterprise platform
Presentation Tip
Start top-down: business objectives → data → AI → platform → governance → roadmap
Highlight measurable business outcomes for each layer
Emphasize hybrid cloud, governance, and fallback controls for risk-aware innovation
.png)

Comments