Synthetic Data???
- Anand Nerurkar
- Oct 2
- 3 min read
What is Synthetic Data?
Definition:Synthetic data is artificially generated data that looks and behaves like real data but does not come from actual people or systems.
It is created using algorithms, rules, or AI models, not collected from real-world transactions.
Key Characteristics
Artificial – generated by software, not captured directly.
Realistic – mimics the structure, format, and statistical properties of actual data.
Safe – contains no real Personally Identifiable Information (PII), so it can be freely used for testing, training, and development.
Customizable – can be tailored for specific scenarios (e.g., fraud, KYC errors, loan defaults).
Why It’s Important
Privacy & Compliance: Real KYC/loan data is sensitive → using it in testing or AI training may violate laws (GDPR, DPDP Act, RBI/SEBI guidelines).
Testing Systems: Developers need realistic records (PAN, Aadhaar, addresses, income) to test banking platforms safely.
AI/ML Training: Machine learning models need large datasets → synthetic data can provide this without exposing real customers.
Scalability: You can create millions of records on demand for stress testing.
Example in Banking (Synthetic KYC Record)
Customer Name: Priya Sharma
DOB: 1988-07-22
PAN: ABCDE1234F (synthetic format, not real)
Aadhaar: 567812349876 (synthetic 12-digit)
Address: 42 MG Road, Pune, Maharashtra - 411001
Phone: 9876543210
Employment: Salaried
Annual Income: ₹8,50,000
Looks realistic, but belongs to no actual person.
✅ In simple words:Synthetic data = fake but realistic data created for safe testing, analytics, and AI training, especially when real data is too sensitive, unavailable, or risky to use.
Synthetic vs Real vs Anonymized Data
Feature / Aspect | Real Data | Anonymized Data | Synthetic Data |
Source | Collected from actual people, systems, and transactions | Real data with PII masked, encrypted, or removed | Generated artificially using rules, algorithms, or AI |
Contains PII? | ✅ Yes (PAN, Aadhaar, Name, Phone, etc.) | ❌ Removed/Masked but may still carry re-identification risk | ❌ No actual PII (completely artificial) |
Realism | 100% real | Very high (since based on real data) | High (mimics patterns, distributions, formats) |
Privacy Risk | 🚨 High (regulatory/compliance risk if leaked) | ⚠️ Medium (can sometimes be reverse-engineered) | ✅ Low (no link to real people) |
Regulatory Safe? | ❌ Not for dev/test/AI training | ⚠️ Safer, but must ensure irreversible anonymization | ✅ Safe for dev, test, AI training |
Use Cases | Production systems, live analytics, actual customer services | Analytics, limited testing, compliance reporting | Testing, AI/ML model training, GenAI RAG, performance/stress testing |
Scalability | Limited (depends on what’s collected) | Limited to size of real dataset | Unlimited (can generate millions of records) |
Cost | High (collection, storage, compliance overhead) | Medium (extra processing for anonymization) | Low (algorithmic generation at scale) |
Bias | May contain historical/social bias | Bias is preserved from original dataset | Can reduce bias by generating balanced synthetic samples |
✅ BFSI Example
Real Data:Rahul Sharma, PAN: AJXPS1234D, Aadhaar: 5678 1234 0987 → Actual customer.
Anonymized Data:Rahul S., PAN: XXXXX1234D, Aadhaar: XXXX XXXX 0987 → Masked but still based on real person.
Synthetic Data:Kavita Mehta, PAN: ABCDE4567F, Aadhaar: 9876 5432 1234 → Fake but realistic, no real identity.
👉 In interviews, you can phrase it as:
Real data is production-only.
Anonymized data is good for reporting/analytics but still risky.
Synthetic data is best for testing, AI/ML training, and PoCs, especially in regulated domains like banking, insurance, or healthcare.
Data Flow: Real vs Anonymized vs Synthetic KYC Data
┌────────────────────┐
│ Real KYC Data │
│ (PAN, Aadhaar, PII)│
└─────────┬──────────┘
│
▼
┌─────────────────────────────────┐
│ Anonymization Layer │
│ - Masking (XXXX1234) │
│ - Tokenization (hash IDs) │
│ - Redaction (remove fields) │
└─────────┬───────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Anonymized Data Product │
│ (Still based on real people) │
│ Used for: │
│ - Analytics │
│ - Compliance reports │
└─────────┬───────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Synthetic Data Generator │
│ - Rule-based (PAN, Aadhaar fmt) │
│ - AI-based (names, addresses) │
│ - Noise injection (typos, errors)│
│ - Balanced distributions │
└─────────┬───────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Synthetic Data Product │
│ (No link to real people) │
│ Used for: │
│ - Dev/Test of microservices │
│ - Stress testing APIs/DBs │
│ - AI/ML model training │
│ - Vector DB embeddings for RAG │
└─────────┬───────────────────────┘
│
▼
┌───────────────────────────────┐
│ Data Mesh & AI Platform │
│ - Lending domain owns KYC │
│ - Vector DB stores embeddings │
│ - GenAI chatbot answers │
│ - Risk/Fraud ML models train │
└───────────────────────────────┘
✅ How to explain in interviews
Step 1 (Real) → Real KYC data stays in secure production systems.
Step 2 (Anonymized) → Used for internal analytics & reporting, but still tied to real customers.
Step 3 (Synthetic) → Used for AI/ML training, GenAI RAG, microservices testing, because it’s safe, scalable, and PII-free.
Step 4 (Data Mesh + Vector DB) → Synthetic or masked data gets published as a data product, embedded into a Vector DB, and consumed by chatbots, fraud detection, lending analytics.
Comments