Synthetic Data???

Anand Nerurkar
Oct 2
3 min read

What is Synthetic Data?

Definition:Synthetic data is artificially generated data that looks and behaves like real data but does not come from actual people or systems.
- It is created using algorithms, rules, or AI models, not collected from real-world transactions.

Key Characteristics

Artificial – generated by software, not captured directly.
Realistic – mimics the structure, format, and statistical properties of actual data.
Safe – contains no real Personally Identifiable Information (PII), so it can be freely used for testing, training, and development.
Customizable – can be tailored for specific scenarios (e.g., fraud, KYC errors, loan defaults).

Why It’s Important

Privacy & Compliance: Real KYC/loan data is sensitive → using it in testing or AI training may violate laws (GDPR, DPDP Act, RBI/SEBI guidelines).
Testing Systems: Developers need realistic records (PAN, Aadhaar, addresses, income) to test banking platforms safely.
AI/ML Training: Machine learning models need large datasets → synthetic data can provide this without exposing real customers.
Scalability: You can create millions of records on demand for stress testing.

Example in Banking (Synthetic KYC Record)

Customer Name: Priya Sharma
DOB: 1988-07-22
PAN: ABCDE1234F   (synthetic format, not real)
Aadhaar: 567812349876 (synthetic 12-digit)
Address: 42 MG Road, Pune, Maharashtra - 411001
Phone: 9876543210
Employment: Salaried
Annual Income: ₹8,50,000

Looks realistic, but belongs to no actual person.

✅ In simple words:Synthetic data = fake but realistic data created for safe testing, analytics, and AI training, especially when real data is too sensitive, unavailable, or risky to use.

Synthetic vs Real vs Anonymized Data

Feature / Aspect	Real Data	Anonymized Data	Synthetic Data
Source	Collected from actual people, systems, and transactions	Real data with PII masked, encrypted, or removed	Generated artificially using rules, algorithms, or AI
Contains PII?	✅ Yes (PAN, Aadhaar, Name, Phone, etc.)	❌ Removed/Masked but may still carry re-identification risk	❌ No actual PII (completely artificial)
Realism	100% real	Very high (since based on real data)	High (mimics patterns, distributions, formats)
Privacy Risk	🚨 High (regulatory/compliance risk if leaked)	⚠️ Medium (can sometimes be reverse-engineered)	✅ Low (no link to real people)
Regulatory Safe?	❌ Not for dev/test/AI training	⚠️ Safer, but must ensure irreversible anonymization	✅ Safe for dev, test, AI training
Use Cases	Production systems, live analytics, actual customer services	Analytics, limited testing, compliance reporting	Testing, AI/ML model training, GenAI RAG, performance/stress testing
Scalability	Limited (depends on what’s collected)	Limited to size of real dataset	Unlimited (can generate millions of records)
Cost	High (collection, storage, compliance overhead)	Medium (extra processing for anonymization)	Low (algorithmic generation at scale)
Bias	May contain historical/social bias	Bias is preserved from original dataset	Can reduce bias by generating balanced synthetic samples

✅ BFSI Example

Real Data:Rahul Sharma, PAN: AJXPS1234D, Aadhaar: 5678 1234 0987 → Actual customer.
Anonymized Data:Rahul S., PAN: XXXXX1234D, Aadhaar: XXXX XXXX 0987 → Masked but still based on real person.
Synthetic Data:Kavita Mehta, PAN: ABCDE4567F, Aadhaar: 9876 5432 1234 → Fake but realistic, no real identity.

👉 In interviews, you can phrase it as:

Real data is production-only.
Anonymized data is good for reporting/analytics but still risky.
Synthetic data is best for testing, AI/ML training, and PoCs, especially in regulated domains like banking, insurance, or healthcare.

Data Flow: Real vs Anonymized vs Synthetic KYC Data

                ┌────────────────────┐
                │   Real KYC Data    │
                │ (PAN, Aadhaar, PII)│
                └─────────┬──────────┘
                          │
                          ▼
          ┌─────────────────────────────────┐
          │   Anonymization Layer           │
          │ - Masking (XXXX1234)            │
          │ - Tokenization (hash IDs)       │
          │ - Redaction (remove fields)     │
          └─────────┬───────────────────────┘
                    │
                    ▼
          ┌─────────────────────────────────┐
          │   Anonymized Data Product       │
          │ (Still based on real people)    │
          │ Used for:                       │
          │ - Analytics                     │
          │ - Compliance reports            │
          └─────────┬───────────────────────┘
                    │
                    ▼
          ┌─────────────────────────────────┐
          │   Synthetic Data Generator      │
          │ - Rule-based (PAN, Aadhaar fmt) │
          │ - AI-based (names, addresses)   │
          │ - Noise injection (typos, errors)│
          │ - Balanced distributions        │
          └─────────┬───────────────────────┘
                    │
                    ▼
          ┌─────────────────────────────────┐
          │   Synthetic Data Product        │
          │ (No link to real people)        │
          │ Used for:                       │
          │ - Dev/Test of microservices     │
          │ - Stress testing APIs/DBs       │
          │ - AI/ML model training          │
          │ - Vector DB embeddings for RAG  │
          └─────────┬───────────────────────┘
                    │
                    ▼
      ┌───────────────────────────────┐
      │   Data Mesh & AI Platform     │
      │ - Lending domain owns KYC     │
      │ - Vector DB stores embeddings │
      │ - GenAI chatbot answers       │
      │ - Risk/Fraud ML models train  │
      └───────────────────────────────┘

✅ How to explain in interviews

Step 1 (Real) → Real KYC data stays in secure production systems.
Step 2 (Anonymized) → Used for internal analytics & reporting, but still tied to real customers.
Step 3 (Synthetic) → Used for AI/ML training, GenAI RAG, microservices testing, because it’s safe, scalable, and PII-free.
Step 4 (Data Mesh + Vector DB) → Synthetic or masked data gets published as a data product, embedded into a Vector DB, and consumed by chatbots, fraud detection, lending analytics.