top of page

Synthetic Data???

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Oct 2
  • 3 min read

What is Synthetic Data?

  • Definition:Synthetic data is artificially generated data that looks and behaves like real data but does not come from actual people or systems.

    • It is created using algorithms, rules, or AI models, not collected from real-world transactions.

Key Characteristics

  1. Artificial – generated by software, not captured directly.

  2. Realistic – mimics the structure, format, and statistical properties of actual data.

  3. Safe – contains no real Personally Identifiable Information (PII), so it can be freely used for testing, training, and development.

  4. Customizable – can be tailored for specific scenarios (e.g., fraud, KYC errors, loan defaults).

Why It’s Important

  • Privacy & Compliance: Real KYC/loan data is sensitive → using it in testing or AI training may violate laws (GDPR, DPDP Act, RBI/SEBI guidelines).

  • Testing Systems: Developers need realistic records (PAN, Aadhaar, addresses, income) to test banking platforms safely.

  • AI/ML Training: Machine learning models need large datasets → synthetic data can provide this without exposing real customers.

  • Scalability: You can create millions of records on demand for stress testing.

Example in Banking (Synthetic KYC Record)

Customer Name: Priya Sharma
DOB: 1988-07-22
PAN: ABCDE1234F   (synthetic format, not real)
Aadhaar: 567812349876 (synthetic 12-digit)
Address: 42 MG Road, Pune, Maharashtra - 411001
Phone: 9876543210
Employment: Salaried
Annual Income: ₹8,50,000

Looks realistic, but belongs to no actual person.

In simple words:Synthetic data = fake but realistic data created for safe testing, analytics, and AI training, especially when real data is too sensitive, unavailable, or risky to use.


Synthetic vs Real vs Anonymized Data

Feature / Aspect

Real Data

Anonymized Data

Synthetic Data

Source

Collected from actual people, systems, and transactions

Real data with PII masked, encrypted, or removed

Generated artificially using rules, algorithms, or AI

Contains PII?

✅ Yes (PAN, Aadhaar, Name, Phone, etc.)

❌ Removed/Masked but may still carry re-identification risk

❌ No actual PII (completely artificial)

Realism

100% real

Very high (since based on real data)

High (mimics patterns, distributions, formats)

Privacy Risk

🚨 High (regulatory/compliance risk if leaked)

⚠️ Medium (can sometimes be reverse-engineered)

✅ Low (no link to real people)

Regulatory Safe?

❌ Not for dev/test/AI training

⚠️ Safer, but must ensure irreversible anonymization

✅ Safe for dev, test, AI training

Use Cases

Production systems, live analytics, actual customer services

Analytics, limited testing, compliance reporting

Testing, AI/ML model training, GenAI RAG, performance/stress testing

Scalability

Limited (depends on what’s collected)

Limited to size of real dataset

Unlimited (can generate millions of records)

Cost

High (collection, storage, compliance overhead)

Medium (extra processing for anonymization)

Low (algorithmic generation at scale)

Bias

May contain historical/social bias

Bias is preserved from original dataset

Can reduce bias by generating balanced synthetic samples

✅ BFSI Example

  • Real Data:Rahul Sharma, PAN: AJXPS1234D, Aadhaar: 5678 1234 0987 → Actual customer.

  • Anonymized Data:Rahul S., PAN: XXXXX1234D, Aadhaar: XXXX XXXX 0987 → Masked but still based on real person.

  • Synthetic Data:Kavita Mehta, PAN: ABCDE4567F, Aadhaar: 9876 5432 1234 → Fake but realistic, no real identity.

👉 In interviews, you can phrase it as:

  • Real data is production-only.

  • Anonymized data is good for reporting/analytics but still risky.

  • Synthetic data is best for testing, AI/ML training, and PoCs, especially in regulated domains like banking, insurance, or healthcare.


Data Flow: Real vs Anonymized vs Synthetic KYC Data

                ┌────────────────────┐
                │   Real KYC Data    │
                │ (PAN, Aadhaar, PII)│
                └─────────┬──────────┘
                          │
                          ▼
          ┌─────────────────────────────────┐
          │   Anonymization Layer           │
          │ - Masking (XXXX1234)            │
          │ - Tokenization (hash IDs)       │
          │ - Redaction (remove fields)     │
          └─────────┬───────────────────────┘
                    │
                    ▼
          ┌─────────────────────────────────┐
          │   Anonymized Data Product       │
          │ (Still based on real people)    │
          │ Used for:                       │
          │ - Analytics                     │
          │ - Compliance reports            │
          └─────────┬───────────────────────┘
                    │
                    ▼
          ┌─────────────────────────────────┐
          │   Synthetic Data Generator      │
          │ - Rule-based (PAN, Aadhaar fmt) │
          │ - AI-based (names, addresses)   │
          │ - Noise injection (typos, errors)│
          │ - Balanced distributions        │
          └─────────┬───────────────────────┘
                    │
                    ▼
          ┌─────────────────────────────────┐
          │   Synthetic Data Product        │
          │ (No link to real people)        │
          │ Used for:                       │
          │ - Dev/Test of microservices     │
          │ - Stress testing APIs/DBs       │
          │ - AI/ML model training          │
          │ - Vector DB embeddings for RAG  │
          └─────────┬───────────────────────┘
                    │
                    ▼
      ┌───────────────────────────────┐
      │   Data Mesh & AI Platform     │
      │ - Lending domain owns KYC     │
      │ - Vector DB stores embeddings │
      │ - GenAI chatbot answers       │
      │ - Risk/Fraud ML models train  │
      └───────────────────────────────┘

✅ How to explain in interviews

  • Step 1 (Real) → Real KYC data stays in secure production systems.

  • Step 2 (Anonymized) → Used for internal analytics & reporting, but still tied to real customers.

  • Step 3 (Synthetic) → Used for AI/ML training, GenAI RAG, microservices testing, because it’s safe, scalable, and PII-free.

  • Step 4 (Data Mesh + Vector DB) → Synthetic or masked data gets published as a data product, embedded into a Vector DB, and consumed by chatbots, fraud detection, lending analytics.


 
 
 

Recent Posts

See All
Open Banking Vs Tradinal Banking

1. What is Open Banking? Open banking  is a system where banks allow secure sharing of financial data  with authorized third-party...

 
 
 
How To Validate Architecture

🧭 1️⃣ What Architecture Validation Means It’s the structured process of verifying that the proposed or implemented solution : Meets...

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page