Systematic Diagnosis
- Anand Nerurkar
- Nov 26, 2025
- 3 min read
1. The Core Principle of Systematic Diagnosis
Never jump to solutions. Always stabilize → observe → hypothesize → test → confirm.
Most failures happen because people:
Fix symptoms, not root causes
Trust logs blindly
Skip validation
Apply “tribal fixes”
Systematic diagnosis avoids that.
2. The 7-Step Systematic Diagnosis Framework (Universal)
Step 1 — Stabilize First (Stop the Bleeding)
Goal: Prevent business damage before deep analysis.
Ask:
Is customer impact ongoing?
Is data corruption happening?
Is financial loss active?
Actions:
Rollback
Disable feature flag
Scale horizontally
Block traffic
Switch to fallback
⚠️ No diagnosis while the system is burning.
Step 2 — Define the Problem Precisely
Most incidents fail here.
Bad:
“System is slow”
Good:
“P95 latency increased from 220ms to 4.8s for POST /loan/apply since 10:42 IST only on Region-B.”
Always capture:
What exactly broke
When
Where (service, region, tenant)
Since when
Who is impacted
What still works
Create a problem statement, not a complaint.
Step 3 — Establish the Last Known Good State (LKG)
Ask:
When was it working fine last?
What changed after that?
Typical change vectors:
Code deploy
Config change
Data migration
Infrastructure change
Traffic spike
External dependency
90% of root causes lie between LKG and failure time.
Step 4 — Break the System into Diagnostic Layers
Diagnose top-down or bottom-up:
Layer | Example |
Business | Orders failing |
API | 5xx errors |
Service | Thread pool exhausted |
Data | Deadlocks |
Infra | Node disk full |
Network | Packet loss |
External | Payment gateway slow |
You never jump layers randomly.You move layer by layer.
Step 5 — Generate Multiple Hypotheses (Not One)
Never fall in love with your first idea.
For each layer ask:
What could logically cause this symptom?
Example (High latency):
DB slow
GC pause
Network jitter
Thread starvation
Lock contention
Downstream timeout
Cache miss storm
Write at least 3–5 hypotheses.
Step 6 — Test Hypotheses with Evidence (Not Guesswork)
Each hypothesis must be tested with:
Metrics
Logs
Traces
State inspection
Controlled experiments
Rule:
If you cannot validate it with data, it is an assumption—not a diagnosis.
Bad:
“It must be Kafka”
Good:
“Kafka consumer lag jumped from 200 to 2M at 10:41; correlation confirmed.”
Step 7 — Confirm Root Cause + Secondary Causes
You must identify:
Primary Root CauseWhat actually broke the system
Secondary Contributing CausesWhy it propagated / became severe
Detection FailureWhy it wasn’t caught earlier
Recovery GapsWhy it took long to fix
This is how you avoid repeat incidents.
3. The Golden Rule of Diagnosis
Symptoms tell you where to look. Causes are found elsewhere.
Example:
Symptom: API timeout
Root cause: DB index dropped
Detection failure: No slow query alert
Recovery gap: No automated rollback
4. Systematic Diagnosis Template (Use This in Real Incidents)
You can literally follow this during any outage:
1. Impact:
- Users impacted:
- Revenue/data risk:
2. Exact Symptom:
-
3. LKG Time:
-
4. Changes After LKG:
-
5. Hypotheses:
H1:
H2:
H3:
6. Evidence Collected:
H1 -> Supported/Rejected
H2 -> Supported/Rejected
H3 -> Supported/Rejected
7. Root Cause:
-
8. Contributing Factors:
-
9. Detection Failure:
-
10. Preventive Actions:
-
This structure alone makes you look senior & methodical in interviews.
5. Systematic Diagnosis Example (AI/RAG System)
Symptom
“LLM started hallucinating loan eligibility rules since morning.”
Step-wise Diagnosis
Stabilize
Disable auto-approval
Switch to static rule engine fallback
Define
Only for Home Loans
Only after 9:30 AM deploy
Last Known Good
Working fine till 9:25 AM
Hypotheses
Vector store corruption
Wrong embedding model deployed
Chunking bug
Prompt template change
Stale cache
Evidence
Vector row count dropped by 40%
Metadata hash mismatch found
Re-index job failed at 9:10 AM
✅ Root Cause: Partial vector re-index after policy update✅ Contributing Cause: No atomic index swap✅ Detection Failure: No vector integrity check✅ Prevention: Blue-green index strategy + checksum validation
That is textbook systematic diagnosis.
6. Common Anti-Patterns (What Senior Engineers Avoid)
❌ Fixing without defining the problem
❌ “It worked for me”
❌ Assuming infra issue without proof
❌ Restarting blindly
❌ Single-hypothesis tunnel vision
❌ No postmortem
7. When to Use Which Diagnostic Style
Situation | Style |
Production outage | Top-down (Business → Infra) |
Performance issue | Bottom-up (Infra → App) |
Data inconsistency | Data-first |
AI hallucination | Pipeline-first |
Security incident | Attack-surface-first |
8. The 5 Diagnostic Questions You Should Always Ask
What changed?
What is the narrowest failing surface?
What is still working?
What evidence disproves my assumption?
How would this fail again?
If you can answer these → you are diagnosing, not guessing.
9. How This Is Expected at Architect / Staff / Principal Level
At senior levels, diagnosis is judged by:
Speed with correctness
Data-backed reasoning
Ability to eliminate layers fast
Structured communication
Prevention mindset
You are not rewarded for fixing fast.You are rewarded for stopping recurrence.
.png)

Comments