Systematic Diagnosis

Anand Nerurkar
Nov 26, 2025
3 min read

1. The Core Principle of Systematic Diagnosis

Never jump to solutions. Always stabilize → observe → hypothesize → test → confirm.

Most failures happen because people:

Fix symptoms, not root causes
Trust logs blindly
Skip validation
Apply “tribal fixes”

Systematic diagnosis avoids that.

2. The 7-Step Systematic Diagnosis Framework (Universal)

Step 1 — Stabilize First (Stop the Bleeding)

Goal: Prevent business damage before deep analysis.

Ask:

Is customer impact ongoing?
Is data corruption happening?
Is financial loss active?

Actions:

Rollback
Disable feature flag
Scale horizontally
Block traffic
Switch to fallback

⚠️ No diagnosis while the system is burning.

Step 2 — Define the Problem Precisely

Most incidents fail here.

Bad:

“System is slow”

Good:

“P95 latency increased from 220ms to 4.8s for POST /loan/apply since 10:42 IST only on Region-B.”

Always capture:

What exactly broke
When
Where (service, region, tenant)
Since when
Who is impacted
What still works

Create a problem statement, not a complaint.

Step 3 — Establish the Last Known Good State (LKG)

Ask:

When was it working fine last?
What changed after that?

Typical change vectors:

Code deploy
Config change
Data migration
Infrastructure change
Traffic spike
External dependency

90% of root causes lie between LKG and failure time.

Step 4 — Break the System into Diagnostic Layers

Diagnose top-down or bottom-up:

Layer	Example
Business	Orders failing
API	5xx errors
Service	Thread pool exhausted
Data	Deadlocks
Infra	Node disk full
Network	Packet loss
External	Payment gateway slow

You never jump layers randomly.You move layer by layer.

Step 5 — Generate Multiple Hypotheses (Not One)

Never fall in love with your first idea.

For each layer ask:

What could logically cause this symptom?

Example (High latency):

DB slow
GC pause
Network jitter
Thread starvation
Lock contention
Downstream timeout
Cache miss storm

Write at least 3–5 hypotheses.

Step 6 — Test Hypotheses with Evidence (Not Guesswork)

Each hypothesis must be tested with:

Metrics
Logs
Traces
State inspection
Controlled experiments

Rule:

If you cannot validate it with data, it is an assumption—not a diagnosis.

Bad:

“It must be Kafka”

Good:

“Kafka consumer lag jumped from 200 to 2M at 10:41; correlation confirmed.”

Step 7 — Confirm Root Cause + Secondary Causes

You must identify:

Primary Root CauseWhat actually broke the system
Secondary Contributing CausesWhy it propagated / became severe
Detection FailureWhy it wasn’t caught earlier
Recovery GapsWhy it took long to fix

This is how you avoid repeat incidents.

3. The Golden Rule of Diagnosis

Symptoms tell you where to look. Causes are found elsewhere.

Example:

Symptom: API timeout
Root cause: DB index dropped
Detection failure: No slow query alert
Recovery gap: No automated rollback

4. Systematic Diagnosis Template (Use This in Real Incidents)

You can literally follow this during any outage:

1. Impact:
   - Users impacted:
   - Revenue/data risk:

2. Exact Symptom:
   -

3. LKG Time:
   -

4. Changes After LKG:
   -

5. Hypotheses:
   H1:
   H2:
   H3:

6. Evidence Collected:
   H1 -> Supported/Rejected
   H2 -> Supported/Rejected
   H3 -> Supported/Rejected

7. Root Cause:
   -

8. Contributing Factors:
   -

9. Detection Failure:
   -

10. Preventive Actions:
   -

This structure alone makes you look senior & methodical in interviews.

5. Systematic Diagnosis Example (AI/RAG System)

Symptom

“LLM started hallucinating loan eligibility rules since morning.”

Step-wise Diagnosis

Stabilize

Disable auto-approval
Switch to static rule engine fallback

Define

Only for Home Loans
Only after 9:30 AM deploy

Last Known Good

Working fine till 9:25 AM

Hypotheses

Vector store corruption
Wrong embedding model deployed
Chunking bug
Prompt template change
Stale cache

Evidence

Vector row count dropped by 40%
Metadata hash mismatch found
Re-index job failed at 9:10 AM

✅ Root Cause: Partial vector re-index after policy update✅ Contributing Cause: No atomic index swap✅ Detection Failure: No vector integrity check✅ Prevention: Blue-green index strategy + checksum validation

That is textbook systematic diagnosis.

6. Common Anti-Patterns (What Senior Engineers Avoid)

❌ Fixing without defining the problem
❌ “It worked for me”
❌ Assuming infra issue without proof
❌ Restarting blindly
❌ Single-hypothesis tunnel vision
❌ No postmortem

7. When to Use Which Diagnostic Style

Situation	Style
Production outage	Top-down (Business → Infra)
Performance issue	Bottom-up (Infra → App)
Data inconsistency	Data-first
AI hallucination	Pipeline-first
Security incident	Attack-surface-first

8. The 5 Diagnostic Questions You Should Always Ask

What changed?
What is the narrowest failing surface?
What is still working?
What evidence disproves my assumption?
How would this fail again?

If you can answer these → you are diagnosing, not guessing.

9. How This Is Expected at Architect / Staff / Principal Level

At senior levels, diagnosis is judged by:

Speed with correctness
Data-backed reasoning
Ability to eliminate layers fast
Structured communication
Prevention mindset

You are not rewarded for fixing fast.You are rewarded for stopping recurrence.