top of page

Systematic Diagnosis

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Nov 26, 2025
  • 3 min read

1. The Core Principle of Systematic Diagnosis

Never jump to solutions. Always stabilize → observe → hypothesize → test → confirm.

Most failures happen because people:

  • Fix symptoms, not root causes

  • Trust logs blindly

  • Skip validation

  • Apply “tribal fixes”

Systematic diagnosis avoids that.

2. The 7-Step Systematic Diagnosis Framework (Universal)

Step 1 — Stabilize First (Stop the Bleeding)

Goal: Prevent business damage before deep analysis.

Ask:

  • Is customer impact ongoing?

  • Is data corruption happening?

  • Is financial loss active?

Actions:

  • Rollback

  • Disable feature flag

  • Scale horizontally

  • Block traffic

  • Switch to fallback

⚠️ No diagnosis while the system is burning.

Step 2 — Define the Problem Precisely

Most incidents fail here.

Bad:

“System is slow”

Good:

“P95 latency increased from 220ms to 4.8s for POST /loan/apply since 10:42 IST only on Region-B.”

Always capture:

  • What exactly broke

  • When

  • Where (service, region, tenant)

  • Since when

  • Who is impacted

  • What still works

Create a problem statement, not a complaint.

Step 3 — Establish the Last Known Good State (LKG)

Ask:

  • When was it working fine last?

  • What changed after that?

Typical change vectors:

  • Code deploy

  • Config change

  • Data migration

  • Infrastructure change

  • Traffic spike

  • External dependency

90% of root causes lie between LKG and failure time.

Step 4 — Break the System into Diagnostic Layers

Diagnose top-down or bottom-up:

Layer

Example

Business

Orders failing

API

5xx errors

Service

Thread pool exhausted

Data

Deadlocks

Infra

Node disk full

Network

Packet loss

External

Payment gateway slow

You never jump layers randomly.You move layer by layer.

Step 5 — Generate Multiple Hypotheses (Not One)

Never fall in love with your first idea.

For each layer ask:

  • What could logically cause this symptom?

Example (High latency):

  • DB slow

  • GC pause

  • Network jitter

  • Thread starvation

  • Lock contention

  • Downstream timeout

  • Cache miss storm

Write at least 3–5 hypotheses.

Step 6 — Test Hypotheses with Evidence (Not Guesswork)

Each hypothesis must be tested with:

  • Metrics

  • Logs

  • Traces

  • State inspection

  • Controlled experiments

Rule:

If you cannot validate it with data, it is an assumption—not a diagnosis.

Bad:

“It must be Kafka”

Good:

“Kafka consumer lag jumped from 200 to 2M at 10:41; correlation confirmed.”

Step 7 — Confirm Root Cause + Secondary Causes

You must identify:

  1. Primary Root CauseWhat actually broke the system

  2. Secondary Contributing CausesWhy it propagated / became severe

  3. Detection FailureWhy it wasn’t caught earlier

  4. Recovery GapsWhy it took long to fix

This is how you avoid repeat incidents.

3. The Golden Rule of Diagnosis

Symptoms tell you where to look. Causes are found elsewhere.

Example:

  • Symptom: API timeout

  • Root cause: DB index dropped

  • Detection failure: No slow query alert

  • Recovery gap: No automated rollback

4. Systematic Diagnosis Template (Use This in Real Incidents)

You can literally follow this during any outage:

1. Impact:
   - Users impacted:
   - Revenue/data risk:

2. Exact Symptom:
   -

3. LKG Time:
   -

4. Changes After LKG:
   -

5. Hypotheses:
   H1:
   H2:
   H3:

6. Evidence Collected:
   H1 -> Supported/Rejected
   H2 -> Supported/Rejected
   H3 -> Supported/Rejected

7. Root Cause:
   -

8. Contributing Factors:
   -

9. Detection Failure:
   -

10. Preventive Actions:
   -

This structure alone makes you look senior & methodical in interviews.

5. Systematic Diagnosis Example (AI/RAG System)

Symptom

“LLM started hallucinating loan eligibility rules since morning.”

Step-wise Diagnosis

Stabilize

  • Disable auto-approval

  • Switch to static rule engine fallback

Define

  • Only for Home Loans

  • Only after 9:30 AM deploy

Last Known Good

  • Working fine till 9:25 AM

Hypotheses

  • Vector store corruption

  • Wrong embedding model deployed

  • Chunking bug

  • Prompt template change

  • Stale cache

Evidence

  • Vector row count dropped by 40%

  • Metadata hash mismatch found

  • Re-index job failed at 9:10 AM

Root Cause: Partial vector re-index after policy update✅ Contributing Cause: No atomic index swap✅ Detection Failure: No vector integrity check✅ Prevention: Blue-green index strategy + checksum validation

That is textbook systematic diagnosis.

6. Common Anti-Patterns (What Senior Engineers Avoid)

  • ❌ Fixing without defining the problem

  • ❌ “It worked for me”

  • ❌ Assuming infra issue without proof

  • ❌ Restarting blindly

  • ❌ Single-hypothesis tunnel vision

  • ❌ No postmortem

7. When to Use Which Diagnostic Style

Situation

Style

Production outage

Top-down (Business → Infra)

Performance issue

Bottom-up (Infra → App)

Data inconsistency

Data-first

AI hallucination

Pipeline-first

Security incident

Attack-surface-first

8. The 5 Diagnostic Questions You Should Always Ask

  1. What changed?

  2. What is the narrowest failing surface?

  3. What is still working?

  4. What evidence disproves my assumption?

  5. How would this fail again?

If you can answer these → you are diagnosing, not guessing.

9. How This Is Expected at Architect / Staff / Principal Level

At senior levels, diagnosis is judged by:

  • Speed with correctness

  • Data-backed reasoning

  • Ability to eliminate layers fast

  • Structured communication

  • Prevention mindset

You are not rewarded for fixing fast.You are rewarded for stopping recurrence.

 
 
 

Recent Posts

See All
Best Chunking Practices

1. Chunk by Semantic Boundaries (NOT fixed size only) Split by sections, headings, paragraphs , or logical units. Avoid cutting a sentence or concept in half. Works best with docs, tech specs, policie

 
 
 
Future State Architecture

USE CASE: LARGE RETAIL BANK – DIGITAL CHANNEL MODERNIZATION 🔹 Business Context A large retail bank wants to “modernize” its digital channels (internet banking + mobile apps). Constraints: Heavy regul

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page