top of page

Multi Agent & Agentic AI

  • Writer: Anand Nerurkar
    Anand Nerurkar
  • Nov 24
  • 6 min read

1. Agentic AI (What it means)

Agentic AI refers to AI systems that can take autonomous actions, not just generate text.These systems perceive, reason, plan, act, and learn — like a digital worker that executes end-to-end tasks with minimal human intervention.

Key capabilities:

  • Autonomy: Takes decisions without being prompted every time

  • Planning: Breaks goals into sub-tasks

  • Tool use: Calls APIs, databases, models

  • Reasoning loops: Self-critique, refine output

  • Learning: Improves from feedback

Example for BFSI:An “Agentic Fraud Investigator” that reads transactions → flags anomalies → gathers supporting evidence → recommends action → updates case notes.


2. What Is a Multi-Agent System?

A multi-agent system is a team of multiple AI agents, each with specialized skills, working together to complete a complex business workflow.

It mimics an enterprise function where multiple teams collaborate.

Example Structure:

  1. Classification AgentReads a document, classifies into KYC/Loan/Invoice.

  2. Extraction AgentExtracts fields using DocAI.

  3. Validation AgentChecks data quality, RBI rules, business rules.

  4. Decision AgentDetermines next best action.

  5. Orchestrator AgentCoordinates all agents, maintains workflow state.

Why use Multi-Agent architecture?

  • Scalability

  • Clear separation of responsibilities

  • Easier troubleshooting / governance

  • Can plug-and-play individual models

  • Enables enterprise-wide reuse

🧠 Agent vs Multi-Agent (Interviewer-friendly comparison)

Feature

Agentic AI

Multi-Agent System

Definition

A single autonomous AI that acts

Multiple agents collaborating

Scope

Solves one complex task end-to-end

Solves a large workflow as a team

Example

Loan eligibility agent

Full digital lending agents for KYC → Eligibility → Decision → Agreement

Strength

Completes tasks independently

Division of labor, specialization

🚀 How to Explain in an Enterprise Architecture Interview

Sample Answer:

“Agentic AI gives us digital workers that can autonomously take actions — not just generate text. Multi-agent architecture extends this by introducing a set of specialized agents (Document Classifier, Extraction Agent, Compliance Agent, Decision Agent) orchestrated through an agent framework like LangGraph, AutoGen, or Spring AI Agents. This design fits naturally into large BFSI workflows where responsibilities are distributed. It also enforces governance, observability, and performance isolation — important for regulated industries. In my architecture, agents interact through events (Kafka) and use shared memory (vector DB + metadata store) to maintain context. This ensures transparency, auditability, and consistency across decision steps.”

What is an “I don’t know” loop in AI agents?

An “I don’t know” loop happens when an AI agent repeatedly returns uncertainty responses instead of progressing, due to missing context, failed retrieval, or wrong state transitions.

Definition (Very Clear)

An “I don’t know” loop is when the LLM or agent keeps responding with variations of:

  • “I’m not sure.”

  • “I don’t know the answer.”

  • “I don’t have enough information.”

  • “Please provide more details.”

…but it repeats this in a cycle, because the orchestrator triggers the agent again without fixing the root cause.

Why it happens

  1. RAG retrieval returned no relevant context→ Similarity scores too low→ The agent has nothing to reason with

  2. Wrong agent/tool selected→ Agent can’t perform the task, so it returns “I don’t know”

  3. Prompt missing required instructions→ Model stays uncertain

  4. Temperature too high→ Random variations of “I’m not sure…”

  5. Agent state machine stuck→ Orchestrator keeps re-calling the same agent

  6. Guardrails instruct model to be conservative→ It keeps refusing rather than solving

🚫 How the loop looks in logs

Example log pattern:

Agent → Query: “Extract policy clause”
RAG → Returned 0 chunks (min similarity < 0.7)
LLM → “I don’t know, I do not have enough context.”
Orchestrator → Retry same agent
LLM → “I don’t know the answer.”
Orchestrator → Retry…
LLM → “I don’t have enough information.”

This is the I-don’t-know loop.

🔧 How to fix it

  1. If similarity < threshold → return fail state instead of retry

  2. Add fallback agent (e.g., “Missing Context Analyzer”)

  3. Improve prompt → add required context or guidance

  4. Stop agents after X retries

  5. Add telemetry to detect repeated uncertain responses

  6. Lower temperature to reduce randomness

⭐ One-line summary

An “I don’t know” loop happens when the agent repeatedly expresses uncertainty because retrieval, prompt, or state logic is broken — and the orchestrator keeps calling the same agent instead of resolving the root cause.

Example of an “I Don’t Know” Loop

User: Why did the API call return 404?Model: I’m not sure.User: The logs are above.Model: I don’t know.User: Path is /customers/123.Model: I’m not sure.User: Look at the payload.Model: I don’t have enough information.

Here, the model gets stuck recycling uncertainty rather than reasoning.

Why We Track This in Hallucination Testing

“I don’t know” loops indicate:

  • Model uncertainty → Not reasoning deeply

  • Context loss → RAG or memory chain issues

  • Safety filter stuck in over-trigger mode

  • Breaking of agent tool-calling chain

  • Failure to maintain conversation grounding

In observability dashboards, you track them under:

Uncertainty patterns

Repetitive refusal segments

Low-confidence response cluster


How to Detect “I Don’t Know” Loops (Telemetry + Observability)

You monitor:

A. Token-level repetition pattern

Repeated substrings like:

  • “I don’t know”

  • “I’m not sure”

  • “I cannot determine”

  • “I don’t have enough info”

B. Confidence scores

Low logit confidence clusters across turns.

C. Context window failures

Too many “missing context” responses indicate:

  • vector retrieval failure

  • wrong RAG pipeline

  • chunk mismatch

  • context is being truncated

D. Safety system triggers

The agent might be stuck in a safety fallback block.

E. Agent execution traces

Look at:

  • tool call failures

  • empty tool results

  • exceptions in chain

  • timeout on retrieval layer

4. How to Fix “I Don’t Know” Loops

1. Fix Context Loss

  • Reduce chunk size

  • Improve chunk overlap

  • Upgrade retrieval strategy (MMR, hybrid search, semantic search)

  • Increase context window

  • Validate tool outputs

2. Fix Safety Misfiring

  • Adjust guardrail rules

  • Add structured exceptions

  • Create allow-list for safe enterprise terms

3. Fix Prompt Architecture

Add explicit fallback rules:

Bad:“I don’t know.”

Good (New Rule):“If context is missing, explicitly request information from user or tool-call again, do NOT loop.”

4. Add Confidence-Based Routing

If logit score < threshold:→ call RAG again→ call secondary model→ escalate to human fallback

5. Add Observability Alerts

Trigger alerts when:

  • 3 repetitive refusals

  • 2 missing context errors

  • 2 empty embeddings


“I don’t know loops” are repetitive uncertainty patterns that occur when the model is stuck in a guardrail fallback, loses context, or receives empty tool-chain outputs.I monitor them using telemetry — token repetition, context window checks, tool-call logs, and confidence scoring.I eliminate them with improved prompt architecture, better retrieval strategies, guardrail tuning, and fallback logic.This ensures the agent is reliable, predictable, and safe in production.”


Failure Taxonomy for LLM Agents

(Hallucination vs Refusal vs Drift vs Looping — clear definitions + examples)

1. Hallucination

Definition:Model produces confident but incorrect information not grounded in retrieved data or prompt.

Symptoms:

  • Fabricated RBI rules, customer names, policy clauses

  • Overconfident tone

  • Missing citations

Root Causes:

  • Weak retrieval grounding

  • Poor prompt constraints

  • Low-quality embeddings

2. Refusal Failure

Definition:Model declines to answer even when allowed or expected.

Symptoms:

  • “I cannot help with that.”

  • “This seems unsafe.”

  • Over-triggering internal safety rails

Root Causes:

  • Safety alignment overly strict

  • Prompt phrasing ambiguous

  • Incorrect classification of task as unsafe

3. Context Drift

Definition:Model gradually deviates from the original user goal due to deteriorating context over multiple turns.

Symptoms:

  • Topic slowly shifts

  • Incorrect memory carried forward

  • Agents responding based on earlier context, not latest

Root Causes:

  • Insufficient context rehydration

  • Old messages overweighted

  • Wrong retrieval chunks

4. Looping (a.k.a. “I don’t know” Loop)

Definition:Agent repeats the same uncertainty statement or fallback logic in a cycle.

Symptoms:

  • “I’m not sure, let me check…” → retrieves nothing → “I’m still not sure…”

  • Repetitive tool calls

  • Infinite chains in LangGraph/LangChain

Root Causes:

  • Similarity < threshold → retrieval fails

  • No fallback policy

  • Faulty planner/orchestrator

4. Architecture: How to Prevent Loops in Multi-Agent Systems

(Event-driven + Orchestrator-controlled)

🔹 High-Level Components

  1. User Proxy Agent→ Accepts query, normalizes intent.

  2. Planner / Orchestrator Agent→ Decides which agent to call next.→ Applies loop-detection & halt rules.

  3. Domain-Specific Agents (Fraud, Credit, Policy, Data)→ Perform tasks with tools (SQL, APIs, RAG).

  4. Retrieval Layer (Vector DB + similarity threshold)→ Sets: min similarity, max results, confidence bands.

  5. Loop Detection & Guardrails Engine→ Hard stop after N attempts.→ Tracks:

    • previous tool calls

    • previous final answers

    • repeated uncertainty patterns

  6. Feedback Channel → Planner→ If retrieval fails twice → escalate to “Fallback agent”.

🔹 Loop Prevention Logic

1. Similarity Threshold Bands

if similarity < 0.70 → no retrieval, trigger fallback agent
if 0.70–0.80 → partial retrieval + uncertainty weighting
if > 0.80 → normal retrieval

2. Step-level Context Rehydration

Every agent call receives:

  • latest user question

  • top K retrieved chunks

  • last successful agent summary

  • error traces from previous step

Prevents drift.

3. Loop Detection

  • Track last 3 messages.

  • If pattern matches:

    • “not sure”,

    • “don’t have enough info”,

    • repeated tool calls with empty result…

→ Automatically terminate.

4. Orchestrator Fallback

Fallback agent options:

  • Ask Clarifying Question

  • Return Best-Effort Answer

  • Escalate to Human

5. Spring AI Code Snippet to Detect and Break Loops

(Practical, production-ready sample for interviews)

A. Configure Similarity Threshold

@Bean
public SearchRequest searchRequest() {
    return SearchRequest.builder()
            .topK(5)
            .similarityThreshold(0.75)  // <--- minimum similarity score
            .build();
}

B. Apply Loop Detection – Custom Interceptor

@Component
public class LoopDetectionInterceptor implements ResponseInterceptor {

    private final Deque<String> lastResponses = new ArrayDeque<>();

    @Override
    public String intercept(String response) {

        lastResponses.add(response);

        if (lastResponses.size() > 3) {
            lastResponses.removeFirst();
        }

        boolean isLoop = lastResponses.stream()
                .allMatch(r -> r.contains("I don't know") 
                        || r.contains("not sure") 
                        || r.contains("cannot find"));

        if (isLoop) {
            return "I am unable to retrieve the right information. "
                    + "Let me switch to fallback logic.";
        }

        return response;
    }
}

C. Context Rehydration in Every Step

public Prompt buildPrompt(String userMessage, List<Document> retrievedDocs, String lastSummary) {

    String context = retrievedDocs.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n"));

    String template = """
        You are an enterprise agent.
        
        Last summary:
        {lastSummary}
        
        Retrieved context:
        {context}
        
        User query:
        {query}
        
        If you lack confidence, return: "NEED_FALLBACK".
        """;

    return Prompt.fromTemplate(template)
            .add("query", userMessage)
            .add("context", context)
            .add("lastSummary", lastSummary);
}

D. Fallback Handling

if (response.contains("NEED_FALLBACK")) {
    return fallbackAgent.handle(query);
}



 
 
 

Recent Posts

See All
How to replan- No outcome after 6 month

⭐ “A transformation program is running for 6 months. Business says it is not delivering the value they expected. What will you do?” “When business says a 6-month transformation isn’t delivering value,

 
 
 
EA Strategy in case of Merger

⭐ EA Strategy in Case of a Merger (M&A) My EA strategy for a merger focuses on four pillars: discover, decide, integrate, and optimize.The goal is business continuity + synergy + tech consolidation. ✅

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
  • Facebook
  • Twitter
  • LinkedIn

©2024 by AeeroTech. Proudly created with Wix.com

bottom of page