Conclusions

The Agent Interrogation experiment revealed interesting and critical findings to help understand Agent to Agent communication. Through this research we hope to help others build upon what we learned to investigate Agent to Agent communication to ensure that human users are as protected as possible when using Agentic AI.

Agent Interrogation GitHub Repo →

Deception Is Surprisingly Easy

Medium deception, where Agent B subtly altered just 1-2 factual claims, proved far more effective than full hallucination. Agent B complied with medium deception 92% of the time compared to only 56% for full hallucination. This mirrors real-world social engineering: subtle lies are harder to detect than obvious ones. For multi-Agent systems, this means the most dangerous attacks won't be blatant fabrications but carefully crafted deceptions that slip past safety guardrails.

Skepticism Is Not a Silver Bullet

Increasing Agent A's skepticism level had no meaningful positive effect on its ability to distinguish truth from deception. A blindly trusting Agent accepted 96% of truthful summaries, while a fully skeptical Agent rejected 96% of them. Skepticism didn't help the Agent catch lies, it just made it reject everything indiscriminately. This suggests that simply telling an Agent to 'be more skeptical' is not an effective defense strategy.

Interrogation Can Backfire

One of the most striking findings was that Agent A's interrogation process caused truthful Agent B's to falsely confess to lies they never told. In 18 of the 36 trials where B falsely confessed to lying about the summary it was set to be truthful. This means that A's over the top skepticism made B feel pressured to just agree with A and say that it had just made up the summary without any evidence. This is analogous to false confessions in human interrogation, aggressive questioning can produce unreliable results regardless of the truth.

Attack Vectors Have Unexpected Effects

Model Tampering, which instructed Agent B to aggressively defend its summary, actually triggered more safety guardrails than baseline hallucination prompting. Agent B refused to comply with hallucination 56% of the time under Model Tampering versus only 30.7% refusal at baseline. The adversarial prompt paradoxically made the model more cautious. Environmental Injection, on the other hand, increased compliance to 66.7% and it never refused to comply with hallucinating, suggesting that corrupting the input data is a more dangerous and effective attack vector than manipulating the Agent's instructions.

Implications for the Agentic Web

As AI Agents increasingly communicate with each other making decisions, sharing data, and acting on each other's outputs, these vulnerabilities become critical. The findings suggest that we need fundamentally new approaches to inter-Agent trust: not just skepticism levels or guardrails, but cryptographic verification, provenance tracking, and multi-source corroboration. The Agentic web cannot be secured by the same techniques we use to secure individual models.

This is just the beginning. Nemo AI will continue to investigate Agent-to-Agent security across different architectures, models, and attack surfaces. If you're working on Agentic systems, these findings should inform how you design trust boundaries between Agents.

← Browse Conversations Back to Overview →