ai

AI Debugging Tool Tackles Complex Agent Failures

March 23, 2026 · 3 min read

AI Debugging Tool Tackles Complex Agent Failures

As artificial intelligence agents become more complex, developers face a new : debugging systems that operate across hundreds of steps with prompts spanning thousands of lines. When something goes wrong in these intricate AI workflows, the context that caused the failure is often buried deep within lengthy execution traces. This debugging problem represents a significant bottleneck for teams building production AI agents, where traditional debugging approaches fall short against the scale and complexity of modern AI systems.

The researchers behind LangSmith have developed Polly, an AI assistant specifically designed to read through 300-step execution traces, identify failures, and explain exactly what happened. Polly represents their answer to the question of how developers can effectively debug complex AI agents without manually scanning through overwhelming amounts of data. The tool is now generally available to all LangSmith users after previously being limited to specific areas of the platform.

Ology behind Polly involved extensive work with teams building production agents on LangSmith before development began. The researchers identified consistent failure patterns that kept emerging across different projects: traces too long to scan manually, prompts too tangled to reason about logically, and conversations too sprawling to follow completely. This research informed Polly's design as a context-aware assistant that understands what developers are looking at and can act on that information throughout an entire debugging session.

Polly's capabilities now extend across multiple areas of the LangSmith platform, addressing the reality that the hardest debugging problems don't exist on just one page. Developers can start in a trace view, compare to another experiment, pull examples into datasets, and then fix prompts while Polly maintains context throughout this entire workflow. In thread views, Polly analyzes entire conversations between users and AI agents, helping developers understand user sentiment, conversation outcomes, and interaction patterns without reading through every message manually.

The tool also assists with writing and refining evaluator logic directly within the Evaluators pane. Developers can ask Polly to create evaluators that check for hallucinations, improve existing evaluators' accuracy, or add handling for edge cases. Polly generates the necessary code, explains what each check accomplishes, and iterates with developers through refinement cycles. After running evaluations, developers can ask Polly which experiment performed best, receiving recommendations grounded in actual data rather than manual analysis.

Polly's design acknowledges that it doesn't replace engineering judgment but rather handles the parts of debugging that typically slow developers down. The assistant knows what developers are looking at, acts on that context, and remains available throughout entire debugging sessions. This approach addresses the core identified during development: that debugging AI agents differs fundamentally from debugging other types of software due to the scale and complexity involved.

Accessing Polly requires LangSmith users to add an API key for their model provider as a workspace secret, a process described as taking just two minutes. For new LangSmith users, the researchers recommend first setting up tracing so data flows into the platform before Polly can begin assisting with understanding system behavior and identifying improvement opportunities. The tool is accessible from the bottom-right corner of any LangSmith page and can be opened with keyboard shortcuts across different operating systems.

The researchers emphasize that Polly represents a tool for augmenting developer capabilities rather than replacing human judgment entirely. By handling the labor-intensive aspects of scanning long traces, analyzing complex conversations, and comparing experimental , Polly allows developers to focus their engineering expertise on higher-level problem-solving and system improvements. This balanced approach reflects the practical realities of debugging production AI systems where both automated assistance and human insight remain essential.