Engineering Complex AI solutions: Observability and Testing of multi-Agent Solutions
Speaker
Dmitri
Lead systems engineer with more than 20 years experience in IT:
- 3 years in AI-based solutions development;
- 8 years of S-SDLC methodology implementation;
- 7 years of security architecture development;
- 4 years Experience in the development of the scalabale solution for the non-functional testing using private clouds AWS based APIs
- solid knowledge of building secure and resilient architectures in multi-cloud environments
- 9 years of experience in consulting services for the external and internal EPAM accounts
- 6 years of experience in project and team management
- 8 years of experience in solutions architectures development and deployment
- 10 years security concepts and tests development including regular security audits
- 8 years of experience of developing and deploying Continuous Delivery and Continuous Integration concepts
- Experience in processes development
- 10 years developing and deploying Unix/Linux based infrastructures
Abstract
As AI agents evolve from simple chatbots to complex multi-agent systems utilizing Model Context Protocol (MCP), manual validation becomes impossible. During this talk, I will demonstrate the process of architecting a quality assurance loop for these solutions. No theoretical fluff: I will focus on the practical analysis of automated pipeline results, interpreting Langfuse reports for cost and performance, and ensuring reliability from an AI Architect/System Engineer perspective.
Description
Deploying agentic AI solutions is no longer just about prompt engineering; it involves orchestrating complex interactions between LLMs, multiple Agents, and MCP servers. When an agent fails, did it hallucinate, fail to call a tool, or receive bad data? Traditional manual verification cannot scale to debug these multi-step workflows, and production regressions can lead to runaway costs or dangerous data mishandling.
This talk demonstrates how to establish a rigorous testing and observability strategy for complex AI solutions using Langfuse, covering the following practices:
1 - Automated Quality Analysis: Methodologies for scoring accuracy and relevance in multi-turn conversations using the LLM-as-a-Judge concept; 2 - Validating Tool Usage: Analyzing traces to ensure MCP servers and external tools are invoked correctly; 3 - Cost & Performance Auditing: Using granular reports to detect latency spikes and token usage anomalies before they hit production; 4 - Root Cause Analysis: Interpreting execution traces to pinpoint exactly where logic broke down in the reasoning chain.
You will see architectural patterns, real-world examples of pipeline execution reports, and deep-dive analytics via Langfuse. The focus is on the engineering of the CI/CD process and methodology required to maintain stable, high-performing AI solutions in an enterprise environment.