Measuring Experiments in LLMs: A/B Tests and Automated Testing
Speakers
Kader Miyanyedi
I have been a backend developer for 4 years, working primarily with Python and Django. I enjoy sharing what I’ve learned at previous PyCon talks and through writing on Medium, helping others improve their coding and AI skills.
Özge Çinko
Abstract
Even small changes in LLMs can impact output quality, safety, and user experience. In this talk, we’ll show how to log experiments with Langfuse, automate tests with Pytest, and enrich them using Hypothesis-generated random data scenarios. Participants will learn how to use code, tests, and data-driven A/B tests to improve LLM development.
Description
Modern LLM development is no longer just running a single model. Even small changes can impact output quality, safety, and user experience. In this session, attendees will learn how to log experiments with Langfuse, automate tests with Pytest, and enrich tests with random data scenarios generated by Hypothesis. They will also explore why A/B testing is critical in LLM development and how to measure which model or version performs best using real user data.
By the end of the session, attendees will understand how to:
-
Design data-driven A/B tests in LLM development workflows
-
Combine code, tests, and experiments to create reliable, repeatable testing processes
-
Write automated scenario and edge-case tests using Pytest and Hypothesis
-
Log experiments and track model performance and user-facing outputs with Langfuse
This session gives Python developers a framework to integrate experiments, tests, and data into LLM development, enabling reliable, repeatable, and data-driven workflows.