Measuring Experiments in LLMs: A/B Tests and Automated Testing

Friday, April 10, 2026

08:00 AM - 08:25 AM

Krantas/Shore 213 (3rd building)

Speakers

ÖÇ

Özge Çinko

Kader Miyanyedi

I have been a backend developer for 4 years, working primarily with Python and Django. I enjoy sharing what I’ve learned at previous PyCon talks and through writing on Medium, helping others improve their coding and AI skills.

Abstract

Even small changes in LLMs can impact output quality, safety, and user experience. In this talk, we’ll show how to log experiments with Langfuse, automate tests with Pytest, and enrich them using Hypothesis-generated random data scenarios. Participants will learn how to use code, tests, and data-driven A/B tests to improve LLM development.

Description

Modern LLM development is no longer just running a single model. Even small changes can impact output quality, safety, and user experience. In this session, attendees will learn how to log experiments with Langfuse, automate tests with Pytest, and enrich tests with random data scenarios generated by Hypothesis. They will also explore why A/B testing is critical in LLM development and how to measure which model or version performs best using real user data.

By the end of the session, attendees will understand how to:

Design data-driven A/B tests in LLM development workflows
Combine code, tests, and experiments to create reliable, repeatable testing processes
Write automated scenario and edge-case tests using Pytest and Hypothesis
Log experiments and track model performance and user-facing outputs with Langfuse

This session gives Python developers a framework to integrate experiments, tests, and data into LLM development, enabling reliable, repeatable, and data-driven workflows.