Data Day Schedule

Thursday, April 24

10:30

Beyond dbt: Modern SQL Transformation and Lineage with sqlglot and sqlmesh

Tomas Peluritis
Tomas Peluritis
Hear more about the evolving landscape of SQL transformation tools and data lineage challenges. Explore how sqlglot enables powerful SQL parsing and transformation capabilities, and see practical demonstrations of sqlmesh as a modern alternative to dbt. Learn about open-source approaches to data lineage tracking and discover how these tools are shaping the future of data engineering workflows.
Room: 101
Data Day - Apr 24
Talk
10:30–10:55

Cutting the price of Scraping Cloud Costs

Ed Crewe
Ed Crewe
A case study of rewriting a simple data pipeline involving Python, a pinch of Go, Git workflows, Airflow, Postgres and Cloud. Investigating some common assumptions and principles of designing data pipelines. The benefits and issues with the tools and how these may be handled. I hope this case study of a pipeline rewrite will give you insights that are applicable to Python use for your own data pipelines, and into cloud pricing.
Room: 3
Data Day - Apr 24
Talk
10:30–10:55

Image deduplication using embeddings

Jonas Jarutis
Jonas Jarutis
This presentation examines approaches for detecting and eliminating near-duplicate images across datasets ranging from small collections to repositories containing millions of images. We will compare the performance of several embedding models, including CLIP, ResNet, and other variants, assessing their ability to capture semantic and perceptual similarity and performance tradeoffs. We will benchmark various vector database solutions on query speed, memory consumption, and scalability. We will demonstrate p
Room: 2
Data Day - Apr 24
Talk
10:30–10:55

Data-Driven Impact in Africa

Chris Achinga
Chris Achinga
There are a lot of NGOs in Africa, trying to help improve lives. The problem is we do not have enough data to help them understand us well to curate impactful humanitarian programs. Discover how NGOs can leverage data science to understand and serve African communities better. Learn about data collection, privacy, and impact assessment.
Room: Workshop 2
Data Day - Apr 24
Workshop
10:30–11:25
11:00

Read Your Stocks Via Screenshots

Ąžuolas Krušna
Ąžuolas Krušna
In this talk, we’ll build a Python app that extracts stock transaction data from screenshots or documents. We’ll refine screenshot extraction accuracy using OCR and regex in an interactive lab environment, store structured data in DuckDB, and visualize insights with Streamlit—transforming raw data into actionable trading insights. This approach is highly adaptable and can be applied to various industries.
Room: 3
Data Day - Apr 24
Talk
11:00–11:25

Data Warehouses Meet Data Lakes

Mauro Pelucchi
Mauro Pelucchi
Many organizations have migrated their data warehouses to datalake solutions in recent years. With the convergence of the data warehouse and the data lake, a new data management paradigm has emerged that combines the best of 2 approaches: the botton-up of big data and the top-down of a classic data warehouse.
Room: 101
Data Day - Apr 24
Talk
11:00–11:25
11:30

From Chaos to Control: Automating BI Tools with Pydantic and Python

Patricia Goldberg
Patricia Goldberg
Maintaining Business Intelligent Tool (BI) governance, managing permissions, syncing documentation, and handling schema changes, can be chaotic. This talk explores how Python, Pydantic, and smart design patterns automate these tasks, ensuring seamless BI tool governance. Learn how to auto-sync table metadata, adjust queries on column renames, and enforce permissions effortlessly. With real-world examples, discover how to transform BI maintenance from a headache into a streamlined, automated process.
Room: 101
Data Day - Apr 24
Talk
11:30–11:55

Unlocking Web data with TLSNotary, zkProofs and LLM while preserving privacy

Jayaditya Gupta
Jayaditya Gupta
Imagine sharing the data with a third party without revealing any information while still proving you own the data. The core of this talk is about using autogen framework (PyAutogen made by microsoft) and TLSNotary protocol (made by tlsnotary.org) . The talk is about leveraging LLM and TLSNotary to make data portable while maintaining privacy. If you're looking for a way to make data portable without compromising on security, check out this talk.
Room: 3
Data Day - Apr 24
Talk
11:30–11:55

Orchestrating an end-to-end Data Engineering Workflow: Leveraging Python in Apache Beam and Airflow

Sadeeq Akintola
Sadeeq Akintola
This talk explores the synergy between Apache Beam and Apache Airflow, demonstrating how to create a robust, end-to-end data engineering workflow. We'll dive into the challenges of orchestrating complex data processing tasks and show how combining Airflow's scheduling capabilities with Beam's data processing framework can create more efficient and manageable data pipelines. The session will cover integration with Google Cloud Platform services, including Cloud Functions, BigQuery, and Gemini AI models.
Room: Workshop 2
Data Day - Apr 24
Workshop
11:30–12:25

cluster-experiments: A Python library for end-to-end A/B testing workflows

David Masip
David Masip
In this talk, we introduce cluster-experiments, a Python library designed to facilitate end-to-end A/B testing workflows, including power analysis, experiment analysis, and variance reduction techniques.
Room: Workshop 1
Data Day - Apr 24
Talk
11:30–12:15
12:00

Python on the Pitch: How Germany will win World Cup 2026

Ruslan Korniichuk
Ruslan Korniichuk
We will dive into the fascinating world of football analytics, showcasing how to collect and process match data (e.g., Hudl Statsbomb, Sportmonks, and Understat), including player tracking, event logs, and tactical formations. Attendees will walk away with practical knowledge and Jupyter Notebooks, demonstrating Python's power in decoding modern football strategies.
Room: 3
Data Day - Apr 24
Talk
12:00–12:25

Variable Selection: What your model can't tell you

James Donahue
James Donahue
Variable selection is often left up to an algorithm. However, controlling for some variables can improve measurement accuracy, and thus overall performance. On the other hand, certain "bad" controls can block pathways of relationships between variables that we want to preserve or create spurious correlations. Using real and simulated data, I explain when to reconsider your controls, and why that may significantly improve model accuracy.
Room: 2
Data Day - Apr 24
Talk
12:00–12:25

Real-Time Data Analytics at Scale: From Ingestion to Retrieval

Tung Hoang
Tung Hoang
Real-time data analytics is essential for powering modern applications like monitoring, personalization, search, and to some extend, RAG pipelines. However, building systems that can handle real-time ingestion, indexing, and retrieval at scale is no trivial task. This talk provides actionable insights into designing and maintaining such systems at scale using best practices.
Room: 101
Data Day - Apr 24
Talk
12:00–12:25
13:00

Working for a Faster World: Accelerating Data Science with Less Resources

Maximilian Lattka
Maximilian Lattka
In data science, speed matters as much as accuracy, especially when users expect quick results. This talk explores simple yet effective techniques to boost performance, using a real-life case of accelerating a Panel app. While some strategies are case-specific, most apply broadly to data-driven projects.
Room: 101
Data Day - Apr 24
Talk
13:00–13:25

Accelerating privacy-enhancing data processing

Florian Stefan
Florian Stefan
Our mission is simple but profound: to improve and extend lives by learning from the experience of every person with cancer. This talk explains how we transform sensitive data from heterogeneous environments into research-grade datasets. And how we shift insights generation left to iterate faster.
Room: 2
Data Day - Apr 24
Talk
13:00–13:25

Real-time visualization using dash and plotly

Hampo, JohnPaul A.C.
Hampo, JohnPaul A.C.
In this paper, a simple live dashboard will be developed using plotly and dash on a practical dataset. This will ease the presentation of data job by abstracting the technicalities and codes from the non-data persons.
Room: 3
Data Day - Apr 24
Talk
13:00–13:25

Build & Deploy Apps like a (pro) Data Scientist using Streamlit

Siddharth Gupta
Siddharth Gupta
Do you ever find it complicated to learn the complexities of a traditional web framework to push your data science work online? Worry no more! Streamlit might help speed things up as it is designed for the required purpose - creating beautiful data-related web apps that can be deployed in minutes. In the hands-on tutorial, we’ll go through various features of Streamlit and build a small lyric fetcher app based on the available curated dataset of around 24K Billboard top-100 songs.
Room: Workshop 2
Data Day - Apr 24
Workshop
13:00–13:55

Using feature stores to deliver awesome models

Laurynas Stašys
Laurynas Stašys
Mantas Cepulkovskis
Mantas Cepulkovskis
In today’s fast-paced machine learning environment, the ability to efficiently manage and reuse features across multiple models is crucial. This workshop explores how leveraging a feature store can streamline ML pipelines by ensuring consistency and accelerating deployment cycles. Participants will gain hands-on experience with setting up, managing, and integrating feature stores into their existing workflows—transforming raw data into valuable, production-ready features.
Room: Workshop 1
Data Day - Apr 24
Workshop
13:00–13:55
13:30

Top 5 Lessons from a Senior Data Scientist

Megan
Megan
A successful data scientist needs to have solid coding skills and stay up to date with the latest artificial intelligence and machine learning algorithms. However, there are many other skills and experiences that help you succeed in data science. In this talk Megan shares five of her most helpful career lessons she's learned in over eight years as a data scientist. These lessons will include tips on advocating for your own career development, how to collaborate with other teams and more.
Room: 101
Data Day - Apr 24
Talk
13:30–13:55

The Power of Python for Data Management (or How You’ve Been Doing Data Management All Along Without Even Realizing It)

Vidmantė Čižienė
Vidmantė Čižienė
Are you using Airflow or Pandas? Great! You've contributed to better data management at your organization. The breakthrough of AI has reignited focus on high-quality data and effective data governance (not that scary as it sounds!) and management practices. AI needs fit-for-purpose data to reach its potential, and we already have powerful toolkit — like Airflow, Pandas, Matplotlib/Seaborn, or Great Expectations — to optimize workflows and ensure data quality.
Room: 3
Data Day - Apr 24
Talk
13:30–13:55

Temporal: Bulletproof Workflows

Ruslan Korniichuk
Ruslan Korniichuk
Temporal is an open source, distributed, and scalable workflow orchestration platform designed to execute mission-critical business logic with resilience. Manage failures, network outages, flaky endpoints, long-running processes and more, ensuring your workflows never fail.
Room: 2
Data Day - Apr 24
Talk
13:30–13:55
14:00

Unlocking Probability Distributions with Python

Elvis Kwabena Asare Nkrumah
Elvis Kwabena Asare Nkrumah
In this hands-on session, we'll explore the world of probability distributions using Python. From Bernoulli to Gaussian, we'll demonstrate how to apply these distributions to solve real-world problems. Attendees will learn how to use popular Python libraries like NumPy, SciPy, and Matplotlib to visualize and calculate probabilities.
Room: Workshop 1
Data Day - Apr 24
Workshop
14:00–14:55

A Crash course in Time Series Forecasting from Naive to Foundational

Pietro Peterlongo
Pietro Peterlongo
Forecasting is a common activity that has clear business value in various domains but it is not a very common skill that Data Scientists have or feel confident about. In this crash course I will cover the fundamentals of Time Series forecasting from the basic methods to more advanced techniques. I will do this showcasing practical code examples using libraries from Nixtla.
Room: 3
Data Day - Apr 24
Talk
14:00–14:25

Investing: Technical Analysis libraries in Python

Ruslan Korniichuk
Ruslan Korniichuk
We will explore the landscape of technical analysis libraries available for the Python language, including popular choices like TA-Lib (aka talib), Pandas TA, and Technical Analysis (aka bukosabino/ta) library.
Room: Workshop 2
Python Day - Apr 23
Workshop
14:00–14:55

Smarter Retrieval, Better Generation: Improving RAG Systems

David Batista
David Batista
Good retrieval performance is key to an effective RAG system, as it ensures relevant information is selected, directly impacting augmentation and generation quality. My presentation focuses on RAG indexing and retrieval, exploring methods to convert text into searchable formats, comparing techniques, and analyzing their advantages, disadvantages, and performance on an annotated dataset to enhance document retrieval based on user queries.
Room: 2
Data Day - Apr 24
Talk
14:00–14:25

AI Agents and Digital Trust, The Utilisation of Python In Enhancing Safety In The African Health Care System

Anuoluwapo Gabriel
Anuoluwapo Gabriel
AI agents and digital trust are revolutionizing African healthcare via Python-powered innovations. Advanced machine learning models, built with robust Python libraries, enhance diagnostic precision, predictive analytics, and medical imaging analysis while bolstering cybersecurity and enabling telemedicine. Strict adherence to data privacy, transparency, and ethical standards is crucial to building trust, overcoming infrastructure challenges, and driving sustainable improvements in patient safety and care.
Room: 101
Data Day - Apr 24
Talk
14:00–14:25
14:30

Automate Brag Document Writing with LLMs

Ludvig Wärnberg Gerdin
Ludvig Wärnberg Gerdin
A brag document is a powerful tool to highlight your work by making it visible, measurable, and demonstrating its real impact on you and your organisation - but such a document can be time-consuming to maintain. My talk explores automation of the writing process with language models fed with data from tools like Jira, Notion, and code commits. Learn how to save time, avoid registering missed achievements, and make your work stand out. Ideal for engineers at all levels looking to grow their impact.
Room: 2
Data Day - Apr 24
Talk
14:30–14:55