Dataset Updates Without Losing Your Mind
Speaker
Oleksii Liashuk
Oleksii Liashuk is lead ML engineer working with Python computer vision systems, such as object detection, segmentation, OCR and object tracking. He focuses on practical ML problems like car damage detection, container number recognition in difficult conditions, and continuous dataset and model maintenance. Oleksii has hands-on experience with dataset updates, model retraining cycles and deployment of ML systems using Docker and Kubernetes.
Abstract
Many teams work with datasets that evolve over time. What starts as simple setup, quickly turns into chaos once updates become regular. In this talk, I share a practical workflow for managing dataset updates by splitting the process into clear stages, each represented by a Python script. This approach was used in production for two years on image datasets from 2,000 to 200,000 samples and helps small teams reduce cognitive load and keep dataset and model updates predictable.
Description
Here is bigger description of the talk:
Many teams have to work with datasets that evolve over time. While it starts as a simple setup, it quickly turns into a serious problem once dataset updates become regular. It becomes really painful to find the right methods in set of strangely named scripts which were obvious a month ago, but not anymore. Scattered Python files and constant attempts to remember what to run - and in what order - to prepare new data, validate it, and retrain the model. Sounds familiar? In this talk, I want to share what happened when dataset management was split into clear and understandable stages where each stage is represented by a dedicated Python script. This workflow was developed in production and used for two years on datasets size from 2,000 to 200,000 images, with constant updates and model retraining cycles. We split “obvious” operations into small scripts with clear boundaries, such as data collection, annotation, validation, merging, training, and evaluation. This allows the team to simply follow the same steps on each dataset update and greatly reduce cognitive load. The talk mainly focuses on datasets used for object detection, segmentation, and OCR, but the patterns discussed can be applied to most file-based datasets used in Python data projects.