Data versioning
Speaker
Federico Marchesi
Hi, my name is Federico Marchesi. During my career, I have had the pleasure to work with a variety of different ML systems, ranging from complex OLAP systems, distributed Machine learning inference platforms, and I have also touched the rise of modern data lakehouses. I’m especially passionate about data, which I believe is the foundation of modern software, not just in ML. Outside of work, I enjoy staying active through MTB, swimming, and running. I’m also a passionate motorsport enthusiast.
Abstract
One of the core fundamental pieces of technology every software-related tech stack is heavily dependent on is Git. The ability to version code and control the flow of development is the only common focus for every software project. We take for granted that everyone in the working industry can indeed properly version code.
In this talk, we’ll explore the meaning of data versioning and how we could borrow methodologies from the software engineering field to better manage our data.
Description
In today’s new era of Big Data and ML systems, every expert now has to handle not only high-quality code, but also data… a lot of data! As the system evolves, data gets accumulated, and ML models start to drift in production, the need for a solid data strategy becomes essential. As we treat code, we should treat data in the same way, carefully with a versioning system, reviews, PRs, and be able to make a pipeline to reproduce the state of complex data-driven systems.