2025-04-25 –, Zeiss Plenary (Spectrum)
In contemporary data-driven environments, the seamless integration of data into automated workflows is paramount. The reliability of automation, however, is constantly threatened by breaking changes in the source data. The Data-as-Code (DaC) paradigm address this challenge by treating data as a first-class citizen within the software development lifecycle.
Data-as-Code (DaC) is a paradigm that streamlines data distribution by encapsulating dataset retrieval within Python packages, along with a data contract. This approach makes it easy to enforce data quality, effortlessly leverage on semantic versioning to prevent errors in the data pipeline, and abstracts away from the Data Scientist all the boilerplate code to load the data needed by the ML models, improving efficiency and consistency. This presentation will delve into the implementation of DaC, demonstrate its practical applications, and discuss the benefits it offers in modern data workflows.
This session will cover:
1. Introduction to Data-as-Code (DaC):
- What problems do we want to solve with DaC
- What it is out of scope
2. Implementing DaC:
- Packaging data as Python packages
- Defining data contracts
3. Advantages of DaC:
- Application of semantic versioning to manage data changes effectively
- Breaking changes in data are automatically detected as part of the data distribution
- Abstraction of data loading mechanisms, allowing seamless transitions between data sources
- Elimination of hard-coded data field names, enhancing code maintainability
- Facilitation of unit testing through schema examples
- Inclusion of comprehensive data descriptions and metadata
- Centralized data distribution via the Python Package Index (PyPI)
4. DaC in the real world:
- Step-by-step walkthrough of creating and distributing a DaC package
- Guidelines for data engineers on preparing data for DaC
- Instructions for data scientists on consuming DaC packages in their workflows
- Discussion on the scalability and adaptability of DaC
6. Q&A Session:
- Addressing audience questions and remarks
Intermediate
Expected audience expertise: Python:Intermediate
Public link to supporting material, e.g. videos, Github, etc.:Physicist, ML Engineer, Agile adept. I’d rather have a taste of everything than specialize. Eager to learn, unlearn, try out, share, help.