Portable Feature Engineering with Hamilton: Write Once, Run Everywhere
05-19, 15:00–15:25 (Europe/Vilnius), Saphire B - PyData

Most data transformations are written twice. In the field of feature engineering for Machine Learning, data scientists regularly have to build, manage, and iterate on batch jobs, then translate those jobs to a service setting to load data and make fresh predictions. At best, this process is an engineering headache. At worst, this can result in difficult-to-detect deltas between training and inference, complex code, and highly bespoke infrastructure. In this talk we discuss Hamilton, a lightweight open-source framework in python that enables data practitioners to cleanly and portably define dataflows. Hamilton places no restrictions on the nature of transformations, allowing data scientists to use their favorite python libraries. With Hamilton, you can run the same code in your airflow DAG for training as you would in your fastAPI service for inference, and get the same result.


In this talk, we present Hamilton, and talk about how it can enable data scientists to build highly portable dataflows that can run in a variety of different contexts. At a high level, we will discuss:
The paradigm Hamilton introduces, and how it is simplifies the process of building and maintaining feature engineering pipelines
How Hamilton can be used to help scale batch data preparation for training and inference
How the same hamilton code can be used in a web-service to prepare data and run live inference, with minimal changes

We will go over working code examples, making sure to connect with tooling people are familiar with (E.G. airflow, fastapi, metaflow, django).


What is a level of your talk

Beginner

What topics define your talk the best?

python, open source, PyData, design and architecture, data science, ML engineering, data engineering, best practices, open source

Elijah has always enjoyed working at the intersection of math and engineering. More recently, he has focused his career on building tools to make data scientists more productive. At Two Sigma, he was building infrastructure to help quantitative researchers efficiently turn ideas into production trading models. At Stitch Fix he led the Model Lifecycle team — a team that focuses on streamlining the experience for data scientists to create and ship machine learning models. He is now focusing on building out DAGworks, Inc, a YC-backed startup that aims to make it easier for data scientists to build and maintain ETLs for machine learning. In his spare time, he enjoys geeking out about fractals, poring over antique maps, and playing jazz piano.