2025-12-09 –, Horace Mann
Most data science projects start with a simple notebook—a spark of curiosity, some exploration, and a handful of promising results. But what happens when that experiment needs to grow up and go into production?
This talk follows the story of a single machine learning exploration that matures into a full-fledged ETL pipeline. We’ll walk through the practical steps and real-world challenges that come up when moving from a Jupyter notebook to something robust enough for daily use.
We’ll cover how to:
- Set clear objectives and document the process from the beginning
- Break messy notebook logic into modular, reusable components
- Choose the right tools (Papermill, nbconvert, shell scripts) based on your workflow—not just the hype
- Track environments and dependencies to make sure your project runs tomorrow the way it did today
- Handle data integrity, schema changes, and even evolving labels as your datasets shift over time
And as a bonus: bring your results to life with interactive visualizations using tools like PyScript, Voila, and Panel + HoloViz
- (3 mins) Intro
- I've been supporting various groups in their developer experience since 2020 after being a freelance Python consultant. I've worked on many many dozens of projects, unblocking users picking the right tools for the task at hand.
- It works on my machine
- What we're building today: ML pipeline with RAPIDS -> Snowflake
- We're going to watch a real project grow up
- (3 mins) Exploration - starting as a single messy notebook, sample data set.
- Why RAPIDS? GPU
- Large data sets
- GPU availability - remote machine, local GPU
- workflows that work well with GPU
- Load Data cuDF / pandas
- Quick EDA and data visualization
- Train cuML / scikit-learn model
- no-code change philosophy
- Why RAPIDS? GPU
- (7 mins) Make it repeatable - Start with simple tried and true tools, explore where tools like Papermill help with flexibilty and reproducibility
- common painpoints: operating cadence, specialized scenarios, manual execution is error prone
- shell scripts versus papermill
- reproducible environments
- generate HTML reports
- pass through parameters in your notebook
- (8 mins) Make it reliable - Modular code & testing
- common painpoints: data schema changes, debugging issues, testing & modularity
- nbconvert + Python: turn your notebook into a script
- turn a function into a module
- dashboard with HoloViz / Panel, discuss choosing tools like Voila and PyScript
- (5 mins) Snowflake integration
- common painpoints: data volume, coordinate with other data systems, audits
- picking the right tools: cost complexity tradeoff
- RAPIDS preprocessing to Snowflake storage
- self-service access for stakeholders
- (3 mins) Conclusion
- Start simple
- Add complexity when you feel specific pain
Dawn Gibson Wages is the Director of Community & Developer Relations at Anaconda. From her early work as Research Developer at Wharton Computing and Instructional Technology, then working with Python developer experience at Microsoft to her current role, she has been consulting on Python developer experiences across the ecosystem -- speaking to thousands of developers in the process. She co-hosts the Sad Python Girls Club podcast and served as Chair of the Python Software Foundation Board.
Dawn is a member of the Wagtail CMS core team, has organized DjangoCon US sponsorship efforts and Django Girls workshops, and mentors through Djangonaut Space. She's founder of At The Root, which developed the first Anti-Racist Ethical Source License.
A frequent conference speaker on Python development topics, Dawn is currently writing "Domain-Driven Django," exploring architecture patterns for Django applications. Her work focuses on gathering insights from the Python ecosystem to improve developer tooling and experiences.
When she is not engaging in Python, she's chilling at home in Philadelphia with her wife and dogs.