2025-09-30 –, Louis Armand 2 - Ouest
Publicly available data is rarely analysis-ready, hampering researchers, organizations, and the public from easily accessing the information these datasets contain. One way to address this shortcoming is to "bake" the data into a structured format and ship it alongside code that can be used for analysis. For analytical work in particular, DuckDB provides a performant way to query the structured data in a variety of contexts.
This talk will explore the benefits and tradeoffs of this architectural pattern using the design of scipeds–an open source Python package for analyzing higher-education data in the US–as a case study.
No DuckDB experience required, beginner Python and programming experience recommended. This talk is aimed at data practitioners, especially those who work with public datasets.
The Integrated Postsecondary Education Data System (IPEDS) offers a wealth of comprehensive data on U.S. higher education institutions, including degree completions broken down by field, race/ethnicity, gender, and institution. However, this data and the codes used to represent and classify different educational fields is spread across numerous files and formats that have changed over time, making it difficult to conduct longitudinal or comparative analyses.
In their work to understand representation in STEM, Science for America surfaced the need for a more accessible and structured way to work with IPEDS data, and partnered with DrivenData to build something to address this need.
The solution they designed together, called scipeds, is an open source Python package that uses a "baked data" architectural pattern to pre-process IPEDS data into a structured DuckDB database. The package includes a Python API to enable users to run basic queries without needing to know how to write SQL. Despite the size of the dataset (>22M rows), queries run quickly enough to enable visualizations with real-time filtering deployed via a Plotly Dash app.
This talk will introduce the audience to the baked data architectural pattern, its implementation in scipeds, and discuss the benefits and tradeoffs of using this approach to make data more easily accessible to end users. The audience will come away with an understanding of backed data architecture as well as a concrete example of its use with DuckDB.
Talk structure (30 min)
- Motivation–the public dataset problem [5 min]
- Applying a baked data architecture to education data [10 min]
- Baked data across contexts: CoLab, web apps, etc. [5 min]
- Benefits, tradeoffs, and takeaways [5 min]
- Q&A [5 min]
I'm Chris Kucharczyk, a data scientist and data visualization designer. I live in Oxfordshire, UK.
I currently work at DrivenData, a social enterprise developing machine learning solutions to social impact problems. We host data science competitions and offer data science consulting services.