Postmodern Architecture: The Python Powered Modern Data Stack
2023-04-19 , B09

The Modern Data Stack has brought a lot of new buzzwords into the data engineering lexicon: "data mesh", "data observability", "reverse ETL", "data lineage", "analytics engineering". In this light-hearted talk we will demystify the evolving revolution that will define the future of data analytics & engineering teams.

Our journey begins with the PyData Stack: pandas pipelines powering ETL workflows...clean code, tested code, data validation, perfect for in-memory workflows. As demand for self-serve analytics grows, new data sources bring more APIs to model, more code to maintain, DAG workflow orchestration tools, new nuances to capture ("the tax team defines revenue differently"), more dashboards, more not-quite-bugs ("but my number says this...").

This data maturity journey is a well-trodden path with common pitfalls & opportunities. After dashboards comes predictive modelling ("what will happen"), prescriptive modelling ("what should we do?"), perhaps eventually automated decision making. Getting there is much easier with the advent of the Python Powered Modern Data Stack.

In this talk, we will cover the shift from ETL to ELT, the open-source Modern Data Stack tools you should know, with a focus on how dbt's new Python integration is changing how data pipelines are built, run, tested & maintained. By understanding the latest trends & buzzwords, attendees will gain a deeper insight into Python's role at the core of the future of data engineering.


This light-hearted talk will aim to introduce the audience to the theory and terminology of data pipelines and architectures past, present and future. The "Modern Data Stack" set of interoperable tools introduced a shift in how organisations can rapidly construct a data architecture that can combine multiple data sources into a single unified data warehouse with clean analytics-ready tables for plugging BI tools, self-serve analytics dashboards, and ML models into.

Until recently, the complexity of data transformation and modelling was limited to what can be done with SQL, leaving the rich ecosystem of Python tooling for complex transformations, geospatial analytics, time series modelling, data validation tools and clean tested CI-enabled codebases mostly uninvited to the Modern Data Stack party. A recent trend has been a number of tools that launched Python integrations in 2022 (most notably by dbt), opening up a world of productivity and fast scalable data processing for the PyData-savvy Pythonista.

Another recent trend is an explosion of jargon, with analytics engineers getting into heated debates around whether data observability or metadata-capture should be prioritised within a data mesh architecture. These are all important concepts, especially for organisations operating at a scale where reliable data governance is mission-critical. Not all organisations are operating at that scale, and every organisation large or small is own its own data maturity journey.

My goal with this talk is to bring these concepts together, introduce attendees to these recent trends, and provide a framework they can take back into their organisations for accelerating their own data maturity journey using the latest tooling & best practices.


Expected audience expertise: Domain:

Novice

Expected audience expertise: Python:

Novice

Abstract as a tweet:

Learn how to upgrade your pandas pipelines powering DAG workflows to a Python Powered Modern Data Stack, demystify the jargon from ETL to ELT, and see how tools like dbt can integrate with Python to change how data pipelines are built and maintained.

John Sandall is the CEO and Principal Data Scientist at Coefficient.

His experience in data science and software engineering spans multiple industries and applications, and his passion for the power of data extends far beyond his work for Coefficient’s clients. In April 2017 he created SixFifty in order to predict the UK General Election using open data and advanced modelling techniques. Previous experience includes Lead Data Scientist at YPlan, business analytics at Apple, genomics research at Imperial College London, building an ed-tech startup at Knodium, developing strategy & technological infrastructure for international non-profit startup STIR Education, and losing sleep to many hackathons along the way.

John is also a co-organiser of PyData London, co-founded Humble Data in 2019 to promote diversity in data science through a programme of free bootcamps, and in 2020 was a Committee Chair for the PyData Global Conference. He is currently a Fellow of Newspeak House with interests in open data, AI ethics and promoting diversity in tech.