WALD: A Modern & Sustainable Analytics Stack
2023-04-17 , B05-B06

The name WALD-stack stems from the four technologies it is composed of, i.e. a cloud-computing Warehouse like Snowflake or Google BigQuery, the open-source data integration engine Airbyte, the open-source full-stack
BI platform Lightdash, and the open-source data transformation tool DBT.

Using a Formula 1 Grand Prix dataset, I will give an overview of how these four tools complement each other perfectly for analytics tasks in an ELT approach. You will learn the specific uses of each tool as well as their particular features. My talk is based on a full tutorial, which you can find under waldstack.org.


The current zeitgeist is that the data lake concept from classical data engineering and modern data warehousing from business intelligence are converging more and more. This is also driving the shift from ETL to ELT, and so tools such as dbt are becoming increasingly important in combination with modern Big Data warehouses such as Snowflake and Google BigQuery. For typical data and MI engineers, this is quite a departure from familiar tools like Spark.

Having a pure Spark and ETL background myself, this trend motivated me to explore the foreign realms of ELT, data warehousing and especially the fuzz about dbt. In this talk I want to share my key insights with classical data / ml engineers that might have only heard about Snowflake, dbt, Airbyte and Lightdash but have never cared to dig deeper.

My talk is structured like this:
* short introduction to the differences of data lake vs data warehouse, ETL vs ELT
* high-level introduction of Snowflake, Airbyte, dbt, and Lightdash
* demonstration based on the Kaggle Formula 1 World Championship dataset to see those four tools in action
* my main take-aways and key insights

After this talk, you will have learned the differences between ETL & ELT, what these four tools do and in which cases you should consider the WALD stack. Also, you will know how to use Python instead of SQL to define models in dbt, which is a brand-new feature.

The WALD-stack is sustainable since it consists mainly of open-source technologies, however all technologies are also offered as managed cloud services. The data warehouse itself, i.e. Snowflake or Google BigQuery, is the only non-open-source
technology in the WALD-stack. In my talk, I will focus on the open-source parts of the WALD-stack.


Public link to supporting material:

https://github.com/FlorianWilhelm/wald-stack-demo

Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Novice

Abstract as a tweet:

WALD: A modern & sustainable analytics stack consisting of a warehouse like Snowflake or BigQuery, Airbyte, Lightdash and dbt.

See also: Slides

Data Scientist and Python developer with a strong mathematical background. Always looking to apply mathematics to real-world problems and enthusiastic about everything math.