PyConDE & PyData Berlin 2024

The Struggles We Skipped: Data Engineering for the TikTok Generation
2024-04-24 , A1

In a world increasingly embracing Python, plug-and-play solutions and AI-generated code, our generation growing up with these advancements may not fully grasp the challenges faced by our predecessors. Meanwhile, data engineering, traditionally known for its complexity, can now transition into the plug-and-play realm too, thanks to Python libraries such as dlt.

Aimed to be both fun and insightful, this talk will educate the listener on the concepts of data engineering our generation finds most important and enable them to use high level abstractions to automate most of what used to be highly manual work. The juniors will gain an appreciation for the difficulties in data pipeline engineering, the seniors - a straightforward solution to expedite the creation of robust pipelines.

From the perspective of junior data engineers such as us, the talk will walk through the challenges associated with constructing a data pipeline and demonstrate how these can be effectively addressed using Python libraries such as dlt that simplify the intricacies of data extraction, transformation, and loading.


A tale of two junior data engineers.

Our generation of developers might have it “easy” due to there being a plethora of tools available to automate and plug and play everything. However, this abundance poses challenges in breaking into a field. This talk explores the perspectives of two junior data engineers—one entirely new to data and the other with a data science background—both navigating the complexities of data engineering.

The first one, a data scientist navigating her tasks without the luxury of well-formatted data. This journey inadvertently led to a gradual familiarity with complex tools like Spark, and the necessity of understanding various connectors and writing detailed code for data extraction and normalization. With the introduction of dlt, a significant shift occurred. This technology automated many of the tedious processes, allowing analysts to focus more on analytics, and less on tedious data handling.

The second one, never having had to deal with the chaos of unstructured data, was directly introduced to dlt. Spared by the typical struggles faced by traditional data engineers, she's set to find out what happens behind dlt’s automation throughout the talk. After realizing that the two lines of Python code she wrote saved her from the manual tasks of data normalization, structuring, and loading, she will gain an appreciation for the tools at her disposal, especially dlt.

dlt, or data load tool is an open-source python library for data teams of all sizes. It can extract a range of data formats from various sources, then normalizes that unstructured data into a relational structure and loads it into the destination of your choice. All of this is done within a few lines of Python code, as compared to the usage of different tools that were needed to get these tasks done. It is a valuable and cost effective addition to a company’s data stack.

The talk will follow a step-by-step, linear narrative to outline the challenges of building a data pipeline and illustrate how dlt can resolve these issues, thereby automating the process. Beginning with schema inference and evolution, then progressing to dependency handling and data governance, each challenge will be portrayed as a quest on the journey to constructing a well-defined data pipeline. As junior data engineers, we would like to emphasize the paradigm shift in data engineering towards a greater level of abstraction. This shift, enabled by tools such as dlt's declarative incremental loading, empowers junior engineers to tackle tasks that traditionally would not be considered junior-level work.


Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Intermediate

Abstract as a tweet (X) or toot (Mastodon):

A new wave in data engineering! From tangled tasks to sleek, plug-and-play magic in data pipelines. 🚀

Public link to supporting material, e.g. videos, Github, etc.:

https://github.com/dlt-hub

See also: PyCon Berlin Presentation

Writer by choice and a data enthusiast at heart. Crafting compelling narratives with Open Source Software at dltHub. With a background in International Relations, I am currently pursuing Computer Science, focusing on Machine Learning, at TU Berlin.

The data field has been my home for 3 years. I'm now a Data Science Working Student at dltHub in Berlin. Previously, I contributed as a researcher, data scientist and business analyst in startups and government-funded projects in Pakistan. Currently pursuing a master's degree in data analytics and AI for business management, I hold a prior degree in Computer Science with a touch of liberal arts.