PyCon Hong Kong 2024

PyCon Hong Kong 2024

Spark-less local data stack in 2024
2024-11-16 , LT7
Language: English

In 2024, the Composable Data Stack is getting more mature and it's only getting easier to mix tools for different use cases. The capabilities of local data stacks continue to grow with advancements in tools like Polars and DuckDB, the necessity of using Spark for end users is increasingly being questioned.

Traditionally, Spark has been regarded as the most mature and reliable data processing framework, making it a default choice for many. However, the landscape has evolved significantly by 2024, with numerous libraries now offering more efficient and versatile local data processing solutions.

This presentation will explore these new alternatives, focusing on:

SQLFrame: A framework providing a Spark DataFrame API that can interface with different computing engines.
Ibis: A unified API that seamlessly integrates dataframes and databases, eliminating the need to commit to a single engine.
SQLGlot: A powerful tool for transpiling SQL queries between different dialects, enhancing compatibility and flexibility.
Our goal is not to declare the obsolescence of Spark but to highlight efficient alternatives that may be better suited for specific environments and use cases. Attendees will gain insights into how these modern tools can be leveraged to optimize their local data processing workflows, potentially reducing the need for Spark in certain scenarios.

I am a recovering data scientist, currently working as a software engineer for an open source project called Kedro (https://github.com/kedro-org/kedro) in London. I am passionate about open source and machine learning. In my free time, I like playing badminton.