2026-07-21 –, Room 1.38 (Ground Floor, Turing)
During this comprehensive talk, we will discuss how to optimize your spatial data processing using Apache Sedona, a distributed processing engine, and SedonaDB, a powerful data fusion-based database that treats spatial data as a first-class citizen. In this talk, you will understand how to optimize:
- Distributed and non-distributed spatial join
- How to optimize spatial partitioning and reduce data skew
- How to leverage Spatial Apache Parquet and Geoparquet to efficiently store and retrieve data
- Optimizing Apache Sedona Python applications to be more performant and consume less memory, incorporating Apache Arrow and SedonaDB
- Powerful indexing techniques
- Distributed K-nearest neighbor algorithm
I will explain why the knowledge of optimization patterns is important and how understanding Apache Sedona's Python limitations is crucial to making your spatial data pipelines robust and efficient. The last part is to explain when use Apache Sedona and where SedonaDB fits.
The Apache Sedona ecosystem is powerful, but using it with an invalid understanding might lead to wasting computational cycles, data skew, or even application crashes. This talk aims to discuss in detail how popular spatial processing algorithms work and how we can make them more efficient. The talk focuses on the typical problems a Spatial Data Engineer, Analyst, or Scientist faces daily, such as spatial joins, KNN searches, or integrating different spatial tooling together.
The talk consists of four major sections,
- Introducing the Apache Sedona ecosystem and how Apache Sedona solves complex distributed spatial problems, like spatial partitioning and spatial joins
- Explaining what Spatial Parquet and Geoparquet are and the problems they solve
- Optimizing spatial processing pipelines, including
- reducing skew in spatial join
- evenly distributed spatial partitioning
- k nearest neighbor search
- effiecient user defined functions with Arrow optimization
- efficient storing and retrieving data from spatial Parquet
- powerful indexing techniques
- understanding of limitation of the Apache Sedona Python API
- When to use Apache SedonaDB in your spatial data processing tasks.
To create robust spatial queries, it's important to understand the fundamentals and how Apache Sedona implements specific spatial algorithms. This will help you select the right tools for the job and improve user satisfaction with Apache Sedona.
SedonaDB is a rapidly growing, single-node open-source analytical database built around spatial data. It is written in Rust, leveraging DataFusion and GeoArrow to build a powerful, unified engine that integrates easily with spatial and non-spatial data tools in the Python ecosystem. I'll discuss how to incorporate it into your data pipelines, with an emphasis on when to use it, how to make it efficient, and how to integrate it with other tools like DuckDB, Polars, or GeoPandas.
Paweł Tokaj is a staff software engineer at Splunk and a PMC member of the Apache Sedona project who enjoys writing reliable, efficient software that helps others. His love for geospatial data started at the Warsaw University of Technology, where he graduated in geodesy and cartography.
Paweł’s primary focus areas are distributed databases and systems, cloud computing, and geospatial data processing. He believes that open source projects make knowledge more accessible; he has contributed to Apache Sedona, Open Lineage, and Airbyte. He attends various conferences or meetups where he shares his knowledge as a speaker or participant. He is a technology nerd, spending a lot of his spare time reading books and articles and developing open source software.