Optimize the geospatial data processing with Apache Sedona and SedonaDB. EuroSciPy 2026

Optimize the geospatial data processing with Apache Sedona and SedonaDB.
.ical
2026-07-21 16:00–16:20, Room 1.38 (Ground Floor, Turing)

During this comprehensive talk, we will discuss how to optimize your spatial data processing using Apache Sedona, a distributed processing engine, and SedonaDB, a powerful data fusion-based database that treats spatial data as a first-class citizen. In this talk, you will understand how to optimize:

Distributed and non-distributed spatial join
How to optimize spatial partitioning and reduce data skew
How to leverage Spatial Apache Parquet and Geoparquet to efficiently store and retrieve data
Optimizing Apache Sedona Python applications to be more performant and consume less memory, incorporating Apache Arrow and SedonaDB
Powerful indexing techniques
Distributed K-nearest neighbor algorithm

I will explain why the knowledge of optimization patterns is important and how understanding Apache Sedona's Python limitations is crucial to making your spatial data pipelines robust and efficient. The last part is to explain when use Apache Sedona and where SedonaDB fits.

The Apache Sedona ecosystem is powerful, but using it with an invalid understanding might lead to wasting computational cycles, data skew, or even application crashes. This talk aims to discuss in detail how popular spatial processing algorithms work and how we can make them more efficient. The talk focuses on the typical problems a Spatial Data Engineer, Analyst, or Scientist faces daily, such as spatial joins, KNN searches, or integrating different spatial tooling together.

The talk consists of four major sections,

Introducing the Apache Sedona ecosystem and how Apache Sedona solves complex distributed spatial problems, like spatial partitioning and spatial joins
Explaining what Spatial Parquet and Geoparquet are and the problems they solve
Optimizing spatial processing pipelines, including
- reducing skew in spatial join
- evenly distributed spatial partitioning
- k nearest neighbor search
- effiecient user defined functions with Arrow optimization
- efficient storing and retrieving data from spatial Parquet
- powerful indexing techniques
- understanding of limitation of the Apache Sedona Python API
When to use Apache SedonaDB in your spatial data processing tasks.

To create robust spatial queries, it's important to understand the fundamentals and how Apache Sedona implements specific spatial algorithms. This will help you select the right tools for the job and improve user satisfaction with Apache Sedona.

SedonaDB is a rapidly growing, single-node open-source analytical database built around spatial data. It is written in Rust, leveraging DataFusion and GeoArrow to build a powerful, unified engine that integrates easily with spatial and non-spatial data tools in the Python ecosystem. I'll discuss how to incorporate it into your data pipelines, with an emphasis on when to use it, how to make it efficient, and how to integrate it with other tools like DuckDB, Polars, or GeoPandas.

Expected audience expertise: Domain: some Expected audience expertise: Python: some Your relationship with the presented work/project: Original author or co-author, Active contributor, Maintainer of the presented library/project

Paweł Tokaj

Paweł Tokaj is a staff software engineer at Splunk and a PMC member of the Apache Sedona project who enjoys writing reliable, efficient software that helps others. His love for geospatial data started at the Warsaw University of Technology, where he graduated in geodesy and cartography.

Paweł’s primary focus areas are distributed databases and systems, cloud computing, and geospatial data processing. He believes that open source projects make knowledge more accessible; he has contributed to Apache Sedona, Open Lineage, and Airbyte. He attends various conferences or meetups where he shares his knowledge as a speaker or participant. He is a technology nerd, spending a lot of his spare time reading books and articles and developing open source software.

Optimize the geospatial data processing with Apache Sedona and SedonaDB. .ical 2026-07-21 16:00–16:20, Room 1.38 (Ground Floor, Turing)

Optimize the geospatial data processing with Apache Sedona and SedonaDB.
.ical
2026-07-21 16:00–16:20, Room 1.38 (Ground Floor, Turing)