Paweł Tokaj
Paweł Tokaj is a staff software engineer at Splunk and a PMC member of the Apache Sedona project who enjoys writing reliable, efficient software that helps others. His love for geospatial data started at the Warsaw University of Technology, where he graduated in geodesy and cartography.
Paweł’s primary focus areas are distributed databases and systems, cloud computing, and geospatial data processing. He believes that open source projects make knowledge more accessible; he has contributed to Apache Sedona, Open Lineage, and Airbyte. He attends various conferences or meetups where he shares his knowledge as a speaker or participant. He is a technology nerd, spending a lot of his spare time reading books and articles and developing open source software.
Splunk
Staff Software Engineer
Session
During this comprehensive talk, we will discuss how to optimize your spatial data processing using Apache Sedona, a distributed processing engine, and SedonaDB, a powerful data fusion-based database that treats spatial data as a first-class citizen. In this talk, you will understand how to optimize:
- Distributed and non-distributed spatial join
- How to optimize spatial partitioning and reduce data skew
- How to leverage Spatial Apache Parquet and Geoparquet to efficiently store and retrieve data
- Optimizing Apache Sedona Python applications to be more performant and consume less memory, incorporating Apache Arrow and SedonaDB
- Powerful indexing techniques
- Distributed K-nearest neighbor algorithm
I will explain why the knowledge of optimization patterns is important and how understanding Apache Sedona's Python limitations is crucial to making your spatial data pipelines robust and efficient. The last part is to explain when use Apache Sedona and where SedonaDB fits.