Partitions and chains: enabling batch processing for your data
2021-07-28, 17:50–18:00 (UTC), Green

While big data isn't new anymore, building efficient pipelines to parse, analyze, transform, aggregate, and save all this data is still a tricky business. Come learn about new tools across the JuliaData family of packages for batch processing data, allowing automatic use of multithreading for data processing tasks.


I want to give a overview of the next "phase" of functionality we've been building across the data ecosystem and some walk-throughs of how the functionality is already being leveraged, including:
* The ChainedVector array type, which allows treating "batches" of arrays as one long array, while allowing efficient multithreading and other concurrent operations on the data automatically
* Tables.partitions: The Tables.jl package now supports "batches" of data for sinks to process, with a focus on enabling multithreaded sink processing of source partitions
* The TableOperations.jl package provides the makepartitions and joinpartitions utility functions for facilitating working with partitions and your data
* Examples of how packages are already taking advantage: Arrow.jl, CSV.jl, JuliaDB.jl, Parquet.jl, and Avro.jl

Been a Julia enthusiast for a long time, since Julia 0.1! Always been interested in data engineering, making data processing more efficient, and various data formats, and Julia is just such a fun tool to dive into these kinds of problems.