Jonas Dedden
Hi, I'm Jonas Dedden, Staff Research Data Engineer at DeepL SE, Germany. Johanna Goergen and I work at the Research Data Platform team of DeepL Research, where we are responsible for the on-prem & cloud-based k8s compute infrastructure for petabyte scale data processing pipelines. We provide the platform that our Research Data Engineers can use to collect & preprocess all data needed for training the DeepL foundational language models that power our production services.
Session
This talk will detail how we used Rust to solve a number of resource utilization inefficiencies while scaling data pre-processing to a petabyte scale and enable next-generation model training at DeepL. Besides other factors, this was done by developing an internal library for interacting with Parquet files in a memory efficient nature.
Topics include:
• Convincing you to love Rust for its memory safety
• Comparing C++ and Rust ecosystems for Python library development
• Diving into Python-Rust interoperability
• Convincing you to love Rust for its user-friendly (yes, actually!) language features
• Providing a high-level overview of the continuously growing impact that Rust is having on the Arrow and data engineering ecosystem