PyCon DE & PyData 2026

What Breaks When Automatic Speech Recognition Systems Go Multilingual
, Palladium [2nd Floor]

Building machine learning models for audio deepfake detection seems straightforward until datasets span multiple languages, such as Hindi, Korean, Mandarin, and German. In practice, multilingual Automatic Speech Recognition (ASR) systems often fail in production because language-specific acoustic variations and assumptions about the processing pipeline break down at scale.

This talk examines the engineering challenges of building a multilingual deepfake detection system using a Python-centric pipeline. It covers practical issues encountered during large-scale audio preprocessing, including memory-efficient data loading, resumable feature-extraction workflows, and validation strategies designed to prevent cross-lingual leakage. The session also shares lessons from deploying a multilingual ASR-based system, with a focus on pipeline structure, evaluation correctness, and operational robustness in real-world settings.


In a multilingual Automatic Speech Recognition (ASR) dataset containing over 440,000 audio samples, preprocessing methods that were effective for one language often failed silently for others. This resulted in shifts in acoustic features, misleading validation outcomes, and prolonged jobs that failed due to assumptions that held true only in monolingual contexts. This presentation examines the issues that arise when extending ASR systems to multilingual data, using a real-world deepfake detection system that includes Hindi, Korean, Mandarin, and German. It addresses the engineering challenges encountered while developing and operating a Python-based pipeline at scale.

The session will discuss practical issues in large-scale audio processing, including the creation of memory-efficient data loaders, the design of workflows that support resumable preprocessing and feature extraction, and strategies for managing long-running jobs to avoid redundant computations. Additionally, it will cover validation strategies for multilingual ASR systems, emphasizing that language imbalance and shared pipelines can lead to cross-lingual leakage, which skews evaluation results if not explicitly addressed.

Key takeaways include:
1. Multilingual ASR pipelines reveal language-specific issues that are not present in monolingual systems.
2. Scalable audio processing requires memory-efficient and resumable Python workflows.
3. Cross-lingual evaluation necessitates explicit control over language imbalance and leakage.


Expected audience expertise in your talk's domain:: Intermediate Expected audience expertise in Python:: Intermediate

Rashmi is a AI Research Scientist at Poseidon and a researcher at MIT CSAIL, working in the intersection of cybersecurity and artificial intelligence. She has six years of industrial experience, having brought ideas to life at pre-seed startups and contributed to impactful redesigns and features at established industry giants. Beyond coding, Rashmi finds inspiration in capturing the wonders of the cosmos through her telescope and engaging in board games with friends.