PyData London 2026

Ken Obata

Ken Obata is a senior data engineer currently working at Lyft, with over seven years of experience building large-scale data infrastructure at KPMG, Amazon, and Lyft. His current research focuses on scalable text deduplication for LLM training data, where he developed a partition-aware MinHash LSH system that processes hundreds of millions of documents on commodity Spark clusters.


Session

06-06
11:05
45min
Beyond Spark MLlib: Deduplicating Common Crawl at Scale
Ken Obata

Training large language models requires massive, high-quality text corpora—but web-scale datasets like Common Crawl contain significant near-duplicate content that degrades model performance and wastes compute. Existing solutions like Spark MLlib's MinHashLSH suffer from UDF serialization overhead and shuffle explosion, causing out-of-memory failures at scale.

We present a partition-aware MinHash LSH system that co-locates similar documents within Spark partitions, dramatically reducing cross-partition shuffles during similarity computation. Our approach combines vectorized MinHash generation using mathematical permutation tricks, band-based candidate filtering with configurable collision limits to handle edge cases like boilerplate false positives, and GraphFrames-based connected components for transitive deduplication.

Benchmarks on Common Crawl 253.4 million documents, generating 2.1 billion candidate pair
completed in under five hours on a 9-node r5d.8xlarge EMR cluster. We discuss key optimizations including partition-aware MinHash LSH and band collision filtering for common boilerplate content.
Attendees will learn partition-aware LSH design patterns, strategies for handling boilerplate-induced false positives, and how to integrate deduplication into existing Spark ETL pipelines. The system will be open-sourced, enabling practitioners to deploy production-ready deduplication pipelines for their own LLM training workflows.

Doddington Forum