BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.com//pydata-london-2026//talk//T7GMEL
BEGIN:VTIMEZONE
TZID:GMT
BEGIN:STANDARD
DTSTART:20001029T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:GMT
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T020000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:BST
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-pydata-london-2026-T7GMEL@pretalx.com
DTSTART;TZID=GMT:20260606T110500
DTEND;TZID=GMT:20260606T115000
DESCRIPTION:Training large language models requires massive\, high-quality 
 text corpora—but web-scale datasets like Common Crawl contain significan
 t near-duplicate content that degrades model performance and wastes comput
 e. Existing solutions like Spark MLlib's MinHashLSH suffer from UDF serial
 ization overhead and shuffle explosion\, causing out-of-memory failures at
  scale.\n\nWe present a partition-aware MinHash LSH system that co-locates
  similar documents within Spark partitions\, dramatically reducing cross-p
 artition shuffles during similarity computation. Our approach combines vec
 torized MinHash generation using mathematical permutation tricks\, band-ba
 sed candidate filtering with configurable collision limits to handle edge 
 cases like boilerplate false positives\, and GraphFrames-based connected c
 omponents for transitive deduplication.\n\nBenchmarks on Common Crawl 253.
 4 million documents\, generating 2.1 billion candidate pair\ncompleted in 
 under five hours on a 9-node r5d.8xlarge EMR cluster. We discuss key optim
 izations including partition-aware MinHash LSH and band collision filterin
 g for common boilerplate content.\nAttendees will learn partition-aware LS
 H design patterns\, strategies for handling boilerplate-induced false posi
 tives\, and how to integrate deduplication into existing Spark ETL pipelin
 es. The system will be open-sourced\, enabling practitioners to deploy pro
 duction-ready deduplication pipelines for their own LLM training workflows
 .
DTSTAMP:20260602T223426Z
LOCATION:Doddington Forum
SUMMARY:Beyond Spark MLlib: Deduplicating Common Crawl at Scale - Ken Obata
URL:https://pretalx.com/pydata-london-2026/talk/T7GMEL/
END:VEVENT
END:VCALENDAR
