2026-07-16 –, Memorial Hall
Imbalanced datasets are common across science and industry: most screened molecules are inactive and most batted balls in baseball result in outs. One standard practice is to downsample the majority class or avoid collecting more of it. But majority-class examples are not interchangeable. Some are closely related to other examples, while others are distinct from any other example in the dataset. Others define the boundary between success and failure.
This talk asks two practical questions:
1. How much majority-class data is actually necessary for a performative machine learning model?
2. If we cannot collect all of it, which majority-class examples should we collect?
Using three wildly different datasets—antibacterial molecular screening, sandwich taste ratings, and Major League Baseball at-bat outcomes—I compare random downsampling to strategies that retain harder or more diverse majority-class examples, and evaluate the impact on generalization and performance for real-world machine learning models.
Motivation
The goal of this talk is pragmatic. Rather than assume that majority-class data is disposable, I measure its value in different domains and discuss how to retain the right subset under budget constraints. I also evaluate whether those choices improve performance where it matters most: generalization and discrimination on a decision boundary.
Intended Audience
This talk is aimed at:
* Python data scientists working with imbalanced datasets
* scikit-learn + other ML package users building applied ML systems
* Anyone who has wondered whether all that negative data is actually necessary
It assumes familiarity with basic machine learning concepts (classification, regression, cross-validation), but does not require deep theoretical background. The focus is on applied ML.
Datasets
I explore these questions across three domains.
1) Antibacterial screening:
This dataset consists of ~40,000 small molecules experimentally screened for antibacterial activity. Only a small fraction (3%) show measurable activity. Evaluation uses both random splits and scaffold splits, where entire structural families of molecules are held out to test generalization under distribution shift.
2) MLB batted-ball outcomes:
Using features such as exit velocity and launch angle, the task is to predict outcomes (out, single, double, home run). The majority of at bats result in outs. Rare but desirable events like home runs occupy a small section of feature space and can have similar features to near-misses.
3) “Roll for Sandwich” ratings:
This dataset contains ingredient combinations (bread, meat, cheese, toppings) and a human rating from 0–10 from the TikTok series "Roll For Sandwich". Roughly half of sandwiches score above 7, while very low scores are rare (only ~11% have scores <3). The space of possible combinations is large and sparsely explored. This provides a regression setting where “negative” examples are low-rated sandwiches.
Evaluation
Across all three datasets, I run two main experiments.
First, data saturation experiments: hold the minority-class examples fixed, and gradually increase the number of majority-class examples to determine where performance plateaus.
Second, fixed-budget data selection: vary how the majority-class examples are chosen:
* Random down-sampling
* Hard examples near the decision boundary (e.g., inactive molecules structurally similar to actives, near-miss home runs, or sandwich variants that differ by one ingredient)
* Diversity-oriented selection that maximizes coverage of feature space
Evaluation includes classic ML metrics (e.g., F1 score). We also use matched pairs: pairs of examples that are highly similar in features but differ in outcome. In chemistry, these are matched molecular pairs that differ by a small structural modification yet flip activity. In sandwiches, these are nearly identical ingredient sets with different ratings. In baseball, these are batted balls with similar exit velocity and launch angle but different outcomes. Performance on these pairs measures whether a model captures meaningful decision boundaries rather than broad class separation. I also report top-k metrics (e.g., precision@k) to reflect practical decision-making scenarios.