Jackie Valeri SciPy 2026

Jackie Valeri
.ical

Session

Just throw it away? Class imbalance lessons from molecular machine learning to meatballs

Imbalanced datasets are common across science and industry: most screened molecules are inactive and most batted balls in baseball result in outs. One standard practice is to downsample the majority class or avoid collecting more of it. But majority-class examples are not interchangeable. Some are closely related to other examples, while others are distinct from any other example in the dataset. Others define the boundary between success and failure.

This talk asks two practical questions:

How much majority-class data is actually necessary for a performative machine learning model?
If we cannot collect all of it, which majority-class examples should we collect?

Using three wildly different datasets—antibacterial molecular screening, sandwich taste ratings, and Major League Baseball at-bat outcomes—I compare random downsampling to strategies that retain harder or more diverse majority-class examples, and evaluate the impact on generalization and performance for real-world machine learning models.

Data-Driven Discovery, Machine Learning and Artificial Intelligence

Memorial Hall

Jackie Valeri .ical

Session

Jackie Valeri
.ical