Just throw it away? Class imbalance lessons from molecular machine learning to meatballs
Imbalanced datasets are common across science and industry: most screened molecules are inactive and most batted balls in baseball result in outs. One standard practice is to downsample the majority class or avoid collecting more of it. But majority-class examples are not interchangeable. Some are closely related to other examples, while others are distinct from any other example in the dataset. Others define the boundary between success and failure.
This talk asks two practical questions:
1. How much majority-class data is actually necessary for a performative machine learning model?
2. If we cannot collect all of it, which majority-class examples should we collect?
Using three wildly different datasets—antibacterial molecular screening, sandwich taste ratings, and Major League Baseball at-bat outcomes—I compare random downsampling to strategies that retain harder or more diverse majority-class examples, and evaluate the impact on generalization and performance for real-world machine learning models.