2019-09-05, 10:30–11:00, Track 2 (Baroja)
Machine learning is a search for the best combination of features, model, and hyperparameters. But as data grow, so does the search space! Fortunately, visual diagnostics can focus our search and allow us to steer modeling purposefully, and at scale.
Even with a modestly-sized dataset, the hunt for the most effective machine learning model is hard. Arriving at the optimal combination of features, algorithm, and hyperparameters frequently requires significant experimentation and iteration. This leads some of us to stay inside algorithmic comfort zones, some to trail off on random walks, and others to resort to automated processes like gridsearch. But whatever path we take, we are often left in doubt about whether our final solution really is the optimal one. And as our datasets grow in size and dimension, so too does this ambiguity.
Fortunately, many of us have developed strategies for steering model search. Open source libraries like seaborn, pandas and yellowbrick can help make machine learning more informed with visual diagnostic tools like histograms, correlation matrices, parallel coordinates, manifold embeddings, validation and learning curves, residuals plots, and classification heatmaps. These tools enable us to tune our models with visceral cues that allow us to be more strategic in our choices. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the multi-dimensional realm in which our models operate.
However, large, high-dimensional datasets can prove particularly difficult to explore. Not only do the majority of people struggle to visualize anything beyond two- or three-dimensional space, many of our favorite open source Python tools are not designed to be performant with arbitrarily big data. So how well do our favorite visualization techniques hold up to large, complex datasets?
In this talk, we'll consider a suite of visual diagnostics — some familiar and some new — and explore their strengths and weaknesses with several publicly available datasets of varying size. Which suffer most from the curse of dimensionality in face of increasingly big data? What are the workarounds (e.g. sampling, brushing, filtering, etc.) and when should we use them? And most importantly, how can we continue to steer the machine learning process — not only purposefully but at scale?