Visual Diagnostics at Scale
2019-09-05 , Track 2 (Baroja)

Machine learning is a search for the best combination of features, model, and hyperparameters. But as data grow, so does the search space! Fortunately, visual diagnostics can focus our search and allow us to steer modeling purposefully, and at scale.


Even with a modestly-sized dataset, the hunt for the most effective machine learning model is hard. Arriving at the optimal combination of features, algorithm, and hyperparameters frequently requires significant experimentation and iteration. This leads some of us to stay inside algorithmic comfort zones, some to trail off on random walks, and others to resort to automated processes like gridsearch. But whatever path we take, we are often left in doubt about whether our final solution really is the optimal one. And as our datasets grow in size and dimension, so too does this ambiguity.

Fortunately, many of us have developed strategies for steering model search. Open source libraries like seaborn, pandas and yellowbrick can help make machine learning more informed with visual diagnostic tools like histograms, correlation matrices, parallel coordinates, manifold embeddings, validation and learning curves, residuals plots, and classification heatmaps. These tools enable us to tune our models with visceral cues that allow us to be more strategic in our choices. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the multi-dimensional realm in which our models operate.

However, large, high-dimensional datasets can prove particularly difficult to explore. Not only do the majority of people struggle to visualize anything beyond two- or three-dimensional space, many of our favorite open source Python tools are not designed to be performant with arbitrarily big data. So how well do our favorite visualization techniques hold up to large, complex datasets?

In this talk, we'll consider a suite of visual diagnostics — some familiar and some new — and explore their strengths and weaknesses with several publicly available datasets of varying size. Which suffer most from the curse of dimensionality in face of increasingly big data? What are the workarounds (e.g. sampling, brushing, filtering, etc.) and when should we use them? And most importantly, how can we continue to steer the machine learning process — not only purposefully but at scale?


Project Homepage / Git

https://github.com/districtdatalabs/yellowbrick

Project Homepage / Git

https://github.com/districtdatalabs/yellowbrick

Abstract as a tweet

How to do #machinelearning thoughtfully AND at scale? Use Scikit-Yellowbrick to visualize feature selection, model evaluation, and hyperparameter and steer towards more informed modeling!

Python Skill Level

basic

Domain Expertise

none

Domains

Big Data, Data Visualisation, Machine Learning, Open Source

Dr. Rebecca Bilbro is a data scientist, Python and Go programmer, teacher, speaker, and author in Washington, DC. She specializes in visual diagnostics for machine learning, from feature analysis to model selection and hyperparameter tuning, and has conducted research on natural language processing, semantic network extraction, entity resolution, and high dimensional information visualization. An active contributor to the open source software community, Rebecca enjoys collaborating with other developers on inclusive projects like Scikit-Yellowbrick - a pure Python visualization package for machine learning that extends scikit-learn and Matplotlib to support model selection and diagnostics. In her spare time, she can often be found either out-of-doors riding bicycles with her family or inside practicing the ukulele. Rebecca earned her doctorate from the University of Illinois, Urbana-Champaign, where her research centered on communication and visualization in engineering.