JuliaCon 2020 (times are in UTC)

JuliaCon 2020 (times are in UTC)

Enterprise data management with low-rank topic models
2020-07-30 , Purple Track

How can enterprises create a catalogue of what data they have, given only a few labels and access to physical data layouts and other metadata? I show how to extend GeneralizedLowRankModels.jl to generate topic models that can be used for semisupervised learning tasks like extrapolating from known labels, evaluating possible errors in existing labels, and predicting missing metadata.


To adopt modern practices for reproducible data science, enterprises need to first know what kinds of data they have. In some industries like financial services, being able to reproduce critical risk calculations is even a regulatory requirement. A necessary first step is for enterprises to build comprehensive data catalogue, before building other infrastructure such as data lakes. Building such a catalogue can be challenging for enterprises with multiple legacy systems, incomplete documentation, and inherited technical debt. The expert knowledge needed to provide and verify subject labels further escalates the cost of building a data catalogue.

In this talk, I demonstrate how topic modeling can be used to help build a comprehensive data catalogue from incomplete subject labels and access to metadata such as low-level record types and database table names. By treating such metadata as a sequence of tokens, similar to natural text, I show how to construct semisupervised topic models that allow extrapolation from existing labels. First, I show how a gauge transformation of a standard topic modeling technique, latent semantic indexing (LSI), yields a labelled topic model that is explictly separable. Next, I show how to use generalized low-rank models (GLRMs), as implemented in GeneralizedLowRankModels.jl, to explicitly construct a labelled topic model that is a sparse, interpretable, and separable generalization of principal components analysis. I show how to implement a new regularizer in Julia, including an implementation of its corresponding proximal gradient operator. Furthermore, I show how to modify the code of GeneralizedLowRankModels.jl to take advantage of the new multithreading model in Julia 1.3 for near-perfect parallel speedup. Additionally, numerical tricks such as low-precision iterative linear algebra, randomized subsampling, and warm starts help to make efficient the training of a GLRM via proximal gradient descent.

As an illustration of the technique, I will show how this new topic model performs on predicting subject tags on over 25,000 datasets from Kaggle.com. The GLRM-based topic model can be used for several different semisupervised learning tasks, like extrapolating from known labels, evaluating possible errors in existing labels, and predicting missing metadata.

Jiahao Chen is a senior Vice President and Research Lead at JPMorgan AI Research in New York, with research focusing on explainability and fairness in machine learning, as well as semantic knowledge management. He was previously a Senior Manager of Data Science at Capital One focusing on machine learning research for credit analytics and retail operations.

When still in academia, Jiahao was a Research Scientist at MIT CSAIL where he co-founded and led the Julia Lab, focusing on applications of the Julia programming language to data science, scientific computing, and machine learning. Jiahao has organized JuliaCon, the Julia conference, for the years 2014-2016, as well as organized workshops at NeurIPS, SIAM CSE, and the American Chemical Society National Meetings. Jiahao has authored over 120 packages for numerical computation, data science and machine learning for the Julia programming language, in addition to numerous contributions to the base language itself.