Hierarchical Multiple Instance Learning
2021-07-28 , Green

Learning from raw data input is one of the key components of many successful applications of machine learning methods. While machine learning problems are often formulated on data that naturally translate into a vector representation suitable for classifiers, there are data sources with a unifying hierarchical structure, such as JSON. This talk will describe Mill.jl and JsonGrinder.jl, which offers a theoretically justified approach to solve machine learning problems with these data sources.


Learning from raw data input, thus limiting the need for manual feature engineering, is one of the key components of many successful applications of machine learning methods. While machine learning problems are often formulated on data that naturally translate into a vector representation suitable for classifiers, there are data sources, for example in cybersecurity, that are naturally represented in diverse files with a unifying hierarchical structure, such as XML, JSON, and Protocol Buffers.

Converting this data to vector (tensor) representation is generally done by manual feature engineering, which is laborious, lossy, and prone to human bias about the importance of particular features.

Mill.jl and Jsongrinder.jl is a tandem of libraries, which fully automates the conversion. Starting with an arbitrary set of JSON samples, they create a differentiable machine learning model capable of infer from further JSON samples in their raw form.

In the spirit of the Julia language, the framework is split into two packages --- Mill.jl implementing the hierarchical multiple instance learning paradigm, offering a theoretically justified approach for building machine learning models for this type of data, and Jsongrinder.jl summarizing the structure in a set of JSON samples and reflecting it in a Mill.jl model.

The talk will be split in four parts.
1) Motivation why we think the problem is interesting
2) Description of mathematical function and theorems about mathematical correctness
3) Description of a design of libraries
4) Practical demo

Link to libraries:
https://github.com/CTUAvastLab/Mill.jl
https://github.com/pevnak/JsonGrinder.jl

Tomas Pevny has graduated from Faculty of Nuclear sciences and Physical Engineering, CTU, Prague, in 2003. From 2004-2008, he was pursuing Ph.D. at Binghamton University, SUNY, USA specializing on Steganalysis. In 2008-2009 he spent a wonderful post-doc year in Grenoble. Since 2009, he is with Faculty of electrical engineering, CTU, at Prague. From 2013-2019, he was also consulting scientist at Cisco and from 2019 he is consulting scientist at Avast. His specialization is machine learning in security domains. He is an active user of Julia since 2015.