Exceeding Classical: Probabilistic Data Structures in Data Intensive Applications
2019-09-05 , Track 3 (Oteiza)

We interact with an increasing amount of data but classical data structures and algorithms can't fit our requirements anymore. This talk is to present the probabilistic algorithms and data structures and describe the main areas of their applications.


Nowadays, research in every scientific domain, from medicine to astronomy, is impossible without processing huge amounts of data to check hypotheses, find new relations, and make discoveries. However, the traditional technologies which include data structures and algorithms, become ineffective or require too many resources. This creates a demand for various optimization techniques, new data processing paradigms, and, finally, appropriate algorithms.

The presentation is dedicated to probabilistic data structures, that is a common name for advanced data structures based mostly on different hashing techniques. Unlike classical ones, these provide approximated answers but with reliable ways to estimate possible errors and uncertainty. They are designed for extremely low memory requirements, constant query time, and scaling, the factors that are essential for data applications. It is hard to imagine a branch that requires learning from data, where they cannot be applicable.

They are not necessarily new. Probably, everybody knows about the Bloom filter data structure, designed in the 70s, it efficiently solves the problem of performing membership queries (a task to decide whether some element belongs to the dataset or not) in a constant time without requirements to store all elements. This is an example of a probabilistic data structure, but there are much more that have been designed for various tasks in many domains.

In this talk, I explain the five most important problems in data processing that occurred in different domains but can be efficiently solved with probabilistic data structures and algorithms. We cover the membership querying, counting of unique elements, frequency and rank estimation in data streams, and similarity.

Everybody interested in such a topic is welcome to participate in contributing a free and open-source Python (Cython) library called PDSA.


Project Homepage / Git:

https://pdsa.gakhov.com

Abstract as a tweet:

Learn why do you need to use probabilistic data structures and algorithms in your data applications.

Python Skill Level:

professional

Domain Expertise:

some

Domains:

Big Data, Robotics & IoT

See also: PDSA in Big Data Ecosystem (134.3 KB)

Andrii Gakhov is a mathematician and software engineer holding a Ph.D. in mathematical modeling and numerical methods. He has been a teacher in the School of Computer Science at V. Karazin Kharkiv National University in Ukraine for a number of years and currently works as a software practitioner for ferret go GmbH, the leading community moderation, automation, and analytics company in Germany. His fields of interests include machine learning, stream mining, and data analysis.

The author of "Probabilistic Data Structures and Algorithms for Big Data Applications" (ISBN: 9783748190486)