PyCon UK 2019

What are they talking about? Mining topics in documents with topic modelling and Python
09-15, 10:30–12:00 (Europe/London), Room K

This tutorials is a practical introduction to topic modelling in Python, tackling the problem of analysing large data sets of textual data, in order to identify topics of interest and related keywords.


This tutorial tackles the problem of analysing large data sets of unstructured textual data, with the aim of identifying and understanding topics of interest and their related keywords.

Topic modelling is a technique that provides a bird's-eye view on a large collection of text documents. The purpose is to identify abstract topics and capture hidden semantic structures. Topic modelling techniques can be used in exploratory analysis, to better understand its semantics even in absence of explicit labels.

In this tutorial, we'll walk through the whole pipeline of pre-processing textual data, applying topic modelling techniques, and evaluating the output. The focus will be on classic approaches like Latent Dirichlet Allocation (LDA), with practical examples in Python using the library Gensim.

The tutorial is tailored to beginner users of Natural Language Processing (NLP) tools and people who are interested in knowing more about NLP tools and techniques.

By attending this tutorial, participants will learn:
- how to run an end-to-end NLP pipeline on the problem of topic mining
- how to capture semantic structures in text with topic modelling
- how to assess the output of topic modelling techniques applied to textual data

If you're planning to attend the tutorial, please download the material beforehand: https://github.com/bonzanini/topic-modelling


Is your proposal suitable for beginners? – yes
See also: Slides (119.2 KB)

Marco is a Data Science Consultant and Trainer based in London, co-organiser of the PyData London meetup and chairperson of the PyData London conference 2018-19.

This speaker also appears in: