2025-04-24 –, Hassium
Topic modelling has come a long way, evolving from traditional statistical methods to leveraging advanced embeddings and neural networks. Python’s diverse library ecosystem includes tools like Latent Dirichlet Allocation (LDA) using gensim, Top2Vec, BERTopic, and Contextualized Topic Models (CTM). This talk evaluates these popular approaches using a dataset of UK climate change policies, considering use cases relevant to organisations like DEFRA (Department for Environment, Food & Rural Affairs). The analysis explores real-time integration, dynamic topic modelling over time, adding new documents, and retrieving similar ones. Attendees will learn the strengths, limitations, and practical applications of each library to make informed decisions for their projects.
Objectives:
The session aims to:
- Compare Python-based topic modelling libraries, highlighting their relevance to real-world scenarios like policy analysis.
- Explore practical use cases, including real-time document integration, tracking topic evolution, and finding similar documents.
- Evaluate the tools based on performance, interpretability, scalability, and flexibility, with a focus on climate change policy data presented by [1] focusing on adaptation and mitigation.
- Provide actionable guidance on selecting the right library for different project needs and datasets.
Outline:
-
Introduction to Topic Modeling: Overview of traditional and modern approaches, including their practical significance.
-
Algorithms & Libraries Overview: LDA (gensim) [2], CTM [3], Top2Vec [4], BERTopic [5]
-
Dataset and Use Cases:
- Overview of the UK climate change policy dataset.
- Use cases inspired by DEFRA and similar organisations, such as:
- Real-time integration for continuously adding new documents.
- Tracking topic development over time (dynamic topic modeling).
- Retrieving similar documents for faster insights.
(- Classification)
-
Evaluation Criteria: Analysis of libraries based on:
- Ease of Use: How easy it is for no coding experts
- Quality: Coherence and diversity of extracted topics.
- Efficiency: Runtime performance and scalability.
- Flexibility: Features like contextual embeddings and integration capabilities.
- Interpretability: Ease of understanding topics and output. -
Results: Detailed findings, including specific advantages and limitations of each library in supporting the outlined use cases.
-
Practical Recommendations: Guidance on choosing a library based on project goals, dataset characteristics, and organisational needs.
-
Conclusion and Future Directions: Summary of key insights and the evolving role of embedding-based methods in topic modelling.
Outcomes:
By attending this session, participants will:
- Gain an in-depth understanding of Python’s top topic modeling libraries.
- Learn how to apply these tools to real-world challenges in policy analysis and other fields.
- Understand how to handle use cases like real-time document integration and topic evolution over time.
- Develop the skills to evaluate and choose the best tool for specific datasets and objectives.
Target Audience
This talk is for:
- Data scientists and NLP practitioners seeking to apply topic modelling to unstructured text data.
- Policy analysts and researchers working with large textual datasets, such as government or environmental policies.
- Professionals in organisations like DEFRA, where tracking changes, adding new documents, or finding similar records are critical tasks.
- Python enthusiasts interested in cutting-edge NLP techniques for extracting meaningful insights.
[1] R. Biesbroek, S. Badloe, and I. Athanasiadis. Machine learning for research on cli-
mate change adaptation policy integration: an exploratory uk case study. Regional
Environmental Change, 20, 07 2020.
[2] https://pypi.org/project/gensim/
[3] https://github.com/MilaNLProc/contextualized-topic-models
[4] https://github.com/ddangelov/Top2Vec
[5] https://maartengr.github.io/BERTopic/index.html
5 https://github.com/MilaNLProc/contextualized-topic-models
Intermediate
Expected audience expertise: Python:Intermediate
Public link to supporting material, e.g. videos, Github, etc.:N/A yet
Lisa is an accomplished educator, researcher, and freelancer specializing in data science, natural language processing (NLP), and artificial intelligence. With a PhD in Intelligent Systems from UCL and a master's from Imperial College London, Lisa has extensive experience in academia and industry, having taught at UCL, and contributed to impactful projects like those with Cancer Research UK.
A digital nomad at heart, Lisa teaches corporate clients and supervises university students worldwide, focusing on Python, machine learning, and NLP. Known for their engaging teaching style and passion for problem-solving, they are currently developing innovative courses and creating a YouTube channel featuring masterclasses on data analysis and machine learning.
Driven by a love for teaching, research, and helping others succeed, Lisa is exploring opportunities to return to academia, with aspirations to lecture in Eastern Europe and Central Asia. Multilingual and versatile, they are shaping the future of data science education while continuing to inspire learners globally.