EuroSciPy 2024

Using Wikipedia as a language corpus for NLP
08-27, 11:00–12:30 (Europe/Berlin), Room 6

Learning NLP often requires a corpus of sample texts. The common choice is Wikipedia. The project is open source and has huge amounts of a natural language content in dozens of languages. Happily the Wikimedia Foundation publishes data dumps in XML format, which could be easily parsed. In this tutorial you will learn how to do that in Python.


In this tutorial you will learn where to find the Wikipedia dumps, how to use Python’s built-in XML parser together with a MediaWiki syntax parser (mwparserfromhell) to extract raw text from Wikipedia articles.

We will also discuss the difference between streaming and in-memory parsers, and why the former are better for parsing huge amounts of data.

We will discuss the typical NLP stream, and as an example of additional steps needed in inflected languages, we will use a morphological analyser to lemmatise words sourced from polish language Wikipedia to calculate their frequencies.

As an example application we will compare such statistics with a Polish language corpus available in Python’s NLTK library (“pl196x” module, with the IPI PAN corpus of polish language of the 1960s) and show lexical differences between both corpora.


Abstract as a tweet

Introduction to parsing Wikipedia XML dumps in Python

Category [Scientific Applications]

Other

Expected audience expertise: Domain

none

Expected audience expertise: Python

some

I’m a software developer working for Polish Philology department of Adam Mickiewicz University. I’m interested in natural language processing and hypertext literature.