Jakub B. Jagiełło
I’m a software developer working for Polish Philology department of Adam Mickiewicz University. I’m interested in natural language processing and hypertext literature.
Adam Mickiewicz University in Poznań, department of Polish philology
Session
08-27
11:00
90min
Using Wikipedia as a language corpus for NLP
Jakub B. Jagiełło
Learning NLP often requires a corpus of sample texts. The common choice is Wikipedia. The project is open source and has huge amounts of a natural language content in dozens of languages. Happily the Wikimedia Foundation publishes data dumps in XML format, which could be easily parsed. In this tutorial you will learn how to do that in Python.
Scientific Applications
Room 6