Natural Language Processing (NLP) is revolutionising communication in the 21st century, particularly in digital translation and machine-readable text applications. However, indigenous African languages are severely underrepresented in these applications, because good-quality, open African language datasets are rare to non-existent. This will widen the digital divide in Africa unless organisations proactively support the development of good-quality African language datasets.
This workshop will talk about how NLP researchers and engineers in Africa collected, developed and curated datasets for underrepresented African languages, ensuring the data is representative, useful, and public. We will discuss datasets and their tasks like machine translation using Yoruba from Nigeria, sentiment analysis in Tunizi Arabizi, automatic speech recognition in Wolof from Senegal and classification in Swahili.
The data collection was facilitated through AI4D and Zindi.
The Yoruba dataset will be available on Zindi after the event for participants to download and work on a machine learning solution to submit on Zindi.
By joining Zindi participants will join a community of data scientists across Africa and the world where they will be able to interact, learn and share their process of working with this dataset and many others from diverse fields such as agriculture, finance, and conservation.
Zindi hosts a number of active NLP challenges from across Africa with datasets from underrepresented indigenous languages, ranging from sentiment analysis to machine translation.
Along with the NLP challenges, Zindi offers predictive and computer vision challenges, among many other machine learning problems. Participants will be able to view the different challenges, take part in any that appeal to them, and have an impact on Africa and the world by applying their data science skills.
How will you deal with varying numbers of participants in your session?:Our session will start with talks from individuals in the area, followed by a walk through of how to work with NLP data and how to implement a certain task.
The talks will be prepared for a larger audience but can easily be tailored to a smaller audience by questions by becoming more interactive and making the session more activity-focused.
A Zindi data scientist will conduct a live coding demonstration and walk-through of the machine learning solution, and a starter notebook in Python will be made available through Google Colab for users to access and work on during the session. Working with Google Colab allows for any number of people to access and work on the notebook shared together, encouraging participatory learning.
I am a data scientist at Zindi, the largest pan-African gathering of data scientists.
I am excited to be part of Mozfest to meet like-minded individuals.
David Adelani is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. Originally from Nigeria, he is also involved in the development of NLP datasets for African languages