MozFest House Amsterdam

Inclusive data for AI: focus on low resource languages representation and unbiased content
2024-06-13 , Room F - Garden Tent

In the rapidly evolving landscape of AI, the transformative potential across industries worldwide is immense. Yet, this potential is markedly hindered by limitations in the quality and diversity of the data used to train these systems. Predominantly, over 90% of the data feeding into popular Large Language Models (LLMs) is in English, leaving a significant gap in digital representation for low-resource languages. This disparity not only limits AI's reach but also perpetuates biases and inequities in technological solutions.
At TAUS, we are committed to addressing these challenges through our Human Language Project (HLP). Our mission is to democratize AI by enhancing data diversity and ensuring fair representation across all languages. This session aims to discuss the challenges in creating a more inclusive AI ecosystem and to share insights from our ongoing efforts to:
- Significantly increase the digital representation of low-resource languages, thereby widening AI’s scope to serve global communities effectively.
- Actively reduce biases in AI datasets to promote more equitable and just AI applications.
- Implement fair compensation practices for our contributors, recognizing their invaluable role in enriching AI with diverse linguistic data.

Sofie is the Project Coordinator for all things HLP at TAUS. She holds a Master's degree in Linguistics from Leiden University, with a specialization in Language and Communication. She is passionate about language accessibility and gender-fair language (GFL), and brings several years of community management and localization experience to the table.

Amir is a Computational Linguist with a background in Computer Science, currently dazzling as a Solution Architect at TAUS. With a passion for language and technology, Amir is at the forefront of developing innovative Natural Language Processing (NLP) solutions for TAUS. His work focuses on enhancing translation automation, language data management, and AI-driven language solutions.