WikidataCon 2025

WikidataCon 2025

Observations on Lexeme Modeling Across Languages
2025-10-31 , One and Only

Editing lexicographical data on Wikidata shows that every language stretches the model in its own way. Across 30+ languages, we have seen patterns and divergences in modeling Lexemes.
This session shares those observations plainly, with a focus on underserved and less-documented languages where contributors often work without much guidance. While sharing our observations, we will highlight languages that already have established practices (like Turkish, German, and French), as examples others can follow.

These reflections build directly on conversations happening in the Lexicographical Data community (from Telegram, Talk pages, and evolving documentation pages), and are shared here as learning experiences. We will also briefly show how we tried to fold these lessons into our approach when we built a tool to edit Lexemes, as an example of how contributors' pain points shape tool design.

Participants will leave with grounded examples of what works, where challenges remain, and ideas for how to approach contributing or building tools that handle linguistic diversity realistically.


The lexicographical data model on Wikidata is deliberately flexible. That flexibility can help, but it also creates recurring challenges. Some languages align smoothly, while others may expose weak spots in the base model. In many underserved languages, contributors face open questions: which language variant code to use, how to assign lexical categories, what to do when hyphenation rules differ, or how to begin when little data exists.

This talk draws on our team’s observations across 30+ languages and builds directly on conversations in the Lexicographical Data community (like Telegram, Talk pages, documentation). We will share:
- Script variants, such as Javanese Hanacaraka vs. Latin transliterations
- Differences in hyphenation rules
- Gaps in underserved languages where modeling practices are still emerging
- Established practices in language (like Turkish, German, French) and the model they have developed, as points of reference for others

We will also include a short case study on how contributors’ pain points informed our tool design, showing how we tried to capture multiple layers of language modeling in practice.

Learning outcomes: Participants will gain a clearer view of how Lexeme modeling plays out across languages, see concrete examples from underserved contexts, and take away ideas for contributing or designing with linguistic diversity in mind.

Raisha Abdillah is a Project Lead and Wikidata contributor based in Jakarta. She leads the team behind Lexica, developing tool for lexicographical data with a focus on accessibility, multilingual support, thoughtful design, and community-centered development, especially for contributors in underserved language communities.