2025-11-18 –, Main Room
In recent years, large language models have come to rely almost entirely on massive amounts of unlicensed, web-scraped text, raising questions about intellectual property, consent, and fairness. Estimates suggest that compensating content creators even minimally would cost billions. Faced with mounting lawsuits and competition, most AI companies have not only stopped sharing their training datasets, but some also claimed that training competitive models without copyrighted material is “impossible.” This argument has helped justify growing opacity, even as the open web steadily closes in response to extractive AI practices. We are seeing the consequences: opaque systems we can’t audit, models we can’t reproduce, biases we can’t trace, and communities that feel exploited.
A global community of open LLM developers set out to demonstrate that performant models can be trained on responsibly sourced, openly licensed data.
In 2024, Mozilla and EleutherAI convened over 30 builders of open AI datasets to document what’s working, identify shared obstacles, and exchange strategies. The result was a collaborative paper, Towards Best Practices for Open Datasets for LLM Training, offering a practical guide to responsibly sourcing, curating, and releasing large-scale open datasets. It draws from real-world experience, tested tools, and persistent challenges.
The paper outlines strategies for addressing key technical and policy issues: unreliable metadata and “data laundering,” locked-in data, digitization costs, jurisdictional uncertainty, and the need for cross-domain collaboration. It also proposes a tiered model of openness, reflecting the reality that transparency in AI is inseparable from the data it’s built on.
Building performant, ethical, and openly licensed LLMs is hard work, but it’s possible and necessary. In times of data enclosure and declining trust, openness and consent must be the foundations of AI that serves the public good.
In recent years, large language models (LLMs) have relied almost entirely on massive amounts of unlicensed, web-scraped text, raising questions about intellectual property, consent, and fairness. Estimates suggest that compensating content creators even minimally would cost billions. Faced with mounting lawsuits and competition, most AI companies have not only stopped sharing their training datasets, but some have also claimed that training competitive models without copyrighted material is “impossible.” This argument has helped justify growing opacity, even as the open web steadily closes in response to extractive AI practices. We are seeing the consequences: opaque systems we can’t audit, models we can’t reproduce, biases we can’t trace, and communities that feel exploited.
A global community of open LLM developers set out to demonstrate that performant models can be trained on responsibly sourced, openly licensed data. In 2024, Mozilla and EleutherAI convened over 30 builders of open AI datasets to document what’s working, identify shared obstacles, and exchange strategies. The result was a paper called 'Towards Best Practices for Open Datasets for LLM Training', which offers a practical guide to responsibly sourcing, curating, and releasing large-scale open datasets. This paper builds on that one, outlining strategies for addressing key technical and policy issues: unreliable metadata and “data laundering,” locked-in data, digitization costs, jurisdictional uncertainty, and the need for cross-domain collaboration. It also proposes a tiered model of openness, reflecting the reality that transparency in AI is inseparable from the data it’s built on.
Kasia Odrozek works at the intersection of research, policy, and community engagement to help turn ideas into action in the ethical tech space. A strategist and AI ethics expert, she focuses on open and ethical datasets for AI, the emergence of artificial intimacy, identity shifts in human–AI interaction, and the broader concept of Public AI. She advises a range of organizations—from international institutions to grassroots initiatives—on responsible technology development and governance.
Kasia currently leads the Business Council at UNESCO, where she supports the implementation of the Recommendation on the Ethics of Artificial Intelligence. She brings a cross-sector perspective, translating high-level AI ethics principles into tangible commitments and practice.
Previously, she served as Director of Insights at the Mozilla Foundation, where she led research and policy efforts to advance Trustworthy AI and supported responsible tech initiatives. She has also advised public-interest funders, including the EU AI & Society Fund and the Prototype Fund.
Kasia’s background spans open-source and open-knowledge communities, having worked with Wikipedia on open culture and software, and tech entrepreneurship, where she led product strategy at the podcasting platform TapeWrite. She also founded the Berlin chapter of Zebras Unite, a network promoting ethical, inclusive alternatives to traditional startup models.
She holds qualifications in law, political science, and product management, and brings a multidisciplinary lens to complex challenges at the intersection of technology and society.