Towards Best Practices for Open Datasets for LLM Training OFA Symposium 2025: Open Technology Impact in Uncertain Times

Towards Best Practices for Open Datasets for LLM Training
.ical
2025-11-18 17:10–17:40, Main Room

In recent years, large language models have come to rely almost entirely on massive amounts of unlicensed, web-scraped text, raising questions about intellectual property, consent, and fairness. Estimates suggest that compensating content creators even minimally would cost billions. Faced with mounting lawsuits and competition, most AI companies have not only stopped sharing their training datasets, but some also claimed that training competitive models without copyrighted material is “impossible.” This argument has helped justify growing opacity, even as the open web steadily closes in response to extractive AI practices. We are seeing the consequences: opaque systems we can’t audit, models we can’t reproduce, biases we can’t trace, and communities that feel exploited.

A global community of open LLM developers set out to demonstrate that performant models can be trained on responsibly sourced, openly licensed data.

In 2024, Mozilla and EleutherAI convened over 30 builders of open AI datasets to document what’s working, identify shared obstacles, and exchange strategies. The result was a collaborative paper, Towards Best Practices for Open Datasets for LLM Training, offering a practical guide to responsibly sourcing, curating, and releasing large-scale open datasets. It draws from real-world experience, tested tools, and persistent challenges.

The paper outlines strategies for addressing key technical and policy issues: unreliable metadata and “data laundering,” locked-in data, digitization costs, jurisdictional uncertainty, and the need for cross-domain collaboration. It also proposes a tiered model of openness, reflecting the reality that transparency in AI is inseparable from the data it’s built on.

Building performant, ethical, and openly licensed LLMs is hard work, but it’s possible and necessary. In times of data enclosure and declining trust, openness and consent must be the foundations of AI that serves the public good.

Many AI companies are training their large language models on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models.

While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and a culture of openness.

Kasia Odrozek

Kasia Odrozek works at the intersection of research, policy, and community engagement to help turn ideas into action in the ethical tech space. A strategist and AI ethics expert, she focuses on open and ethical datasets for AI, the emergence of artificial intimacy, identity shifts in human–AI interaction, and the broader concept of Public AI. She advises a range of organizations—from international institutions to grassroots initiatives—on responsible technology development and governance.

Kasia currently leads the Business Council at UNESCO, where she supports the implementation of the Recommendation on the Ethics of Artificial Intelligence. She brings a cross-sector perspective, translating high-level AI ethics principles into tangible commitments and practice.

Previously, she served as Director of Insights at the Mozilla Foundation, where she led research and policy efforts to advance Trustworthy AI and supported responsible tech initiatives. She has also advised public-interest funders, including the EU AI & Society Fund and the Prototype Fund.

Kasia’s background spans open-source and open-knowledge communities, having worked with Wikipedia on open culture and software, and tech entrepreneurship, where she led product strategy at the podcasting platform TapeWrite. She also founded the Berlin chapter of Zebras Unite, a network promoting ethical, inclusive alternatives to traditional startup models.

She holds qualifications in law, political science, and product management, and brings a multidisciplinary lens to complex challenges at the intersection of technology and society.

This speaker also appears in:

Q&A Panel: Open Source and AI

Towards Best Practices for Open Datasets for LLM Training .ical 2025-11-18 17:10–17:40, Main Room

Towards Best Practices for Open Datasets for LLM Training
.ical
2025-11-18 17:10–17:40, Main Room