CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training
2025-10-01 , Louis Armand 1 - Est

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance.
In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.


In this mostly high-level talk, attendees interested in source code archival and code generation training datasets will get to know CodeCommons (https://codecommons.org), an ambitious project by a collaboration of the Software Heritage team at the French Institute for Research in Computer Science and Automation (INRIA), the French Alternative Energies and Atomic Energy Commission (CEA) and Modus Create.
CodeCommons is being built on the foundation of Software Heritage (https://www.softwareheritage.org), the world's largest publicly accessible source code archive. While Software Heritage currently comprises over 350,000,000 source code projects, it lacks licensing and important metadata about the context in which code was written.
As AI code generators become increasingly widespread, concerns over legal and ethical use of training data based on open-source projects are growing. CodeCommons extends Software Heritage with licensing information, comprehensive contextual metadata and filtering capabilities to create a large-scale reference dataset for training AI on code. It emphasizes traceability, author's rights, and ethical reuse, reducing the need for repeated, unsustainable data collection.
In this talk, we will introduce the audience to the Software Heritage project, then explore the changes CodeCommons will introduce and finally present in more detail Modus Create's work on the CodeCommons distributed computing, storage and querying infrastructure.