Simeon Carstens PyData Paris 2025

Simeon Carstens
.ical

Session

CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance.
In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.

Louis Armand 1 - Est

Simeon Carstens .ical

Session

Simeon Carstens
.ical