Developing a benchmark for AI reviewers of preregistrations
In this hackathon we will work on a benchmark for AI evaluations of preregistrations. When it comes to the assessment of preregistrations, human labeled data scales poorly due to the need for a lot of expert labor. Instead, we will focus on synthetically generated preregistrations in which we know the ground truth: what is described adequately and what is missing/inadequately described.
Possible project tasks:
Gathering or creating flawless preregistrations.
Find ways in which important components of a preregistration can be broken
Formulate prompts and AI workflows for generating versions of the flawless preregistrations that are wrong in specific ways.
Create a database for synthetic preregistrations and ground truth about which components are adequate and which are not
Write code that would allow quick benchmarking by AI or Human coders.
Validate synthetic preregistrations by evaluating adequacy of the synthetically generated preregistrations.