We’re going to use https://github.com/EleutherAI/lm-evaluation-harness to run these evals.
For the purposes of running LLM evals, we want to create a jsonl file formatted like this:
{
"question": "What is the capital of France?",
"choices": ["London", "Berlin", "Paris", "Madrid"],
"answer": 2
}
We can structure these questions as either multiple choice, or free choice answers, like this:
{
"question": "What is the capital of France?",
"answer": "Paris"
}
In order to create our LLM questions, we need to decide:
If we choose to ask these questions as multiple choice questions, we can use various techniques from data generation to create the distractor wrong answers. For example, we can programmatically generate them based on the propositions, or we could use an LLM to generate them. In either case, we can use PyETR to ensure that the wrong answers are not actually right answers.
For all of these questions, we can annotate the LLM’s response as being correct according to formal logic theory, or correct according to the erotetic theory of reasoning. As far as I can tell, the Humans in, Humans out paper does not report these two numbers separately, but it would be good to report them as a confusion matrix.