Running LLM Evals

We’re going to use https://github.com/EleutherAI/lm-evaluation-harness to run these evals.

Question Formatting

For the purposes of running LLM evals, we want to create a jsonl file formatted like this:

{
	"question": "What is the capital of France?", 
	"choices": ["London", "Berlin", "Paris", "Madrid"], 
	"answer": 2
}

Multiple Choice vs Free Answer

We can structure these questions as either multiple choice, or free choice answers, like this:

{
	"question": "What is the capital of France?", 
	"answer": "Paris"
}

Our Questions

In order to create our LLM questions, we need to decide:

What question are we asking?
What is the answer?
If multiple choice, what are the options?

Multiple Choice Distractor Answers

If we choose to ask these questions as multiple choice questions, we can use various techniques from data generation to create the distractor wrong answers. For example, we can programmatically generate them based on the propositions, or we could use an LLM to generate them. In either case, we can use PyETR to ensure that the wrong answers are not actually right answers.

Is the Correct Answer the Logically Correct, or the ETR-Predicted Answer?

For all of these questions, we can annotate the LLM’s response as being correct according to formal logic theory, or correct according to the erotetic theory of reasoning. As far as I can tell, the Humans in, Humans out paper does not report these two numbers separately, but it would be good to report them as a confusion matrix.