See Evals for Reasoning for some background on what sort of data is being collected and how, and ‣ for a different presentation of these ideas.

See this excellent presentation from Ryan about problem generation.

All data can be collected with the https://github.com/Oxford-HAI-Lab/PyETR_fork Python library unless otherwise described.

Final Paper

HAI_Lab___Reasoning_Evals___NeurIPS_2025-1.pdf

https://arxiv.org/abs/2506.11128

Basic Questions

The basic question that we want to answer is:

Do LLMs “think” in an erotetic way?

Behaviorally, we can leverage the fact that erotetic theory makes different predictions than classical reasoning does, to get this question:

Do LLMs tend to make logical reasoning mistakes in the way that the erotetic theory predicts?

Empirical evidence supporting this hypothesis would be very interesting, and this document is an attempt to nail down what the best way is to get that.

The Structure of Questions

What is Asked About

As a model gets more advanced, how does performance change?

We can redo the analysis of looking at GPT3, 3.5, 4, 4-mini, and now we can add o1!
Different companies’ models, including Anthropic, Llama, etc
Just the evalutaion of how o1 performs might be paper-worthy

How do results change with and without Chain of Thought?

We can easily get different results by either asking the LLM to think before answering, or to answer immediately
- If we see a difference, how will we interpret it?