See Evals for Reasoning for some background on what sort of data is being collected and how, and ‣ for a different presentation of these ideas.
See this excellent presentation from Ryan about problem generation.
All data can be collected with the https://github.com/Oxford-HAI-Lab/PyETR_fork Python library unless otherwise described.
HAI_Lab___Reasoning_Evals___NeurIPS_2025-1.pdf
https://arxiv.org/abs/2506.11128
The basic question that we want to answer is:
Do LLMs “think” in an erotetic way?
Behaviorally, we can leverage the fact that erotetic theory makes different predictions than classical reasoning does, to get this question:
Do LLMs tend to make logical reasoning mistakes in the way that the erotetic theory predicts?
Empirical evidence supporting this hypothesis would be very interesting, and this document is an attempt to nail down what the best way is to get that.