See Evals for Reasoning for some background on what sort of data is being collected and how, and ‣ for a different presentation of these ideas.

See this excellent presentation from Ryan about problem generation.

All data can be collected with the https://github.com/Oxford-HAI-Lab/PyETR_fork Python library unless otherwise described.

Basic Questions

The basic question that we want to answer is:

Do LLMs “think” in an erotetic way?

Behaviorally, we can leverage the fact that erotetic theory makes different predictions than classical reasoning does, to get this question:

Do LLMs tend to make logical reasoning mistakes in the way that the erotetic theory predicts?

Empirical evidence supporting this hypothesis would be very interesting, and this document is an attempt to nail down what the best way is to get that.

The Structure of Questions

What is Asked About

As a model gets more advanced, how does performance change?

How do results change with and without Chain of Thought?

As an eval gets more difficult, how does performance change?