See Evals for Reasoning for some background on what sort of data is being collected and how, and ‣ for a different presentation of these ideas.
See this excellent presentation from Ryan about problem generation.
All data can be collected with the https://github.com/Oxford-HAI-Lab/PyETR_fork Python library unless otherwise described.
Basic Questions
The basic question that we want to answer is:
Do LLMs “think” in an erotetic way?
Behaviorally, we can leverage the fact that erotetic theory makes different predictions than classical reasoning does, to get this question:
Do LLMs tend to make logical reasoning mistakes in the way that the erotetic theory predicts?
Empirical evidence supporting this hypothesis would be very interesting, and this document is an attempt to nail down what the best way is to get that.
The Structure of Questions
What is Asked About
- We can redo the analysis of looking at GPT3, 3.5, 4, 4-mini, and now we can add o1!
- Different companies’ models, including Anthropic, Llama, etc
- Just the evalutaion of how o1 performs might be paper-worthy
How do results change with and without Chain of Thought?
- We can easily get different results by either asking the LLM to think before answering, or to answer immediately
- If we see a difference, how will we interpret it?
- ‣
- This is a bit of a follow-on question, but when we ask LLMs to think, we can look at their chain of thought text and see if they have the sorts of intermediate failures that ETR predicts
- It will be hard to get numbers here, but we can talk about it
- What elements of “difficulty” do we want to ask about?
- Qualitatively different logical structures, e.g. inclusion of ∀ and ∃
- “Size” of question, such as number of clauses in the presupposed views
- “Size” of questions, as measured by character length of propositions
- Are there other ways of gauging difficulty that the ETR suggests?
- Do some questions have more erotetic traps than others?