I have a lot of ideas for how we could do evals on LLMs in order to get at questions about reasoning. I can build them on the engineering side, and I think they’ll all be interesting. But I’d like help from you academics to figure out how to strategically pursue the experiments that will yield the best papers.
<aside> 📰
What experiments will yield the best papers?
</aside>
Here is the question I want these experiments to investigate:
<aside> 🤔
LLMs are imperfect at reasoning. What systematic failures do they have, and how can we reveal them?
</aside>
As different companies and open source groups have created many different LLMs, many different evaluation metrics have been created to evaluate how they perform. For example, https://huggingface.co/datasets/openai/gsm8k gsm8k has math word problems, or Common Sense QA tests general world knowledge. For a fuller discussion of this, please see HuggingFace’s excellent overview. For a list of evals, see here.
We’ve set up Eleuther’s lm-evaluation-harness to run our own reasoning centric evals here.
With this capability, we can easily run thousands of questions against an LLM in a repeatable fashion. By setting up our questions like this, we can feed them to the model and automatically score answers from an LLM. Furthermore, this lets us easily compare multiple models like GPT4 or Claude, simply by changing one line in a config file.
If we have an idea for an eval, but we don’t have questions for it, we can use various techniques from data generation to create them.
For example: