A question set designed to discover the capabilities of a Large Language Model.

Links

Use multiple runs to get error bars:

You need to be spending more money on evals - kamilė lukošiūtė