A question set designed to discover the capabilities of a Large Language Model.
Use multiple runs to get error bars:
You need to be spending more money on evals - kamilė lukošiūtė