Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

by the love of god, please stop overfitting on gsm8k


Difficult one. GSM8K and MATH evals (both reported in Reasoning Gym paper) are common in smaller model RL papers for a reason, which is that smaller models can get decent scores on them, unlike fresher & harder benchmarks.

Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!


It looks like your neural network is overfitted on seeing overfitt where is none.

Prejudices is a form of overfitting IMHO


agree, the RG evals feel like a fresh breeze




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: