by the love of god, please stop overfitting on gsm8k

olliestanley · 2025-06-02T19:59:26 1748894366

Difficult one. GSM8K and MATH evals (both reported in Reasoning Gym paper) are common in smaller model RL papers for a reason, which is that smaller models can get decent scores on them, unlike fresher & harder benchmarks.

Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!

i5heu · 2025-06-02T14:00:52 1748872852

It looks like your neural network is overfitted on seeing overfitt where is none.

Prejudices is a form of overfitting IMHO

t55 · 2025-06-02T14:02:50 1748872970

agree, the RG evals feel like a fresh breeze