Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower co...

Majromax · 2026-04-18T13:45:01 1776519901

> Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds

That feels like a concession to the limited benchmarking framework. 5.4-xhigh is supposed to be (and is widely believe to be) a better model than 5.2, so if that's invisible in the benchmarking scores then the protocol has problems. The test probably should include cases that should be 'easy passes' or 'near always failures', and then paired testing could offer greater precision on improvements or degradations.

Conversely, if model providers also don't do this then they could be accidentally 'benchmaxxing' if they use protocols like this to set dynamic quantization levels for inference. All you really need for a credible observation of problems from 'less intensive use' is a problem domain that isn't well-covered by the measured and monitored benchmark.

yorwba · 2026-04-18T15:17:48 1776525468

Here's a sample-size calculator that may help illustrate the issue: https://sample-size.net/sample-size-proportions/ Put in the benchmark score of one model as p₀ and of the other model as p₁ (as a fraction between 0 and 1) and observe what kind of sample size you need to reliably observe a significant difference. The largest change between GPT 5.2 and 5.4 highlighted in https://openai.com/index/introducing-gpt-5-4/ is OSWorld-Verified going from 47.3% to to 75.0%. That's quite the difference, right? So plug in 0.473 and 0.75 and note that the required sample size per model is 55. For the software engineering tasks in SWE-Bench Pro, the change from 55.6% to 57.7% is a whopping 2.1 percentage points, which you can detect with a mere 8836 samples.

I'm sure someone in charge of benchmarking at OpenAI knows how statistics work and always makes sure to take a sufficiently large number of samples when comparing different models, but for most other people who want to know which model is better, the answer is unlikely to be worth the cost of measuring it precisely enough to find out.