Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For the curious, I did some napkin math on their posted benchmarks and it racks up 20.1 percentage point difference across the 20 metrics where both were scored, for an average improvement of about 2% (non-pp). I really can't decide if that's mind blowing or boring?

Claude4.6 was almost 10pp better at at answering questions from long contexts ("corpuses" in CorpusQA and "multiround conversations" in MRCR), while DSv4 was a staggering 14pp better at one math challenge (IMOAnswerBench) and 12pp better at basic Q&A (SimpleQA-Verified).

 help



FWIW it's also like 10x cheaper.



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: