Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That link says "* With sparsity". For extremely sparse matrixes you can get more than 989 TFLOPS on CPU, if we're counting elided operations in TFLOPS.


I am counting FP16/BF16 without sparsity, which is used in majority of AI.


That change checks out then. They didn't see much need for FP16 outside of that so no longer run it at double FP32 rate outside of tensor cores (unless I'm mixing that up with AMD).

Other forms of sparsity are heavily used at training time now, like block compression in Deepseek.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: