That link says "* With sparsity". For extremely sparse matrixes you can get more than 989 TFLOPS on CPU, if we're counting elided operations in TFLOPS.
That change checks out then. They didn't see much need for FP16 outside of that so no longer run it at double FP32 rate outside of tensor cores (unless I'm mixing that up with AMD).
Other forms of sparsity are heavily used at training time now, like block compression in Deepseek.