That doesn't seem correct. It's just matrix multiplications at the end. Doesn't ...

jashulma · 2026-02-21T22:05:12 1771711512

https://thinkingmachines.ai/blog/defeating-nondeterminism-in... A nice write up explaining how it’s not as simple as it sounds

tripplyons · 2026-02-21T22:29:20 1771712960

There are many ways to compute the same matrix multiplication that apply the sum reduction in different orders, which can produce different answers when using floating point values. This is because floating point addition is not truly associative because of rounding.

spwa4 · 2026-02-21T22:51:20 1771714280

Is that really going to matter in FP32, FP16 or BF16? I would think models would be written so they'd be at least somewhat numerically stable.

Also if the inference provider guarantees specific hardware this shouldn't happen.

nomel · 2026-02-22T01:06:02 1771722362

Wait, wouldn't it be more significant in low bit numbers, which is the whole reason they're avoided in maths applications? In any work I've ever done, low bit numbers were entirely the reason exact order was important, where float64 or float128makes it mostly negligible.

measurablefunc · 2026-02-21T22:10:41 1771711841

You're assuming consistent hardware & software profiles. The way these things work at scale is essentially a compiler/instruction scheduling problem where you can think of different CPU/GPU combinations as the pipelines for what is basically a data center scale computer. The function graph is broken up into parts, compiled for different hardware profiles w/ different kernels, & then deployed & stitched together to maximize hardware utilization while minimizing cost. Service providers are not doing this b/c they want to but b/c they want to be profitable so every hardware cycle that is not used for querying or optimization is basically wasted money.

You'll never get agreement from any major companies on your proposal b/c that would mean they'd have to provide a real SLA for all of their customers & they'll never agree to that.