There’s a lot of tradeoffs to play with, those inference ASICs may not carry the gradient but they are still optimised for larger batches and to run any model. They need enough memory for the weights, wide batch inference, and ideally leftovers for kv cache efficiency.
For personal inference you’re given a lot more room to play in - much of it poorly explored today - enough to concern an argument of cost advantages evaporating
For personal inference you’re given a lot more room to play in - much of it poorly explored today - enough to concern an argument of cost advantages evaporating