By continuously testing competitors and local LLMs? The reason for rising prices is that they (Anthropic) probably realized that they have reached a ceiling of what LLMs are capable of, and while it's a lot, it is still not a big moat and it's definitely not intelligence.
> Anything but the simplest tooling is not transferable between model generations, let alone completely different families.
It is transferable-yes, you will get issues if you take prompts and workflows tuned for one model and send them to another unchanged. But, most of the time, fixing it is just tinkering with some prompt templates
People port solutions between models all the time. It takes some work, but the amount of work involved is tractable
Plus: this is absolutely the kind of task a coding agent can accelerate
The biggest risk is if your solution is at the frontier of capability, and a competing model (even another frontier model) just can’t do it. But a lot of use cases, that isn’t the case. And even if that is the case today, decent odds in a few more months it won’t be
Yep. My approach has been, if I can’t reliably get something to 90+% with a flash / nano / haiku, then it’s not viable for any accuracy critical work. (I don’t know of or have the luck of having any other work.) Starting out with the pro / opus for any production classification work has always been a trick.
Ha. Sounds a lot like the one 10x vs. predictable mediocre guys with a scaffolding of processes. Aim high and hit or miss or try to grind predictably and continuously. Same with humans and depends on the loss you can afford.
If you're talking about APIs and SDKs, whether direct API calls or driving tools like Claude code or codex with human out of the loop, I think that's actually fairly straightforward to switch between the various tools.
If you're talking about output quality, then yeah, that's not as easy. But for product outputs (building a customer service agent or something like that), having a well-designed eval harness and doing testing and iteration can get you some degree of convergence between the models of similar generations. Coding is similar (iterate, measure), but less easy to eval.
For most tasks, at some future date, isn't there going to be some ambient baseline of capabilities you can get per $/tok, starting at ~0 for OSS models, such that eventually all tooling gets trivially transferable?
It's not that hard to make it generic. It does take a little work, but really it boils down to figuring out how to make things work with the "dumbest" model in your set.
Note that it is very likely this market can't sustain this level of competition for long. We are all still chasing the carrot of AGI, while hardware costs skyrocket.