Exactly! In https://matrix.dev/blog-2026-04-04-2.html#questions-this-rai..., we raised exactly the same concerns. In particular, we actually saw that a hot swap caused a 100% cache miss. If it's a session filled with 800k tokens, rebuilding the cache is very expensive.
Also looking back at their claim: "Token counts may include tokens added automatically by Anthropic for system optimizations. You are not billed for system-added tokens. Billing reflects only your content."
A/B testing sounds a bit different. Do they really count it as "system-added tokens" and not charge for this extra cost? If you consider the model you're requesting as the baseline, then yes. But technically it's an A/B test of a different model, so they might secretly charge 130% as "we didn't add any system prompt, we just routed you to a better model."