I am having a shit experience lately. Opus 4.7, max effort. > You're right, that...

tremon · 2026-04-17T20:44:43 1776458683

> I read the V1 code this time instead of guessing

Does the LLM even keep a (self-accessible) record of previous internal actions to make this assertion believable, or is this yet another confabulation?

cheesecakegood · 2026-04-18T03:33:06 1776483186

No they do not (to be clear, not internal state, just the transcript). It’s entirely role-play. LLM apologies are meaningless because the models are mostly stateless. Every new response is a “what would a helpful assistant with XYZ prior context continue to say?”

johnmaguire · 2026-04-17T21:17:59 1776460679

Yes, the LLM is able to see the entire prior chat history including tool use. This type of interaction occurs when the LLM fails to read the file, but acts as though it had.

al_borland · 2026-04-17T20:43:49 1776458629

This seems like the experience I've had with every model I've tried over the last several years. It seems like an inherent limitation of the technology, despite the hyperbolic claims of those financially invested in all of this paying off.

smt88 · 2026-04-17T20:58:09 1776459489

Opus 4.6 pre-nerf was incredible, almost magical. It changed my understanding of how good models could be. But that's the only model that ever made me feel that way.

whalesalad · 2026-04-17T21:09:41 1776460181

Yes! I genuinely got a LOT of shit done with Opus 4.6 "pre nerf" with regular old out-of-the-box config, no crazy skills or hacks or memory tweaks or anything. The downfall is palpable. Textbook rugpull.

solenoid0937 · 2026-04-18T15:56:41 1776527801

There was no nerf - this meme needs to die.

smt88 · 2026-04-18T22:08:14 1776550094

What exactly happened then? How did we all have this collective hallucination?

solenoid0937 · 2026-04-18T22:12:44 1776550364

Collective hallucinations are common. Mandela effect, people thinking FB is listening to your microphone because they see relevant ads, etc

This is a common phenomenon that all humans pattern match to things we expect. When we learn a new vocabulary word you see it everywhere for the next two days. When we think Claude might be nerfed, we overindex on every instance of Claude underperforming.

The only way to account for this is credulous, hard data. Like benchmarks over time. To this day no one has provided evidence that Claude Code, when fixed to the same thinking level, has had degraded performance.

al_borland · 2026-04-19T00:00:36 1776556836

Are there any good ways to benchmark models over time that don't fall victim to Goodhart's law? It seems that once the benchmark is defined, the AI will train on it, and it will become effectively meaningless.

I read many articles about AIs doing extremely well on various tests in graduate or PhD level programs. But these tests are well defined. A professor put the same models though his freshman CS class and most of them failed.

solenoid0937 · 2026-04-19T13:17:09 1776604629

These models don't learn continuously, they are a static snapshot one training is finished. You only need a new benchmark once new models are published (or you need a private benchmark, in which case you don't need to update the benchmark at all)

ec109685 · 2026-04-18T15:26:54 1776526014

Did they nerf the model or was it changes to Claude code? I agree it got frustrating.

al_borland · 2026-04-17T21:51:58 1776462718

That was better, but still not to the point that I just let it go on my repo.

ed_elliott_asc · 2026-04-17T21:23:50 1776461030

If it isn’t working for you why don’t you choose an older model? 4.6

ericol · 2026-04-17T20:45:05 1776458705

Matches what I am experiencing. Makes incredible stupid mistakes.

The weird stuff is yesterday I asked it to test and report back on a 30+ commit branch for a PR and it did that flawlessly.

alphabettsy · 2026-04-17T20:58:01 1776459481

The docs suggest not using max effort in most cases to avoid overthinking :shrug:

whalesalad · 2026-04-17T21:11:32 1776460292

They've jumped the shark. I truly can't comprehend why all of these changes were necessary. They had a literal money printing machine that actually got real shit done, really well. Now it's a gamble every time and I am pulling back hard from Anthropic ecosystem.

solenoid0937 · 2026-04-18T15:58:07 1776527887

it's clearly all in your head. 4.6 is just as capable as it used to be. literally no one on the internet has managed to post credulous and real evidence of a nerf

this is just another trendy conspiracy theory that people reinforce because of selection/recency bias. you hear "nerf", your brain overindexes on the next time Claude does poorly. it is the same phenomenon when you notice a new vocabulary word all the time.

geraldwhen · 2026-04-17T21:43:39 1776462219

It seems clear that it was a money spending machine, not a money printing machine.