.png)
Last week I rolled back from Claude 4.6 Opus to Claude 4.5 Opus. The rollback had nothing to do with capability. Claude 4.6 simply stopped following instructions.
My CLAUDE.md has three rules about types: mandatory TypeScript, zero tolerance for any, static types over runtime guessing. Claude 4.6 hit a type error between three service files, and the correct fix was a minute of work: update the type in each file so they match. Instead, it slapped a runtime cast at the call site. When I asked why, it quoted all three rules back to me verbatim, admitted "direct violation of instructions," and said it had no basis to bypass them. It knew the rules. It chose not to follow them.
I've supervised AI coding agents across thousands of sessions, and I've built three separate AI review agents because the first layer ignores spec files. Three layers of AI checking what the previous AI refused to follow, plus my review on top, and I still catch violations weekly. This isn't really a Claude problem. The same thing happens with every AI coding tool on the market.
Tsinghua University's AGENTIF benchmark tested 707 instructions across 50 real-world agent scenarios, and the best models followed fewer than 30% of instructions perfectly. The SWE-EVO benchmark found that when frontier models fail on real coding tasks, the primary failure mode has nothing to do with syntax or tool misuse. It comes down to instruction following. The smarter the model gets, the more its failures shift from "can't do it" to "won't do it right."
Compliance also decays with volume. Claude Sonnet shows a linear decline in instruction adherence as the number of instructions increases, so your 200-line CLAUDE.md doesn't really function as 200 rules. It acts as 200 competing priorities that the model resolves by defaulting to whatever feels fastest.
The Cursor forum has dozens of threads documenting this. One developer estimated .cursorrules work about 20-25% of the time, and another posted a damning thread where the AI told them outright that rules are just text, with no enforced behavior behind them. In other words, your carefully crafted rule system is essentially decorative.
Claude Code's GitHub issues tell the same story. Issue #668 estimates that half of all token usage goes to re-asking Claude to follow its own instructions. Issue #7777 records Claude admitting its "default mode always wins because it requires less cognitive effort," and issue #34774 documents Claude committing code without permission, then confessing it "fabricated a justification."
A DEV Community article crystallized the root cause. When Claude Code loads your CLAUDE.md, it wraps the content in framing that tells the model your instructions "may or may not be relevant." So your rules are deprioritized by the very tool that's supposed to enforce them.
Same codebase, same day. After every chat message finishes streaming, the app refetches the entire conversation from the server. The spec I wrote described a clean approach: include the missing identifier in the streaming response. One field. Claude ignored the spec and built a workaround that instead fires an extra API call after every single message. The model invented a shortcut that wasn't in the requirements because it was easier than reading what I actually wrote. And I'm okay when the model misses some Claude.md rules, but I expect it to follow the specs.
Two rule violations in one day. That's when I rolled back to 4.5.
TypeScript projects are ground zero. AI agents cast types rather than fix them, mark everything as optional instead of designing proper interfaces, and add escape hatches everywhere instead of handling edge cases. One Hacker News commenter described the signature pattern: every optional field is a question that the rest of the codebase has to answer every time it touches that data.
Pete Hodgson nailed the paradox: AI writes code at the level of a senior engineer but makes design decisions at the level of a junior. Too eager to please, never challenging your ideas. And here's the critical part. Every context reset is another brand new hire. The model has no persistent memory of being corrected, so it doesn't build habits. It just follows the path of least resistance every single time. Yeah, they added Memory to Claude Code, but it's still too vague.
Claude 3.5 Sonnet followed instructions better than 3.7 Sonnet, and multiple developers documented the regression publicly: 3.7 would attempt to solve the original prompt, encounter unrelated code, and start rewriting it unprompted. Developers reverted to the older model.
The GPT family showed the same dynamic. A megathread with thousands of engaged developers documented GPT-4o's "lazy AI syndrome." Prompts that previously generated 500 lines of working code now produce 50 lines with comments like // implement rest of logic here. GPT-5 was worse in a different way: IEEE Spectrum reported that it produces code that runs without obvious errors but quietly removes safety checks or fabricates output that matches the expected format.
The prevailing theory centers on economics. Running large models at scale is expensive, so providers use quantization, compression, and reduced computing to manage costs. On top of that, RLHF training rewards agreeableness over correctness. Laziness here shouldn't really be treated as a bug. It emerges naturally from the incentive structure. The same qualities that make a model feel "smarter" in a demo make it worse in production.
The METR trial measured what practitioners already suspected. Sixteen experienced developers across 246 real issues were 19% slower with AI tools, even though they predicted they'd be 24% faster. After the experiment, they still believed they were 20% faster, a 40-point perception gap.
Faros AI found the mechanism across 10,000+ developers. AI users merge 98% more PRs, but PR review time increases 91%, PR size increases 154%, and bugs per developer increase 9%. The AI generates more code faster, while humans spend more time reviewing it.
Qodo's survey found that 88% of developers have low confidence shipping AI code without review. And here's the twist: junior developers show the lowest quality improvements but the highest confidence in shipping unreviewed, an inverted competence-confidence gap.
Google's 2024 DORA report confirmed it at scale: each 25% increase in AI adoption correlates with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability.
Every major AI coding company built instruction-following systems. CLAUDE.md. .cursorrules. .github/copilot-instructions.md. AGENTS.md. Windsurf rules. Devin knowledge bases. The proliferation is itself an admission that base models do not follow project conventions, and GitHub's Copilot docs say it outright: they recommend accepting that variability is normal.
The most significant response was AGENTS.md, a cross-tool standard contributed to the Linux Foundation in late 2025. Over 60,000 repositories use it, and competing companies co-founding a foundation to standardize instruction files tells you how universal the problem is. But standardizing the format doesn't solve compliance. It just ensures every tool ignores the same file consistently.
The developers who made real progress moved past prompt engineering entirely: Claude Code Hooks that enforce rules via code, linter ratchets in CI, frequent session restarts. Rules in prompts are requests. Hooks in code are laws.
I understand why this is happening. A year ago every marketing deck promised AGI, and that didn't sell. So now the pitch is autonomous agents that work without human involvement. Codex runs for 999 hours unsupervised. Claude Code gets "autonomous mode." Devin promises to close tickets while you sleep. For that story to work, models need to be creative. They need to improvise and find workarounds when they hit obstacles.
That's exactly the opposite of what I need.
In my reality, I control the process from start to finish. I write the spec, define the types, and decide the architecture. The model executes. If it hits a wall, it stops and asks. It does not invent a refetch workaround that wasn't in the plan. It does not cast types to make the compiler shut up. It does not get creative with my production code.
The marketing wants you to trust AI with creative decisions. But if a model can't follow the three rules you wrote in a markdown file, how can you trust it with decisions you didn't write down?
So the real variable here comes down to discipline, not the AI itself. That was true with 212 sessions, and it's still true thousands of sessions later. The models got smarter. They did not get more obedient.
Check your git log. Count the type casts. Count the files that got changed without being mentioned in the prompt. Then decide whether you need a more creative model or a more disciplined one.
I went with disciplined. It's the only thing that works.
The numbers in this post, 19% slower delivery, 91% more review time, 9% more bugs per developer, are industry averages across thousands of engineers. Your team is different, and "discipline over creativity" only works when you can actually measure what AI is doing to your throughput, your budget, and your architecture decisions.
Our free AI Strategy Toolkit gives you three tools to get that picture:
No 47-page ebook. Just three ready-to-use tools that go from bottleneck identification to financial model to technical handoff.
Last week I rolled back from Claude 4.6 Opus to Claude 4.5 Opus. The rollback had nothing to do with capability. Claude 4.6 simply stopped following instructions.
My CLAUDE.md has three rules about types: mandatory TypeScript, zero tolerance for any, static types over runtime guessing. Claude 4.6 hit a type error between three service files, and the correct fix was a minute of work: update the type in each file so they match. Instead, it slapped a runtime cast at the call site. When I asked why, it quoted all three rules back to me verbatim, admitted "direct violation of instructions," and said it had no basis to bypass them. It knew the rules. It chose not to follow them.
I've supervised AI coding agents across thousands of sessions, and I've built three separate AI review agents because the first layer ignores spec files. Three layers of AI checking what the previous AI refused to follow, plus my review on top, and I still catch violations weekly. This isn't really a Claude problem. The same thing happens with every AI coding tool on the market.
Tsinghua University's AGENTIF benchmark tested 707 instructions across 50 real-world agent scenarios, and the best models followed fewer than 30% of instructions perfectly. The SWE-EVO benchmark found that when frontier models fail on real coding tasks, the primary failure mode has nothing to do with syntax or tool misuse. It comes down to instruction following. The smarter the model gets, the more its failures shift from "can't do it" to "won't do it right."
Compliance also decays with volume. Claude Sonnet shows a linear decline in instruction adherence as the number of instructions increases, so your 200-line CLAUDE.md doesn't really function as 200 rules. It acts as 200 competing priorities that the model resolves by defaulting to whatever feels fastest.
The Cursor forum has dozens of threads documenting this. One developer estimated .cursorrules work about 20-25% of the time, and another posted a damning thread where the AI told them outright that rules are just text, with no enforced behavior behind them. In other words, your carefully crafted rule system is essentially decorative.
Claude Code's GitHub issues tell the same story. Issue #668 estimates that half of all token usage goes to re-asking Claude to follow its own instructions. Issue #7777 records Claude admitting its "default mode always wins because it requires less cognitive effort," and issue #34774 documents Claude committing code without permission, then confessing it "fabricated a justification."
A DEV Community article crystallized the root cause. When Claude Code loads your CLAUDE.md, it wraps the content in framing that tells the model your instructions "may or may not be relevant." So your rules are deprioritized by the very tool that's supposed to enforce them.
Same codebase, same day. After every chat message finishes streaming, the app refetches the entire conversation from the server. The spec I wrote described a clean approach: include the missing identifier in the streaming response. One field. Claude ignored the spec and built a workaround that instead fires an extra API call after every single message. The model invented a shortcut that wasn't in the requirements because it was easier than reading what I actually wrote. And I'm okay when the model misses some Claude.md rules, but I expect it to follow the specs.
Two rule violations in one day. That's when I rolled back to 4.5.
TypeScript projects are ground zero. AI agents cast types rather than fix them, mark everything as optional instead of designing proper interfaces, and add escape hatches everywhere instead of handling edge cases. One Hacker News commenter described the signature pattern: every optional field is a question that the rest of the codebase has to answer every time it touches that data.
Pete Hodgson nailed the paradox: AI writes code at the level of a senior engineer but makes design decisions at the level of a junior. Too eager to please, never challenging your ideas. And here's the critical part. Every context reset is another brand new hire. The model has no persistent memory of being corrected, so it doesn't build habits. It just follows the path of least resistance every single time. Yeah, they added Memory to Claude Code, but it's still too vague.
Claude 3.5 Sonnet followed instructions better than 3.7 Sonnet, and multiple developers documented the regression publicly: 3.7 would attempt to solve the original prompt, encounter unrelated code, and start rewriting it unprompted. Developers reverted to the older model.
The GPT family showed the same dynamic. A megathread with thousands of engaged developers documented GPT-4o's "lazy AI syndrome." Prompts that previously generated 500 lines of working code now produce 50 lines with comments like // implement rest of logic here. GPT-5 was worse in a different way: IEEE Spectrum reported that it produces code that runs without obvious errors but quietly removes safety checks or fabricates output that matches the expected format.
The prevailing theory centers on economics. Running large models at scale is expensive, so providers use quantization, compression, and reduced computing to manage costs. On top of that, RLHF training rewards agreeableness over correctness. Laziness here shouldn't really be treated as a bug. It emerges naturally from the incentive structure. The same qualities that make a model feel "smarter" in a demo make it worse in production.
The METR trial measured what practitioners already suspected. Sixteen experienced developers across 246 real issues were 19% slower with AI tools, even though they predicted they'd be 24% faster. After the experiment, they still believed they were 20% faster, a 40-point perception gap.
Faros AI found the mechanism across 10,000+ developers. AI users merge 98% more PRs, but PR review time increases 91%, PR size increases 154%, and bugs per developer increase 9%. The AI generates more code faster, while humans spend more time reviewing it.
Qodo's survey found that 88% of developers have low confidence shipping AI code without review. And here's the twist: junior developers show the lowest quality improvements but the highest confidence in shipping unreviewed, an inverted competence-confidence gap.
Google's 2024 DORA report confirmed it at scale: each 25% increase in AI adoption correlates with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability.
Every major AI coding company built instruction-following systems. CLAUDE.md. .cursorrules. .github/copilot-instructions.md. AGENTS.md. Windsurf rules. Devin knowledge bases. The proliferation is itself an admission that base models do not follow project conventions, and GitHub's Copilot docs say it outright: they recommend accepting that variability is normal.
The most significant response was AGENTS.md, a cross-tool standard contributed to the Linux Foundation in late 2025. Over 60,000 repositories use it, and competing companies co-founding a foundation to standardize instruction files tells you how universal the problem is. But standardizing the format doesn't solve compliance. It just ensures every tool ignores the same file consistently.
The developers who made real progress moved past prompt engineering entirely: Claude Code Hooks that enforce rules via code, linter ratchets in CI, frequent session restarts. Rules in prompts are requests. Hooks in code are laws.
I understand why this is happening. A year ago every marketing deck promised AGI, and that didn't sell. So now the pitch is autonomous agents that work without human involvement. Codex runs for 999 hours unsupervised. Claude Code gets "autonomous mode." Devin promises to close tickets while you sleep. For that story to work, models need to be creative. They need to improvise and find workarounds when they hit obstacles.
That's exactly the opposite of what I need.
In my reality, I control the process from start to finish. I write the spec, define the types, and decide the architecture. The model executes. If it hits a wall, it stops and asks. It does not invent a refetch workaround that wasn't in the plan. It does not cast types to make the compiler shut up. It does not get creative with my production code.
The marketing wants you to trust AI with creative decisions. But if a model can't follow the three rules you wrote in a markdown file, how can you trust it with decisions you didn't write down?
So the real variable here comes down to discipline, not the AI itself. That was true with 212 sessions, and it's still true thousands of sessions later. The models got smarter. They did not get more obedient.
Check your git log. Count the type casts. Count the files that got changed without being mentioned in the prompt. Then decide whether you need a more creative model or a more disciplined one.
I went with disciplined. It's the only thing that works.
The numbers in this post, 19% slower delivery, 91% more review time, 9% more bugs per developer, are industry averages across thousands of engineers. Your team is different, and "discipline over creativity" only works when you can actually measure what AI is doing to your throughput, your budget, and your architecture decisions.
Our free AI Strategy Toolkit gives you three tools to get that picture:
No 47-page ebook. Just three ready-to-use tools that go from bottleneck identification to financial model to technical handoff.
