Are AI Models Actually Getting Worse in 2026?

Riva
May 9, 2026
8:23 am

Most of the time when someone says AI models are getting worse, they’re remembering the old version as smarter than it actually was. The recent complaints about Anthropic’s Claude 4.7, though, have some real teeth to them.

At TJ Digital, we run every client’s Brand Ambassador through Claude, which is part of how we deliver about four times the output at the same rates as a traditional agency. So when our team started noticing regressions a few weeks after 4.7 launched, we paid attention. Anthropic has since admitted to several bugs and default setting changes that were hurting performance, and there are deeper design decisions in 4.7 that explain why a lot of us find ourselves reaching for 4.6 for most work.

Table of Contents

Why do AI models seem to get worse over time?

Most “the model is getting dumber” complaints don’t hold up to scrutiny. Benchmarks generally show each new release performing better than the last on coding, reasoning, and reading comprehension tests. People remember the standout responses from the old model and forget the misses.

There’s also a real factor that gets ignored. Every new release retunes the model for new priorities: more safety, fewer hallucinations, tighter compliance. That retuning can flatten some of the personality and intuition users liked about the previous version, even when raw capability goes up.

So the perception isn’t always wrong. The capability is still there. It just gets applied differently than it used to.

@tjrobertson52
The complaints about Claude getting worse? Actually valid this time. Here’s why I still reach for the older model #Claude #AITools #Anthropic #LLM
♬ original sound – TJ Robertson – TJ Robertson

Is Claude 4.7 worse than Claude 4.6?

For coding and other tasks where precision matters most, 4.7 is the better model. For writing, strategy, content creation, and anything that requires the model to use judgment, 4.6 is still a better choice for most of us.

Here’s a side-by-side based on what we’ve seen running both models across client work:

Aspect	Claude 4.6	Claude 4.7
Writing quality	More natural, better tone matching	Flatter, more literal
Speed	Faster across most tasks	Around 2.3x slower in independent tests
Token usage	Baseline	About 3.6x more tokens for the same outcome
Following instructions	Infers intent, applies common sense	Strictly literal
Best use case	Strategy, writing, agentic browsing	Complex coding, formal analysis

The token usage gap is the part that surprised us. Third-party testers running the same coding workloads on both models reported 4.7 consuming roughly 3.6 times more tokens and costing 3.6 times more for the same outcome. That cost difference adds up fast when you’re running hundreds of prompts a day across client work.

What did Anthropic admit about Claude 4.7?

About a week after launch, Anthropic acknowledged a few bugs and default setting changes that were causing performance to feel off. A bug in Claude Code’s clear_thinking function was making the model lose context mid-task, and a new system prompt designed to reduce verbosity was hurting code output. Both have since been reverted.

The default thinking mode also changed. Older versions let you turn on extended thinking with a fixed token budget. 4.7 replaced that with adaptive thinking, where the model decides for itself how much to think on any given prompt.

Anthropic is also currently compute constrained relative to OpenAI and Google. They’ve been signing infrastructure deals to expand capacity, but the constraint is real, and some have speculated that the company is being conservative with token usage on default settings. The behavior on adaptive thinking suggests there’s something to that.

Why is Claude 4.6 better for writing and strategy?

4.6 has better judgment. That’s the simplest way to put it.

When we give Claude 4.6 a set of brand guidelines for a client, it applies them but still notices when there’s an obvious exception. Claude 4.7 will follow the same guidelines religiously, to the point of abandoning reasonable judgment.

If a brand guideline says “avoid passive voice,” 4.6 understands there are cases where passive voice is the right call. 4.7 will rewrite around it every time.

This matters because most of our work is nebulous. Strategy, content creation, voice matching, recommendations for clients running AI search campaigns, none of these tasks can be itemized into a rigid set of rules. You want a model that understands the intent behind your guidance and can fill in the gaps that you didn’t think to spell out.

Anthropic’s own migration guide confirms this design choice. 4.7 will not generalize an instruction from one item to another, and it will not infer requests you didn’t make. That’s a feature for code, and a bug for writing.

When is Claude 4.7 the better choice?

For coding and complex reasoning, 4.7 is the right model. Anthropic’s tests showed a 13% improvement over 4.6 on a 93-task coding benchmark, with better edge case handling and faster median latency on the tasks it gets right.

It’s also the right choice when you need the model to follow detailed specifications exactly. If you’re running an agent that has to comply with a strict workflow without improvising, 4.7’s literal interpretation is an asset. The same trait that frustrates writers is what makes it more predictable for engineering teams.

The tradeoff is cost and speed. 4.7 uses more tokens per task and runs slower, so for high-volume work, the bill grows quickly.

Did Anthropic make Claude worse on purpose?

Anthropic has stated directly that they never intentionally degrade their models. I believe them. What’s happening with 4.7 reflects deliberate design choices about how the model handles instructions, which is something you can adapt to once you know what to look for.

The bugs from launch have been fixed. The default settings have been adjusted. The compute constraints will ease as Anthropic’s infrastructure expansion comes online.

What won’t change is the underlying design philosophy of 4.7, which prioritizes literal compliance over inferred intent. That’s the part you have to work around.

For our agency, the answer has been simple. We use 4.7 for coding and skill development, where its rigor pays off, and 4.6 for everything client-facing, where judgment and writing quality matter more. If you’re running AI tools in your business and you’ve felt like things got worse recently, that’s probably the adjustment you need to make.

Want help building an AI content machine?

We’ve been building AI systems for small and medium-sized businesses for over a year, and we know which models to use for which jobs. If you want help building a content machine that gets your brand recommended by AI, reach out to our team.