How to Prompt Newer AI Models in 2026

A clipboard titled “SUCCESS” with a checklist of four green checkmarks and a large stamped approval checkmark, with a rubber stamp hovering above on a soft blue-green gradient background.

A lot of people think AI models like GPT-5.5 and Claude Opus 4.7 aren’t getting any smarter, or that they’re getting dumber. The actual issue is almost always how they’re being prompted. To prompt newer AI models well, define the desired outcome with clear success criteria, then let the model figure out the path. The old approach of walking the model through every step now actively hurts performance, because frontier models follow instructions more literally and need less procedural hand-holding. The shift is from recipe-writing to contract-writing.

At TJ Digital, where we run these models across content production for roughly 40 clients, we’ve spent the last year rebuilding our entire prompt stack around this idea. The performance gap between an old prompt and a new one on the same model is bigger than most people realize. If you’re still using prompts you wrote a year or two ago, you’re probably hamstringing the new generation of models. All the legacy scaffolding you added to coax behavior out of GPT-4 or Claude 3 is now constraining frontier models that would have done the work better on their own.

Why Old Prompts Make Newer Models Look Worse

Talking to a more capable model is not dissimilar to talking to a more capable person. When I go back and look at the prompt templates I wrote two years ago, they’re painfully specific. I was giving the model exact steps to take and almost deterministic guidelines to follow, because I couldn’t trust the older models to make judgment calls. If I run those same prompts on today’s models, the results are marginally better, but the model is mostly just following the old process instead of using its actual capabilities.

The current public guidance from both OpenAI and Anthropic confirms this directly. OpenAI’s GPT-5.5 documentation says shorter, outcome-first prompts usually work better than process-heavy prompt stacks, and warns that legacy prompts add noise, constrain exploration, and make answers feel mechanical. Anthropic’s latest Claude guidance makes the same point and adds an important detail: newer Claude models follow instructions more literally, so old prompt baggage is more likely to be obeyed exactly rather than gently ignored.

OpenAI’s GPT-5 launch data made this concrete. GPT-5 reached 74.9% on SWE-bench Verified versus o3’s 69.1%, while using 22% fewer output tokens and 45% fewer tool calls. OpenAI explicitly noted that GPT-5 was run with a short prompt emphasizing thorough verification, and that the same prompt did not improve o3.

Anthropic’s April 2026 postmortem made the same lesson visible from the other side. Changing Claude Code’s default reasoning effort from high to medium made users report that the model felt less intelligent, and a system-prompt addition that imposed tight word limits between tool calls caused a measurable 3% drop on a broader evaluation suite. The product-layer prompting was suppressing useful work. The underlying model capability hadn’t changed.

There’s also research backing this up. A Princeton/NYU study found that inference-time reasoning can reduce performance on some tasks, with cases of up to a 36.3-point absolute accuracy loss. OpenAI’s reasoning best-practices page now says that instructing reasoning models to “think step by step” may be unnecessary and can sometimes hurt performance.

@tjrobertson52

If GPT-5.5 feels dumb to you, your prompts are stuck in 2023. Here’s what changed and how to fix it. #PromptEngineering #ChatGPT #Claude #AI

♬ original sound – TJ Robertson – TJ Robertson

Deterministic Prompting vs. Outcome-Based Prompting

The clearest way to think about this shift is the contrast between deterministic and outcome-based prompting.

Deterministic prompting tries to control the route. It specifies ordered steps, intermediate reasoning behavior, exact sequences, and visible workflows. OpenAI’s GPT-4.1 prompting guide is a good artifact of that era. It explicitly recommends structures like “Reasoning Steps” and “Final instructions and prompt to think step by step,” and OpenAI reported that those scaffolds improved internal SWE-bench scores by close to 20% on GPT-4.1. That made sense for a highly capable but non-reasoning model that benefited from externally supplied structure.

Outcome-based prompting controls the destination. The prompt defines the target outcome, success criteria, constraints, and available context, then lets the model choose the path. Anthropic’s guidance puts it bluntly: prefer general instructions over prescriptive steps, because Claude’s reasoning frequently exceeds what a human would prescribe. The prompt becomes less like a script and more like a brief plus an acceptance test.

Here’s how the two approaches compare:

ElementDeterministic PromptingOutcome-Based Prompting
FocusThe routeThe destination
FormatOrdered steps and workflowsGoals, constraints, success criteria
MethodPrescribed for the modelChosen by the model
Best fitGPT-4-era models, simpler modelsFrontier models (GPT-5.5, Opus 4.7)
Main riskConstrains capable modelsUnderspecification (vagueness)

Outcome-based prompting still requires precision, though. A 2026 study on prompt sensitivity found that a significant share of observed prompt brittleness comes from underspecification. Prompts that lack specificity show higher performance variance. The lesson is to be precise about goals, standards, and constraints, while staying less prescriptive about internal method.

How to Write a Definition of Success

If old prompts were procedure-heavy, modern prompts are success-criteria-heavy. Anthropic’s evaluation guidance recommends success criteria that are specific, measurable, achievable, and relevant. Most real applications need multidimensional evaluation rather than a single metric.

In practice, a strong definition of success for a frontier model usually includes five elements:

  1. The external goal. What must be true at the end?
  2. The binding constraints. What boundaries can’t be crossed?
  3. The required evidence standard. What proof is needed before the model can claim something is done?
  4. The output contract. What shape proves the work was completed?
  5. The stopping rule. When should the model stop, continue, or ask for missing input?

OpenAI’s GPT-5.5 documentation has a useful “prefer this” example that demonstrates the pattern. Instead of micromanaging reasoning, the prompt defines success as making an eligibility decision from policy and account data, completing any allowed action before responding, returning the required output fields, and asking only for the smallest missing field if evidence is incomplete. That’s it. No step-by-step procedure. The model figures out how.

For agentic systems, this gets even more important. Anthropic’s evals guide draws a sharp distinction between the transcript and the outcome. An agent saying “your flight has been booked” is not the same thing as a reservation actually existing in the database. Your prompt should define what counts as completion in the world or in the artifact, not just what the final paragraph should sound like.

When to Keep Steps and Phases

Yes, you should still include steps and phases. But only when the path itself matters.

OpenAI’s current guidance is explicit: full execution order should be specified when tool use or side effects matter, and prompts for smaller or more literal models often need more exact structure. Add dependency checks, completeness contracts, and verification loops before simply cranking up reasoning effort.

This is where phases remain useful. In long-running or tool-heavy flows, both OpenAI and Anthropic recommend explicit phase handling, dependency-aware retrieval, and verification before advancing. Anthropic specifically recommends prompt chaining for complex tasks, because splitting a hard job into smaller subtasks improves consistency by giving the model one bounded problem at a time.

Modern phased prompting works at the level of external milestones. Walk the model through the dependencies, require verification before advancing, and keep the user informed at the right moments. The internal cognition is the model’s problem, not yours.

If the path doesn’t matter, don’t over-specify it. That’s the balance. Preserve structure where the outside world demands it, and stop trying to hand-write the model’s internal cognition.

Teaching the Model to Avoid Common Mistakes

I still include a “common mistakes” section in most of my prompts. The way it works has changed.

The most effective way to teach a capable model to avoid failures is to give it decision rules and failure checks rather than piling on more bans. OpenAI’s GPT-5.5 guidance warns against unnecessary absolute rules like ALWAYS, NEVER, must, and only, except for true invariants like safety constraints or required fields. For judgment calls (when to search, when to ask, when to proceed, when to stop), use decision rules instead.

Anthropic adds another important lesson: positive steering is usually better than negative steering. Showing the model what good output looks like beats listing things it shouldn’t do. A “common mistakes” section becomes more effective when it’s structured as contrastive examples: here’s the wrong shape, here’s the right shape, here’s the edge case.

For factual work, the highest-leverage anti-error pattern is demanding auditable grounding. Anthropic recommends direct quotes for grounding on long-document tasks, citations for claims, and an explicit rule that if the model can’t find supporting text, it must retract the claim. Those heuristics catch the kinds of mistakes that “be careful” never will.

Heuristics and Self-Evaluation

This is the part most people get wrong, and it’s where the biggest performance gains live.

If a model has a clear goal and a reliable heuristic for evaluating its own output, it’s incredibly good at figuring out the best process. Heuristics are the rules of thumb you embed so the model can govern itself without you scripting every move. OpenAI’s current guidance offers several:

  • Retrieval budgets that define when enough evidence is enough
  • Stopping conditions that ask whether the model can answer the user’s core request
  • Rules to use the minimum evidence sufficient to answer correctly
  • Nudges not to stop at the first plausible answer
  • Instructions to perform at least one verification step on safety- or accuracy-critical tasks

OpenAI’s troubleshooting guide makes the same point: overthinking usually comes from oversized reasoning effort, conflicting guidance, or no clear definition of done. The fix is an explicit stop condition and a single fast self-check. For underthinking, prompting the model to construct and apply an internal rubric has been “surprisingly effective” on coding tasks.

The research backs this strongly. The Self-Refine paper reported that iterative self-feedback improved performance by about 20 percentage points on average across seven tasks. Reflexion reported 91% pass@1 on HumanEval versus 80% for the prior GPT-4 state of the art. Self-evaluation works.

It has to attach to something concrete, though. A vague instruction like “check your work” rarely changes behavior. A specific instruction like “before finalizing, verify every numerical claim is grounded in context. If any are not, revise or qualify them” actually does. The difference is having a real rubric.

Putting Outcome-Based Prompting Into Practice

Across our SEO and content work at TJ Digital, the prompt stack we use today looks almost nothing like the prompts we shipped two years ago. The old prompts have evolved into Claude skills, where the structural scaffolding lives in reusable assets and the prompt itself focuses on success criteria for the specific deliverable. The rigid step-by-step guidelines have evolved into rigorous definitions of what success looks like for each output, plus heuristics the model uses to evaluate its own work.

The operating pattern that holds up across GPT-5.5, Claude Opus 4.7, and the rest of the current frontier:

  • Start zero-shot or with the smallest viable prompt.
  • Add specificity around outcome, constraints, evidence, and format before adding reasoning effort.
  • Use provider-native mechanisms like structured outputs, tool descriptions, and evals rather than stuffing every contingency into the prompt.
  • Treat reasoning effort as a tuning knob, not a proxy for quality.
  • Change one thing at a time and validate on evals before and after each adjustment.

OpenAI specifically recommends that GPT-5.5 migrations begin with a fresh baseline rather than carrying over every instruction from the older prompt stack. Start with the smallest prompt that still preserves the product contract, then add only what evaluation results justify.

The Bottom Line on Prompting Newer AI Models

If you only remember one thing about prompting frontier models, make it this: be stricter about the quality bar and looser about the method.

Leave the model freedom on how to solve the task. Be uncompromising about what counts as success, what evidence is required, what common mistakes have to be avoided, and what self-checks must pass before the answer is done. That’s the center of gravity in the current OpenAI and Anthropic guidance, and it’s the cleanest answer to how prompting strategy needs to change for newer, more capable models.

If you’re trying to get more out of the AI tools your team already pays for, this is the place to start. At TJ Digital, we build prompt stacks and skills for clients that match how the new generation of models actually thinks, and the gains tend to show up immediately on the same content the team was already producing. If you want to talk through what that looks like for your business, get in touch.