Stack Truth · Jun 2026 · 9 min

Claude Opus 4.8 vs Gemini 3.1 Pro for agents: real numbers

Every model launch comes with a chart where the new model wins. This isn't that. It's how we read the public numbers when we're choosing a model to build an agent on — what Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 actually do well, and where the benchmark stops being useful.

The question

"Which model is best for agents" is the wrong question. The right one is "best at what, at what cost, with what failure mode." An agent calling tools across a twenty-step task puts completely different pressure on a model than a single clever answer does — and the model that tops the launch chart isn't automatically the one that finishes the task without quietly going off the rails.

How we read benchmarks

One rule before any number: a public benchmark is directional, not a guarantee. Headline coding scores (SWE-bench Verified) are older, partly contaminated, and Python-heavy. They tell you a model is capable; they don't tell you it's reliable on your repo or your workflow. We weight agentic and terminal benchmarks higher than chat benchmarks, and we treat everything below as a starting hypothesis to test on the actual task — not a ranking to obey.

The numbers

Benchmark / spec	Claude Opus 4.8	Gemini 3.1 Pro	GPT-5.5
Intelligence Index (overall)	61 — #1	57	59–60
Terminal-Bench 2.0 (agentic coding)	69.4%*	68.5%	82.7%
SWE-bench Verified (coding)	88.6%	—	—
GPQA Diamond (reasoning)	—	94.3%	—
Context window	1M	~1M	256K
API price (per M in / out)	$5 / $25	—	—

Compiled from public leaderboards and vendor cards (Artificial Analysis Intelligence Index; Scale SWE-bench; Terminal-Bench; provider docs) — see Sources. Cells marked “—” weren't a clean public number at writing. *Terminal-Bench 2.0 figure was reported for Opus 4.7; Opus 4.8 supersedes it at the same price tier. Directional, not our own benchmark.

What the numbers don't tell you

Read the table and three things stand out — none of which is "pick the highest row."

Headline scores collapse on real agent tasks. Tool-agent-user benchmarks like τ-bench (agents that change a flight, resolve a ticket, follow a policy across turns) score dramatically lower than the coding numbers above, and ranking shuffles. A model that resolves 88% of curated coding issues can still fail the majority of multi-step tool tasks. The agentic benchmark, not the coding one, is the closer proxy for whether your agent survives production.

The failure modes differ. In practice, Opus tends toward more careful output and better handling of edge cases and uncertainty; GPT-5.5 is faster and completes equivalent tasks in fewer tool calls, which matters when latency and per-run cost dominate; Gemini 3.1 Pro leads hardest-mode reasoning and brings native, reliable Google Search grounding for factual work.

Cost is a function of behaviour, not just the price card. A cheaper per-token model that makes twice the tool calls to finish a task isn't cheaper. Measure cost per completed task, not per token.

How we actually choose

Long-running, high-stakes agentic work (code, multi-step automation): Opus 4.8 is our default — it currently leads overall and is built for long-horizon tasks where reviewability matters.
High-volume, latency-sensitive pipelines: GPT-5.5 earns a look — fewer tool calls and lower operating cost at scale.
Reasoning-heavy or search-grounded tasks: Gemini 3.1 Pro, especially where native Google grounding removes a retrieval layer you'd otherwise build.

One more for the record: as of June 9, Anthropic's Claude Fable 5 (95.0% SWE-bench Verified, $10/$25-tier output) sits above Opus 4.8 as the new ceiling — worth knowing if you're choosing a model to commit to for the next two quarters.

The verdict

There is no single "best agent model," and any article that gives you one is selling the chart, not the truth. Opus 4.8 is the safest default for serious agentic builds today, GPT-5.5 wins on speed and volume economics, and Gemini 3.1 Pro wins on reasoning and grounding. The real work isn't picking the winner — it's routing the right model to the right job and measuring cost per completed task.

Benchmarks tell you which model can. Your eval set tells you which model does.

Sources

More Stack Truth

Stack Truth

The smallest RAG eval loop that actually catches drift

→