TL;DR. ITBench-AA — the first agentic enterprise IT benchmark, published May 27, 2026 by IBM Research and Artificial Analysis — shows Claude Opus 4.7 at 47% and GPT-5.5 at 46% on live Kubernetes SRE tasks. Every model on the leaderboard fails more than half the time. Cost per task ranges from $0.14 to $5.38, making cost and turn efficiency as decisive as raw score for vendor selection.
Context: a benchmark that forces a reassessment
On May 27, 2026, IBM Research and Artificial Analysis published ITBench-AA on Hugging Face — the first benchmark built specifically to evaluate AI agents on enterprise-grade IT operations. The dataset comprises 59 SRE (Site Reliability Engineering) tasks centered on Kubernetes incident diagnosis: infrastructure failures, application outages, resource quota exhaustion, rollout failures, and network partitions.
Scoring is unforgiving, per the published methodology: an agent must identify the minimal set of independent root causes. Missing any ground-truth root cause scores 0.0; including a false positive reduces precision. That strictness is what makes the headline number worth taking seriously — not a single frontier or open-weight model in the field clears 50%.
Where Claude holds the lead — and its binding constraint
According to the ITBench-AA leaderboard, Claude Opus 4.7 in Adaptive Reasoning, Max Effort mode scores 47% — the highest result published to date. That is 1 point above GPT-5.5, 7 points above Gemini 3.5 Flash, and 17 points above Gemini 3.1 Pro Preview.
The binding constraint is documented in the same benchmark: Claude Opus 4.7 is the most expensive model on the leaderboard, at $5.38 per task. For an SRE team handling hundreds of incidents per week, that unit cost is an architectural variable, not a billing footnote.
Where GPT-5.5, Gemini, and open-weight models still hold the line
GPT-5.5 at xhigh scores 46% — 1 point behind Claude — but with an execution efficiency the benchmark makes explicit: an average of 31 turns per task. Gemini 3.1 Pro Preview, by contrast, consumes 83 turns to score only 30%. That is 2.7 times more turns for 16 fewer accuracy points — a gap that materialises as API cost and real-time latency, not just a statistical footnote.
Gemini 3.5 Flash lands at 40% for $1.70 per task — a considerably better cost-to-score ratio than Gemini 3.1 Pro at $2.23 for 30%. Qwen3.7 Max scores 42%, sitting between the two dominant frontier models.
Among open-weight models, GLM-5.1 (Reasoning) reaches 40% at $1.23 per task. DeepSeek V4 Pro (Reasoning) scores 38%. Gemma 4 31B (Reasoning) closes the open-weight bracket at 37% for $0.14 per task — a cost 38 times lower than Claude Opus 4.7, per IBM Research and Artificial Analysis's published data. Notably, Gemma 4 31B outperforms Gemini 3.1 Pro Preview on both score (37% vs. 30%) and cost ($0.14 vs. $2.23 per task).
Pricing and operational implications
The cost gap between the top-scoring and lowest-cost model on the leaderboard is 38x ($5.38 vs. $0.14), according to the published data. For any organisation automating SRE diagnostics at scale, that spread makes the assumption of a single frontier model across all IT agent tasks economically indefensible.
Turn count is a second cost axis that model comparison reports routinely omit. An agent averaging 83 turns per task introduces latency that is structurally incompatible with real-time SRE alerting. GPT-5.5's 31-turn average delivers an operational advantage that the 1-point score delta versus Claude does not begin to capture. Execution cadence is a performance dimension in its own right.
What this means for a multi-model architecture
The joint reading of scores, costs, and turn counts points toward a functional segmentation. High-criticality, low-frequency incidents — network partitions, security diagnostics, complex rollout failures — justify Claude Opus 4.7 or GPT-5.5 despite their cost. High-volume, recurring SRE work — quota monitoring, standard application alerts, routine diagnostics — can be routed toward Gemma 4 31B or GLM-5.1, with a cost-performance ratio documented in the benchmark itself.
A single-model architecture covering the full enterprise IT agent perimeter is no longer defensible on these figures. Routing by incident criticality and type becomes a first-class architectural decision, not an optimisation to revisit later.
Three levers to activate this week
- Review the ITBench-AA leaderboard on artificialanalysis.ai before any model vendor decision for agentic IT use cases — score, cost-per-task, and turn-count data are public and directly comparable.
- Instrument turn count in current SRE agent deployments, not just success rate. A 2.7x gap in turns between models translates to real API cost and latency differences in production.
- Run a Gemma 4 31B pilot on high-volume SRE tasks before automatically renewing a frontier subscription: at $0.14 per task, the financial risk of the experiment is low, and the reference data to evaluate it already exists in the benchmark.
If the best available model fails more than half the time on autonomous IT diagnosis, where exactly does the non-negotiable boundary with human oversight sit?
If this analysis speaks to you, I publish a piece of this calibre every day on digital innovation and enterprise AI. 👉 Get the next one straight in your inbox — sign-up takes ten seconds, and each edition is read before 9 a.m. by leaders of European SMEs, mid-caps and public institutions.