What does Matthieu Pesesse do?

Matthieu Pesesse is an independent consultant, providing four service lines: AI Automation & Agent Systems, IT Workplace & Infrastructure, Project Management & Service Delivery, and Tech Advisory. The practice is based in Brussels, Belgium and serves clients across the European Union in English, French, and Dutch.

What is Matthieu Pesesse's background?

Matthieu Pesesse has 25+ years of professional experience across telecom (Proximus), media (Clear Channel, where he raised digital network availability from 75% to 99%), public institutions (European Commission), enterprise IT (Etex Group, with 100% Smart Workplace migration), and healthcare (Anicura, sole IT support for 18 veterinary clinics). He won the European Podcast Award 2009 for Best Business Podcast and holds ITIL V3, Agile Scrum, Microsoft 365, and CrowdStrike Falcon certifications.

What does an engagement with Matthieu Pesesse cost?

Engagement pricing depends on scope. Typical entry points are an AI Discovery Workshop from EUR 2,500, a focused pilot from EUR 8,000-15,000, and ongoing retainers from EUR 4,500/month. Fixed-scope projects are quoted after a Discovery call. Travel within Belgium is included; travel outside Belgium is invoiced at cost.

Who does Matthieu Pesesse typically work with?

Matthieu Pesesse typically works with SMEs and mid-market enterprises (50-2,000 employees) operating in Belgium or across the European Union, in regulated or multilingual environments where AI adoption, IT workplace modernization, or digital transformation is a strategic priority. Engagements range from one-off advisory to multi-month delivery.

What technology stack does Matthieu Pesesse build on?

For AI Automation: OpenClaw multi-agent orchestration, OpenAI and Anthropic APIs, NVIDIA NIM for on-premises GPU inference, Docker, and Nginx. For IT Workplace: Microsoft 365, Microsoft Intune, CrowdStrike Falcon, Zscaler, and Datto. The stack is selected for production reliability rather than novelty.

How long does a typical engagement take?

An AI Discovery Workshop runs 1 to 2 weeks. A pilot or proof-of-concept typically runs 6 to 12 weeks. Workplace modernization or service delivery engagements run 3 to 12 months depending on scope. Tech Advisory retainers are open-ended monthly engagements.

In which languages does Matthieu Pesesse operate?

Matthieu Pesesse operates natively in English, French, and Dutch. This trilingual capability is uncommon among Belgian technology consultancies and matters in Belgium's three-community business landscape (Wallonia, Flanders, Brussels) and for European Union institutions.

ITBench-AA: Claude Tops the Ranking at 47%, GPT-5.5 at 46% — and No Model Clears 50%

TL;DR. ITBench-AA — the first agentic enterprise IT benchmark, published May 27, 2026 by IBM Research and Artificial Analysis — shows Claude Opus 4.7 at 47% and GPT-5.5 at 46% on live Kubernetes SRE tasks. Every model on the leaderboard fails more than half the time. Cost per task ranges from $0.14 to $5.38, making cost and turn efficiency as decisive as raw score for vendor selection.

Context: a benchmark that forces a reassessment

On May 27, 2026, IBM Research and Artificial Analysis published ITBench-AA on Hugging Face — the first benchmark built specifically to evaluate AI agents on enterprise-grade IT operations. The dataset comprises 59 SRE (Site Reliability Engineering) tasks centered on Kubernetes incident diagnosis: infrastructure failures, application outages, resource quota exhaustion, rollout failures, and network partitions.

Scoring is unforgiving, per the published methodology: an agent must identify the minimal set of independent root causes. Missing any ground-truth root cause scores 0.0; including a false positive reduces precision. That strictness is what makes the headline number worth taking seriously — not a single frontier or open-weight model in the field clears 50%.

Where Claude holds the lead — and its binding constraint

According to the ITBench-AA leaderboard, Claude Opus 4.7 in Adaptive Reasoning, Max Effort mode scores 47% — the highest result published to date. That is 1 point above GPT-5.5, 7 points above Gemini 3.5 Flash, and 17 points above Gemini 3.1 Pro Preview.

The binding constraint is documented in the same benchmark: Claude Opus 4.7 is the most expensive model on the leaderboard, at $5.38 per task. For an SRE team handling hundreds of incidents per week, that unit cost is an architectural variable, not a billing footnote.

Where GPT-5.5, Gemini, and open-weight models still hold the line

GPT-5.5 at xhigh scores 46% — 1 point behind Claude — but with an execution efficiency the benchmark makes explicit: an average of 31 turns per task. Gemini 3.1 Pro Preview, by contrast, consumes 83 turns to score only 30%. That is 2.7 times more turns for 16 fewer accuracy points — a gap that materialises as API cost and real-time latency, not just a statistical footnote.

Gemini 3.5 Flash lands at 40% for $1.70 per task — a considerably better cost-to-score ratio than Gemini 3.1 Pro at $2.23 for 30%. Qwen3.7 Max scores 42%, sitting between the two dominant frontier models.

Among open-weight models, GLM-5.1 (Reasoning) reaches 40% at $1.23 per task. DeepSeek V4 Pro (Reasoning) scores 38%. Gemma 4 31B (Reasoning) closes the open-weight bracket at 37% for $0.14 per task — a cost 38 times lower than Claude Opus 4.7, per IBM Research and Artificial Analysis's published data. Notably, Gemma 4 31B outperforms Gemini 3.1 Pro Preview on both score (37% vs. 30%) and cost ($0.14 vs. $2.23 per task).

Pricing and operational implications

The cost gap between the top-scoring and lowest-cost model on the leaderboard is 38x ($5.38 vs. $0.14), according to the published data. For any organisation automating SRE diagnostics at scale, that spread makes the assumption of a single frontier model across all IT agent tasks economically indefensible.

Turn count is a second cost axis that model comparison reports routinely omit. An agent averaging 83 turns per task introduces latency that is structurally incompatible with real-time SRE alerting. GPT-5.5's 31-turn average delivers an operational advantage that the 1-point score delta versus Claude does not begin to capture. Execution cadence is a performance dimension in its own right.

What this means for a multi-model architecture

The joint reading of scores, costs, and turn counts points toward a functional segmentation. High-criticality, low-frequency incidents — network partitions, security diagnostics, complex rollout failures — justify Claude Opus 4.7 or GPT-5.5 despite their cost. High-volume, recurring SRE work — quota monitoring, standard application alerts, routine diagnostics — can be routed toward Gemma 4 31B or GLM-5.1, with a cost-performance ratio documented in the benchmark itself.

A single-model architecture covering the full enterprise IT agent perimeter is no longer defensible on these figures. Routing by incident criticality and type becomes a first-class architectural decision, not an optimisation to revisit later.

Three levers to activate this week

Review the ITBench-AA leaderboard on artificialanalysis.ai before any model vendor decision for agentic IT use cases — score, cost-per-task, and turn-count data are public and directly comparable.
Instrument turn count in current SRE agent deployments, not just success rate. A 2.7x gap in turns between models translates to real API cost and latency differences in production.
Run a Gemma 4 31B pilot on high-volume SRE tasks before automatically renewing a frontier subscription: at $0.14 per task, the financial risk of the experiment is low, and the reference data to evaluate it already exists in the benchmark.

If the best available model fails more than half the time on autonomous IT diagnosis, where exactly does the non-negotiable boundary with human oversight sit?

If this analysis speaks to you, I publish a piece of this calibre every day on digital innovation and enterprise AI. 👉 Get the next one straight in your inbox — sign-up takes ten seconds, and each edition is read before 9 a.m. by leaders of European SMEs, mid-caps and public institutions.