Granite 4.1: The Five-Phase Pipeline That Proves Architecture Discipline Beats Scale

By Matthieu Pesesse

TL;DR. IBM trained Granite 4.1 on approximately 15 trillion tokens across a five-phase pipeline and four reinforcement-learning stages — including one stage dedicated solely to recovering the mathematical regression introduced by RLHF. Published result: an 8B dense model that consistently matches or outperforms its 32B MoE predecessor.

The Business Problem: One Model, Contradictory Goals

IBM's specification for Granite 4.1 was enterprise-grade from the outset: Apache 2.0 licence, twelve languages — English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, Chinese — a context window capable of handling heavy document workloads, and three deployable variants: 3B, 8B, and 30B parameters.

The hard constraint was not parameter count. It was making a single set of weights simultaneously strong at mathematical reasoning, code generation, multilingual instruction-following, tool calling, and conversational behaviour. In unstructured training, each objective tends to erode the others. IBM resolved this by sequencing training into discrete phases rather than optimising for everything at once.

Architecture and Pipeline Design

IBM chose a dense decoder-only transformer with Grouped Query Attention, Rotary Position Embeddings, SwiGLU activations, RMSNorm, and shared input/output embeddings — technically conventional choices. The differentiation lives in the pipeline structure, not the base architecture.

Pre-training covers approximately 15 trillion tokens, per the IBM documentation published on Hugging Face, distributed across five sequential phases:

Phase 1 — 10 trillion tokens: general coverage (web, code, mathematics, technical)
Phase 2 — 2 trillion: mathematics (35%) and code (30%) emphasis
Phase 3 — 2 trillion: high-quality annealing with chain-of-thought data
Phase 4 — 500 billion: refinement on high-quality CommonCrawl (40%)
Phase 5: long-context extension from 32K to 128K then 512K tokens, using books and code repositories

Supervised fine-tuning drew on 4.1 million curated samples filtered through a multi-dimensional LLM-as-Judge framework with global deduplication. Training ran on 16 nodes with 4× GB200 GPUs in an NVIDIA GB200 NVL72 cluster hosted at CoreWeave, over NVLink and NDR 400 Gb/s InfiniBand — all documented in the IBM publication.

The Trade-offs Accepted

The reinforcement learning pipeline is where the real tensions surface. IBM structured four sequential RL stages using on-policy GRPO with DAPO loss:

Multi-domain RL: mathematics, science, logic, instruction-following, structured output, Text2SQL, temporal reasoning, chat, in-context learning
RLHF: generic chat with a multilingual reward model
Identity and knowledge-calibration RL: model self-identification
Math RL: explicit recovery from the performance drop introduced by the RLHF stage

That fourth stage is the honest admission in the documentation: adding conversational RLHF degraded quantitative reasoning. IBM measured it, named it, and allocated a dedicated recovery stage to address it. Few labs document this tension so plainly in a public release post.

On deployment efficiency, FP8 quantisation reduces disk footprint and GPU memory by 50% per the IBM post — a practical lever for organisations operating outside hyperscaler infrastructure.

The Published Results

On the Granite 4.1-8B Instruct model, IBM publishes the following benchmark scores:

GSM8K (mathematical reasoning): 92.49%
HumanEval pass@1 (code): 87.20%
MMLU (general knowledge): 73.84%
IFEval (instruction-following): 87.06%
BFCL V3 (tool calling): 68.27%
RULER at 128K tokens (long context): 73.0%

The headline finding: the 8B dense model consistently matches or outperforms Granite 4.0-H-Small — a 32B MoE model with 9B active parameters. A model four times smaller in total parameter count, at a fraction of the inference cost, holds its own across a comprehensive benchmark suite.

These validation runs carry costs that rarely appear in deployment budgets. According to the EvalEval coalition's analysis published on Hugging Face in April 2026, a single GAIA evaluation on a frontier model costs $2,829 before caching, and a full PaperBench run costs approximately $9,500 per agent. IBM absorbed comparable evaluation costs at every gate of its five-phase pipeline.

Three Lessons That Apply Broadly

Regression is a documentable engineering artefact, not an anomaly. RLHF that improves conversational quality while degrading mathematical reasoning is a known multi-objective optimisation tension. Naming it, measuring it, and allocating a dedicated recovery stage is a practice every production LLM deployment should reproduce.
Parameter count is no longer the primary quality signal. An 8B dense model trained with pipeline discipline outperforms a 32B MoE model trained differently. Data quality, phase structure, and RL stage design carry more weight than raw parameter volume.
Evaluation is now a full infrastructure cost. Per EvalEval's data, agent benchmarks compress only 2–3.5×, versus 100–200× for static LLM benchmarks. Any organisation that does not budget evaluation compute as a line item is underestimating its true LLM deployment cost.

Three Levers for Your Organisation

Audit your fine-tuning stages by capability domain. If your model undergoes conversational adaptation or RLHF, explicitly measure the regression on analytical and technical tasks. An unmeasured degraded score is a silent production bug.
Revisit the parameter-count criterion in your vendor assessments. Before specifying a 30B+ model in your architecture, validate recent 7B–8B benchmarks against your specific use case. The Granite 4.1-8B versus Granite 4.0-32B MoE comparison is the direct illustration.
Budget your evaluations alongside your GPU costs. Per EvalEval, a full HAL run costs approximately $40,000. That cost is not optional if your organisation wants to compare models honestly in real operational conditions — factor it in before selecting a model or vendor.

What Silent Regression Is Currently Invisible in Your Fine-Tuning Pipeline?

If this analysis speaks to you, I publish a piece of this calibre every day on digital innovation and enterprise AI. 👉 Get the next one straight in your inbox — sign-up takes ten seconds, and each edition is read before 9 a.m. by leaders of European SMEs, mid-caps and public institutions.

Sources

Granite 4.1 LLMs: How They’re Built (Hugging Face)
AI evals are becoming the new compute bottleneck (Hugging Face)

This article is part of the Neurolinks AI & Automation blog.

Read in: French | Dutch