Back to insightsHugging Face

Open-Weight RAG Stack: Why the Embedding and Reranking Layers Moved Before the Agents Did

May 22, 2026
15 min
Open-Weight RAG Stack: Why the Embedding and Reranking Layers Moved Before the Agents Did
TL;DR. Three open-weight releases in the week of 18 May 2026 — the Ettin Reranker family, Granite Embedding Multilingual R2, and IBM Research's Open Agent Leaderboard — draw a clear boundary: the embedding and reranking layers of enterprise RAG now belong to open-weight models under 311M parameters, while agent orchestration still trails frontier closed models by 18 to 29 percentage points, per the leaderboard.

What Just Forced a Layer-by-Layer Reassessment

Between 14 and 19 May 2026, three independent publications reshaped the economics of enterprise information retrieval pipelines. IBM launched Granite Embedding Multilingual R2 with a 32,768-token context window — versus 512 tokens in the R1 generation. Tom Aarsen published the Ettin family, six rerankers under Apache 2.0 licence ranging from 17.6M to 1.04B parameters, distilled from a 1.54B teacher model. IBM Research simultaneously launched the Open Agent Leaderboard, which evaluates complete agent systems — model plus agent architecture pairs — across six benchmarks with no benchmark-specific tuning, per the official announcement.

Taken together, these three releases impose a couche-by-layer rethink. The question is no longer which general-purpose model to call: it is which architecture to compose.

Where Open-Weight Wins: Embeddings and Reranking

Granite Embedding R2: Long Context as the Differentiator

The 97M-r2 model scores 60.3 on the MTEB multilingual retrieval task (18 languages), against 52.7 for multilingual-e5-base at 278M parameters — a gain of +7.6 points at three times fewer parameters, per IBM's published data. On LongEmbed, the 311M-r2 ranks first with 71.7, ahead of harrier-oss-v1-270m at 64.9 and Granite 278M-R1 at 37.7 — a within-family generational gain of +34 points. Throughput on H100 reaches approximately 1,800 documents per second for the 311M-r2, 5.5 times faster than jina-embeddings-v5-text-nano, per IBM's published benchmarks.

The generational break comes down to one variable: 512 tokens of context for R1, 32,768 for R2. Contracts, multi-page regulatory reports and legal briefs that previously overflowed the context window now fit in a single pass — no chunking, no truncation.

Ettin Reranker: Efficiency as the Core Argument

The Ettin family upends the conventional size-versus-performance trade-off in reranking. On MTEB NDCG@10, ettin-32m (32.8M parameters) scores 0.5779 against 0.5526 for bge-reranker-v2-m3 at 568M parameters — a +0.025 gain at 17 times fewer parameters, per the published results. The ettin-1b model (1B parameters) reaches 0.6114, virtually matching its teacher mxbai-rerank-large-v2 (1.54B parameters, score 0.6115) while being 54% lighter and 2.40 times faster on H100. The ModernBERT architecture with unpadded attention delivers an 8.26x throughput gain for the 1B model over the fp32+SDPA baseline, per the published measurements — a figure that materially changes infrastructure cost calculations at scale.

Where Closed Models Still Hold: Agent Orchestration

The IBM Research Open Agent Leaderboard, published on 18 May 2026, introduces a structuring data point: open-weight models tested — DeepSeek V3.2 and Kimi K2.5, added after launch — trail frontier closed-source models by 18 to 29 percentage points on average across six benchmarks, per the leaderboard. This gap does not measure a single isolated task: it measures the complete system (model plus orchestration plus tools) without benchmark-specific optimisation, on high-complexity tasks including SWE-Bench Verified, BrowseComp+, AppWorld, and the tau2-Bench Airline, Retail and Telecom environments.

The operational nuance matters: per IBM Research, the same model paired with different agent architectures produces different quality outcomes and different costs. Architecture counts — but it does not yet close the capability gap between open-weight and frontier on complex tasks. One finding cuts the other way: in several cases, general-purpose agents tested without benchmark-specific tuning matched or outperformed systems built specifically for those tasks, per the same source.

Pricing and Operational Implications

All three model families are released under Apache 2.0 licence. For engineering teams, this means on-premise or private-cloud deployment without per-request fees on the embedding and reranking layers. The agent orchestration layer, if built on closed frontier models, retains a usage-proportional cost.

The Open Agent Leaderboard introduces a variable rarely quantified in model comparisons: the cost of failures. Failed runs cost 20 to 54% more than successful ones, per IBM Research's published data. An agent stack that fails regularly on complex tasks is not merely underperforming — it is structurally more expensive to operate. Tool shortlisting improved performance across every model tested and turned otherwise failing configurations into viable ones, per the same source.

What This Means for a Multi-Model Architecture

The map that emerges in May 2026 points to a three-tier architecture:

  • Embedding layer: open-weight (Granite 97M-r2 or 311M-r2) for multilingual corpora, long documents, and codebases — on-premise deployment viable under Apache 2.0, with a 64x context increase over the previous generation.
  • Reranking layer: open-weight (Ettin 32M to 400M depending on latency constraints) for high-volume pipelines — the quality-to-parameter ratio now exceeds prior-generation alternatives across MTEB benchmarks.
  • Agent orchestration layer: closed frontier models for high-complexity tasks — for as long as the 18 to 29 percentage-point gap remains documented on reference benchmarks.

This segmentation is not theoretical. The Open Agent Leaderboard demonstrates that model choice remains the dominant factor, but agent architecture is beginning to produce a measurable difference. Investing in the orchestration layer — tool selection, routing, failure handling — delivers returns independent of the model chosen.

Three Levers to Activate This Week

  1. Audit the actual context length of your corpora: if your documents exceed 4,096 tokens (contracts, reports, regulatory filings), migrating to Granite R2 (32,768-token context) eliminates artificial chunking and mechanically improves retrieval precision on long passages.
  2. Benchmark your existing reranker against the Ettin family: compare your current NDCG@10 against Ettin's published MTEB scores. Ettin-150m (0.5994) outperforms Qwen3-Reranker-0.6B (0.5940) at four times fewer parameters — if your pipeline runs a prior-generation model, the gain is immediate with no architectural change.
  3. Measure the cost of your agent failures: before any open-weight versus closed arbitrage on the orchestration layer, quantify your current failure rate and the associated overspend. IBM Research's figure of 20 to 54% cost overage per failed run is a usable comparison floor starting this week.

Which layer of your RAG pipeline shows the widest gap between the performance you measure and the cost you actually carry — embeddings, reranking, or agent orchestration?

If this analysis speaks to you, I publish a piece of this calibre every day on digital innovation and enterprise AI. 👉 Get the next one straight in your inbox — sign-up takes ten seconds, and each edition is read before 9 a.m. by leaders of European SMEs, mid-caps and public institutions.

Sources

Share this article

Ready to create something amazing together?

Let's discuss how I can help bring your vision to life through strategic design that delivers tangible results for your business.

Open-Weight RAG Stack: Why the Embedding and Reranking Layers Moved Before the Agents Did | Matthieu Pesesse