TL;DR. Granite 4.1-8B outperforms its 32-billion-parameter MoE predecessor across most benchmarks, per IBM. Nemotron 3 Nano Omni delivers 7.4x throughput on multi-document tasks, per NVIDIA. DeepSeek-V4-Pro-Max hits 80.6% on SWE-Verified — two tenths behind Claude Opus 4.6-Max. Three open-weight models in two weeks: the question is no longer which one to pick, but where each one fits in the stack.
What Just Shifted in the Open-Weight Enterprise Landscape
Between late April and early May 2026, three separate teams published technical posts on Hugging Face documenting three distinct open-weight foundation models: IBM with Granite 4.1, NVIDIA with Nemotron 3 Nano Omni, and DeepSeek with V4. None of these models targets the same functional perimeter. The compressed timeline forces a reassessment of existing model-selection frameworks.
The open-weight market has long organized itself around general-purpose families — the best possible model within a given size envelope. What these three publications reveal is a segmentation by use case: structured efficiency and multilingual fidelity for Granite, native multimodality for Nemotron, and long-range agentic reasoning for DeepSeek-V4. A single default model no longer covers all three axes without significant trade-offs.
Where DeepSeek-V4 Sets a New Agentic Benchmark
DeepSeek-V4 comes in two variants according to the Hugging Face blog published in late April 2026: V4-Pro (1.6 trillion total parameters, 49 billion active) and V4-Flash (284 billion total, 13 billion active). Both carry a one-million-token context window. The layered attention compression architecture — alternating CSA and HCA layers — reduces KV cache to approximately 2% of the standard GQA baseline and cuts inference FLOPs to 27% of DeepSeek-V3.2 levels, per the same blog.
On agent benchmarks, the numbers are specific. V4-Pro-Max reaches 80.6% on SWE-Verified, against 80.8% for Claude Opus 4.6-Max per the DeepSeek blog. On MCPAtlas Public, it scores 73.6 (Opus 4.6-Max: 73.8). On an internal R&D coding benchmark cited in the article, V4-Pro-Max posts a 67% pass rate, ahead of Claude Sonnet 4.5 at 47% and slightly behind Opus 4.5 at 70%. In the developer survey documented in the blog, 52% of respondents said the model could replace their primary coding model, with 39% leaning in that direction.
The interleaved thinking feature — preserving reasoning traces across successive tool calls — is built explicitly for multi-step agentic workflows. It is absent from Granite 4.1. Think Max mode, for tasks requiring maximum reasoning depth, requires a minimum of 384,000 context tokens available, per DeepSeek.
Where Granite 4.1 and Nemotron Omni Hold Their Ground
IBM Granite 4.1: Structured Efficiency and Multilingual Reliability
The defining result in IBM's publication is this: according to IBM's Hugging Face blog, Granite 4.1-8B instruct matches or exceeds the previous Granite 4.0-H-Small — a 32-billion-parameter MoE model with 9 billion active — across all key benchmarks, including IFEval, AlpacaEval 2.0, MMLU-Pro, GSM8K and ArenaHard. A model four times smaller that outperforms its larger predecessor.
The published figures are precise. On structured tool calling (BFCL v3), Granite 4.1-8B instruct scores 68.27; the 30B reaches 73.68. On GSM8K (mathematical reasoning), the 8B posts 92.49%, the 30B 94.16%. On HumanEval (code generation), the 8B hits 87.20%. The RLHF training stage produced a gain of +18.9 points on average on Alpaca-Eval, per IBM. Context window extends to 512,000 tokens for the 8B and 30B variants. FP8 quantization reduces GPU memory and disk footprint by approximately 50%, per IBM. The license is Apache 2.0. Twelve languages are supported natively.
This profile — compact, latency-predictable (no extended reasoning traces), memory-efficient — directly targets RAG pipelines, sector-specific assistants, and structured generation workflows under constrained GPU budgets. The absence of extended reasoning mode is an operational advantage for real-time use cases: latency stays stable and inference costs remain forecastable.
NVIDIA Nemotron 3 Nano Omni: Native Multimodality as a Distinct Perimeter
Nemotron 3 Nano Omni 30B-A3B is built on a hybrid Mamba-Transformer-MoE architecture combining 23 selective state-space layers, 23 MoE layers with 128 experts and top-6 routing, and 6 grouped-query attention layers, per NVIDIA's Hugging Face blog. The model natively processes text, image, video, and audio in a single forward pass — without an intermediate transcription pipeline.
The measured advantages on document-audio-video tasks are material. VoiceBench: 89.4. Video-MME: 72.2. DailyOmni (simultaneous video and audio comprehension): 74.1. MMLongBench-Doc (long documents): 57.5. OSWorld (GUI-based computer use): 47.4. For multi-document workloads, throughput is 7.4x higher than compared alternatives per NVIDIA; for video, 9.2x. The model handles audio sessions exceeding five hours and documents exceeding 100 pages in native context.
Granite 4.1 does not compete on these dimensions. For teams processing recorded calls, long-form PDF contracts, video meetings, or industrial video streams, Nemotron Omni opens a functional perimeter that text-only architectures cannot access.
Pricing and Operational Implications
All three models are open-weight and freely accessible on Hugging Face. The cost structure therefore shifts to inference infrastructure, not licensing. Granite 4.1 is published under Apache 2.0 — no commercial restriction for on-premise deployment. DeepSeek-V4 is available as open source on Hugging Face per the blog. Nemotron 3 Nano Omni is available in BF16, FP8, and NVFP4 formats per NVIDIA.
On memory footprint: Granite 4.1-8B in FP8 reduces GPU memory by approximately 50% per IBM — a figure that translates directly into per-token inference cost at scale. Nemotron 3 Nano Omni in BF16 requires approximately 30GB of VRAM; the NVFP4 variant reduces the model to approximately 18 billion effective parameters per NVIDIA. DeepSeek-V4-Flash, with 13 billion active parameters out of 284 billion total, enables mid-range GPU inference despite the apparent model size.
Latency profiles diverge by use case: Granite 4.1 is designed without extended reasoning chains — stable, predictable latency. DeepSeek-V4 in Think Max mode consumes a minimum of 384,000 context tokens per the DeepSeek blog — a constraint that must be explicitly budgeted for real-time or high-throughput applications.
What This Means for a Multi-Model Architecture
The convergence of these three publications within two weeks reflects a structural dynamic: the open-weight market is segmenting by functional use case, not by model size. Teams attempting to cover all their needs with a single generalist model accumulate compounding trade-offs — in memory, latency, reasoning depth, or supported modalities.
A pragmatic multi-model architecture for 2026 distinguishes three separate layers:
- Structured and multilingual layer (RAG, document generation, tool calling, sector assistants): Granite 4.1-8B or 30B under Apache 2.0, in FP8 for maximum GPU density.
- Multimodal layer (long audio, video, rich PDFs, GUI-based agents): Nemotron 3 Nano Omni 30B-A3B, deployed in NVFP4 to contain memory footprint.
- Long-range agentic layer (coding agents, multi-step workflows, million-token analysis): DeepSeek-V4-Flash for cost efficiency, V4-Pro for maximum reasoning depth.
This segmentation is not theoretical — it is dictated by published benchmarks. Nemotron Omni claims no score on BFCL v3. Granite 4.1 does not handle five hours of audio. DeepSeek-V4 is not engineered for low-cost multilingual generation on constrained GPU budgets. Each model performs best in its lane precisely because it did not attempt to cover the others.
Three Levers to Activate This Week
- Map input modalities across your current workflows — text only, PDF, audio, video, GUI — to determine whether Nemotron Omni enters the scope before any infrastructure testing begins.
- Run Granite 4.1-8B instruct in FP8 against your existing structured use cases (tool calling, JSON generation, multilingual RAG) and benchmark latency and GPU memory cost against the model currently in production.
- Evaluate DeepSeek-V4-Flash on an internal coding or agentic benchmark: at 80.6% on SWE-Verified, the model sits in frontier territory for that use case at open-weight cost — the infrastructure trade-off deserves a direct measurement.
In Your Current Stack, Which of These Three Gaps Is Most Pressing?
If this analysis speaks to you, I publish a piece of this calibre every day on digital innovation and enterprise AI. 👉 Get the next one straight in your inbox — sign-up takes ten seconds, and each edition is read before 9 a.m. by leaders of European SMEs, mid-caps and public institutions.