Nemotron 3.5, Mellum2, Holo3.1: The Week Enterprise AI Stopped Looking for a Single Model

TL;DR. Between 1 and 4 June 2026, NVIDIA, JetBrains and H Company each published an open model on Hugging Face — Nemotron 3.5 Content Safety (96.5% multilingual safety F1), Mellum2 (2x+ faster inference via MoE), and Holo3.1 (79.3% on AndroidWorld). Three enterprise stack layers, none claiming the full spectrum. That segmentation is the strategy.

One week, three releases: why the segmentation matters

The week of 2 June 2026 produced three distinct open-model launches on Hugging Face, each targeting a different pressure point in enterprise AI deployments. JetBrains released Mellum2 on 1 June — a 12B Mixture-of-Experts architecture activating only 2.5B parameters per token, per the official JetBrains announcement on Hugging Face. H Company followed on 2 June with Holo3.1, a computer-use agent family spanning four sizes (0.8B to 35B parameters), built for multi-environment automation across web, desktop, mobile and business software. NVIDIA closed the sequence on 4 June with Nemotron 3.5 Content Safety — a 4B multimodal safety classifier running on a single 8GB GPU, covering 12 explicitly trained languages and approximately 140 in zero-shot mode, per the official NVIDIA publication on Hugging Face.

Taken individually, each is a product announcement. Taken together, they mark a structural shift: specialisation, not generalisation, is becoming the dominant open-model strategy for enterprise AI.

Are open specialised models ready to replace frontier APIs in enterprise deployments?

Not as wholesale substitutes — but as structural components of a tiered architecture. Each of the three models targets a layer where frontier APIs are either over-specified, too costly, or insufficiently auditable for regulated industries.

Where Nemotron 3.5 Content Safety leads: the compliance and content safety layer

On multilingual safety classification, Nemotron 3.5 Content Safety achieves 96.5% harmful-content F1 on the multilingual Aegis benchmark across 12 languages, and 88.8% on RTP-LX, according to the official NVIDIA announcement. The model averages approximately 85% across seven multimodal benchmarks including VLGuard, MM-SafetyBench, PolyGuard, XSafety, MultiJail, Dynaguardrail and CoSA.

Two operational differentiators set it apart from competing safety classifiers. First, end-to-end latency runs 3x lower than comparable multimodal safety models, per the same source. Second, THINK mode — which generates auditable step-by-step reasoning traces — consumes 50% fewer tokens than alternative reasoning-enabled safety models, making compliance audit trails viable at scale. Custom policy injection at inference time — allowing domain-specific definitions of what constitutes a violation — is a meaningful capability for regulated sectors such as financial services, healthcare and children's education.

At 4B parameters, the model runs on an 8GB GPU under the NVIDIA Open Model License, covering research and commercial use.

Where Mellum2 and Holo3.1 hold the line

Mellum2: the orchestration and inference speed layer

JetBrains designed Mellum2 as a component model, not a monolithic one. The 12B Mixture-of-Experts architecture activates only 2.5B parameters per token, delivering what the official JetBrains announcement describes as 2x+ faster inference than comparably sized models. Documented use cases — routing, RAG pipeline post-processing, sub-agent planning and IDE-integrated code completion — position it as the lightweight backbone of a larger multi-model system rather than a standalone assistant.

The Apache 2.0 licence removes friction for commercial self-hosting, directly relevant for organisations handling proprietary code or sensitive internal data.

Holo3.1: the computer-use and local automation layer

H Company built Holo3.1 to operate software interfaces the way a human operator would. The 35B-A3B variant scores 79.3% on the AndroidWorld mobile automation benchmark, up from 67% for the previous generation, per the official H Company announcement. The 4B and 9B variants reach 72% on the same benchmark, up from 58%. Across internal benchmarks covering e-commerce, business software and collaboration tools, Holo3.1 shows a 25% improvement over its predecessor.

The key operational differentiator is local execution. Holo3.1 models are available in quantised formats — FP8, NVFP4 W4A16, Q4 GGUF — for consumer hardware on Windows, macOS and Apple Silicon. The NVFP4 format delivers 1.74x throughput compared to BF16, per the official announcement, with a compound approximately 2x end-to-end speedup combined with agent harness optimisations. For organisations with strict data-residency requirements, a fully local computer-use pipeline without any external API call is now technically accessible.

Pricing and operational implications

All three models are open and self-hostable, with distinct licence terms. Mellum2 carries Apache 2.0 — the least restrictive, suitable for commercial productisation. Nemotron 3.5 operates under the NVIDIA Open Model License, covering research and commercial use under NVIDIA's standard terms. Holo3.1's licence terms are published on H Company's Hugging Face collection; enterprise teams should verify the conditions for their specific deployment context before any production commitment.

The cost argument for open specialised models is strongest at high throughput. A safety classifier running at 3x lower latency than alternatives, or an orchestration model activating only 2.5B parameters per inference call, changes the unit economics of AI-mediated processes at millions of calls per day.

What this means for a multi-model architecture

The three releases converge on a single architectural signal: the enterprise AI stack is becoming a pipeline of specialised models, each handling the layer it was optimised for, rather than a single frontier model handling everything. Nemotron 3.5 Content Safety sits at the safety and compliance gate. Mellum2 occupies the routing, summarisation and sub-agent planning layer. Holo3.1 takes the human-interface automation layer — the outermost execution layer that touches software directly.

Assembling these layers requires explicit decisions about handoff protocols, latency budgets and audit requirements at each boundary. It is not simpler than a single API — but for organisations facing regulatory constraints, data-residency mandates or high-volume workloads, the trade-off is increasingly worth the complexity.

Three levers to activate this week

Map your AI stack against the three layers. Identify which current processes involve safety classification, code orchestration or interface automation. Document where a specialised open model could replace or complement an existing frontier API call.
Run a latency and cost audit on your content safety pipeline. If content moderation or policy enforcement is currently handled by a frontier model, benchmark Nemotron 3.5 Content Safety — starting with the 8GB GPU configuration and THINK mode for any compliance-relevant output.
Prototype a local computer-use workflow with Holo3.1. Download the 4B or 9B quantised variant and test it on one repetitive software interaction in your environment. The 72% AndroidWorld score and the 25% improvement on business software are a starting baseline — your specific environment will determine the real-world utility.

Which layer of your stack is still handled by a frontier API that a specialised open model could run more efficiently?

If this analysis speaks to you, I publish a piece of this calibre every day on digital innovation and enterprise AI. 👉 Get the next one straight in your inbox — sign-up takes ten seconds, and each edition is read before 9 a.m. by leaders of European SMEs, mid-caps and public institutions.

Sources

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI (Hugging Face)
Holo3.1: Fast & Local Computer Use Agents (Hugging Face)
Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains (Hugging Face)