AI

Towards speed-of-light text generation with Nemotron-Labs diffusion language models

At a glance:

  • Nemotron-Labs Diffusion offers three generation modes—autoregressive, diffusion, and self‑speculation—within a single model family.
  • The 8B diffusion model delivers 2.6× higher tokens‑per‑forward‑pass than standard AR models, while self‑speculation reaches up to 6.4× speedup.
  • NVIDIA releases 3B, 8B and 14B text models (plus an 8B vision‑language variant) under the Nemotron Open Model License, with training code via Megatron Bridge.

What nemotron‑labs diffusion is

Nemotron‑Labs Diffusion is a new class of diffusion language models (DLM) that generate text in parallel blocks and then iteratively refine those blocks. Unlike traditional autoregressive (AR) models that emit one token at a time, DLMs can draft multiple tokens simultaneously, allowing modern GPUs to spend more cycles on computation rather than memory fetches. The approach also introduces a natural revision mechanism: generated tokens can be updated in later refinement steps, reducing the propagation of early mistakes.

The research builds on the Efficient‑DLM concept, which showed that a pretrained AR model can be converted into a diffusion model by continuing pre‑training with a block‑wise attention scheme. This preserves the strengths of the original AR model—stability, KV‑cache friendliness, and strong baseline accuracy—while unlocking parallel decoding. NVIDIA’s implementation adds a joint AR‑diffusion objective, letting the same checkpoint serve both paradigms.

Three generation modes in one model

Nemotron‑Labs Diffusion supports three distinct inference pathways:

  1. Autoregressive mode – behaves like any conventional left‑to‑right LLM, useful for compatibility checks or workloads that demand deterministic token‑by‑token output.
  2. Diffusion mode – fills a 32‑token block at a time, iteratively denoising until a confidence threshold marks tokens as “good enough.” This mode maximises raw throughput.
  3. Self‑speculation mode – drafts a block bidirectionally with diffusion, then verifies each token using a fast AR pass. Linear self‑speculation yields a 6× speed boost, while quadratic self‑speculation reaches 6.4×, all with accuracy comparable to the AR baseline.

Switching between these modes requires only a single configuration flag at deployment time, meaning developers can keep their existing application code and experiment with speed‑accuracy trade‑offs on the fly.

Performance highlights

In NVIDIA’s internal benchmarks, the Nemotron‑Labs Diffusion 8B model achieved an average accuracy improvement of 1.2 % over the competing Qwen3 8B model. Measured in tokens‑per‑forward‑pass (TPF), diffusion mode delivered a 2.6× increase over standard AR decoding. Self‑speculation pushed the envelope further: linear self‑speculation attained a 6× TPF gain, and quadratic self‑speculation hit 6.4×, while maintaining comparable task‑level accuracy across the evaluated suite.

On a B200 GPU running the speedbench dataset, the self‑speculation (LinearSpec) configuration reached roughly 865 tokens per second—about four times the AR baseline on identical hardware. These numbers illustrate how parallel block generation can tap the full computational bandwidth of modern GPUs, especially in latency‑sensitive, batch‑size‑one scenarios.

How the models were trained

All Nemotron‑Labs Diffusion models were first pretrained on 1.3 trillion tokens from NVIDIA’s Nemotron pre‑training corpora. After this AR‑focused stage, the models underwent a joint AR‑diffusion fine‑tuning phase using an additional 45 billion tokens drawn from the Nemotron post‑training datasets. This two‑stage regimen allowed the models to retain the strong language understanding of the original AR checkpoint while acquiring the parallel drafting capability of diffusion.

Training leveraged the NVIDIA Megatron Bridge framework, which provides a unified codebase for both AR and diffusion objectives. The resulting models—available at 3B, 8B, and 14B parameter scales for text, plus an 8B vision‑language variant—are released under the commercially‑friendly NVIDIA Nemotron Open Model License (text models) or the NVIDIA Source Code License (vision‑language model), encouraging broad research and commercial adoption.

Deployment and inference through SGLang

Support for Nemotron‑Labs Diffusion is being added to the main branch of SGLang, NVIDIA’s high‑performance serving library. Developers can select the desired mode with a single line in the SGLang configuration:

  • ar_mode=true for plain autoregressive decoding.
  • diffusion_mode=true (FastDiffuser) for block‑wise diffusion.
  • self_speculation=true (LinearSpec) for the bidirectional draft‑then‑verify workflow.

The integration enables serving the same checkpoint in three ways without duplicating model files, simplifying operations and reducing storage overhead. At the time of writing, the feature is accessible via an open issue tracker request on GitHub, and NVIDIA plans to merge full support into the upcoming SGLang release.

Getting started

Developers interested in experimenting can pull the Nemotron‑Labs Diffusion models from Hugging Face, review the technical report for deeper architectural details, and follow the publicly available training recipe on GitHub. Because the models retain full AR compatibility, existing pipelines can be upgraded incrementally, testing diffusion or self‑speculation modes only where latency or throughput gains are most needed.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

FAQ

What generation modes does Nemotron‑Labs Diffusion support?
The model family offers three modes: standard autoregressive decoding, diffusion mode that fills 32‑token blocks through iterative denoising, and self‑speculation mode which drafts blocks with diffusion and then verifies them with a fast AR pass. Each mode is selectable via a single configuration flag in SGLang.
How does the performance of the 8B diffusion model compare to traditional AR models?
In NVIDIA’s benchmarks, the 8B diffusion model achieved 2.6× higher tokens‑per‑forward‑pass than an AR baseline, while self‑speculation reached up to 6.4× speedup. Accuracy remained comparable, with a 1.2 % improvement over the competing Qwen3 8B model.
How can developers deploy Nemotron‑Labs Diffusion models?
Deployment is handled through SGLang. By setting `ar_mode=true`, `diffusion_mode=true` (FastDiffuser), or `self_speculation=true` (LinearSpec) in the SGLang config, the same checkpoint can serve any of the three inference styles. The integration is available via a GitHub issue tracker request and will be merged into the main SGLang branch.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article