NVIDIA released the Nemotron-Labs Diffusion model family on May 23, 2026, introducing language models that generate text up to 6.4 times faster than standard autoregressive models. The 8-billion-parameter version achieves a 1.2% accuracy improvement over Qwen3 8B while reaching 865 tokens per second on NVIDIA B200 hardware in self-speculation mode, according to the company's technical report. The release includes base and chat-tuned models at 3B, 8B, and 14B scales under the commercially-friendly NVIDIA Nemotron Open Model License.
The new models break from the token-by-token generation that has defined large language models since GPT-2. Standard LLMs produce one token at a time, each dependent on the previous one. That autoregressive approach works but leaves GPU compute idle while memory loads model weights.
Nemotron-Labs Diffusion instead generates multiple tokens in parallel and refines them over several steps. The result is higher throughput, especially for small batch sizes where autoregressive models struggle to saturate GPU cores. NVIDIA's technical report details three generation modes built into the same checkpoint.
Autoregressive mode operates like any causal language model, preserving compatibility with existing pipelines. Diffusion mode fills 32-token blocks at a time, iteratively denoising them and committing tokens once a confidence threshold is met. Self-speculation mode drafts a block bidirectionally, then verifies it causally—only tokens matching the autoregressive path are kept.
At temperature zero, output is lossless compared to pure autoregressive generation. "This flexible design is the key developer-facing feature where speed and accuracy both matter, even at workloads with unpredictable batch sizes, or those with a single query," the company stated in its blog post. Switching modes requires a single configuration line. SGLang, an open-source inference framework, will support the models in its main branch, with early access available through a GitHub issue tracker.
Speed numbers tell the story. In diffusion mode, the 8B model reaches 2.6 times more tokens per forward pass than autoregressive baselines. Linear self-speculation pushes that to 6×, quadratic to 6.4×.
On a B200 GPU, linear self-speculation hit 865 tokens per second on the Speedbench dataset—roughly four times the autoregressive baseline on the same hardware. The 14B model shows similar scaling, though NVIDIA has not yet published detailed latency benchmarks for that size. Behind the performance lies a training recipe that blends autoregressive and diffusion objectives.
The models were pretrained on 1.3 trillion tokens from the NVIDIA Nemotron Pretraining datasets, then fine-tuned on 45 billion tokens from the Nemotron Post-training datasets. This joint training allows the model to retain strong left-to-right generation while adding parallel drafting capability. The approach builds on recent research, including the Efficient-DLM paper, which showed that pretrained AR models can be converted to diffusion language models through continued pretraining and block-wise attention. "Diffusion language models have been promising for years, but they have historically had practical barriers: lower accuracy than strong AR models, more difficult training, and limited compatibility with KV caching," NVIDIA noted.
The company credits the block-wise attention mechanism for making KV-cache-friendly parallel decoding possible. Developers get access under two licenses. The 3B, 8B, and 14B text models use the NVIDIA Nemotron Open Model License, which permits commercial use.
The 8B vision-language model, capable of processing images and text, is released under the NVIDIA Source Code License for research flexibility. Training code is available through the NVIDIA Megatron Bridge framework. What this actually means for your family.
Faster inference translates to cheaper, more responsive AI applications. A customer service chatbot that currently takes two seconds to respond could reply in half a second. Real-time translation apps could keep pace with natural speech.
Developers running on tight budgets can serve more users with the same hardware. The policy says one thing. The reality says another.
Benchmarks don't always match production workloads, and the 865 tokens-per-second figure comes from a controlled test on top-tier hardware. Most developers won't have B200 GPUs. But even on more modest hardware, the parallel generation approach reduces the memory bottleneck that slows down small-batch inference.
Both sides claim victory. Here are the numbers. The 8B model's 1.2% accuracy edge over Qwen3 8B comes from averaged benchmarks across code generation, math, summarization, and document understanding tasks.
NVIDIA did not disclose the full benchmark suite, but the technical report includes comparisons on HumanEval, GSM8K, and other standard evaluations. The diffusion models match or exceed autoregressive baselines on most tasks, with the largest gains in code generation and fill-in-the-middle objectives—a natural fit for models that can revise previous tokens. The release intensifies competition in the open-weight LLM space.
Meta's Llama 4, Mistral's latest models, and Alibaba's Qwen series have pushed performance boundaries while keeping weights public. NVIDIA's entry adds a new dimension: generation speed as a first-class feature. For enterprises weighing deployment costs, throughput per dollar may matter as much as raw benchmark scores. "With Nemotron-Labs Diffusion, developers get a new way to draft, refine, verify, and accelerate text generation, without needing to alter their applications," the company said.
The integration with SGLang means developers can test the models using familiar tools. NVIDIA expects support in other inference frameworks to follow, though no timeline was provided. Why It Matters: Faster language models lower the cost of deploying AI in latency-sensitive applications—real-time translation, interactive coding assistants, voice agents.
For businesses running AI at scale, a 4× speedup could mean serving four times as many customers with the same infrastructure. The open license removes a barrier for startups and researchers who cannot afford proprietary API pricing. If diffusion-based generation proves reliable in production, it could shift how the industry thinks about model architecture, making parallel decoding a standard capability rather than a research curiosity.
Key takeaways: - Nemotron-Labs Diffusion models generate text up to 6.4× faster than autoregressive models by drafting multiple tokens in parallel and refining them iteratively. - The same checkpoint supports three modes: standard autoregressive, block-wise diffusion, and self-speculation that verifies drafted tokens against causal decoding. - The 8B model achieves 865 tokens per second on NVIDIA B200 hardware in self-speculation mode, with a 1.2% accuracy improvement over Qwen3 8B. - Models are open-weight under commercially-friendly licenses, with training code available through NVIDIA Megatron Bridge. What comes next. SGLang integration will make the models accessible to a broad developer base.
The research community will stress-test diffusion generation in real-world applications—code completion, document editing, multi-turn chat—where the ability to revise tokens could shine or stumble. NVIDIA's decision to release training recipes may spur other labs to experiment with joint AR-diffusion objectives. The bigger question is whether diffusion language models can scale to 70B parameters and beyond while maintaining the speed advantage.
If they can, the token-by-token era may have an expiration date.
Key Takeaways
— Nemotron-Labs Diffusion models generate text up to 6.4× faster than autoregressive models by drafting multiple tokens in parallel and refining them iteratively.
— The same checkpoint supports three modes: standard autoregressive, block-wise diffusion, and self-speculation that verifies drafted tokens against causal decoding.
— The 8B model achieves 865 tokens per second on NVIDIA B200 hardware in self-speculation mode, with a 1.2% accuracy improvement over Qwen3 8B.
— Models are open-weight under commercially-friendly licenses, with training code available through NVIDIA Megatron Bridge.
Source: Hugging Face - Blog









