Scientists from Nvidia have developed, among others: innovative approach for training gigantic language models (LLM) in a 4-bit quantized format, while maintaining their stability and accuracy at the level of high-precision models. Their technique, NVFP4, enables training models that not only outperform other leading 4-bit formats, but match performance to the larger 8-bit FP8 format, all while using half the memory and a fraction of the processing power.
The success of NVFP4 shows that enterprises can continue to reduce inference costs by using leaner models that match the performance of larger ones. It also points to a future where the costs of LLM training will fall to a point where many more organizations will be able to train their own bespoke models from scratch, rather than just fine-tuning existing ones.
The Quantization Challenge
Model quantization is a technique used to reduce the computational and memory costs associated with running and training artificial intelligence models. It works by converting model parameters or weights from high-precision formats, such as 16- and 32-bit floating-point numbers (BF16 and FP32), to lower-precision formats. The key challenge of quantization is to reduce the size of the model while retaining as much of its knowledge and capabilities as possible.
In recent years, 8-bit floating point (FP8) formats have become a popular industry standard, offering a good balance between performance and efficiency. They significantly reduce computational costs and memory requirements for LLM training without a major drop in accuracy.
The next logical step is 4-bit floating point (FP4), which promises to halve memory consumption again and further improve performance on advanced hardware. However, this transition was a challenge. Existing 4-bit formats such as MXFP4 often struggle to maintain the same level of accuracy as their 8-bit counterparts, forcing a arduous trade-off between cost and performance.
How NVFP4 works
NVFP4 overcomes the stability and accuracy challenges of other FP4 techniques with a smarter design and focused training methodology. The key problem with 4-bit precision is its extremely restricted range: it can only represent 16 different values. When converting from a high-precision format, outliers can distort the entire dataset, negatively impacting the accuracy of the model. NVFP4 uses a more sophisticated, multi-level scaling approach that better deals with these outliers, allowing “a more precise and accurate representation of tensor values during training,” Nvidia says.
In addition to the format, researchers introduce a 4-bit training recipe that provides accuracy comparable to FP8. The main element is their “mixed precision strategy”. Instead of converting the entire model to NVFP4, most of the layers are quantized, while a miniature portion of the numerically sensitive layers are stored in a higher precision format such as BF16. This allows you to maintain stability where it matters most. The methodology also adjusts how gradients are calculated during backpropagation – the training phase of the model – to reduce systematic errors that can accumulate as a result of low-precision arithmetic.
NVFP4 in practice
To test their approach, the Nvidia team trained a powerful hybrid with 12 billion parameters Mamba-Transformer model on a whopping 10 trillion tokens. They then compared its performance directly with a baseline model trained in the widely popular FP8 format. The results showed that the training loss of the NVFP4 model and the accuracy of downstream tasks in the NVFP4 model closely match the FP8 version in the whole process.
Results covered a wide range of domains, including knowledge-intensive reasoning, mathematics, and common sense tasks, with only a slight decline in coding patterns at the end of training.
“To our knowledge, this marks the first successful demonstration of training multi-parameter language models with 4-bit precision on a multi-trillion-token horizon, laying the foundation for faster and more efficient training of future frontier models,” the researchers write.
According to Nvidia’s product director for AI and data center GPUs, NvidiaShar Narasimhan, in practice, the 4-bit NVFP4 format allows developers and companies to train and deploy AI models with almost the same accuracy as conventional 8-bit formats.
“By training model weights directly in 4-bit while maintaining accuracy, it enables developers to experiment with new architectures, iterate faster, and discover insights without limiting resources,” he told VentureBeat.
In contrast, FP8 (although already a step forward from FP16) still imposes constraints on model size and inference performance due to higher memory and bandwidth requirements. “NVFP4 breaks this ceiling by offering equivalent quality with much greater scope for development and experimentation,” Narasimhan said.
Compared to the alternative 4-bit MXFP4 format, the advantages of NVFP4 become even clearer. In an experiment with a model with 8 billion parameters, NVFP4 achieved a better loss result than MXFP4. To achieve the same level of performance as the NVFP4 model, the MXFP4 model had to be trained on 36% more data, which meant a significant escalate in training time and cost.
In addition to increasing the effectiveness of pre-training, NVFP4 redefines the possibilities. “Demonstrating that 4-bit precision can maintain model quality at scale opens the door to a future where highly specialized models can be trained from scratch by mid-sized enterprises or startups, not just hyperscale entities,” Narasimhan said, adding that over time we can expect to see a shift from developing general-purpose LLM models to “a diverse ecosystem of custom, high-performance models built by a broader range of innovators.”
Apart from pre-training
While the paper focuses on the benefits of NVFP4 during pretraining, its impact also extends to inference.
“NVFP4-trained models can not only deliver faster inference and higher throughput, but also reduce the time it takes for AI factories to achieve ROI, accelerating the cycle from model development to real-world deployment,” Narasimhan said.
Because these models are smaller and more capable, they open up up-to-date opportunities to handle intricate, high-quality responses in real time, even in token-intensive agent applications, without increasing energy and computation costs.
Narasimhan said he is looking to the future of model performance, which is not just about lowering precision, but also about building smarter systems.
“There are many opportunities to extend research to lower precisions, as well as modify architectures to accommodate components that increasingly dominate computation in large-scale models,” he said. “These areas are rich with opportunity, especially as we move closer to agent-based systems that require high throughput, low latency, and adaptive reasoning. NVFP4 proves that precision can be optimized without sacrificing quality, and sets the stage for a new era of intelligent, efficient AI design.”
