Wednesday, April 23, 2025

DeepSeek-V3, an open-source very enormous AI, outperforms Llama and Qwen at launch

Share


Join our daily and weekly newsletters to receive the latest updates and exclusive content on our industry-leading AI coverage. Find out more


Chinese AI startup DeepSeek, known for challenging leading AI vendors with creative open source technologies, today launched a novel super-large model: DeepSeek-V3.

Available via Face Hugging According to the company’s licensing agreement, the novel model is equipped with 671B parameters, but uses a mixed architecture to activate only selected parameters to perform assigned tasks accurately and efficiently. According to benchmarks provided by DeepSeek, the offering is already topping the charts, outperforming leading open source models including Meta’s Llama 3.1-405B, and closely matching the performance of Anthropic and OpenAI’s closed-source models.

This release represents another critical step in development, bridging the gap between closed-source and open-source AI. Ultimately, DeepSeek, which started as an offshoot of a Chinese quantitative hedge fund Capital management at the highest levelhopes that these developments will pave the way for artificial general intelligence (AGI), in which models will be able to understand and learn any intellectual task that a human can perform.

What does DeepSeek-V3 offer?

Like its predecessor DeepSeek-V2, the novel extra-large model uses the same basic architecture multiheaded covert attention (MLA) AND DeepSeekMoE. This approach ensures proficient training and inference – thanks to specialized and shared “experts” (individual, smaller neural networks within a larger model) activating 37B of 671B parameters for each token.

While the basic architecture provides solid performance for DeepSeek-V3, the company has also introduced two innovations that raise the bar even further.

The first is an auxiliary lossless load balancing strategy. This feature dynamically monitors and adjusts expert workload to utilize them in a balanced manner without compromising overall model performance. The second solution is multiple token prediction (MTP), which allows the model to predict multiple future tokens at once. This innovation not only increases training efficiency, but enables the model to run three times faster, generating 60 tokens per second.

It is worth noting that during the training phase, DeepSeek applied many hardware and algorithmic optimizations, including the FP8 mixed-precision training platform and the DualPipe algorithm for pipeline parallelism, to reduce process costs.

Overall, he claims to have completed the entire DeepSeek-V3 training in approximately 2788k. H800 GPU hours, or approximately $5.57 million, based on a rental price of $2 per GPU hour. This is significantly less than the hundreds of millions of dollars typically spent on pre-training enormous language models.

For example, Llama-3.1 is estimated to have been trained with an investment of over $500 million.

The strongest open source model currently available

Despite its economical training, DeepSeek-V3 has become the strongest open source model on the market.

The company conducted multiple benchmarks to compare AI performance and noted that it convincingly outperforms leading open models, including the Llama-3.1-405B and Qwen 2.5-72B. It even outperforms closed-source GPT-4o in most tests, with the exception of the English-focused SimpleQA and FRAMES – where the OpenAI model won with scores of 38.2 and 80.5 (vs. 24.9 and 73.3), respectively.

Notably, DeepSeek-V3’s performance particularly stood out in the Chinese and math benchmarks, scoring better than all of its peers. He scored 90.2 on the Math-500 test, and his Qwen score of 80 was the next best score.

The only model that managed to challenge DeepSeek-V3 was Anthropic’s Claude 3.5 Sonnet, which outperformed the MMLU-Pro, IF-Eval, GPQA-Diamond, SWE Verified and Aider-Edit tests.

The work shows that open-source software is approaching closed-source models, promising near-equivalent performance across a variety of tasks. The development of such systems is extremely good for the industry because it potentially eliminates the chances of one enormous AI player ruling the game. It also gives enterprises multiple options to choose from and work with when orchestrating stacks.

Currently, the DeepSeek-V3 code is available via GitHub under the MIT license, while the model is released under the company’s model license. Enterprises can also test the novel model via DeepSeek ChatChatGPT-like platform and access the API for commercial employ. DeepSeek provides an API on the site same price as DeepSeek-V2 until February 8. It will then charge $0.27 per million for input tokens ($0.07 per million tokens for cache hits) and $1.10 per million for output tokens.

Latest Posts

More News