Today, MLCommons® announced recent results for its industry standard MLPerf®The Inference v4.1 benchmark suite, which provides architecture-neutral, representative, and repeatable benchmarking of machine learning (ML) systems. This release includes the first results from a recent benchmark based on the Mixed Expert Model (MoE) architecture. It also presents recent findings on the energy consumption associated with inference execution.
MLPerf Inference v4.1
The MLPerf Inference benchmark suite, spanning data centers to edge systems, is designed to measure the speed at which hardware systems can run AI and ML models in a variety of deployment scenarios. The open-source, peer-reviewed benchmark suite creates a level playing field that drives innovation, performance, and energy efficiency for the entire industry. It also provides key technical insights for customers who are purchasing and tuning AI systems.
Benchmark results in this round demonstrate broad industry engagement and include the debut of six recent or soon-to-be-available processors:
○ AMD MI300x Accelerator (available)
○ AMD EPYC “Turin” Processor (Preview)
○ Google TPUv6e Accelerator “Trillium” (Preview)
○ Intel “Granite Rapids” Xeon Processors (Preview)
○ NVIDIA “Blackwell” B200 Accelerator (Preview)
○ UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerators
The MLPerf Inference v4.1 report includes 964 performance test results from 22 organizations that submitted results: AMD, ASUSTek, Cisco Systems, Connect Tech Inc, CTuning Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo, Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Supermicro, Sustainable Metal Cloud, and Untether AI.
A recent mix of experts sets the benchmark
Keeping up with today’s ever-changing AI landscape, MLPerf Inference v4.1 introduces a recent benchmark to the suite: Mix of Experts. MoE is an architectural design for AI models that moves away from the customary approach of using a single, massive model; instead, it uses a collection of smaller “expert” models. Inference queries are routed to a subset of expert models to generate results. Research and industry leaders have found that this approach can provide accuracy equivalent to a single monolithic model, but often with a significant performance advantage because only a fraction of the parameters are invoked for each query.
The MoE benchmark is unique and one of the most sophisticated implemented by MLCommons to date. It uses the open-source Mixtral 8x7B model as a reference implementation and performs inference using datasets covering three independent tasks: general question and answer, mathematical problem solving, and code generation.
Energy consumption benchmarking
The MLPerf Inference v4.1 benchmark includes 31 power consumption test results from three submitted systems covering both data center and edge scenarios. These results demonstrate the importance of understanding the power requirements of AI systems running inference tasks, as energy costs are a significant portion of the overall cost of running AI systems.
The Increasing Pace of Innovation in Artificial Intelligence
We are currently witnessing an incredible surge in technological advances across the AI ecosystem, driven by a wide range of vendors including AI pioneers, enormous, established technology companies, and miniature startups.
MLCommons would like to especially welcome AMD and Sustainable Metal Cloud who submitted their findings for the first time to MLPerf Inference, as well as Untether AI who provided results on both performance and energy efficiency.
View Results
To view results for MLPerf Inference v4.1, visit HERE.