Tuesday, December 24, 2024

Up-to-date MLPerf Storage v1.0 benchmark results show that storage systems play a key role in AI model training performance

Share

MLCommons® announced results for its industry standard MLPerf® Storage Benchmark Suite v1.0which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative and reproducible manner. The results show that as accelerator technology evolves and datasets continue to grow in size, ML vendors must ensure that their storage solutions keep up with computational needs. This is a time of rapid change in ML systems, where progress in one area of ​​technology generates fresh requirements in other areas. High-performance AI training now requires storage systems that are both large-scale and brisk, so that access to stored data does not become a bottleneck in the overall system. With the release of the MLPerf Storage v1.0 benchmark results, it has become clear that storage vendors are innovating to meet this challenge.

Storage benchmark version 1.0 breaks fresh ground

The MLPerf Storage benchmark is the first and only open, clear benchmark measuring storage performance across a diverse set of ML training scenarios. Emulates storage requirements across several scenarios and system configurations, spanning a range of accelerators, models, and workloads. By simulating the “thinking time” of accelerators, the benchmark can generate true storage patterns without the need for actual training, making it more accessible to everyone. The benchmark focuses on a given storage system’s ability to keep pace, as it requires simulated accelerators to maintain the required utilization levels.

The benchmark includes three models that provide testing of a variety of AI training patterns: 3D-UNet, Resnet50 and CosmoFlow. These workloads offer a variety of sample sizes, from hundreds of megabytes to hundreds of kilobytes, and a wide range of simulated “think times” from a few milliseconds to several hundred milliseconds.

The benchmark emulates the NVIDIA A100 and H100 models as representatives of currently available accelerator technologies. The H100 accelerator reduces the computation time per batch for the 3D-UNet workload by 76% compared to the earlier V100 accelerator version 0.5, turning a typically bandwidth-sensitive workload into a much more latency-sensitive one.

Additionally, MLPerf Storage v1.0 supports distributed training. Distributed learning is an crucial scenario for the benchmark because it represents a common real-world practice for faster training of models on immense datasets and poses specific challenges for the storage system not only in providing higher throughput, but also in supporting multiple training nodes simultaneously.

Version 1.0 benchmark results show improved performance of storage technology in ML systems

The wide range of workloads submitted to the benchmark reflects the breadth and diversity of different storage systems and architectures. This speaks to how crucial ML workloads are for all types of storage solutions and shows how energetic innovation is taking place in this space.

The results in the distributed training scenario demonstrate the tender balance necessary between the number of hosts, the number of simulated accelerators per host, and the storage system to support all accelerators at the required utilization. Adding more nodes and accelerators to handle larger training datasets increases bandwidth requirements. Distributed learning adds another twist because in the past, different technologies – with different bandwidths and latencies – were used to move data within and between nodes. The maximum number of accelerators that a single node can support cannot be circumscribed by the node’s own hardware, but instead by the ability to quickly move enough data to that node in a distributed environment (up to 2.7 GiB/s per emulated accelerator). Storage architects today have few design trade-offs: systems must have high throughput and low latency to keep a large-scale AI training system operating under peak loads.

MLPerf Magazine v1.0

The MLPerf Storage benchmark is the result of a collaborative engineering process between a dozen leading storage vendors and academic research groups. An open and peer-reviewed benchmark suite ensures a level playing field for competition, driving innovation, performance and energy efficiency across the industry. It also provides key technical information to customers who purchase and fine-tune AI training systems.

Version 1.0 benchmark results from a wide range of technology providers demonstrate that the industry recognizes the importance of high-performance storage solutions. MLPerf Storage v1.0 includes over 100 performance results from 13 reporting organizations: DDN, Hammerspace, Hewlett Packard Enterprise, Huawei, IEIT SYSTEMS, Juicedata, Lightbits Labs, MangoBoost, Nutanix, Simplyblock, Volumez, WEKA and YanRong Tech.

See the results

To see results for MLPerf Storage v1.0, visit Storage benchmark results.

Latest Posts

More News