Helping data centers deliver higher performance with less hardware

To improve data center performance, multiple storage devices are often networked together so that multiple applications can share them. However, even when pooled, significant device capacity remains unused due to variability in individual device performance.

MIT researchers have developed a system that improves the performance of storage devices while handling three main sources of variability. Their approach provides significant speed improvements over established methods that eliminate only one source of variability at a time.

The system uses a two-tier architecture, with a central controller that makes overall decisions about the tasks performed by each storage device, and local controllers for each machine that quickly redirect data in the event of problems with that device.

The method, which can adapt in real time to changing loads, does not require specialized equipment. When the researchers tested the system on realistic tasks, such as training artificial intelligence models and image compression, it almost doubled the performance provided by established approaches. By intelligently balancing the load of multiple storage devices, the system can augment overall data center performance.

“There is a trend towards putting more resources into solving the problem, but in many ways it is not sustainable. We want to be able to maximize the sustainability of these very expensive and carbon-intensive resources,” says Gohar Chaudhry, an electrical engineering and computer science (EECS) graduate student and lead author of the book article about this technique. “With our adaptive software, you can still squeeze a lot of performance out of your existing devices before you have to throw them away and buy new ones.”

In the article, Chaudhry is joined by Ankit Bhardwaj, assistant professor at Tufts University; Dr. Zhenyuan Ruan ’24; and senior author Adam Belay, EECS associate professor and member of MIT’s Computer Science and Artificial Intelligence Laboratory. The research results will be presented at the USENIX Symposium on Design and Implementation of Networked Systems.

Utilizing unused capacity

Solid-state drives (SSDs) are high-performance digital storage devices that allow applications to read and write data. For example, an SSD can store massive datasets and quickly send them to the CPU to train machine learning models.

Connecting multiple SSDs so that multiple applications can share them improves performance because not every application needs to exploit the entire capacity of the SSD at any given time. However, not all SSDs perform the same, and the slowest device can limit the overall performance of the pool.

These inefficiencies are due to the variability of SSD hardware and the tasks they perform.

To take advantage of this untapped SSD capacity, researchers developed Sandook, a software-based system that simultaneously deals with three main forms of performance-limiting variability. “Sandook” is an Urdu word that means “box” and means “storage.”

One type of variability comes from differences in the age, wear, and capacity of SSDs, which may have been purchased at different times from multiple vendors.

The second type of variability results from a mismatch of read and write operations occurring on the same SSD. To write fresh data to the device, the SSD must erase some of the existing data. This process may sluggish down simultaneous reading or downloading of data.

The third source of variability is garbage collection, which is the process of collecting and removing archaic data to free up space. This process, which slows down the SSD, is triggered at random intervals over which the data center operator has no control.

“I can’t assume that all SSDs will behave identically throughout the deployment cycle. Even if I put the same load on them, some of them will be grumpy, which will negatively impact the net throughput I can achieve,” Chaudhry explains.

Plan globally, react locally

To handle all three sources of variability, Sandook uses a two-tier structure. The global scheduler optimizes the distribution of tasks across the pool, while faster schedulers on each SSD respond to urgent events and shift operations away from overloaded devices.

The system eliminates latency resulting from read and write disruptions by changing the SSD drives that the application can exploit for reading and writing. This reduces the risk of simultaneous reading and writing on the same machine.

Sandook also profiles the typical performance of each SSD. It uses this information to detect when garbage collection is likely to sluggish down operations. Once detected, Sandook reduces the load on this SSD by offloading some tasks until garbage collection is completed.

“If that SSD is collecting garbage and can no longer handle the same load, I want to reduce it and slowly get it back up and running. We want to find the sweet spot where it’s still doing some work and leverage that performance,” Chaudhry says.

SSD profiles also enable Sandook’s global controller to allocate workloads in a weighted manner that takes into account the characteristics and capacity of each device.

Because the global controller sees the massive picture and local controllers respond in real time, Sandook can simultaneously manage forms of variability occurring on different time scales. For example, delays in garbage collection occur suddenly, while delays due to attrition accumulate over many months.

The researchers tested Sandook on a pool of 10 SSDs and evaluated the system on four tasks: database maintenance, machine learning model training, image compression, and user data storage. Sandook increased each application’s throughput by 12 to 94 percent compared to inert methods and improved overall SSD capacity utilization by 23 percent.

The system enabled SSDs to reach 95 percent of their theoretical maximum performance without the need for specialized hardware or application-specific updates.

“Our dynamic solution can unlock greater performance across all SSDs and really push them to their limits. Every capacity saved really counts at this scale,” says Chaudhry.

In the future, researchers want to exploit fresh protocols available on the latest SSDs that will give operators more control over where data is placed. They also want to leverage the predictability of AI workloads to improve the performance of SSD operations.

“Flash memory is an advanced technology that underpins modern data center applications, but sharing this resource between workloads with widely varying performance requirements remains a unique challenge. This work significantly advances the cause with an elegant and practical solution ready for deployment, bringing the full potential of flash memory to production clouds,” says Josh Fried, a software engineer at Google and future postdoctoral fellow at the University of Pennsylvania, who was not involved in this work.

This research was funded in part by the National Science Foundation, the U.S. Defense Advanced Research Projects Agency, and the Semiconductor Research Corporation.

Categories

Helping data centers deliver higher performance with less hardware

This summer travel season could forever change the future of sustainable aviation fuel

How to write to files in Python: a beginner’s guide

Uranus’s moons may be the key to finding lost planets

Deep dive into language model calibration: Platt scaling, isotonic regression, temperature scaling

The US has a plan to combat snails. It covers many more flies

More News

The startup helps retailers track their products in real time

NSF resumes support for MIT-led Artificial Intelligence and Physics Institute, expanding up-to-date discovery model

Teach AI agents to ask better questions by playing ‘Battleship’

MIT researchers teach AI models to interpret graphs

This summer travel season could forever change the future of sustainable aviation fuel

How to write to files in Python: a beginner’s guide

Uranus’s moons may be the key to finding lost planets