Join our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more
Many companies have high hopes that AI will revolutionize their businesses, but those hopes can be quickly dashed by the staggering costs of training advanced AI systems. Elon Musk has he pointed out that engineering problems often stall progress. This is especially true when optimizing hardware such as GPUs to efficiently handle the massive computational demands of training and tuning vast language models.
While vast tech giants can afford to spend millions, sometimes billions, on training and optimization, miniature and medium-sized businesses and startups with shorter runways often to be on the marginIn this article, we’ll look at several strategies that can allow even the most resource-constrained developers to train AI models without breaking the bank.
I’m in for pennies, I’m in for a dollar
As you probably know, building and bringing an AI product to market—whether it’s a baseline/vast language model (LLM) or a refined downstream application—relies heavily on specialized AI chips, particularly GPUs. These GPUs are so steep and tough to come by that SemiAnalysis invented the terms “GPU-rich” and “GPU-poor” in the machine learning (ML) community. LLM training can be steep mainly due to expenses related to hardware, including both acquisition and maintenance, rather than ML algorithms or expert knowledge.
Training these models requires extensive computation on powerful clusters, and larger models take even longer. For example, training LLaMA2 70B involved exposing 70 billion parameters to 2 trillion tokens, requiring at least 10^24 floating point operations. Should you give up if you’re compact on GPU? No.
Alternative strategies
Today, technology companies are using a number of strategies to find alternative solutions, reduce reliance on steep hardware, and ultimately save money.
One approach involves modifying and improving the training hardware. Although this path is still largely experimental and requires vast capital expenditures, it offers promise for future optimization of LLM training. Examples of such hardware solutions include custom AI chips from Microsoft AND Finishup-to-date semiconductor initiatives Nvidia AND OpenAIsingle computing clusters with Baidugraphics processor rental from Extensiveand Sohu is chipping Etchedamong others.
While this is an vital step forward, this methodology is still more suited to vast players who can afford to invest heavily now to reduce expenses later. It does not work for novices with confined financial resources who want to build AI products today.
What to do: Groundbreaking software
Given the low budget, there is another way to optimize LLM training and reduce costs—through groundbreaking software. This approach is more affordable and accessible to most ML engineers, whether they are seasoned professionals or up-to-date AI enthusiasts and software developers looking to enter the industry. Let’s take a closer look at some of these code-based optimization tools.
Mixed precision training
What is this: Imagine your company has 20 employees, but you rent office space for 200 people. Of course, this would be an obvious waste of resources. A similar inefficiency occurs during model training, where ML frameworks often allocate more memory than is really necessary. Mixed-precision training corrects this through optimization, improving both speed and memory usage.
How it works:To achieve this, lower precision b/float16 operations are combined with standard float32 operations, resulting in fewer computational operations at any given time. To a non-engineer, this may sound like a bunch of technical gibberish, but it essentially means that the AI model can process data faster and require less memory without compromising accuracy.
Improvement indicators:This technique can lead to 6x improvement in execution time on GPUs and 2-3x on non-GPUs. TPU (Google’s Tensor Processing Unit). Open frameworks like Nvidia TOP and Meta AI PyTorch supports mixed precision training, making it readily available for pipeline integration. By implementing this method, companies can significantly reduce GPU costs while maintaining acceptable levels of model performance.
Activation Checkpoint
What is this: If you are confined by confined memory but also willing to spend more time, checkpointing may be the right technique for you. In compact, it helps to significantly reduce memory usage by keeping the computation to a minimum, thus allowing LLM training without the need for hardware upgrades.
How it works:The main idea behind the activation checkpoint is to store a subset of the relevant values while training the model and only recompute the rest when necessary. This means that instead of storing all the intermediate data in memory, the system stores only what is relevant, thus freeing up memory space. This is similar to the principle of “we’ll cross that bridge when we come to it,” which means not to bother with less urgent matters until they require attention.
Improvement indicators:In most situations, activation checkpoints reduce memory usage by up to 70%, although they also extend the training phase by about 15-25%. This fair trade-off means that companies can train vast AI models on their existing hardware without investing additional resources in infrastructure. The aforementioned PyTorch library supports checkpointswhich makes implementation easier.
Multi-GPU Training
What is this: Imagine a miniature bakery needs to quickly produce a vast batch of baguettes. If one baker works alone, it will probably take a long time. With two bakers, the process speeds up. Add a third baker and it goes even faster. Multi-GPU training works in a very similar way.
How it works: Instead of using a single GPU, you operate multiple GPUs at the same time. The training of the AI model is then distributed across these GPUs, allowing them to work side by side. Logically, this is sort of the opposite of the previous method, checkpointing, which reduces the hardware costs in exchange for increased execution time. Here, we operate more hardware, but squeeze the most out of it and maximize performance, thereby reducing execution time and reducing operational costs.
Improvement indicators:Here are three resilient tools for training LLM using multi-GPU setups, listed in ascending order of effectiveness based on experimental results:
- Deep speed:A library specifically designed for training AI models using multiple GPUs, capable of achieving speeds up to 10x faster than established training methods.
- FSDP:One of the most popular frameworks in PyTorch that addresses some of the inherent limitations of DeepSpeed, increasing computational efficiency by another 15-20%.
- YaFSDP:Recently released improved version of FSDP for model training, providing 10-25% speedup over the original FSDP methodology.
Application
By leveraging techniques like mixed precision training, activation checkpoints, and multi-GPU utilization, even miniature and midsize companies can make significant progress in AI training, both in model tuning and model creation. These tools enhance computational efficiency, reduce execution times, and lower overall costs. They also enable larger models to be trained on existing hardware, reducing the need for costly upgrades. By democratizing access to advanced AI capabilities, these approaches enable a broader range of technology companies to innovate and compete in this rapidly evolving field.
As the saying goes, “AI can’t replace you, but someone who uses AI can.” It’s time to embrace AI, and with the above strategies, you can do so even on a budget.
Ksenia Se is the founder Post Turing.
Data decision makers
Welcome to the VentureBeat community!
DataDecisionMakers is a platform where experts, including technical data scientists, can share data-related insights and innovations.
If you want to learn about the latest ideas and insights, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You may even consider writing your own article!
Read more from DataDecisionMakers
