Training a vast AI model is pricey not only in dollars but also in time, energy, and computational resources. Traditionally, getting a smaller, faster model requires either training a huge one first and then trimming it down, or training a compact one from scratch and accepting poorer results.
Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), the Max Planck Institute for Smart Systems, the European Laboratory for Learning and Smart Systems, ETH, and Liquid AI have developed a up-to-date method that bypasses this trade-off entirely by compressing models during training rather than after.
The so-called technique KompresSSMtargets a family of artificial intelligence architectures known as state space models that support a variety of applications ranging from language processing to sound generation and robotics. By borrowing mathematical tools from control theory, researchers can identify which parts of the model are carrying weight and which are self-weighting before surgically removing unnecessary elements at the beginning of the training process.
“It’s basically a technique of making models smaller and faster during training,” says Makram Chahine, a PhD student in electrical engineering and computer science, CSAIL member and lead author of the paper. “As they learn, they also get rid of parts that are not useful for their development.”
The key observation is that the relative importance of different elements of these models stabilizes surprisingly early in training. Using a mathematical quantity called Hankel’s singular values, which measures the contribution of each internal state to the overall behavior of the model, the team showed that they could reliably assess which dimensions mattered and which did not after only about 10 percent of the training process. Once these rankings are established, less significant components can be safely discarded and the remaining 90 percent of training proceeds at the speed of a much smaller model.
“What’s exciting about this work is that it transforms compression from an afterthought to part of the learning process itself,” says senior author Daniela Rus, an MIT professor and director of CSAIL. “Rather than training a large model and then figuring out how to make it smaller, CompreSSM allows the model to discover its own efficient structure as it learns. It’s a fundamentally different way of thinking about building artificial intelligence systems.”
The results are striking. In image classification benchmarks, the compressed models maintained almost the same accuracy as their full-size counterparts and were trained up to 1.5 times faster. A compressed model reduced to roughly one-quarter of its raw dimensions achieved 85.7% accuracy on the CIFAR-10 benchmark, compared to just 81.8% for a model trained from scratch at this smaller size. In Mamba, one of the most widely used state space architectures, this method achieved approximately 4x training speedup, compressing a 128-dimensional model to approximately 12 dimensions while maintaining competitive performance.
“You get the performance of a larger model because you capture most of the complex dynamics in the warm-up phase and then retain only the most useful states,” Chahine says. “The model is still able to perform at a higher level than training a small model from scratch.”
What distinguishes CompreSSM from existing approaches is its theoretical foundations. Conventional pruning methods train the full model and then remove parameters after the fact, which means you still pay the full computational cost of training a vast model. Knowledge distillation, another popular technique, requires training a vast “teacher” model to completion and then training a second, smaller “learner” model on top of it, essentially doubling the training effort. CompreSSM avoids both of these costs by making informed mid-stream compression decisions.
The team compared CompreSSM directly with both alternatives. Compared to Hankel nuclear norm regularization, a recently proposed spectral technique intended to encourage compact state space models, CompreSSM was more than 40 times faster while providing greater accuracy. The regularization approach slowed down training by a factor of about 16 because it required pricey eigenvalue calculations at each gradient step, and even then the resulting models were weaker. Compared to knowledge distillation on CIFAR-10, CompressSM had a clear advantage for highly compressed models: with smaller dimensions, a significant drop in accuracy was observed in the distilled models, while models compressed with CompreSSM retained almost full performance. And because distillation requires both teacher and student to move forward at each stage of training, even the smaller student models trained slower than the full-scale baseline.
The researchers mathematically proved that the importance of individual model states changes smoothly during training, using Weyl’s theorem, and empirically showed that the relative rankings of these states remain stable. Together, these findings provide practitioners with confidence that dimensions initially identified as unimportant will not suddenly become critical later.
This method also provides pragmatic security. If a compression step causes an unexpected drop in performance, practitioners can revert to a previously recorded checkpoint. “It gives people control over how much they are willing to pay for performance, rather than having to define a less intuitive threshold for energy use,” Chahine explains.
This technique has some practical limits. CompreSSM performs best on models that exhibit a sturdy correlation between the internal state dimension and overall performance, a property that varies across tasks and architectures. The method is particularly effective in multiple-input multiple-output (MIMO) models, where the relationship between state size and expression is strongest. For per-channel, single-input, single-output architectures, the gains are more modest because these models are less sensitive to changes in state dimensions in the first place.
The theory is best applied to linear time-invariant systems, although the team has developed extensions for the increasingly popular input-dependent and time-varying architectures. And as the family of state space models extends to architectures such as linear attention, which is a growing area of interest as an alternative to established transformers, the potential scope of application is wide.
Chahine and his colleagues see this work as a stepping stone. The team has already demonstrated extension to linear time-varying systems such as Mamba, and future directions include further pushing CompreSSM toward matrix-valued active systems used in linear attention mechanisms, which will bring the technique closer to the transformer architectures that underpin most of today’s largest AI systems.
“This had to be the first step because in this case the theory is clear and the approach can remain principled,” says Chahine. “It’s a springboard that allows us to then extend this technology to other architectures that people use in industry today.”
“The work of Chahine and his colleagues provides an intriguing, theoretically grounded perspective on the compression of modern state space models (SSM),” says Antonio Orvieto, principal investigator of the ELLIS Institute in Tübingen and leader of the independent group MPI for Smart Systems, who was not involved in the research. “This method provides evidence that the state dimension of these models can be effectively reduced during training and that a control theory perspective can successfully guide this procedure. The work opens new avenues for future research, and the proposed algorithm may become a standard approach when pretraining large SSM-based models.”
A work that has been accepted as conference paper at the 2026 International Conference on Learning Representations, will be presented later this month. The project was partially supported by the ETH Center. Max Planck, the Hector Foundation, Boeing and the United States Office of Naval Research.
