Thursday, March 12, 2026

Together AI’s ATLAS adaptive speculator delivers 400% inference speedup by learning from real-time workloads

Share

Enterprises expanding their AI implementations are hitting an unseen performance barrier. Perpetrator? Inert profiteers who cannot keep up with changing workloads.

Speculators are smaller AI models that interact with vast language models to make inferences. They design multiple tokens in advance, which the main model then validates in parallel. This technique (called speculative decoding) has become indispensable for enterprises trying to reduce inference and latency costs. Instead of generating tokens one at a time, the system can accept multiple tokens at once, dramatically improving throughput.

Total AI today announced research and a up-to-date system called ATLAS (AdapTive-LeArning Speculator System) that aims to facilitate companies overcome the challenge of inert speculators. This technique provides self-learning inference optimization that can facilitate deliver up to 400% faster inference performance compared to the baseline level of performance available in existing inference technologies such as vLLM. The system solves a critical problem: as AI workloads evolve, inference speeds decline, even with the presence of specialized speculators.

The company that has its beginning in 2023 the focus was on inference optimization on an enterprise AI platform. Earlier this year, the company raised $305 million as customer acceptance increases and demand increases.

“The companies we work with typically see a change in workload as they scale up and then don’t see as much acceleration from speculative execution as before,” Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. “These speculators generally don’t perform well when their load domain starts to change.”

The load drift problem no one talks about

Most speculators currently in production are “static” models. They are trained once on a fixed dataset representing the expected workloads and then deployed without adaptation. Companies like Meta and Mistral send pre-trained speculators with their main models. Inference platforms like vLLM exploit these inert speculators to augment throughput without changing output quality.

But there’s a catch. As the exploit of AI in the enterprise evolves, the accuracy of the inert speculator declines rapidly.

“If you’re a coding agent company and most of your developers were writing in Python, suddenly some of them switch to writing in Rust or C and you see the speed start to slow down,” Dao explained. “The speculator has a misunderstanding between what he was trained in and the actual workload.”

This shift in workload is a hidden tax on AI scaling. Companies either accept inferior performance or invest in retraining non-standard speculators. This process only captures a snapshot in time and quickly becomes archaic.

How adaptive speculators work: A two-model approach

ATLAS uses a dual-speculator architecture that combines stability with adaptation:

Inert speculator – Heavyweight model trained on extensive data ensures consistent baseline performance. It serves as a “speed floor”.

Adaptive profiteer – Lightweight model continuously learns from live motion. Specializes in emerging domains and usage patterns on an ongoing basis.

A controller for confidence – The orchestration layer dynamically chooses which speculator to exploit. Adjusts the speculation “lookahead” based on confidence metrics.

“Before the adaptive speculator learns anything, we still have a static speculator to help increase the speed at the beginning,” Ben Athiwaratkun, an artificial intelligence scientist at Together AI, explained to VentureBeat. “As the adaptive speculator becomes more confident, the speed increases over time.”

Technical innovation is about balancing acceptance rate (how often the target model agrees with the designed tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on the lightweight speculator and extends the lookahead function. This increases efficiency.

Users do not need to tune any parameters. “On the user side, you don’t have to turn any knobs,” Dao said. “For our part, we have twisted these knobs so that users can adjust them in a configuration that provides good acceleration.”

Performance comparable to custom silicon

Common AI tests show that ATLAS achieves 500 tokens per second in DeepSeek-V3.1 after full tuning. More impressively, these numbers on Nvidia B200 GPUs match or exceed specialized inference systems such as Groq custom equipment.

“Improvements in software and algorithms have the ability to fill the gap for really specialized hardware,” Dao said. “We saw speeds of 500 tokens per second on these huge models, which are even faster than some custom chips.”

The 400% speedup that the company claims as inference is the cumulative effect of Together’s Turbo optimization suite. Quantization in FP4 provides an 80% speedup compared to the FP8 baseline. Inert Turbo Speculator adds another 80-100% gain. At the top are the adaptive layers of the system. Each optimization combines the benefits of the others.

Compared to standard inference engines such as vLLM or Nvidia’s TensorRT-LLM, the improvement is significant. The AI ​​collaboratively benchmarks with a stronger baseline between them for each workload before applying speculative optimizations.

The trade-off between memory and computation is explained

The performance gains come from exploiting a fundamental inefficiency of newfangled inference: wasted computing power.

Dao explained that typically during inference, a significant amount of computing power is not fully utilized.

“During inference, which is actually the dominant workload today, it mainly uses the memory subsystem,” he said.

Speculative decoding turns idle computation into restricted memory access. When the model generates one token at a time, it is memory bound. The GPU remains idle waiting for memory. However, when the speculator proposes five tokens and the target model validates them simultaneously, computation utilization increases dramatically while memory access remains more or less constant.

“The total amount of computation needed to generate five tokens is the same, but the memory only had to be accessed once rather than five times,” Dao said.

Think of it as intelligent caching for artificial intelligence

For infrastructure teams familiar with time-honored database optimization, adaptive speculators act like an wise caching layer, but with a key difference.

Conventional caching systems like Redis or memcached require exact matches. You store the exact same query result and retrieve it when that particular query is run again. Adaptive speculators work differently.

“You can think of it as a smart way of caching, not about storing it exactly, but about detecting certain patterns that you see,” Dao explained. “Generally speaking, what we see is that you’re working with similar code or you’re working with similar, you know, controlling the computation in a similar way. We can then predict what the big model is going to say. We’re just getting better at predicting it.”

Instead of storing exact answers, the system learns the patterns in which the model generates tokens. Recognizes that if you edit Python files in a specific codebase, certain token sequences become more likely. The speculator adapts to these patterns, improving his predictions over time without requiring identical inputs.

Operate cases: RL training and changing workloads

Adaptive speculation particularly benefits from two enterprise scenarios:

Reinforcement learning training: Inert speculators quickly become unhinged as politics evolves during training. ATLAS constantly adapts to the changing distribution of policies.

Changing loads: As enterprises discover up-to-date exploit cases for AI, workload structure is changing. “Maybe they started using AI in chatbots, but then they realized, hey, it can write code, so they started switching to coding,” Dao said. “Or they realize that these AIs can actually call tools, control computers, do accounting, things like that.”

During a vibration coding session, the adaptive system can specialize in the specific code base being edited. These are files not seen during training. This further increases the acceptance rate and decoding speed.

What this means for enterprises and the inference ecosystem

ATLAS is now available on dedicated Together AI endpoints within the platform at no additional cost. More than 800,000 of the company’s developers have access to optimization (there were 450,000 in February).

But the broader implications go beyond a single vendor’s product. Moving from inert to adaptive optimization means fundamentally rethinking how inference platforms should work. As enterprises deploy AI across multiple domains, the industry will need to move away from one-off trained models towards systems that learn and continually improve.

In the past, AI has made some of its research techniques open source and collaborated with projects such as vLLM. While the fully integrated ATLAS system is proprietary, some of the underlying techniques may ultimately impact the broader inference ecosystem.

For enterprises looking to lead in AI, the message is clear: adaptive algorithms on off-the-shelf hardware can match custom silicon at a fraction of the cost. As this approach matures across the industry, software optimization is increasingly outperforming specialized hardware.

Latest Posts

More News