To scale agent-based AI, Notion tore down the technology stack and started over

Many organizations would be hesitant to modernize their technology stack and start from scratch. NO Concept. For version 3.0 of its productivity software (released in September), the company didn’t hesitate to build everything from scratch; recognized that it was actually necessary to support agentic AI at enterprise scale. While customary AI-based workflows involve explicit step-by-step instructions based on repeated learning, AI agents using advanced reasoning models accurately define tools, can identify and understand what tools are at their disposal, and plan next steps. “Rather than trying to modernize what we were building, we wanted to leverage the strengths of reasoning models,” Sarah Sachs, head of artificial intelligence modeling at Notion, told VentureBeat. “We rebuilt the architecture because the workflows are different from the agent workflows.”

Re-arrangement so models can work autonomously

The concept has been adopted by 94% of Forbes AI 50 companies, has a total of 100 million users, and counts OpenAI, Cursor, Figma, Ramp and Vercel among its customers. In the rapidly evolving artificial intelligence landscape, the company identified a need to move beyond simpler task-based workflows towards goal-oriented reasoning systems that enable agents to autonomously select, coordinate and execute tools in connected environments.

Sachs noted that very quickly, reasoning models become “much better” at learning to operate tools and follow chain of thought (CoT) instructions. This allows them to be “much more independent” and make multiple decisions within one agent-based workflow. “We rebuilt our AI system to adapt to this,” she said. From an engineering perspective, this meant replacing unyielding, prompt-based flows with a unified orchestration model, Sachs explained. This basic model is supported by modular slave agents that search Notion and the Internet, query and add to databases, and edit content. Each agent uses tools contextually; for example, they can decide whether to search Notion itself or another platform such as Slack. The model will perform subsequent searches until the appropriate information is found. For example, it can then transform notes into proposals, create follow-up messages, track tasks, and detect and make updates to knowledge bases. In Notion 2.0, the team focused on having the AI perform specific tasks, which required them to think “extensively” about how to stimulate the model, Sachs noted. However, in version 3.0, users can assign tasks to agents, and agents can actually take actions and perform multiple tasks simultaneously. “We reorganized it to be about self-selecting tools rather than a bunch of shots, which clearly tells you how to get through all these different scenarios,” Sachs explained. The goal is to ensure that everything works with AI and that “anything you can do, your Notion agent can do.”

Bifurcation to isolate hallucinations

Notion’s “better, faster, cheaper” philosophy drives a continuous iteration cycle that balances latency and accuracy through fine-tuned vector embedding and versatile search optimization. The Sachs team employs a exacting assessment framework that combines deterministic testing, vernacular optimization, human annotated data, and LLM as a judge, with model-based scoring identifying discrepancies and inaccuracies. “By bifurcating the assessment, we are able to determine where problems are coming from, which helps us isolate unnecessary hallucinations,” Sachs explained. Moreover, simplifying the architecture itself means that it is easier to make changes as models and techniques evolve. “We optimize latency and parallel thinking whenever possible,” which leads to “significantly greater accuracy,” Sachs noted. The models are based on data from the Internet and the connected Notion workspace. Ultimately, Sachs said, the investment in rebuilding the architecture has already delivered returns for Notion in terms of performance and a faster pace of change. She added: “We are completely open to rebuilding it again if necessary when the next breakthrough occurs.”

Understanding context delay

When building and tuning models, remember that latency is subjective: AI must deliver the most relevant information, not necessarily the most, at the expense of speed. “You’d be surprised how many ways customers are willing to wait for something and not wait for something,” Sachs said. This leads to an compelling experiment: How sluggish can you go before people abandon the model? For example, in a purely navigational search, users may not be as patient; they want answers almost immediately. “If you ask, ‘What’s two plus two,’ you don’t want to have to wait until your agent is searching everywhere in Slack and JIRA,” Sachs noted. But the longer the time given, the more exhausting it can be for the reasoning agent. For example, Notion can perform 20 minutes of autonomous work on hundreds of websites, files, and other materials. In such cases, users prefer to wait, Sachs explained; they allow the model to run in the background while it attends to other tasks. “It’s a product issue,” Sachs said. “How do we set user expectations for the UI? How do we set user expectations for latency?”

Notion is its largest user

Notion understands the importance of using its own product – in fact, its employees are among the largest users of energy. Sachs explained that teams have dynamic sandboxes that generate training and evaluation data, as well as a “really active” thumbs up-thumbs down user feedback loop. Users aren’t shy about saying what they think should be improved or what features they’d like to see. Sachs emphasized that when a user thumbs away an interaction, they are expressly consenting to that interaction being analyzed by a human commentator in the most de-anonymizing way possible. “As a company, we use our own tool all day, every day, so we get really quick feedback,” Sachs said. “We’re actually testing our own product.” That said, they make their own product, Sachs noted, so they understand they may be wearing safety glasses when it comes to quality and functionality. To offset this, Notion has relied on “very AI-savvy” design partners who are given early access to modern capabilities and provide significant feedback. Sachs emphasized that this is just as significant as internal prototyping. “Our goal is to experiment in the open. I think you can get much richer feedback,” Sachs said. “Because, ultimately, if we look at how Notion uses Notion, we’re not really providing the best experience for our customers.” Equally significant, continuous internal testing allows teams to assess progress and ensure models don’t regress (when accuracy and performance deteriorate over time). “Everything you do stays true,” Sachs explained. “You know your delay is within limits.”

Many companies make the mistake of focusing too intensely on retroactive Evan; this makes it tough for them to understand how and where they are improving, Sachs noted. The concept sees evaluations as a “litmus test” of forward-looking development and progress, and evaluations that test observability and regression. “I think a big mistake that a lot of companies make is conflating the two,” Sachs said. “We use them for both purposes; we think about them really differently.”

Lessons from Notion’s journey

For enterprises, Notion can serve as a blueprint for responsibly and dynamically operationalizing agent-based AI in a connected, empowered enterprise workspace. Sach’s tips for other tech leaders:

Don’t be afraid to rebuild when your core capabilities change; Notion has completely redesigned its architecture to accommodate reasoning-based models.
Treat latency as contextual: Optimize by operate case, not universally.
Base all results on reliable, curated enterprise data to ensure accuracy and confidence. She advised: “Be willing to make difficult decisions. Be willing to sit at the top, so to speak, of what you are developing to build the best product you can for your customers.”

Categories

To scale agent-based AI, Notion tore down the technology stack and started over

Re-arrangement so models can work autonomously

Bifurcation to isolate hallucinations

Understanding context delay

Notion is its largest user

Lessons from Notion’s journey

The measles outbreak in South Carolina is slowing down

Nvidia will spend $26 billion to build open-weight artificial intelligence models, filings show

Iran warns that US technology companies could become targets as the war expands

Inside OpenAI’s race to catch up with Claude Code

Meta has developed four novel chips to power its artificial intelligence and recommendation systems

More News

Nvidia will spend $26 billion to build open-weight artificial intelligence models, filings show

Iran warns that US technology companies could become targets as the war expands

Inside OpenAI’s race to catch up with Claude Code

Meta has developed four novel chips to power its artificial intelligence and recommendation systems

The measles outbreak in South Carolina is slowing down

Nvidia will spend $26 billion to build open-weight artificial intelligence models, filings show

Iran warns that US technology companies could become targets as the war expands