Join our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more
Transformer architecture is currently powering the most popular public and private AI models. So we wonder — what’s next? Is this the architecture that will lead to better reasoning? What might come after Transformers? Today, models need vast amounts of data, GPU computing power, and infrequent talent to build intelligence. This makes them typically steep to build and maintain.
AI adoption started diminutive by making basic chatbots more knowledgeable. Now, startups and enterprises have figured out how to package intelligence as co-pilots that augment human knowledge and skills. The next natural step is to package things like multi-step workflows, memory, and personalization into agents that can solve operate cases across multiple functions, including sales and engineering. The expectation is that a basic prompt from a user will enable an agent to classify intent, break down a goal into multiple steps, and complete a task, whether that involves searching the web, authenticating across multiple tools, or learning from past repeat behaviors.
These agents, when applied to consumer operate cases, are starting to give us a sense of a future where everyone could have a personal Jarvis-like agent on their phone that understands them. Want to book a trip to Hawaii, order food from your favorite restaurant, or manage your personal finances? A future where you and I can safely manage these tasks with personalized agents is possible, but from a technological perspective, we’re still a long way off.
Is transformer architecture the final frontier?
The self-attention mechanism of the transformer architecture allows the model to weigh the importance of each input token relative to all tokens in the input sequence simultaneously. This helps improve the model’s understanding of language and computer vision by capturing long-range dependencies and complicated token relationships. However, this means that the computational complexity increases with long sequences (e.g., DNA), leading to impoverished performance and high memory consumption. Several solutions and research approaches to solve the long sequence problem include:
- Upgrading transformers on hardwareA promising technique here is Quick note. This paper argues that the performance of Transformer can be improved by carefully managing reads and writes for different levels of quick and sluggish memory on GPUs. This is done by making the attention algorithms IO-aware, which reduces the number of reads/writes between GPU high-bandwidth memory (HBM) and unchanging random-access memory (SRAM).
- Approximate note: Self-attention mechanisms have complexity O(n^2), where n is the length of the input sequence. Is there a way to reduce this quadratic computational complexity to linear so that transformers can better handle long sequences? Optimizations here include techniques such as reformer, performers, Skyformer and others.
In addition to these optimizations aimed at reducing transformer complexity, some alternative models challenge the dominance of transformers (though it is still too early for most of them):
- State Space Model:is a class of models related to recurrent (RNN) and convolutional (CNN) neural networks that compute with linear or nearly linear computational complexity for long sequences. State space models (SSMs), such as Mamba They cope better with long-distance relationships, but lag behind Transformers in terms of performance.
These research approaches are now out of the university labs and are available in the public domain for anyone to try out as novel models. Furthermore, the latest model releases can tell us about the state of the underlying technology and a viable path for alternatives to the Transformer.
Major model launches
We continue to hear about the latest and greatest model launches from the usual suspects, such as OpenAI, Cohere, Anthropic, and Mistral. The Meta Foundation model on compiler optimization stands out for its effectiveness in code and compiler optimization.
In addition to the dominant transformer architecture, we now see production-grade state space models (SSMs), hybrid SSM-transformer models, mixture of experts (MoEs), and composition of experts (CoEs). They seem to perform well in many benchmarks compared to the state-of-the-art open-source models. The ones that stand out are:
- Data blocks DBRX Open Source model: This MoE model has 132B parameters. It has 16 experts, 4 of which are busy at the same time during inference or training. It supports a 32K context window, and the model was trained on 12T tokens. Some other intriguing details – it took 3 months, $10M, and 3072 Nvidia GPUs connected by 3.2Tbps InfiniBand to complete pre-training, post-training, evaluation, red-teaming, and model refinement.
- SambaNova Systems provides Samba CoE version 0.2: This CoE model is a composition of five 7B parameter experts, only one of which is busy at the time of inference. The experts are open-source models, and along with the experts, the model has a router. It understands which model is best for a given query and routes the request to that model. It is incredibly quick, generating 330 tokens per second.
- AI21 Laboratories are publishing Iamba which is a hybrid Transformer-Mamba MoE model. It is the first production Mamba-based model with elements of the classic Transformer architecture. “Transformer models have two drawbacks: First, high memory and computation requirements make it difficult to process long contexts, where the size of the key-value (KV) cache becomes a limiting factor. Second, the lack of a single summary state results in slow inference and low throughput, since each generated token performs a computation on the entire context.” SSMs like Mamba can handle long-distance relationships better, but they lag behind Transformers in terms of performance. Jamba compensates for the inherent limitations of the pure SSM model by offering a 256K context window and can fit a 140K context on a single GPU.
Challenges related to implementing solutions in enterprises
While recent research and market launches have great potential to support transformer architectures as the next frontier, we must also consider the technical challenges that prevent companies from realizing this opportunity:
- The lack of functions in the company causes frustration: Imagine selling CXOs without basic things like role-based access control (RBAC), single sign-on (SSO), or no access to logs (both prompt and output). Today’s models may not be enterprise-ready, but enterprises are creating separate budgets to ensure they don’t miss out on the next massive thing.
- Destroying what once worked: AI co-pilots and agents are making it harder to secure data and applications. Imagine a basic operate case: a video conferencing app you operate every day introduces AI debriefing capabilities. As a user, you might love the ability to receive transcripts after a meeting, but in regulated industries, this improved feature can suddenly become a nightmare for CISOs. As a result, what used to work well is broken and must undergo additional security review. Enterprises need security to ensure data privacy and compliance as SaaS applications introduce such features.
- The constant battle of RAG vs. fine-tuning: You can implement both together, or neither, without sacrificing much. You can think of search-augmented generation (RAG) as a way to make sure that the facts are represented correctly and that the information is up to date, while tuning can be seen as the outcome of the best model quality. Tuning is tough, which is why some model vendors discourage it. It also includes the challenge of overfitting, which adversely affects model quality. Tuning seems to be under pressure from many sides—as the model context window grows and token costs decrease, RAG may become a better deployment option for enterprises. In the context of RAG, the recently launched Cohere’s R+ Command Model is the first open-weights model to beat GPT-4 in the chatbot arena. Command R+ is a state-of-the-art RAG-optimized model designed to support enterprise-class workflows.
I recently spoke with an AI leader at a vast financial institution who argued that the future does not belong to software engineers, but to artistic English/art students who can create an effective prompt. There may be some truth to this comment. With a basic sketch and multi-modal models, non-technical people can create basic applications with little effort. Knowing how to operate such tools can be a superpower and will support anyone who wants to succeed in their career.
The same is true for researchers, practitioners, and founders. There are now many architectures to choose from as they seek to make their basic models cheaper, faster, and more correct. There are now many ways to change models for specific operate cases, including fine-tuning techniques and newer breakthroughs such as direct preference optimization (DPO), an algorithm that can be seen as an alternative to reinforcement learning with human feedback (RLHF).
With so much rapid change in the generative AI space, it can be tough for founders and buyers alike to prioritize. I look forward to seeing what else comes next from those who are creating something novel.
Ashish Kakran is a director at Thomvest Projects focuses on investing in early-stage startups in the areas of cloud, data/machine learning, and cybersecurity.
Data decision makers
Welcome to the VentureBeat community!
DataDecisionMakers is a platform where experts, including technical data scientists, can share data-related insights and innovations.
If you want to learn about the latest ideas and insights, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You may even consider writing your own article!
Read more from DataDecisionMakers
