Photo by the author Canva
We exploit immense language models in many of our daily tasks. These models have been trained in the scope of billions of internet documents and various data sets, thanks to which they are able to understand, understand and answer in a language similar to man. However, not all LLM is created in the same way. While the basic idea remains similar, Differ in basic architecture, and these differences have a significant impact on their abilities. For example, as you can see on various comparative tests, Deepseek leads to the reasoning of tasks, Claude copes well in coding, and Chatgpt is distinguished by innovative writing.
In this article I will go to 7 popular LLM architectures to give you a clear review, all for minutes. Let’s start.
1. Bert
Paper link: https://arxiv.org/pdf/1810.04805
Developed by Google in 2018, Bert marked a significant change in understanding of the natural language, introducing a deep two -way attention in language modeling. Unlike previous models, which read the text in a way from left or right to left, Bert uses the transformer encoder to consider both directions at the same time. He is trained using two tasks: modeling masked language (predicting randomly masked words) and forecasts of the next event (specifying whether one sentence logically follows the second). Architecturally, Bert comes in two sizes: the base of Berta (12 layers, parameters 110 m) and immense (24 layers, parameters 340 m). Its structure is based solely on the piles of the encoder and contains special tokens such as [CLS] represent the full sentence and [SEP] separate two sentences. You can adapt it to tasks such as sentimental analysis, answering questions (such as a team) and many others. It was the first of its kind to really understand the full meaning of sentences.
2. GPT
3. LAMA
Call the blog link: https://ai.met.com/blog/llama-4-ultimodal-intelligence/
Paper link (Lama 3): https://arxiv.org/abs/2407.21783
Llama, developed by Meta AI and first released in February 2023, is a family of transformer models only Open Source decoders. It circulates from 7 billion to 70 billion parameters, along with the latest version of Llam 4, released in April 2025, like GPT, Lama uses the architecture only the transformer decoder (each model is an autoregressive transformer), but with some architectural corrections. For example, the original Llam models were used to activate SWIGL instead of gel, rotating positioning (ROPE) instead of eternal and RMSNORM instead of the layer standard. The Lama family was released in many sizes from 7b to 65b parameters in LAMA1, later even larger in LAMA3, to make models on a immense scale. In particular, despite the relatively modest number of parameters, these models operated competitively with much larger contemporary: Meta said that the LLAMA 13B model exceeded the GPT-3 OPENAI 175B on many comparative tests, and its 65B model was competitive with current. Open (though studied study) Llama gave birth to the extensive exploit of the community; His key novelty was the combination of proficient training on a scale with greater open access to the mass mass.
4. Palm
5. Gemini
6. Mistral
Paper link (Mistral 7B): https://arxiv.org/abs/2310.06825
Mistral is a French startup AI, which released his first LLM in 2023. His flagship model, Mistral 7b (September 2023), is a decoder model based on a transformer with a value of 7.3 billion parameters. Architecturally, Mistral 7B is similar to a GPT style model, but contains optimizations for inference: Uses attention with the group group (GQA) to accelerate self -improvement and shifted attention to support longer contexts more effectively. In terms of performance, Mistral 7B exceeded Lama 2 13b Meta, and even gave robust results compared to 34B models, while at the same time were much smaller. Mistral AI has released a model under the APACHE 2.0 license, thanks to which it is available for exploit. The next main edition was Mixtral 8 × 7B, a uncommon model of an expert mix (MOE) with eight 7 B-parameter networks on a layer. This project was helped by the Mixtral match or overcome GPT – 3.5 and Lama 2 70b in tasks such as mathematics, coding and multilingual references. In May 2025, Mistral released Mistral Medium 3, a reserved medium -sized model addressed to enterprises. This model provides over 90% of the result of more steep models, such as Claude 3.7 Sonet on standard comparative tests, while dramatically reducing the costs to the token (about USD 0.40 compared to USD 3.00). It supports multimodal tasks (text + images), professional reasoning and is offered via the API interface or to implement a local on only four graphic processors. However, unlike earlier models, Medium 3 is closed, which causes the community to criticize that Mistral is moving away from the Open Source ethos. Shortly after, in June 2025, Mistral introduced a bus, their first model devoted to clear reasoning. The compact version is open as part of Apache 2.0, and the Magistral Medium is only for enterprises. The Magistral Medium obtained 73.6% in Aime2024, and the compact version reached 70.7%, showing robust mathematical and logical skills in many languages.
7. Deepseek
Paper link (Deepseek-R1): https://arxiv.org/abs/2501.12948
Deepseek is a Chinese company AI (spin-off of high-flyer AI, founded 2023), which develops immense LLM. His latest models (such as Deepseek V3 and Deepseek-R1) exploit a highly poorly activated mix of transformer architecture. In Deepseek V3/R1, each transformer layer has hundreds of sub-network experts, but only a few are activated on token. This means that instead of starting all parts of the model at the same time, the model has hundreds of expert networks and activates only a few (such as 9 out of 257) depending on what is needed for each entrance. This is enabled by Deepseek on a huge total size of the model (over 670 billion parameters), while only about 37 billion is used during each answer, which makes it much faster and cheaper to start than a dense model of similar size. Like other current LM, it uses SWIGL activations, rotary embedding (ROPE) and advanced optimizations (including experimental FP8 precision during training) to make it more proficient. This aggressive construction can allow Deepseek to achieve very high abilities (comparable with much larger dense models) at lower calculation costs. Deepseek models (issued as part of open licenses) drew attention to the competition of leading models, such as GPT-4 in multilingual generation and reasoning, while significantly reducing the requirements for training and inference.
Canwal Mehreen Kanwal is a machine learning engineer and a technical writer with a deep passion for data learning and AI intersection with medicine. He is the co -author of the ebook “maximizing performance from chatgpt”. As a Google 2022 generation scholar for APAC, it tells diversity and academic perfection. It is also recognized as a variety of terradate at Tech Scholar, Mitacs Globalink Research Scholar and Harvard Wecode Scholar. Kanwalwal is a balmy supporter of changes, after establishing FemCodes to strengthen women in the STEM fields.
