Friday, March 13, 2026

Why do you need a rag to remain valid as a scientist of data

Share


Photo by the author Canva

If you work in a data -related field, you should regularly update. Data scientists apply various task tools, such as data visualization, data modeling and even warehouse systems.

In this way, AI changed the sciences of data from A to Z.

In this article we will break a rag. Starting from the academic article that introduced it and the way it is now used to reduce costs when working with enormous language models (LLM). But let’s discuss the basics first.

What is the recovery generation (RAG)?

What is the recovery generation (RAG)

Patrick Lewis first introduced a cloth This The academic article for the first time in 2020 combines two key elements: retriever and generator.

The idea of ​​this is plain. Instead of generating answers from parameters, RAG can collect relevant information from the document.

What is retriever?

Retriever is used to collect relevant information from the document. But how?

Let’s consider it. You have a massive Excel sheet. Let’s say it’s 20 MB, with thousands of rows. You want to search for call_date user_id = 10234.

Thanks to this retriever, instead of looking at the entire document, RAG will search only for the appropriate part.

What is retriever in a cloth

But how is it helpful for us? If you search the whole document, you will spend many tokens. As you probably know, the apply of LLM API is calculated with tokens.

Let’s visit https://platform.openai.com/tokenizer And see how these calculations are made. For example, if you paste the introduction of this article. It costs 123 tokens.

You must check this to calculate the cost using the API LLM interface. For example, if you are considering using a WORD document, say 10 MB, it can be thousands of tokens. Every time you send this document using the API LLM interface, the cost is multiplying.

Using RAG, you can only choose the appropriate part of the document, reducing the number of tokens to pay less. It is plain.

What is retriever in a cloth

How does this retriever do it?

Before recovering, the documents are divided into diminutive fragments, paragraphs. Each piece is transformed into a dense vector using a modeling model (OPENAI deposition, sentence, etc.).

So when the user wants surgery, such as how to ask what the date of connection is, Retriever compares the question vector with all vectors of fragments and chooses the most similar. This is brilliant, right?

What is a generator?

As we explained above, after finding the most appropriate documents, the generator takes control. It generates an answer using the user’s inquiry and a recovered document.

By using this method, you also minimize the risk of hallucinations. Because instead of generating a response freely based on the data in which AI was trained, the model justifies her response to the actual document.

Evolution of the context window

Preliminary models, such as GPT-2, have diminutive contextual windows, about 2048 tokens. That is why these models do not have a file transfer function. If you remember, after several models CHATGPT offers a data transfer function because the context window has evolved.

Advanced models, such as GPT-4O, have a 128K token limit that supports the data transfer function and can show a excess rag in the case of a context window. But here there are demands of cost reduction.

So now one of the reasons why users apply RAG is to reduce costs, but not only that. Since the costs of using LLM are falling, GPT 4.1 has introduced a context window to 1 million tokens, which is a fantastic growth. Now Rag has also evolved.

Industry practice

Now LLM is evolving in agents. They should automate your tasks instead of generating only answers. Some companies develop models that even control your keywords and mouse.

So in these cases one should not risk hallucinations. So here Rag enters the stage. In this section we will deeply analyze one example from the real world.

Companies are looking for talent to develop agents. These are not only enormous companies; Even average or diminutive businesses and startups are looking for their options. These tasks can be found on Freelancer websites, such as Upper AND Fiverr.

Marketing agent

Let’s say that the company of medium -sized Europe wants you to create an agent, an agent who generates marketing proposals for its clients using the company’s documents.

What’s more, this agent should apply the content, including the relevant information about the hotel in this proposal of business events or campaigns.

But there is a problem: the agent often hallucinates. Why is this happening? Because instead of relying only on the company’s document, the model downloads information from the original training data. These training data can be dated because, as you know, these LLM are not regularly updated.

As a result, AI finally adds abnormal hotels or simply insignificant information. Now you indicate the basic cause of the problem: no reliable information.

RAG appears here. Using the API of browsing the online company, they used LLM to download reliable information from the Internet and reference to them, while generating answers on how. Let’s see this prompt.

“Generate a proposal based on the tone of voice information and company, and use a network search to find hotel names.”

This network search function becomes the RAG method.

Final thoughts

In this article we discovered the evolution of AI models and why RAG uses them. As you can see, the plaintiff has changed with time, but the problem remains: performance.

Even if the cause is cost or speed, this method will still be used in AI tasks. And through “related to AI” I do not exclude data education, because, as you probably know, with the upcoming summer AI, data science was also deeply affected by AI.

If you want to follow similar articles, solve over 700 interview questions related to data teaching and over 50 data projects, visit mine platform.

Nate Rosidi He is a scientist of data and in the product strategy. He is also an analytical teacher and the founder of Stratascratch, platforms facilitate scientists to prepare for interviews with real questions from the highest companies. Nate writes about the latest trends on the career market, gives intelligence advice, divides data projects and includes everything SQL.

Latest Posts

More News