Beyond Benchmarks: How Deepseek-R1 and O1 perform tasks in the real world

Share

Join our daily and weekly newsletters to get the latest updates and exclusive content regarding the leading scope of artificial intelligence. Learn more

Deepseek-R1 certainly evoked a lot of emotions and care, especially for the competing OPenai model. That is why we will test them compared to several basic market analysis and market research.

To put models to an equal extent, we used Pression Pro Search, which now supports both O1 and R1. Our goal was to look beyond comparative tests and check whether models can actually perform ad hoc tasks that require the collection of information from the Internet, choosing the right data and performing basic tasks that would require significant manual effort.

Both models are impressive, but they make mistakes when the hints have no specifics. O1 is a bit better in the reasoning of tasks, but the transparency of R1 gives it an advantage in cases (and there will be a lot of them) in which he makes mistakes.

Here is the division of several of our experiments and links to the embarrassment pages where you can view the results yourself.

Calculation of returns from the internet investment

Our first test assessed whether the models could calculate the return on investment (ROI). We considered the script in which the user invested 140 USD in a magnificent seven (Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, Tesla) The first day of each month from January to December 2024. We asked the model to calculate the portfolio value on the current day.

To perform this task, the model would have to draw information about the price of Mag 7 for the first day of each month, evenly divide the monthly investment into shares (USD 20 per actions), summarize them and calculate the value of the wallet according to the value of stocks on the current day.

In this task, both models failed. O1 returned the list of share prices in January 2024 and January 2025 together with the formula to calculate the value of the portfolio. However, it was not possible to calculate the correct value and basically said that there would be no roi. On the other hand, R1 made a mistake, investing only in January 2024 and calculated returns for January 2025.

Tracking O1 reasoning does not contain enough information

However, the processing process of models was compelling. While O1 did not provide many details about how he achieved his results, R1 reasoning was followed He showed that this did not have the correct information, because the engine of taking embarrassment did not obtain monthly data for share prices (many recovery generating applications are not managed because of the lack of skills, but because of bad search). It turned out that this is critical feedback that led us to the next experiment.

*R1 reasoning shows that there is a lack of information*

Reasoning of the file content

We decided to conduct the same experiment as before, but instead of encouraging the model to download information from the Internet, we decided to provide it in a text file. For this purpose, we copied monthly data in the warehouse for each supply with yahoo! Finance in a text file and handed it over to the model. The file contained the name of each action plus the HTML table, which included the price for the first day of each month from January to December 2024 and the last registered price. The data has not been cleaned to reduce manual effort and check if the model can choose the relevant parts from the data.

Again, both models did not give the right answer. O1 seemed to separate the data From the file, but it has been suggested that the calculations were made by hand in tools such as Excel. Tracking reasoning was very unclear and did not contain any useful information to solve problems. R1 also failed And he did not answer, but tracking reasoning contained a lot of useful information.

For example, it was clear that the model correctly analyzed HTML data for each supply and was able to separate the correct information. He was also able to calculate the investment monthly by month, add them and calculate the final value in accordance with the latest share price in the table. However, this final value remained in the reasoning chain and could not be found in the final response. The model was also mixed up by the government on the Nvidia chart, which meant the division of the 10: 1 shares on June 10, 2024 and finally converted the final value of the wallet.

*R1 hid the results in tracking the reasoning along with information about where it went wrong*

Again, the real distinguishing feature was not the result, but the ability to examine how the model came to his answer. In this case, R1 provided us with better impressions, enabling us to understand the limitations of the model and how we can reformulate our data and format our data to get better results in the future.

Comparison of data on the Internet

Another experiment we conducted required the model to compare the statistics of four leading NBA centers and determining which of them had the best improvement in the percentage of goals in the field (FG%) from the 2022/2023 season to the seasons 2023/2024. This task required the model to perform multi -stage reasoning at various data points. The monit was that he included Victor Wembanymy, who had just entered the league as a debutant in 2023.

Recovering this prompt was much easier because player statistics are widely reported on the Internet and are usually included in their Wikipedia and NBA profiles. Both models responded correctly (it’s Giannis in case you were compelling), although depending on the sources used, their numbers were slightly different. However, they did not realize that WEMBA did not qualify for comparison and collected other statistics from the time in the European League.

In its answer R1 provided a better division results with a comparative table with links to the sources used to answer. The added context enabled us to correct the poem. After modifying the prompt specifying that we were looking for FG% from NBA seasons, the model correctly excluded WEMBs from the results.

Adding a basic word to prompt made a difference as a result. This is something that a man did not know. Be as detailed as possible, in the hints and try to contain information that a person would assume by default.

Final verdict

Models of reasoning are powerful tools, but they still have a lot to do before they can be fully trusted by tasks, especially when other elements of the application model Immense Language (LLM) are evolved. From our experiments, both O1 and R1 can still make basic mistakes. Despite demonstrating impressive results, they still need some holding to give exact results.

Daily observations in matters of business apply with VB daily

If you want to impress your boss, VB Daily is covered by you. We give you an internal measure about what companies do with generative artificial intelligence, from regulatory changes to practical implementation, so you can share insights for the maximum roi.

Read our Privacy Policy

Thanks for the subscription. Check out more VB newsletter here.

There was a mistake.

The AI Sckool

Categories

Beyond Benchmarks: How Deepseek-R1 and O1 perform tasks in the real world

Calculation of returns from the internet investment

Reasoning of the file content

Comparison of data on the Internet

Final verdict

Scientists are mapping strange, messy space -time inside the black holes

Nvidia ads, news and more, from GTC 2025

Scientists think they have found a brain region that regulates conscious perception

Five and overheating, most humanoid robots do not end the half -marathon in Beijing

Chatbot from customer service AI submitted the company’s rules – and created a mess

More News

Nvidia ads, news and more, from GTC 2025

Five and overheating, most humanoid robots do not end the half -marathon in Beijing

Chatbot from customer service AI submitted the company’s rules – and created a mess

The judge blocks the dog before releasing 90 percent CFPB

Scientists are mapping strange, messy space -time inside the black holes

Nvidia ads, news and more, from GTC 2025

Scientists think they have found a brain region that regulates conscious perception