The final test of humanity is a distraction

Share

# Entry

The final test of humanity (HLE) is a benchmark designed to measure the reasoning and deep knowledge capabilities of most current artificial intelligence systems. Its characteristic feature: the underlying judgment is taken to an extreme. Think of it as today’s evolution of Turing tests, which were born decades ago.

In this article, we gently delve into this benchmark, describing the reason for its creation, presenting a variety of opinions on it from groups of experts in the field, and concluding with a summary of the most widely accepted verdict.

# Why was it built and what does it consist of?

The time-honored testing methods used in classic AI systems became obsolete as these systems evolved and began to achieve excellent results without much effort. For this reason AI Security Center he created with it an pioneering benchmark called HLE AI scales with the aid of world experts. The benchmark was published in Naturemost prestigious scientific journal to date, in January 2026. It has been carefully designed to avoid repetitive patterns as was the case with previous assessment frameworks.

So what is HLE all about? Well, it is an exam that cutting-edge artificial intelligence systems such as language models must take, and it consists of over 2,500 expert-level questions covering over a hundred academic disciplines, including but not restricted to physics, mathematics, biology, humanities, and many others. Importantly, the questions cannot be answered by memorization, nor are they restricted to simply searching for information or answering multiple choice questions. Instead, they require complicated deductive reasoning and deep understanding.

Here is an example of two such questions:

Two sample questions about HLE. Image source: ArXiv
Two sample questions about HLE. Image source: AI Security Center

Let’s talk about the results achieved so far by the most advanced models: even the most sophisticated pioneer models such as GPT, Gemini or Claude barely exceed the overall accuracy threshold of 45-50%. The numbers speak for themselves about how extremely hard this exam is. Moreover, they often fail because they are too confident in answering questions that have been answered incorrectly.

# What is the dominant expert opinion on HLE?

The candid answer is: there is no consensus on this. Opinion is rather divided in the technology, software and academic communities, but there is a subtle, dominant tendency to accept the real utility of HLE. However, there are critical nuances.

Overall, experts and the broader population familiar with HLE don’t think the initiative is completely nonsensical, but they do refer to the exaggerated, seemingly marketing-oriented way of naming it.

On a gigantic scale, there are three dominant opinion groups regarding HLE:

// 1. HLE is really useful and necessary

About 60% of opinions lean towards this collective opinion that there is a technical reason why HLE is currently the most critical: previous benchmarks and frameworks for testing AI systems, including not-so-old language model patterns such as Massive Multitask Language Understanding (MMLU), have become saturated or dated, and almost every current AI has scored over 90% on them. This made it impossible to actually compare the latest models against each other and determine which one was the best. One of the biggest reasons many experts praise HLE is that it measures whether AI is willing to say “I don’t know” rather than hallucinate about complicated problems or questions it can’t answer.

// 2. HLE distracts from real AI

This skeptical point of view is shared by approximately 30% of opinions. These experts believe that the test does not really assess the performance and success of AI in real-life scenarios because it relies solely on overly academic and vague knowledge. Some engineers even dare to claim, ironically, that as soon as AI starts massively reaching over 90% in HLE, enterprises will start rushing to create HLE 2, etc., thus consolidating the marketing hamster wheel to the benefit of gigantic corporations.

// 3. HLE is defective

This is the third and smallest of the three dominant opinions, and is discussed, for example, in data science forums. They claim that there are errors marked as correct in some HLE answers, particularly in some niche questions in fields such as chemistry and advanced mathematics. Rather poetically, it was the most powerful AI systems themselves that began to detect these types of benchmark errors.

# Summary

In summary, the usefulness of HLE is not denied and to some extent many experts emphasize its importance, although its nomenclature is widely considered to be pure marketing drama. Using this benchmark doesn’t seem very likely to determine the birth of super AI or its true emergence artificial general intelligence (AGI): a concept that has been discussed for many years, but is still more fiction than reality. Nevertheless, benchmarking is seen as a very ambitious tool to determine which AI or company has the best model with memory and logic capabilities.

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the apply of artificial intelligence in the real world.

Latest Posts

More News