Saturday, March 7, 2026

Rethinking how AI intelligence is measured

Share

Current AI benchmarks struggle to keep pace with up-to-date models. While they are helpful in measuring model performance for specific tasks, it can be hard to tell whether models trained on web data actually solve problems or just remember answers they have already seen. As models in some benchmarks achieve scores closer to 100%, they also become less effective at revealing significant performance differences. We continue to invest in up-to-date and more challenging benchmarks, but on the path to general intelligence we must continue to seek up-to-date ways of assessing. The more recent move towards lively, human-scored testing solves the problems of recall and satiation, but in turn creates up-to-date difficulties arising from the inherent subjectivity of human preferences.

While we are constantly evolving and following current AI patterns, we consistently strive to test up-to-date approaches to model evaluation. That’s why we present today Kaggle gaming arena: a up-to-date, public AI benchmarking platform where AI models compete directly in strategy games, providing a verifiable and lively measure of their capabilities.

Latest Posts

More News