Scientists from Meta FAIR and National University of Singapore developed a recent reinforcement learning framework for self-improving artificial intelligence systems.
Called Standalone game in Corpus environments (SPICE)Within this platform, two artificial intelligence agents compete against each other, creating their own challenges and gradually improving them without human supervision.
While currently a proof of concept, this self-play mechanism could provide the basis for future AI systems that can dynamically adapt to their environments, making them more resilient to the unpredictability of real-world applications.
The challenge of AI self-improvement
The goal of self-improvement of artificial intelligence is to create systems that can raise their capabilities by interacting with the environment.
A common approach is reinforcement learning with verifiable rewards (RLVR), in which models are rewarded for providing correct answers to problems. This is often narrow by a reliance on human-curated problem sets and domain-specific reward engineering, making it complex to scale.
Another promising paradigm is the self-game, in which a model improves by competing with itself. However, existing self-play methods in language models are often narrow by two critical factors.
-
Factual errors in the generated questions and answers accumulate, leading to a feedback loop in the form of hallucinations.
-
When the problem generator and problem solver have information symmetry (i.e., share the same knowledge base), they fail to generate truly recent challenges and fall into repetitive patterns.
As the researchers note in their paper, “These systematic empirical failures indicate that self-improvement requires interaction with an external source that provides diverse, verifiable feedback, rather than pure closed-loop introspection.”
How SPICE works
SPICE is a platform for independent play in which a single model plays two different roles.
-
“Challenger” constructs a curriculum composed of complex problems based on a vast collection of documents.
-
The “reasoner” then tries to solve these problems without access to the source documents.
This configuration breaks the information symmetry that limits other methods of independent play, because the Reasoner does not have access to the documents and knowledge that the Challenger uses to generate problems.
Situating tasks in a vast and diverse set of documents prevents hallucinations by anchoring questions and answers in real-world content. This is crucial because for AI systems to reliably self-improve, they need external sources of grounding. Therefore, LLM agents should learn from interactions with people and the real world, and not just from their own results, to avoid overlapping errors.
The adversarial lively between both roles creates an automatic curriculum.
The Challenger is rewarded for generating problems that are both diverse and at the limits of the Brainer’s capabilities (not too simple or impossible).
The reasoner is rewarded for the correct answer. This symbiotic interaction pushes both entities to constantly discover and overcome recent challenges.
Because the system uses raw documents rather than predefined question-answer pairs, it can generate a variety of task formats such as multiple-choice and free-form questions.
This flexibility allows SPICE to be applied to any domain, eliminating the bottleneck that narrow previous methods to narrow domains such as mathematics and code. It also reduces dependence on pricey, human-curated data sets for specialized fields such as legal or medical analyses.
SPICE in action
The researchers evaluated SPICE on several base models, including Qwen3-4B-Base and OctoThinker-3B hybrid base.
They compared its performance to baselines such as a base model without training, a Reasoner model trained using the fixed “Strong Challenger” (Qwen3-32B-Instruct), and standalone play methods such as R-Zero and Zero Absolute. The assessment covered a wide range of mathematical and general indicators.
Across all models, SPICE consistently outperformed the baseline, providing significant improvements in both math and general reasoning tasks.
The results show that reasoning skills developed through corpus-based independent play are highly transferable to different models due to the diverse corpus of external knowledge they drew on.
The key finding is that adversarial dynamics create an effective automated curriculum. As training progresses, the Challenger learns to generate increasingly more complex problems.
In one experiment, Reasoner’s pass rate on a fixed set of problems increased over time from 55% to 85%, demonstrating its improved capabilities.
Meanwhile, later versions of Challenger were able to generate questions that lowered the early-stage thinker pass rate from 55% to 35%, confirming that both roles are successfully evolving.
The researchers concluded that this approach represented a paradigm shift in self-improving reasoning methods from “closed-loop self-play, which often stagnates due to hallucination drift, to open improvement through interaction with the vast, verifiable knowledge contained in corpora of online documents.”
Currently, the corpus used in SPICE represents the human experience captured in text. The ultimate goal is for self-improving systems to generate questions based on interactions with reality, including the physical world, the Internet, and human interactions through multiple modalities such as video, audio, and sensor data.
