ARC-AGI-3: Every Frontier AI Model Scores Below 1% While Humans Score 100%

While tech companies race to declare AGI, the ARC Prize Foundation just released a benchmark that measures something none of our frontier models can do: navigate a completely unfamiliar environment with zero instructions.

What is ARC-AGI-3?

ARC-AGI-3 measures agentic intelligence — the ability to explore, adapt, and reason on-the-fly in completely novel environments. The test drops an AI agent into an unknown game-like world with zero instructions, zero stated goals, and no description of the rules. The agent must explore, figure out what it’s supposed to do, form a plan, and execute it.

This isn’t about trivia, coding, or language fluency — tasks where LLMs already shine. It’s about the thing we actually mean when we say "intelligence."

The Scoreboard

Entity	Score	Notes
Humans (untrained)	~100%	Baseline — explore, adapt, and solve from scratch
Best RL + Graph-Search Agent	12.58%	Preview phase — outperformed all frontier LLMs by 30x
Google Gemini 3.1 Pro	0.37%	Highest among frontier LLMs
OpenAI GPT-5.4 (High)	0.26%
Anthropic Claude Opus 4.6	0.25%
xAI Grok-4.20	0.00%	Complete failure to navigate novel environments

Why This Matters

The gap isn’t marginal — it’s a chasm. Humans score near-perfect on tasks they’ve never seen before, while the most powerful AI systems on Earth barely register above zero.

But here’s the most fascinating finding: during the preview phase, a simpler RL + graph-search approach scored 12.58% — outperforming every frontier model by 30x. This strongly suggests that throwing more parameters, more data, and more compute at LLMs is not the path to AGI.

The Implications

Scaling is not enough: LLMs are incredibly sophisticated pattern matchers. But pattern matching is not reasoning. ARC-AGI-3 proves that the current paradigm has fundamental limitations.
Architecture matters more than size: The RL-based agent with far fewer parameters outperformed trillion-parameter models. The next breakthrough in AI may come from new architectures, not bigger models.
The $2M Prize: The ARC Prize 2026 is live on Kaggle with $2 million in prize money for anyone who can match human-level performance. 25 environments are publicly available for researchers to test their approaches.

The Bottom Line

We have built machines that can write poetry, generate code, and pass medical exams. But we have not built machines that can think. ARC-AGI-3 is the clearest measure of that gap, and right now, the gap is 100x.

Until AI can walk into the unknown and figure things out from scratch — the way any human child can — we are still in the foothills of intelligence.

Sources: ARC Prize Leaderboard, The Decoder, ARC Prize Blog