The possibility of achieving artificial general intelligence (AGI) could be closer than we think, based on a popular test. However, the test’s creators argue that this suggests weaknesses in the test’s structure, rather than a real breakthrough in the field.
In 2019, AI industry leader Francois Chollet introduced the ARC-AGI benchmark, an acronym for “Abstract and Reasoning Corpus for Artificial General Intelligence.” This benchmark was designed to assess an AI system’s ability to learn new skills outside of its training data. Chollet maintains that ARC-AGI is the only AI test that measures progress towards general intelligence, despite others being suggested.
Up until now, the highest-scoring AI could only complete less than a third of the tasks in ARC-AGI. Chollet attributes this to the industry’s emphasis on large language models (LLMs), which he believes are incapable of actual “reasoning.”
“LLMs struggle with generalization, being largely dependent on memorization,” Chollet explained in a series of posts. “They fail on anything that wasn’t included in their training data.”
Chollet emphasizes that while LLMs might be able to memorize “reasoning patterns,” it’s doubtful they can generate “new reasoning” based on new situations. According to Chollet, if you need many examples of a pattern to learn a reusable representation for it, you’re merely memorizing.
To encourage research beyond LLMs, Chollet and Zapier co-founder Mike Knoop initiated a $1 million competition in June to create open-source AI capable of outperforming ARC-AGI. Of 17,789 submissions, the highest score was 55.5% — ~20% higher than the top scorer in 2023, but still short of the 85% “human-level” threshold needed to win.
However, this doesn’t mean that we are ~20% closer to AGI, according to Knoop.
Knoop suggested in a blog post that many of the ARC-AGI submissions have been able to “brute force” their way to a solution, indicating that a “large portion” of ARC-AGI tasks “[don’t] provide much useful indication towards general intelligence.”
ARC-AGI comprises puzzle-like problems where an AI, given a grid of differently colored squares, must generate the correct “answer” grid. These problems were designed to force an AI to adapt to new problems it hasn’t encountered before, although their effectiveness in doing so is uncertain.
“[ARC-AGI] has remained the same since 2019 and is not perfect,” Knoop admitted in his post.
Francois and Knoop have also been criticized for overhyping ARC-AGI as a benchmark toward AGI — at a time when the very definition of AGI is being hotly debated. A staff member from OpenAI recently claimed that AGI has “already” been achieved if one defines AGI as AI “superior to most humans at most tasks.”
Knoop and Chollet have plans to launch a second-generation ARC-AGI benchmark to address these issues, along with a 2025 competition. “We will continue to guide the research community’s efforts towards what we see as the most critical unsolved issues in AI, and hasten the timeline to AGI,” Chollet stated in a post.
However, finding solutions won’t be simple. If the first ARC-AGI test’s flaws are any indication, defining intelligence for AI will be as complex and controversial as it has been for humans.