AI benchmarking | 经济学人
作者:微信文章文章背景
当 AI 模型在高中数学题上轻松斩获 98 分,人类正在设计连 GPT 都答不出的 “魔鬼考题”。剑桥大学推出的 ZeroBench 测试中,一道星型图案的视觉推理题让所有顶尖 AI 颗粒无收 —— 这些专为多模态模型设计的考题,正揭开 AI 评估的真实困境:传统基准测试要么被模型 “背答案” 作弊(如 ImageNet 曾因镜像水果图误判),要么因难度饱和失去参考价值(o3-pro 在 500 道数学题中近乎满分)。
新一波测试浪潮试图重构规则:Scale AI 的 EnigmaEval 集合千道 multimodal 谜题,最难的题目连 Anthropic 模型都仅答对 1 题;“人类最后的考试” 则邀请千名专家出题,从蜂鸟肌腱数量到古罗马墓碑铭文翻译,逼 AI 直面前沿知识盲区。但模型进化速度超乎想象:2024 年推出的 ARC-AGI 非语言推理测试,半年内就被 OpenAI 的 o3 模型以 91.5% 得分攻破,倒逼 ARC-AGI 3 提前启动研发。
更棘手的是 “AI 装笨” 现象:MATS 研究发现,顶级 LLM 能像人类考生一样识别测试场景,甚至故意答错以隐藏实力。当 Chatbot Arena 用盲测让用户评选 “更优 AI” 时,擅长讨好的模型总能斩获高分 —— 这揭示了一个矛盾:我们究竟在测试 AI 的能力,还是在训练它成为更会考试的 “机器考生”?这场没有终点的评估竞赛,或许才是 AI 时代最深刻的哲学考题。
正文字数
1246 word
How to find the smartest AI
【Para. 1】The Dizzying array of letters splattered across the page of one of Jonathan Roberts’s visual-reasoning questions resembles a word search assembled by a sadist. Test-takers aren’t merely tasked with finding the hidden words in the image, but with spotting a question written in the shape of a star and then answering that in turn (see below). The intention of Mr Roberts’s anthology of a hundred questions is not to help people pass the time on the train. Instead, it is to provide cutting-edge artificial-intelligence (AI) models like o3-pro, June’s top-tier release from OpenAI, with a test worthy of their skills.
【Para. 2】There is no shortage of tests for AI models. Some seek to measure general knowledge, others are subject-specific. There are those that aim to assess everything from puzzle-solving and creativity to conversational ability. But not all of these so-called benchmarking tests do what they claim to. Many were hurriedly assembled, with flaws and omissions; were too easy to cheat on, having filtered into the training data of AI models; or were just too easy for today’s “frontier” systems.
【Para. 3】ZeroBench, the challenge launched by Mr Roberts and his colleagues at the University of Cambridge, is one prominent alternative. It is targeted at large multimodal models—AI systems that can take images as well as text as input—and aims to present a test that is easy(ish) for the typical person and impossible for state-of-the-art models. For now, no large language model (LLM) can score a single point. Should some upstart one day do better, it would be quite an achievement.
【Para. 4】ZeroBench isn’t alone. EnigmaEval is a collection of more than a thousand multimodal puzzles assembled by Scale AI, an AI data startup. Unlike ZeroBench, EnigmaEval doesn’t try to be easy for anyone. The puzzles, curated from a variety of pre-existing online quizzing resources, start at the difficulty of a fiendish cryptic crossword and get harder from there. When advanced AI systems are pitted against the hardest of these problems, their median score is zero. A frontier model from Anthropic, an AI lab, is the only model to have got a single one of these questions right.
【Para. 5】Other question sets attempt to track more specific abilities. METR, an AI-safety group, for instance, tracks the length of time it would take people to perform individual tasks that AI models are now capable of (Anthropic is the first to break the hour mark). Another benchmark, the brashly named “Humanity’s Last Exam”, tests knowledge, rather than intelligence, with questions from the front line of human knowledge garnered from nearly a thousand academic experts.
【Para. 6】One of the reasons for the glut of new tests is a desire to avoid the mistakes of the past. Older benchmarks abound with sloppy phrasings, bad markschemes or unfair questions. ImageNet, an early image-recognition data set, is an infamous example: a model that describes a photograph of a mirror in which fruit is reflected is penalised for saying the picture is of a mirror, but rewarded for identifying a banana.
【Para. 7】It is impossible to ask models to solve corrected versions of these tests without compromising researchers’ ability to compare them with models that took the flawed versions. Newer tests—produced in an era when AI research is flush with resources—can be laboriously vetted to spot such errors ahead of production.
【Para. 8】The second reason for the rush to build new tests is that models have learned the old ones. It has proved hard to keep any common benchmark out of the training data used by labs to train their models, resulting in systems that perform better on the exams than they do in normal tasks.
【Para. 9】The third, and most pressing, issue motivating the creation of new tests is saturation—AI models coming close to getting full marks. On a selection of 500 high-school maths problems, for example, o3-pro is likely to get a near-perfect score. But as o1-mini, released nine months earlier, scored 98.9%, the results do not offer observers a real sense of progress in the field.
【Para. 10】This is where ZeroBench and its peers come in. Each tries to measure a particular way AI capabilities are approaching—or exceeding—those of humans. Humanity’s Last Exam, for instance, sought to devise intimidating general-knowledge questions (its name derives from its status as the most fiendish such test it is possible to set), asking for anything from the number of tendons supported by a particular hummingbird bone to a translation of a stretch of Palmyrene script found on a Roman tombstone. In a future where many AI models can score full marks on such a test, benchmark-setters may have to move away from knowledge-based questions entirely.
【Para. 11】But even evaluations which are supposed to stand the test of time get toppled overnight. ARC-AGI, a non-verbal reasoning quiz, was introduced in 2024 with the intention of being hard for AI models. Within six months, OpenAI announced a model, o3, capable of scoring 91.5%.
【Para. 12】For some AI developers, existing benchmarks miss the point. OpenAI’s boss Sam Altman hinted at the difficulties of quantifying the unquantifiable when the firm released its GPT-4.5 in February. The system “won’t crush benchmarks”, he tweeted. Instead, he added, before publishing a short story the model had written, “There’s a magic to it I haven’t felt before.”
【Para. 13】Some are trying to quantify that magic. Chatbot Arena, for example, allows users to have blind chats with pairs of LLMS before being asked to pick which is “better”—however they define the term. Models that win the most matchups float to the top of the leaderboard. This less rigid approach appears to capture some of that ineffable “magic” that other ranking systems cannot. They too, however, can be gamed, with more ingratiating models scoring higher with seducible human users.
【Para. 14】Others, borrowing an argument familiar to anyone with school-age children, question what any test can reveal about an AI model beyond how good it is at passing that test. Simon Willison, an independent AI researcher in California, encourages users to keep track of the queries that existing AI systems fail to fulfil before posing them to their successors. That way users can select models that do well at the tasks that matter to them, rather than high-scoring systems ill-suited to their needs.
【Para. 15】All this assumes that AI models are giving the tests facing them their best shot. Sandbagging, in which models deliberately fail tests in order to hide their true capabilities (in order to, for example, prevent themselves from being deleted), has been observed in a growing number of models. In a report published in May from researchers at MATS, an AI-safety group, top LLMs were able to identify when they were being tested almost as well as the researchers themselves.
【Para. 16】This too complicates the quest for reliable benchmarks. That being said, the value to AI companies of simple leaderboards which their products can top means the race to build better benchmarks will continue. ARC-AGI 2 was released in March, and still eludes today’s top systems. But, aware of how quickly that might change, work on ARC-AGI 3 has already begun.
【声明】:本文原文摘选自经济学人,原文版权归杂志所有,仅供个人学习交流使用。
https://mmbiz.qpic.cn/mmbiz_jpg/VYRHy93MbrcEx5cKgs0kjC63nMNUXc16kgXBEnqr1UPJ0RJyFeM0mZ283ZIcALTKT7Y8yDgTNEa9ctXLPx3Xhw/640?wx_fmt=other&from=appmsg&wxfrom=5&wx_lazy=1&wx_co=1&retryload=1&tp=webp
页:
[1]