← Back to Library

Import AI 429: Eval the world economy; singularity economics; and Swiss sovereign AI

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

OpenAI builds an eval that could be to the broad economy as SWE-Bench is to code:
…GDPval is a very good benchmark with extremely significant implications…
OpenAI has built and released GDPval, an extremely well put together benchmark for testing out how well AI systems do on the kinds of tasks people do in the real world economy. GDPval may end up being to broad real world economic impact as SWE-Bench is to coding impact, as far as evals go - which is a big deal!

What it is: GDPval “measures model performance on tasks drawn directly from the real-world knowledge work of experienced professionals across a wide range of occupations and sectors, providing a clearer picture on how models perform on economically valuable tasks.”
The benchmark tests out 9 industries across 44 occupations, including 1,230 specialized tasks “each meticulously crafted and vetted by experienced professionals with over 14 years of experience on average from these fields”. The dataset “includes 30 fully reviewed tasks per occupation (full-set) with 5 tasks per occupation in our open-sourced gold set”.
Another nice property of the benchmark is that it involves multiple formats for response and tries to get at some of the messiness inherent to the real world. “GDPval tasks are not simple text prompts,” they write. “They come with reference files and context, and the expected deliverables span documents, slides, diagrams, spreadsheets, and multimedia. This realism makes GDPval a more realistic test of how models might support professionals.”
“To evaluate model performance on GDPval tasks, we rely on expert “graders”—a group of experienced professionals from the same occupations represented in the dataset. These graders blindly compare model-generated deliverables with those produced by task writers (not knowing which is AI versus human generated), and offer critiques and rankings. Graders then rank the human and AI deliverables and classify each AI deliverable as “better”, “as good as”, or “worse than” one another,” the authors write.

Results: “We found that today’s best frontier models are already approaching the quality of work produced by industry experts”, the authors write. Claude Opus 4.1 came in first with an overall win or tie rate of 47.6% versus work produced by a human, followed by GPT-5-high with 38.8%, and o3 high with ...

Read full article on Import AI →