Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.
Want to make AI go better? Figure out how to measure it:
…One simple policy intervention that works well…
Jacob Steinhardt, an AI researcher, has written a nice blog laying out the virtues in investing in technical tools to measure properties of AI systems and drive down costs in complying with technical policy solutions. As someone who has spent their professional life in AI writing about AI measurement and building teams (e.g, the Frontier Red Team and Societal Impacts and Economic Research teams at Anthropic) to measure properties of AI systems, I agree with the general thesis: measurement lets us make some property of a system visible and more accessible to others, and by doing this we can figure out how to wire that measurement into governance.
How measurement has helped in other fields: Steinhardt points out that accurate measurement has been crucial to orienting people around the strategy for solving problems in other fields; CO2 monitoring helps people think about climate change, and COVID-19 testing helped governments work out how to respond to COVID.
There are also examples where you can measure something to shift incentives - for instance, satellite imagery of methane emissions can help shift incentives for people that build gas infrastructure.
The AI sector has built some of the measures we need: The infamous METR time horizons plot (and before that, various LLM metrics, and before that ImageNet) has proved helpful for orienting people around the pace of AI progress. And behavioural benchmarks of AI systems, like rates of harmful sycophancy, are already helping to shift incentives. But more work is needed - if we want to be able to enable direct governance interventions in the AI sector, we’ll need to do a better job of measuring and accounting for compute, Steinhardt notes. More ambitiously, if we want to ultimately shift equilibria to make certain paths more attractive, we’ll have to unlock some more fundamental technologies, like the ability to cheaply evaluate frontier AI agents (makes it less costly to measure the frontier), and to develop privacy-preserving audit tools (makes it less painful for firms to comply with policy).
Why this matters - measurement unlocks policy: “In an ideal world, rigorous evaluation and oversight of AI systems would ...
This excerpt is provided for preview purposes. Full article content is available on the original publication.