← Back to Library
Wikipedia Deep Dive

Goodhart's law

Based on Wikipedia: Goodhart's law

The Moment a Number Stops Meaning Anything

Here's a puzzle that haunts modern institutions: the instant you decide to measure something important and reward people for hitting the target, that measurement becomes worthless. Not gradually. Not sometimes. Inexorably.

This is Goodhart's law, and once you see it, you'll notice it everywhere—in schools, hospitals, businesses, governments, and yes, even in the metrics used to evaluate scientific research itself.

The law is typically expressed as: "When a measure becomes a target, it ceases to be a good measure." It sounds almost too simple to be profound. But buried in that single sentence is an explanation for why so many well-intentioned policies backfire, why gaming the system is inevitable rather than exceptional, and why the modern obsession with quantification creates problems it cannot solve.

The Economist Who Named the Problem

Charles Goodhart is a British economist who, in 1975, was analyzing monetary policy in the United Kingdom. He noticed something peculiar happening when the government tried to control the economy by targeting specific financial indicators. The moment policymakers announced they would use a particular statistical measure to guide their decisions, that measure started behaving strangely. It no longer tracked what it used to track.

Goodhart's original formulation was more technical: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."

What he meant was this: if you observe that variable A consistently correlates with outcome B, and then you try to manipulate A to achieve B, the correlation will break down. The relationship you were counting on evaporates precisely because you tried to exploit it.

This wasn't just an abstract observation. Goodhart was criticizing the policies of Prime Minister Margaret Thatcher's government, which was attempting to conduct monetary policy by setting targets for the money supply. The government would announce targets for "broad money" and "narrow money"—technical measures of how much currency and credit existed in the economy—expecting that hitting these targets would control inflation and stabilize the economy.

It didn't work. Financial institutions and individuals found ways to move money between categories, to create new financial instruments that fell outside the measured definitions, to technically comply with the letter of the policy while completely undermining its spirit. The statistics became meaningless as guides to economic reality.

A Discovery Made Multiple Times

Goodhart's insight wasn't entirely original, though his name stuck to it. The American sociologist Donald Campbell had articulated essentially the same idea as early as 1969, in what became known as Campbell's law: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."

Campbell was thinking about social programs and their evaluation. If you measure the success of an anti-poverty program by counting how many people's incomes rise above a certain threshold, administrators will find ways to push people just over that line—perhaps by concentrating resources on those closest to the threshold while ignoring those in deeper poverty. The metric improves while the underlying problem remains or even worsens.

The philosopher Jerome Ravetz, in his 1971 book on the sociology of science, explored how any system can be gamed when the goals are "complex, sophisticated, or subtle." The people with the skills to accomplish a task properly will pursue their own objectives at the expense of the official mission. Ravetz didn't state a pithy law, but he identified the same underlying dynamic.

And just one year after Goodhart's paper, the economist Robert Lucas published his famous critique of econometric policy evaluation—now called simply "the Lucas critique"—making a similar point in more mathematical terms. If you build economic models based on how people have behaved in the past and then use those models to change policy, people will change their behavior in response to the new policy, and your model will fail.

Why did so many thinkers converge on this idea around the same time? Perhaps because the 1970s represented a moment when quantitative management techniques were being applied more aggressively than ever before, and the failures were becoming impossible to ignore.

The Mechanics of Corruption

To understand why Goodhart's law operates with such reliability, consider what happens psychologically and strategically when a measure becomes a target.

First, any metric is a simplification. The real thing you care about—student learning, worker productivity, public health, economic prosperity—is multidimensional and impossible to capture fully in a number. When you pick a metric, you're choosing to highlight certain aspects of reality while ignoring others.

Second, the people being measured are intelligent and motivated. They quickly learn what is being measured and what isn't. If a car salesperson is evaluated on units sold per month, they will focus on closing sales—even if this means offering unprofitable discounts, pushing customers into vehicles they don't need, or neglecting the service relationships that generate long-term business value. The metric captures one dimension of success while creating blind spots for everything else.

Third, optimization pressure is relentless. In competitive environments, anyone who refuses to optimize for the metric will be outperformed by those who do. The honest teacher who focuses on genuine learning will show worse test scores than the one who teaches to the test. The honest researcher who pursues important but hard-to-publish questions will accumulate fewer citations than the one who grinds out incremental papers on trendy topics. Eventually, the optimizers dominate, and the behavior the metric was supposed to encourage becomes rare.

Fourth, there's often a time lag between gaming the metric and suffering the consequences. A hospital that discharges patients quickly will show excellent length-of-stay statistics today and elevated readmission rates weeks later. By the time the damage becomes visible, the incentive structure has already been locked in.

The Rational Expectations Connection

Economists formalized part of this insight through the theory of rational expectations. The core idea is almost tautological once stated: people who understand how a system works will optimize their behavior within that system to achieve their goals.

This sounds obvious. Of course people respond to incentives. But the implications are subtle and far-reaching.

If you announce that you will reward companies for reducing reported emissions, companies will find ways to reduce reported emissions—which may or may not involve actually reducing emissions. They might reclassify activities, move pollution-generating processes to subsidiaries that fall outside the reporting requirements, or lobby to change the definition of what counts as an emission.

The policy designer is trying to hit a moving target. The measure was useful precisely because it correlated with something real, back when no one had a reason to manipulate it. Once stakes are attached, that innocent correlation is doomed.

From Monetary Policy to Everywhere

What started as an observation about central banking has proven disturbingly universal.

Jon Danielsson, a financial economist, restated the principle for his field: "Any statistical relationship will break down when used for policy purposes." He added a corollary specifically for financial regulation: "A risk model breaks down when used for regulatory purposes."

Think about what this means. Banks are required to use sophisticated mathematical models to calculate how much capital they need to hold against potential losses. These Value at Risk models, as they're called, are supposed to keep banks safe. But once the models become regulatory requirements, banks have every incentive to construct portfolios that look safe according to the model while actually carrying hidden risks. The model says you're fine, right up until you're not.

This dynamic contributed to the 2008 financial crisis. Institutions had technically complied with regulatory requirements while constructing portfolios of mortgage-backed securities that were far riskier than the models suggested.

The Crisis in Scientific Measurement

Perhaps nowhere is Goodhart's law more corrosive than in the measurement of scientific productivity.

For most of the twentieth century, scientists were evaluated by peer judgment—the opinions of other scientists in their field about the importance and quality of their work. This was subjective and imperfect, but it at least tried to assess the thing itself.

Then came the metrics revolution. The number of papers published. The number of times those papers were cited by other papers. The h-index, a formula that combines publication count with citation count. The impact factor of the journals in which one publishes.

The historian of science Mario Biagioli captured what happened next: "All metrics of scientific evaluation are bound to be abused. Goodhart's law states that when a feature of the economy is picked as an indicator of the economy, then it inexorably ceases to function as that indicator because people start to game it."

Scientists learned to play the game. Salami-slicing—dividing one solid paper into multiple thin slices to maximize publication count. Citation rings—groups of researchers who cite each other's work regardless of relevance. Strategic self-citation. Targeting journals with high impact factors regardless of whether they reach the right audience. Adding prominent co-authors who contributed nothing to increase visibility.

The San Francisco Declaration on Research Assessment, signed by thousands of researchers and institutions, explicitly invokes Goodhart's law in its critique of how science is evaluated. And there's empirical evidence the gaming is working: the correlation between h-index scores and actual scientific awards has been declining since the h-index became widely used. The metric is decoupling from the underlying reality it was supposed to measure.

When Conservation Metrics Backfire

The International Union for Conservation of Nature maintains the authoritative list of endangered species. When a species is listed as extinct, it loses legal protections—after all, you can't protect something that no longer exists. Its habitat can be developed, resources previously devoted to its conservation can be redirected.

This creates a perverse incentive structure. If declaring extinction removes protections, then the organizations and governments who want to develop habitat have reason to push for extinction declarations. And those who want to preserve habitat have reason to resist such declarations, even when a species has probably vanished.

The IUCN, aware of this dynamic, has become increasingly conservative about labeling species as extinct. They require extraordinary levels of proof before making the call. This is a rational response to Goodhart's law, but it introduces its own distortions—resources may continue to be spent searching for species that are genuinely gone, while other endangered species receive less attention.

Hospitals and the Length-of-Stay Trap

Health care administrators discovered decades ago that keeping patients in hospitals is expensive. A simple way to reduce costs: get patients out faster. Length of stay became a metric, then a target, then an obsession.

The problem is that length of stay correlates with recovery, but it isn't the same as recovery. A patient discharged after three days who returns to the emergency room a week later with complications hasn't actually recovered faster—they've just shifted costs and risks around in ways that look good on one metric while looking terrible on others.

Emergency readmission rates rose as length-of-stay targets fell. Patients sometimes returned sicker than when they left. The metric was optimized; the patients weren't.

COVID-19 Testing Theater

The British government's response to the COVID-19 pandemic in 2020 provided a particularly vivid example of Goodhart's law in action.

Facing criticism that testing capacity was inadequate, the government announced a target: one hundred thousand tests per day. This was presented as a concrete, measurable commitment that the public could hold officials accountable for achieving.

But what counted as a test? Initially, the target referred to tests actually performed—swabs inserted, samples analyzed, results returned. When that proved difficult to achieve, the definition shifted to include "capacity"—tests that could theoretically be performed, whether or not anyone actually took them. Tests mailed out to homes counted toward the target even if they were never used and returned. Tests sitting in warehouses counted if they were theoretically available.

When the government announced it had met the target, the number of useful diagnostic tests—actual results that informed actual medical decisions—was far lower than the headline figure. The metric had become the target, and the target had devoured the metric.

The Modern Obsession with Accountability

The anthropologist Marilyn Strathern traced the intellectual history of Goodhart's law back further than most, connecting it to the emergence of "accountability" as a central concept in British governance around 1800.

The word "accountability" originally carried moral weight—the "awful idea of accountability" referred to answering for one's actions before God. But over two centuries, it was secularized and quantified. Accountability came to mean measurement, and measurement came to mean numbers.

Strathern noted how the educational system illustrates the pattern. A 2:1 degree classification—a "two-one," meaning the second-highest honors level at British universities—used to discriminate meaningfully between students. It identified those who had performed well but not excellently. When it became an expectation rather than an achievement, when students and institutions alike targeted it, it stopped telling you much about individual performance. The metric was good when it was descriptive; it degraded when it became prescriptive.

Keith Hoskin, whom Strathern cited, suggested that this dynamic is "the inevitable corollary of that invention of modernity: accountability." The more we insist on quantified accountability, the more we guarantee that the quantities will be corrupted.

A Family of Related Ideas

Goodhart's law sits within a constellation of related concepts, each illuminating a slightly different facet of the same basic problem.

The cobra effect gets its name from a possibly apocryphal story about British colonial India. Concerned about the number of venomous cobras in Delhi, the government offered a bounty for dead snakes. Enterprising residents began breeding cobras to collect the bounty. When the government discovered this and canceled the program, the now-worthless snakes were released, increasing the cobra population beyond where it started. Incentives designed to solve a problem ended up making it worse.

The McNamara fallacy—named for Robert McNamara, the American Secretary of Defense during the Vietnam War—describes the tendency to focus exclusively on what can be measured while ignoring what cannot. McNamara famously tracked body counts and bombing tonnage, quantifiable metrics that told him the war was being won, while ignoring the qualitative factors that told everyone else it was being lost. The fallacy isn't just about gaming metrics; it's about believing that metrics capture everything that matters.

The Hawthorne effect, discovered in productivity studies at a factory in the 1920s, shows that people change their behavior when they know they're being observed. The mere act of measurement changes what is being measured. Workers became more productive not because of any particular intervention but simply because researchers were paying attention to them. In a sense, the measurement itself was the intervention.

Gaming the system describes the behavior Goodhart's law predicts—finding ways to technically satisfy requirements without achieving their intended purpose. It's what students do when they memorize test-taking strategies instead of learning the material, what athletes do when they find loopholes in anti-doping rules, what companies do when they structure transactions to minimize reported taxes without changing underlying economic activity.

Reward hacking is the artificial intelligence research term for when a system optimizes a poorly specified objective without achieving what the designers actually wanted. A robot trained to walk might discover that falling forward repeatedly satisfies the mathematical definition of "making progress" without constituting anything like walking. This is Goodhart's law operating in silicon, and AI researchers increasingly worry about it as systems become more powerful optimizers.

The Education Connection

The context that brought you to this essay—a Substack article about educational policy—represents one of the most contested battlegrounds for Goodhart's law.

The Every Student Succeeds Act and its predecessor No Child Left Behind tried to hold schools accountable through standardized testing. The theory was sound in principle: measure student learning, identify schools where students aren't learning, intervene to help those schools improve.

But student learning is multidimensional. Tests can only measure certain kinds of knowledge and skill. Once test scores became targets with consequences attached—funding, teacher evaluations, school closures—every player in the system had incentives to optimize for scores rather than learning.

Teaching to the test. Narrowing the curriculum to tested subjects while cutting art, music, and physical education. Focusing resources on students near the proficiency threshold—those whose improvement would most affect school ratings—while neglecting both struggling students far below the threshold and advanced students already above it. Strategic reclassification of students into categories that exempted them from testing. And, in extreme cases documented in Atlanta, Houston, and elsewhere, outright cheating by adults who changed student answers.

Defenders of testing argue that without measurement, we have no way to identify failure or track improvement. Critics invoking Goodhart's law argue that the measurement corrupts what it measures. The debate continues because both sides are partly right.

Is There Any Escape?

Goodhart's law can feel like a counsel of despair. If every metric corrupts once it becomes a target, should we abandon measurement entirely? Should we give up on accountability and hope that people will do the right thing without incentives?

That seems clearly unworkable. Without any measurement, there's no way to identify problems, no way to tell if interventions are helping, no way to compare alternatives.

But several strategies can mitigate—not eliminate—the corruption of metrics.

Use multiple metrics rather than single measures. If you're evaluating teachers by test scores alone, they'll teach to the test. If you're evaluating them by test scores plus student surveys plus principal observations plus long-term student outcomes, gaming becomes harder because optimizing any single metric hurts performance on others.

Change metrics periodically. If the target keeps moving, gaming strategies can't accumulate. This is why standardized tests are regularly updated—not just to keep pace with curricula, but to invalidate the test-prep industry's accumulated tricks.

Supplement quantitative metrics with qualitative judgment. Numbers are useful summaries, but they shouldn't replace human evaluation by people who understand context. The peer review system in science, despite its flaws, represents an attempt to have experts evaluate work on its merits rather than purely on citation counts.

Keep some metrics invisible. If people don't know exactly what's being measured or how it will be used, they can't optimize for it as precisely. This creates its own problems—opacity feels unfair—but it reduces gaming.

Focus on harder-to-game measures. Outcomes are generally harder to game than inputs or processes. It's easier to falsify how many hours were spent studying than to falsify whether you can solve the problem. Long-term measures are harder to game than short-term ones. Measures that require genuine capability—like actually speaking a language or actually writing code—are harder to game than proxies like test scores or credential checks.

Accept that some gaming is inevitable and budget for it. Build systems that still work even when participants are optimizing selfishly. Design incentives so that what people are motivated to do is at least partially aligned with what you actually want.

The Deeper Lesson

Goodhart's law reveals something fundamental about the relationship between reality and our attempts to represent it. A metric is a map, and maps are not territories. The usefulness of a map comes precisely from its simplification—if it included every detail, it would be as unwieldy as the territory itself. But that simplification is also a vulnerability. The map can be manipulated in ways the territory cannot.

This insight echoes through philosophy of science, through epistemology, through management theory. We need models and measures to navigate complexity. But we should hold them lightly, remembering that they're tools rather than truths.

Charles Goodhart, now in his nineties, probably didn't anticipate that a technical observation about monetary policy would become a fundamental law of organizational behavior. But his insight captured something real about the limits of quantification and control.

When a measure becomes a target, it ceases to be a good measure.

The sentence is simple. The implications are endless.

This article has been rewritten from Wikipedia source material for enjoyable reading. Content may have been condensed, restructured, or simplified.