Value-added modeling
Based on Wikipedia: Value-added modeling
The Algorithm That Judges Your Child's Teacher
Imagine you're a third-grade teacher. You've spent a decade perfecting your craft, staying late to help struggling readers, buying supplies with your own money. Then one day, a spreadsheet arrives declaring you're in the bottom ten percent of educators in your district. Your career hangs in the balance. The verdict came not from your principal, not from parents, not from anyone who has ever watched you teach. It came from a statistical model.
This is the reality of value-added modeling, one of the most controversial and consequential experiments in American education policy.
What Value-Added Modeling Actually Does
The core idea sounds reasonable enough. Traditional ways of evaluating teachers have obvious problems. If you simply compare how students perform on tests, you're really measuring which teachers got the easiest students to begin with. A teacher at an affluent suburban school where children arrive reading chapter books will look like a genius compared to a teacher in an underfunded urban school where some students have never held a book at home.
Value-added modeling attempts to solve this by asking a different question: How much did this specific teacher improve these specific students, compared to how much they would have improved anyway?
Here's how it works. Statisticians take a student's test scores from previous years and use them to predict what that student should score this year. Maybe a student scored in the 65th percentile last year. The model predicts they'll score around the 65th percentile again. If they actually score in the 75th percentile, the model attributes that gain to the teacher. If they drop to the 55th percentile, the teacher takes the blame.
By aggregating thousands of these individual comparisons, the model produces a score for each teacher. That score supposedly represents how much "value" the teacher added to student learning, isolated from all the messy real-world factors that affect education.
The Seductive Promise
The appeal is obvious. For decades, education reformers have complained that virtually all teachers receive satisfactory evaluations, making it nearly impossible to identify who's truly excellent and who's struggling. Value-added modeling promised objectivity. Numbers don't play favorites. An algorithm doesn't care about teacher seniority, union membership, or whether the principal likes you personally.
A 2003 study by the RAND Corporation captured the optimism perfectly, arguing that value-added modeling "holds out the promise of separating the effects of teachers and schools from the powerful effects of such noneducational factors as family background."
The approach caught fire politically. When the Obama administration launched Race to the Top in 2009, it pushed states to adopt rigorous teacher evaluation systems, and many turned to value-added models. Louisiana's legislature passed a bill authorizing the technique in 2010, signed immediately by Governor Bobby Jindal over the objections of the state teachers' federation. Major urban districts including Chicago, New York City, and Washington adopted the system for high-stakes decisions about hiring, firing, and bonuses.
The Los Angeles Times even created a searchable website publishing value-added scores for 6,000 elementary teachers. Secretary of Education Arne Duncan praised it as a model of transparency.
The Statistical Machinery
Behind these scores lies some seriously sophisticated mathematics. Researchers use something called hierarchical linear modeling, a technique that accounts for the nested structure of education. Students sit within classrooms, classrooms exist within schools, and schools operate within districts. Each level affects outcomes.
The models can incorporate a dizzying array of variables. At the student level: past performance, socioeconomic status, race and ethnicity. At the teacher level: certification status, years of experience, highest degree earned, teaching methods. At the school level: size, type, urban or suburban or rural setting. The goal is to control for everything except the teacher's actual impact.
William Sanders, a statistician at SAS who pioneered the practical use of these models in Tennessee schools during the 1990s, argued confidently that "if you use rigorous, robust methods and surround them with safeguards, you can reliably distinguish highly effective teachers from average teachers and from ineffective teachers."
Where the Model Breaks Down
But here's the thing about statistical models: they're only as good as their assumptions. And value-added modeling rests on some shaky foundations.
The most fundamental assumption is that students are randomly assigned to teachers. In the real world, this almost never happens. Parents request specific teachers. Principals steer struggling students toward their most experienced educators. Students with behavioral problems get clustered together. These non-random patterns can systematically bias the results.
Jesse Rothstein, an economist at the University of California, Berkeley, has been one of the most persistent critics on this point. "Non-random assignment of students to teachers can bias value added estimates of teachers' causal effects," he writes. His research suggests these biases aren't just theoretical—they actually distort the scores in measurable ways.
There's also the problem of what statisticians call measurement error. Test scores fluctuate. A student might have slept badly, felt anxious, or simply had an off day. These random variations get attributed to teachers as if they reflected genuine differences in teaching quality.
The result is startling instability. A teacher rated in the bottom quartile one year might land in the top quartile the next, even if nothing about their teaching actually changed. One analysis found that a ranking based on a single classroom correctly classifies teachers only about 65 percent of the time. You'd be right more often flipping a coin twice.
The Kindergarten Problem
Value-added modeling has a fundamental structural limitation that reveals its conceptual fragility. The entire system depends on comparing this year's test scores to previous years' scores. But what about kindergarteners? What about first graders?
They don't have previous standardized test scores. The model simply cannot evaluate their teachers.
This isn't just an inconvenience. It means the teachers working with our youngest, most impressionable students—the ones laying foundations for everything that follows—exist entirely outside the evaluation system that supposedly identifies teacher quality. Some researchers respond by limiting their models to third grade and above, effectively admitting the approach doesn't work for early childhood education.
The Math and Reading Gap
Here's a curious finding that should give reformers pause. Value-added scores are much more sensitive to teacher effects in mathematics than in reading or language arts. The same teacher can have a dramatically different score depending on which subject you're measuring.
Why might this be? One possibility: the reading tests are poorly constructed. But there's another explanation that cuts deeper. Students learn language from everywhere—from their families, from television, from books they read for pleasure, from conversations with friends. Math instruction happens predominantly in school.
If this explanation is correct, it means value-added models are measuring not just teacher quality but the relative importance of school versus everything else in a student's life. They might work reasonably well for subjects where teachers are the primary source of learning, but fail for subjects where learning is distributed across many influences.
The Small Sample Problem
Statistical precision requires adequate sample sizes. This is Statistics 101. To get reliable value-added estimates, researchers typically need data from at least 50 students per teacher.
Think about what this means in practice. An elementary teacher might have 20 to 25 students per year. They'd need two or three years of data before their score becomes reasonably stable. A first-year teacher, almost by definition, cannot be evaluated reliably—precisely when evaluation might be most valuable.
Even with accumulated data, the confidence intervals remain wide. The American Statistical Association issued a pointed statement in 2014 noting that the large standard errors result in "unstable year-to-year rankings." A teacher might bounce from excellent to mediocre to excellent again purely due to statistical noise, with no actual change in their teaching.
The Student Mobility Challenge
Modern students don't stay put. Families move. Students transfer between schools mid-year. Some districts experience turnover rates exceeding 30 percent annually.
Each transfer creates a measurement problem. If a student spends September through January at one school and February through June at another, who gets credit—or blame—for their year-end score? The models typically assign the outcome to wherever the student landed at testing time, even though that teacher may have had the student for only a few months.
And what about the student's scores from their previous school? Different states use different tests. Different districts within states sometimes use different assessments. When a student transfers, their prior scores may be unavailable or non-comparable, leaving the model without the baseline data it needs.
When the Algorithm Went Public
The most dramatic real-world test of value-added modeling came in August 2010, when the Los Angeles Times published its database of teacher scores. The newspaper had commissioned its own analysis of seven years of data from the Los Angeles Unified School District.
The reaction was explosive. Teachers protested. The union was furious. One teacher, Rigoberto Ruelas, died by suicide shortly after the ratings were published, though his family disputed that the rating was the primary cause.
Six months later, researchers from the National Education Policy Center published a devastating reanalysis. Derek Briggs and Ben Domingue examined the same data, attempting to replicate the Times' results. Their conclusion: "the research on which the Los Angeles Times relied for its August 2010 teacher effectiveness reporting was demonstrably inadequate to support the published rankings."
The Times had treated estimates riddled with uncertainty as if they were precise measurements. Teachers' careers and reputations had been publicly evaluated using methods that couldn't support the conclusions.
The Gates Foundation's Experiment
The Bill and Melinda Gates Foundation, never shy about ambitious education interventions, launched a major multi-year study called Measures of Effective Teaching to test whether value-added scores actually identify good teachers.
The initial results, released in December 2010, seemed promising. Value-added scores correlated with student perceptions of teachers on key dimensions like classroom control and challenging coursework. Perhaps most intriguingly, the study found that teachers who "teach to the test"—drilling students on likely exam content—actually had lower value-added scores than teachers who promoted deep conceptual understanding.
This was an encouraging sign. It suggested the model might capture something real about teaching quality, not just test-taking tricks.
But then Rothstein reanalyzed the results. His findings were uncomfortable. The data, he argued, didn't actually support the conclusions the project had announced. "Interpreted correctly," he wrote, the analyses "undermine rather than validate value-added-based approaches to teacher evaluation."
The Gates Foundation's subsequent reports defended the methodology, but the controversy underscored how even well-funded, carefully designed research could yield ambiguous results.
Beyond Teachers: Evaluating Principals
If value-added modeling could evaluate teachers, why not apply it to principals? The logic seemed straightforward. Just as teachers affect student learning, school leaders affect everything that happens in their buildings.
Research in Texas explored this idea by tracking what happens to student achievement when principals change schools. If a principal moves from School A to School B, and School A's scores drop while School B's rise, that suggests the principal was making a difference.
The results were striking—perhaps too striking. The Texas analysis found that principals have enormous impacts, with effective leaders producing gains equivalent to two months of additional learning per year for every student in their school. Ineffective principals caused equally large negative effects.
But these findings raise their own questions. Two months of additional learning is a massive effect. Is it plausible that leadership matters this much? Or are we seeing the same measurement problems that plague teacher evaluation—noise and non-random selection masquerading as signal?
What the Research Actually Shows
The academic debate over value-added modeling isn't as one-sided as critics sometimes suggest. Research by Harvard economist Raj Chetty and his colleagues followed students into adulthood and found that those who had higher value-added teachers earned modestly more money as adults. The effects were small—we're talking about a few hundred dollars per year—but they were real and statistically significant.
This suggests value-added scores do capture something meaningful about teacher impact, at least on average. The problem is that "on average" isn't the same as "for individual teachers." A measure might correctly identify that high-scoring teachers are generally better without being accurate enough to fairly evaluate any specific teacher's career.
The distinction matters enormously. Using value-added scores to study educational policy is very different from using them to fire the bottom ten percent of teachers.
The Expert Consensus
By 2014, the American Statistical Association had seen enough. Their statement on value-added models acknowledged the technique had some value for research but warned against high-stakes personnel decisions. The models' limitations, they wrote, were too significant for that purpose.
The Economic Policy Institute, in a 2010 report, put it more bluntly. While acknowledging that American schools generally do a poor job of evaluating teachers, the report warned that overreliance on standardized test scores "will not lead to better performance." Value-added should be "one factor among many," not the determining factor.
Even proponents agree. No serious researcher recommends using value-added scores as the sole basis for any consequential decision. The consensus is that the scores should be combined with classroom observations, student feedback, and professional judgment.
The trouble is that once you have a number, people want to use it. The nuance gets lost in implementation.
The Deeper Question
Education policy researcher Gerald Bracey raised a point that cuts to the heart of the debate. Even if value-added scores perfectly captured something about teachers, that something might not be what we actually care about.
The models measure teacher impact on standardized test scores. But is that the same as teaching quality? Is it the same as inspiring a love of learning? Is it the same as building the habits of mind that lead to success in life?
Maybe test score gains correlate with these deeper goods. Maybe they don't. The value-added approach, by its nature, can only evaluate what can be tested. Everything else—creativity, curiosity, character, citizenship—lies outside its vision.
Where Things Stand
The fervor for value-added modeling has cooled since its peak in the early 2010s. The Every Student Succeeds Act of 2015 walked back some of the federal pressure that had pushed states toward test-based teacher evaluation. Several states have modified or abandoned their value-added systems.
But the approach hasn't disappeared. Many districts still incorporate it into teacher evaluation, usually alongside other measures. The dream of objective, data-driven evaluation remains seductive, even as the evidence suggests that reality is messier than any algorithm can capture.
The fundamental tension remains unresolved. We want to identify and reward excellent teaching. We want to help struggling teachers improve. We want accountability. These are legitimate goals. The question is whether a statistical model, however sophisticated, can achieve them fairly—or whether it creates the illusion of precision where none exists.
That teacher in the opening scenario? She might be excellent. She might be struggling. The model might be right, or it might be wrong. What we know for certain is that the confidence the number projects far exceeds the confidence the methodology can actually support.
A Brief History of the Idea
The concept of judging teachers by how much their students learn isn't new. Eric Hanushek, now a senior fellow at the Hoover Institution at Stanford University, introduced the core idea into academic research in 1971. Richard Murnane at Harvard extended the analysis. For decades, it remained an academic exercise.
William Sanders transformed it from theory to practice. Working for SAS, the analytics company, he developed operational value-added models for school districts in North Carolina and Tennessee. Tennessee became the first state to use the technique for actual teacher evaluation in the 1990s.
The No Child Left Behind Act of 2002, with its emphasis on standardized testing and accountability, created demand for exactly this kind of quantitative measure. Race to the Top in 2009 accelerated adoption further. Within a decade, a statistical technique that had been confined to academic journals was shaping careers and policy nationwide.
Whether that expansion happened too fast, without sufficient validation, is still being debated. What's clear is that the gap between what the research supported and what the policy required was substantial. Practitioners needed certainty. The methodology offered probabilities. Something had to give.