January 17, 2010

Evaluating teachers on value-added test scores: the Regression toward the Mean problem

Can sending star teachers into slum schools close the racial gap in school achievement? Can teachers be fairly evaluated by how much their students' test scores went up from last spring to this spring?

Both ideas are very fashionable these days. I want to evaluate both theoretically, using a simple model with two assumptions:

First, star teachers exist, fortunately. Over the course of oneyear, some teachers can raise their students test scores more than one grade level. (There are also dud teachers who can't raise test scores as much as the averager teacher can.) In my simplified model, a star teacher is one who raises grade levels 1.5 years per year 1.0 years in the classroom.

Second, the positive impact of star teachers' is partly reduced over time by regression toward the mean. After nine months under the guidance of Miss Jean Brodie, the kids are well ahead of the average. But when they come back from summer vacation, they aren't as far ahead anymore. Away from Ms. Wonderful, they've regressed toward the mean. There can be a lot of other causes for regression toward the mean. Perhaps after a second year under Miss Jean, some of the students are bored with her tricks and less intimidated by her shtick. Maybe, especially in math and science, the students start getting closer to their intellectual limits.

So, let's assess both questions about teachers with these two concepts in mind. Let's start with something I've always assumed was a good idea: value-added evaluations of teacher performance.

I've long advocated that teachers should not be evaluated upon how well their students do on standardized tests, since the impact of the teacher is typically overwhelmed in the results by the differences between students. Those kind of evaluation systems just augment the natural tendency for the best teachers to wind up with the best students, as everybody scrambles to get hired at the schools with the smartest students. Instead, I've argued for "value-added" evaluations of teachers, measuring how much test scores have gone up under the teacher relative to the students' previous scores. The Obama Administration has come around to this view, too.

Now, though, I've developed a worrisome question about measuring teacher performance on value added, something I've always recommended. How do you factor the effects of regression toward the mean into formulas for measuring teacher performance? In the real world, you can't always assume that last year's test scores show how smart each teacher's students are on average. Last years scores were likely driven up or down by the quality of the teacher last year. The really confusing thing is that it's likely that students whose test scores were unnaturally depressed by a bad teacher last year are likely to go up more this year than students whose test scores were boosted last year by a very good teacher. That's regression toward the mean.

Let's take a sports coaching example. When I was at Notre Dame High School, our archrival Crespi always killed us in pole vaulting during our annual track meet. In fact, Crespi vaulters set a whole bunch of different national age group and high school year records.That's pretty amazing. Strangely enough, it becomes less amazing when you discover that all three star Crespi vaulters were named Curran. It turns out that the Curran brothers had a pole vault track and pit in their backyard, where their father, who had been a pole vaulter, trained them in advanced pole vaulting techniques.

Here's a one minute video from a Super Eight home movie from around 1972 of seventh-grader Anthony Curran clearing 9 feet in his backyard. I had always imagined ever since I read in the 1970s about the Curran family pole vaulting practice ground that they were very rich and had a huge back yard with an Olympic Stadium type set-up, but the video shows it's cramped, ramshackle, and the pit consists of old mattresses right in front of a brick wall. It looks like a good place to break your neck. I'm sure no modern upper middle class mom would put up with Dad and the boys building such a nightmare in the backyard, but Mrs. Curran can be seen waving happily in the home movie as her 13-year-old son hurtles toward his fate.

Not too surprisingly, the Curran Brothers were quite good pole vaulters in college (Anthony Curran, now the pole vault coach at UCLA, has an all-time personal best of 18'-8"), but they weren't the record setters in their subsequent careers that they had been in high school. I don't think any Curran's ever made the U.S. Olympic team. Regression toward the mean set in as they got older and better natural athletes started to catch up to them in hours of lifetime training.

Say you were the college pole vault coach of the Curran Brothers and the athletic director said to you, "Tim Curran set a world age group record at 15, Anthony Curran sent national class year records in high school for sophomore, junior, and senior years. We recruited you the two most accomplished high school vaulters in the history of the top pole vaulting state in the Union. But under your coaching, they aren't even winning college national championships. Why are you failing so badly with all this talent we gave you?"

The true answer is that because the Currans started training so much younger than their current competitors in college, they came closer to fulfilling 100% of their natural potential in high school than anybody else in California did. Now, the other kids are catching up and regression toward the mean is kicking in for the Currans. As high schoolers, the Currans had good nature and exceptional nurture to dominate an obscure sport. By college, they were running into competitors with even better nature, and the nurture gap was closing as all the top competitors got the same amount of coaching in college.

Now, let's think about this in a typical school, where children aren't always fully randomly shuffled after each year. For example, at my elementary school in the 1960s, there were 70 children at each grade level, so they were divided up into the Blue and the Red classrooms. They weren't tracked, they were just randomly assigned. If you started out as a Blue, you typically stayed in Blue with your closer friends.

Say that the two 1st grade teachers are wildly different in effectiveness. The Blue 1st Grade teacher's students finish the year a half grade level above the average, while the Red 1st Grade teachers students finish the year a half grade level below average.

Now, if you are a second grade teacher of perfectly average effectiveness, a teacher who can be expected to raise the grade level of an average class by 1.0 years (relative to the average), which class do you want to inherit, Blue or Red, to do best on the teacher effectiveness evaluation at the end of their second grade.

Let's say that the great Blue first grade teacher's benefits have a one year half life and the bad Red first grade teacher's harm's have a one year half life. In other words, there is regression toward the mean over time in teaching effectiveness, as in so much in life.

If you were just being measured not on value added, but on simple absolute performance at the end of the grade, you'd want to inherit the Blue class that ended last year 0.5 grade levels above average. If you do an average job and the half life is one year, then they'll finish your year averaging grade level 2.25: 0.25 grade levels above average, and you'll be considered a good teacher.

On the other hand, if you are being relativistically measured on value added as calculated by your second graders' grade level at the end of your year minus their grade level at the end of the previous year, you don't want to inherit the star teacher's overachieving Blue class, because you will only get credit for adding a crummy 0.75 grade levels in value. Sure, after two years, they'll be at grade level 2.25, but the were at 1.5 a year ago, so you only get credit for 2.25-1.50 = 0.75 grade levels of value added.

Under value added measurement, you might get fired for, in essence, having inherited the better taught class.

Instead, under value added measurement, you want to inherit the underachieving Red Class from that bad teacher, so that you can get the credit for her students inevitable upward regression toward the mean. They'll wind up the year going from 0.5 to 1.75, so you'll get credit for adding the value of 1.25 grades. I'm a star! Give me my bonus money, Arne Duncan, gimme it now!

This model where there is partial regression toward the mean after the impact of superstar teachers has interesting implications for the national obsession with closing the racial gaps in school achievement.

Assume you have an elementary school with average students where every teacher is a star capable of pushing students ahead 1.5 grades each year (a Grade Level Boost of 0.5), all else being equal. If there is zero regression toward the mean, a simple Excel model predicts that when the average student graduates at the end of eighth grade, he's performing at the 12th grade level.

Grade Grd Level Boost Regress to Mean Grade Level
1 0.5 0% 1.5
2 0.5 0% 3.0
3 0.5 0% 4.5
4 0.5 0% 6.0
5 0.5 0% 7.5
6 0.5 0% 9.0
7 0.5 0% 10.5
8 0.5 0% 12.0

On the other hand, if there is 100% regression toward the mean, the average student, after eight years of star teachers, tests at just the 8.5 grade level at the end of 8th grade:
Grade Grade Level Boost Reg to Mean Grade Level
1 0.5 100% 1.5
2 0.5 100% 2.5
3 0.5 100% 3.5
4 0.5 100% 4.5
5 0.5 100% 5.5
6 0.5 100% 6.5
7 0.5 100% 7.5
8 0.5 100% 8.5

The discouraging thing is that the results of regression toward the mean aren't symmetrical: you only get the the big boosts in grade level by eliminating the last bits of regression toward the mean, but that's very hard to do.

For example, if the regression toward the mean factor is 50 percent per year, then the average student who has benefited from eight consecutive star teachers leaves the school at the end of the 8th grade performing at just the 9.0 grade level. Eight star teachers in a row have gotten him up only one grade level:

Grade Grade Level Boost Reg to Mean Grade Level
1 0.5 50% 1.5
2 0.5 50% 2.8
3 0.5 50% 3.9
4 0.5 50% 4.9
5 0.5 50% 6.0
6 0.5 50% 7.0
7 0.5 50% 8.0
8 0.5 50% 9.0

So, you can see the contemporary obsession in the Obama Administration and the prestige press comes from with trying to reduce regression toward the mean by taking away kids' summer vacations, by keeping them at school a dozen hours per day (the celebrated KIPP program), and so forth.

Unfortunately, the big gains only come from eliminating the last bits of regression toward the mean. If you can cut regression toward the mean from 50% to 25%, then the average student's grade level at the end of eighth grade increases from 9.0 to 9.8:

Grade Grade Level Boost Reg to Mean Grade Level
1 0.5 25% 1.5
2 0.5 25% 2.9
3 0.5 25% 4.2
4 0.5 25% 5.4
5 0.5 25% 6.5
6 0.5 25% 7.6
7 0.5 25% 8.7
8 0.5 25% 9.8

But, as you can see, in a school of star teachers, reducing annual regression toward the mean from 100% to 25% only boosts grade level upon eighth grade graduation by 1.3 years, from 8.5 to 9.8. In contrast, reducing annual regression toward the mean from 25% to 0% would, theoretically, boost grade level at elementary school graduation by 2.2 years, from 9.8 to 12.0. But, due to diminishing marginal returns, it's probably much harder to reduce regression toward the mean from 25% to 0% than from 100% to 25%.

Since the white-black gap at the end of high school is three to four years, these regression toward the mean calculations can help explain why there is such a Blind Side-like obsession with plugging holes in the environment where NAM students' regression toward the mean might occur. For example, the NYT Magazine ran a feature on a public boarding school in a poor part of Washington DC where the taxpayers pay $35k per student per year for five nights per week at this boarding school. But the article was heavily devoted to worrying about whether the two nights per week that the students spend at home was causing the presumed test score gains of the five nights in the dorm to regress back toward the black mean.

Of course, the real killer in terms of closing the racial gap by eliminating sources of regression toward the mean is that eventually, these individuals turn into adults whom you can't manipulate so much, and then they choose environments for themselves.

My published articles are archived at iSteve.com -- Steve Sailer


anony-mouse said...

I think what society really needs are tests to find out what makes a good blogger (or blog commenter).

Anonymous said...

anony-mouse, ask Hillary Clinton who makes a good blogger/commenter. Some years ago she was blithering about society's need for gooberment "gateways" to the net, so she probably has some ideas.

jody said...

"Can sending star teachers into slum schools close the racial gap in school achievement?"

it can definitely improve the performance of the students from the worst parts of town. this is what my cousin does for teach for america. i mentioned her before. she has an undergrad in math from UVA and a masters in education from columbia. she's been working for teach for america for about 6 years. she taught for about 4 years, but now runs one of the entire city wide operations.

i also mentioned one of my best friends from high school, one of the most natural math guys i had ever met, 800 on the math SAT and 5 on the AP calculus BC exam, et cetera. he is a christian nutjob, and decided it was his duty to help the less fortunate by becoming a math teacher in philadelphia high schools. i don't think he helped them. he quit after 1 year.

Mitch said...

All of these value added discussions are geared towards elementary school. You can't do value added calculations in high school math, history, or science.

Incidentally, I remember reading a report that evaluated a state's educational results--North Carolina, I think?--and determined that moving the best teachers to the worst schools would lower good student scores more than it would raise poor student scores. Not a good trade at all.

Anonymous said...

Would you consider a value added system valid when the test devised in middle school is written by a different company than the one that writes th one in high school, and then compared to each other? That's what's happening in my state.

Polistra said...

Very interesting point. Most biological variables are logarithmic. As the physical input increases linearly, the bio response expands quickly then asymptotes toward an unbreakable ceiling.

In the simpler sorts of biological measurements like vision and hearing, this log-ness is taken into account.

Perhaps you could develop a decibel-like scale for intellectual development? Call it the deciWatson, which would partner up nicely with the Bel and would also salute a later Watson who was persecuted for his intelligence....

Anonymous said...

I have been in this situation in terms of teaching ESL students. A teacher definitely wants the kids from the lower performing teacher because the curriculum is so repetitive that most everything is retaught. If a kid scores low just because he wasn't taught, he hasn't lost the ability to learn, so as soon as you teach him, he will get it. I have had kids grow 2-4 years in a single year in their reading level. They were middle school with limited English, but pretty bright so they could go from a grade 1.5 reading level to level 3.5 or even level 5.5 in a year. We always expected more than a year of growth per year because it was ESL.

The only fair way to evaluate teachers based on student performance is to compare the students' IQ's to their performance. Every special ed dept. has a chart that has expected achievement test values based on IQ to see whether kids qualify for special ed. A teacher's entire student group scores could be averaged and compared to their group IQ average. Effectiveness is then based on how well students perform based on their abilities.

Anonymous said...

Steve, I normally give you the benefit of the doubt but this model is completely made-up. Unless there's research to show that learning actually works this way, this post is kind of like arguing whether Green Lantern could beat up Darth Vader.

The Outsider said...

Are you sure that "regression toward the mean" is really the phenomenon you're describing? I think of RTM as what happens when you randomly sample from a population - the more extreme a sample is, the less likely it is that the next sample will be more extreme.

What you're describing seems to be something like "decay toward the mean." Is this a distinction with a difference? Maybe. You're saying that although the students' test scores at the end of each year are an accurate measure of their actual ability, but that the gap between their actual ability and the mean shrinks with time. RTM could occur even if there were no shrinkage; but it would be caused by variations among the samples of students tested, not changes in their actual ability.

Anonymous said...

I think what society really needs are tests to find out what makes a good blogger (or blog commenter).

Don't call us, we'll call you.

Anonymous said...

This not being a Mao style state I wonder about this notion of assigning teachers, star or not, to wherever some bureaucrat decides. Don't the teachers themselves have some ideas where they want to work?Or are they just pawns to be moved around according to the latest new great idea? Has anyone even considered that they may not want to work in the ghetto? Also, I've wondered throughout the years about the results to be had if these star teachers were to teach white students. All the resources gobbled up by trying to close one gap or another might have been better utilized to improve the performance of white students who would give a better return on the dollar. Most poor people in the USA are white, let's do something for them for once.

josh said...

Re jody: I cant imagine why a "nutjob" white Christian who is a natural at math would have any problems in the black/PuertoRican/Dominican high schools. Nope. I cant.

albertosaurus said...

When I was still in High School I read a multiple regression study of college outcomes done by the National Merit Scholarship people. They studied the question of which inputs accounted for the outputs of a college education. The answer was simply the quality of the students. Teaching didn't matter.

They found for example that Harvard graduates did very well on achievement tests but the only factor they could identify as contributing to that result was the scores of entering students on aptitude tests. Harvard had famous professors, excellent labs, and small class sizes. None of those showed up as a contributing dependant variable. The only thing that mattered were the test scores of the entering freshmen.

It's hard to accept if you are a teacher but there is a lot of evidence that teaching quality just doesn't matter. I admit that I have hardly read anything about education in the half century since I first read that study. Maybe things have changed - but I don't think so.

I have taught a lot of classes (mostly adults at night )in those fifty years and I have an insight.

I was at one time a star teacher. I was interesting, informed and enthusiastic. I motivated the students and they responded by learning. I was great - for about three years. Alas I taught statistics for five years. By the end of the fifth year of teaching statistics I was as bad as any college teacher I had ever had. I was boring and dull - all the student assessments said so. The college suggested that I not come back. I agreed.

That same sort of arc occurred again with other subjects - the first time a taught a subject I was a poor teacher because I didn't rally know the subject very well. The next time I taught that class I was better but after four or five times I was so bored that I stunk.

I could be a star teacher for about three years. I kept switching subjects to avoid going stale. I find it hard to believe that anyone can be a star teacher on a long term basis.