November 19, 2007

More from Audacious Epigone on how well schools are doing

Is this the best ranking yet available of how the states differ in how good a job their public schools are doing?

RankStateWhites' relative NAEP
improvement from 4th
grade to 8th (St.Dev.)
2.North Dakota.84
6.Dept. of Defense
9.South Dakota.50
23.New Mexico.13


33.New Jersey(.06)
36.Rhode Island(.17)
41.South Carolina(.40)
43.New Hampshire(.56)
47.New York(.89)
48.West Virginia(1.10)
49.North Carolina(1.18)

Over the years, I've been frustrated by how everybody uses the absolute test scores of students to evaluate how good a job a school is doing: "You'll get a great education at Harvard because the average SAT score there is 1500!" Yes, but that's what they got in high school before Harvard got its mitts on them. In truth, nobody has much of an idea whether Harvard is doing a better or worse job than, say, Cal State Dominguez Hills at helping its students live up to their individual potential.

Similarly, I often hear people assume that the principal at, say, Beverly Hills H.S. is doing a good job because test scores are high there, while the principal at say, Compton Dominguez must be doing badly because scores are low. That's quite unfair.

Absolute test scores for public schools are so dominated by demographics that the results are notoriously boring and depressing.

The state of California attempts to deal with this problem by giving two Academic Performance Index scores to each public school, one absolute and one relative to "similar schools."

But I've always wanted to look at how much "value added" schools provide.

Earlier, Audacious Epigone tried to figure out from the federal National Assessment of Educational Progress reading and math test results how much value different state educational systems are adding. He compared, across states, performance by 4th graders in 2003 vs. performance by 8th graders in 2007 on the NAEP.

That's a pretty clean comparison (for example, if one state has had a policy of discouraging Special Ed kids from taking the NAEP and another doesn't it, the differences shouldn't affect the relative change over time, unlike the usual absolute comparisons).

But what if there is a big demographic shift going on, such as in states with a dramatic Hispanic influx? That would distort the numbers.

So now, in the table above, he's looking at just the change in performance from 4th to 8th grade among non-Hispanic white students in order to reduce the impact of demographic change and make for even more of an apples to apples comparison.

(This analysis could also be done for blacks and for Hispanics, but not for all 50 states because of inadequate sample sizes of minorities in, say, Montana or Vermont.)

The results are quite striking. In the best state, Montana's white students did almost a standard deviation better as 8th graders in 2007 than they (using the term "they" roughly) did as 4th graders in 2003 relative to the rest of the country. In contrast, in the worst state, Connecticut's white students' change from 4th to 8th grade was one and a third standard deviations worse than the national average, relatively speaking.

That's more than a two standard deviation difference between #1 and #50. These are such large differences that I'm hesitant to present the numbers, but maybe somebody out there can help us check them out.

Clearly, there is some demographic change from 4th grade in 2003 to 8th grade in 2007 still showing up in the data. Perhaps the top white students in Connecticut (last on the list) are more likely than in typical states to leave the public schools for elite prep schools starting in 7th grade? (Maybe not -- most of the boarding schools in that state famous for boarding schools are 9-12).

In general, the states at the top of the list tend to be less demographically diverse than those at the bottom, although there are obvious exceptions, such as West Virginia doing quite badly.

Still, the sample sizes are impressively large: 196,000 for public school 4th graders (across all races) and 164,000 for 8th graders or around 5-6% of all students in those two grades. So the typical state is represented by roughly 2,000 white 4th graders and 2,000 white 8th graders. So, there are probably close to 1,000 whites in each grade at minimum for just about every state. (D.C., though, is excluded because there are so few whites in its public school system.) The two superstates, California and Texas, have extra-large samples of at least 10,000 students of all races in each 4th grade sample, so the number of whites there are adequate, yet they differ by about a standard deviation.

Part of the results are no doubt methodological noise. Some states might have switched to more upscale schools where the test is administered rom 2003 to 2007 to make themselves look better. Or, for example,the NAEPs are administered during a window from January to March, so if a state gave its 4th graders the test in March in 2003 and its 8th graders the test in January in 2007, it would be cheating itself of two months of maturation vs. the national norm.

On the other hand, there would be one obvious way to cheat: give a bad education from K-3 to depress 4th grade scores, then start to do your best to teach kids a lot once they take the NAEP in 4th grade so you can score high on the 8th grade test.

Still, it's unlikely that anybody has tried to game this particular analysis simply because I don't think anybody has ever thought of this analysis before.

Just looking at the table, I don't see any obvious demographic pattern explaining why, for example, Vermont would be in 13th place a standard deviation ahead of New Hampshire in 43rd place. Or why are Maryland's whites (3rd place) two standard deviations ahead of Connecticut's whites (50th place)? Both have affluent, moderately liberal, well-educated white populations. Perhaps we really are approaching the Holy Grail of a measure of educational effectiveness?

Normally, when I look at a table of data, I can figure out what's driving the rankings. Here, though, I can't. That could be good news - I really don't know much of anything about pubic school quality across the country apart from demographics (other than a vague impression from the media that Texas is better than California), so the fact that the results look pretty random could mean that we are looking at actual differences in public school effectiveness. The bad news is that the results could also look random because they are pretty random due to lots of different kinds of noise.

Any suggestions you might have for torture testing the data would be appreciated.

My published articles are archived at -- Steve Sailer


Audacious Epigone said...


I was surprised by the variation, too. Although the SDs are narrower than is the case when all races are considered (about five points), in terms of absolute scores on a 500-scale test, the difference in improvement is still more than eleven points between Montana and Connecticut. That is substantial (the absolute difference in scores by subject and grade, between Mass or Connecticut and West Virginia, is around 20 points).

Also, Jason Malloy suggested this:

It's conceivable that states that see more improvement between the 4th and 8th grade level are actually the ones with worse school systems.

The heritability of academic achievement rises rapidly between pre and post-adolescence, so a student (A) hobbled by a poor primary school system would make more gains than a genetically identical child (B) in a good school system.

Time I
Bad Grade 4 (A)
IQ 92

Good Grade 4 (B)
IQ 98

Time II
Bad Grade 8 (A)
IQ 100

Good Grade 8 (B)
IQ 100

Bad (A) gain = 8 points
Good (B) gain = 2 points

I suppose this could be tested by controlling for initial ability.


If that is the case, primary schooling, and by extension the effectiveness of teaching, really is of little value.

What would be most helpful would be some fairly clear-cut, quantitative proxy of teaching effectiveness or a way to compare the rigor of certification requirements by state. Something along those lines would probably provide a pretty good idea of which way the trend runs.

I've read that Kansas has relatively tough requirements for its educators, and I've come across complaints that Massachusetts' certification requirements are so difficult that more than half of Hispanic and black applicants cannot make the grade. Both do well.

lmg said...

I wonder if the students who are losing more than a standard deviation's worth of intelligence become physically darker over time.;-) Anyway, this needs to be investigated, because if the stats are correct, some schools are doing something very right, while others are doing something very wrong, and we can learn from that.

Ian Lewis said...

I am drawing a blank on what could help with the seemingly random scores. However, one thing that could help support them is to cross-reference the Parental Satisfaction numbers for each state.

If it turns out that Marylands Parents are significantly happier with their schools than the Parents from Connecticut, then, that will be one more feather in that cap.

Granted, if the numbers don't match up, it may not mean all that much.

agnostic said...

How about pathogen load? Just impressionisticly, it looks like the states with larger metro areas are roughly ordered (i.e., just a "moderate" r) by how large and filthy their nearby megapolis is. DC is pretty small, while New York is a huge mess, so kids in New York, New Jersey, and Connecticut may be getting more environmental insults from germs in childhood.

Looks like the same for the more rural states -- those that are farther north do better than the humid subtropical nightmare of the deep south.

The idea is that some groups are more "teachable" than others since there is a trade-off between investing the body's resources in fighting off insults vs. luxury items like paying attention to the teacher, thinking hard, doing homework, etc.

You could sort states into 2 or 3 broad classes of urbanization, and then see if educational ranking correlated with proxies like frequency of schizophrenia or known infectious diseases.

So it still may not be something that the schools are doing -- maybe it's something else in the environment. That's the lesson from behavior genetics (environment isn't the same as parenting style).

David Davenport said...

Some of the states cheat and lie re all their edu-stats.

James B. Shearer said...

I would look at a scatter plot (with one axis fourth grade performance and the other eighth grade performance). I am not convinced this is a good way of evaluating schools. Some regression to the mean could be going on, states that perform well with K-3 education are likely at a disadvantage. I believe school starting ages are not uniform (some preschool, some no K) which which might affect the fourth grade scores more than the eighth grade scores.

fifi said...

This is a little off topic but it occured to me that there might not be any room for improvement at some of the top performing groups from 4th to 8th grade. Wasn't that one of the theories presented here, that interventions are more likely to produce significant improvements in people from lower in the IQ spectrum than in those with higher IQs? I'd take this to mean you'd have to have absolute and relative test scores in order to make a fair analysis.

Justin Halter said...

The most startling thing I noticed was how well the West does compared to the East. Aside from the stupid liberal states of California and Hawaii, only one Western state was on the minus side, and that just barely (WY at -0.07).

Have you looked at the data relative to school choice options, such as prevalence of charter schools? As I understand it, Western states lead the nation in school choice options.

As you noticed, those Western states tend to have less demographic diversity, i.e., they are more white. Maybe this is just another data point of proof that whites do better in environments away from minorities?

Half Sigma said...

It's possible the the low scoring states are artificially boosting their 4th grade scores by "teaching to the test."

The 8th grade test is harder to coach for because it tests a wider spectrum of cognitive ability.

Φ said...

Steve: A solid effort given what we have, and yet: there are all sorts of opportunities for within-race aptitude variation to bias the results. It would not surprise me that, for instance, West Virginia's whites are not as bright as those in Maryland. And I bet there is plenty of aptitude variation within a state as well, between whites in, say, Baltimore and those in Columbia.

Comparing progress between 4th graders and 8th graders corrects for this, but do we not need to assume that the effect of a lower IQ on educational performance isn't cummulative? What if "slow" learners fall even further behind quick ones between 4th and 8th grade?

What we really need is to compare a school's average IQ with their average z-score on the NAEP. These ratios could then be compared between individual schools and districts, where is suspect most of education policy is determined.

I assume, however, that we don't have IQ data for a sufficient number of students, IQ testing not being very widespread. I suppose we could use parental wealth/income as a (very) rough proxy, but this introduces regional biases that are not easily corrected.

Steve Sailer said...

Right, if we had IQ scores at age 7 or whenever, we could further adjust for innate differences among white students. Still, take a look at four East Coast affluent, liberal states that all likely have fairly smart white kids. In 4th to 8th grade growth, we see:

Maryland - Excellent
Massachusetts - Good
New Jersey - Average
Connecticut - Awful

That's either very indicative of something or other apart from the innate quality of the kids or very random.

Anonymous said...

You ask, "Is this the best ranking yet available of how the states differ in how good a job their public schools are doing?" NO!!!!!

Using NAEP to rank order states is NOT a valid use of NAEP scores. See the reason/ proof at

ETS does the analysis and reporting for NAEP. ETS recommends avoiding 2003-4th to 2007-8th comparisons. The NAEP scale doesn't support such comparions.

Steve Sailer said...

Obviously, there is a problem with noise in the data, which means that when looking at absolute results, you only be sure at the 95% confidence level that, say, Massachusetts ranks from first to sixth in the 4th grade reading test. For more typical states, the 95% confidence level is pretty broad for any single administration of a single subject.

One easy way to deal with this problem is to look at more data points: 4th grade, 8th grade, 12th grade, reading, math, science, history, 2007, 2003, 1999, etc. They're multiplicative, so you can quickly come with a lot of data points (e.g., 36 in this example).

If you lump a bunch of these data points together, however, you can conclude that, yeah, you can be quite confident that Massachusetts students score higher than West Virginia students overall. I've looked at many of these data points casually, and Massachusetts _always_ scores higher than West Virginia.

Okay, but that's for absolute data, which is driven at least as much by the innate quality of the students (Massachusetts's many colleges have been importing smart people into the state for generations, while West Virginia's lack of upscale jobs has been exporting smart people) as by whatever the schools are doing.

For relative rankings, noise is likely to become a much bigger problem. So, here are some strategies to judge how much effect noise has:

First, do the analysis for other NAEP subject tests than reading and math, such as history and science.

Second, do the reading and math analysis for, say, 1993 vs. 2003.

Do you see consistent patterns or is it all just random?

William said...

Regression to the mean (as noted above) should be checked. Assuming a certain noise component exists in any state aggregate (even at the state-wide level if the exam is given on the same day, there can be bad weather, bad hair days etc.). Those bad-hair-day states would likely show the greatest improvement four years later. Regress the reported four-year gain against the actual level of the fourth grade results and see if there is a negative relationship.

Another tactic to proxy a noise effect is to see if the fourth-grade result is much different from previous fourth-grade results for that particular state. If say Conn. has a higher than average fourth grade result compared to earlier years, it is possibly reflecting some one-time or extraneous event. The eight-grade result would not be affected by the same event and the gap between the four years would be unusually low.

Jeff Williams said...


Congratulations on publishing AE's table.

Explaining the differences among states is now the task at hand.

I think readers are off the track if they look for differences in the white student populations. I believe the table actually shows what it hoped to show: the relative effectiveness of each state's corps of teachers. This means (per Occam) that we should look for the explanation in differences among teachers and their teaching practices.

What can explain the differences, then, in the effectiveness of teachers? One determinant could be the toxic effect of union power. Under a certain kind of powerful-union regime, teachers no longer have any incentive to care about the quality of work they do. These union-protected teachers can't be fired or disciplined.

To examine this relationship, one must look at indicators of teacher union power. One indicator is average teacher pay, and a table of that can be found at this URL:

Go to page 32 in the Adobe reader.

Connecticut is known for having the highest-paid teachers. Michigan is known for having a powerful teachers union. Adjusting for relative income levels and cost of living, Michigan's teachers' pay is higher than Connecticut's. To me it is no coincidence that these states are at the bottom of the list.

In my guess, however, a better correlation will be found if you look at relative numbers of teachers fired. I believe that states where very few teachers are ever fired will rank near the bottom of your chart.

That data on firings, which I don't know where to find, would be an indicator, in my mind, of a "toxic" union presence that would undermine teacher performance.

Keep up the good work.

Evil Neocon said...

Steve -- Homeschooling? One thing jumps out is that due to large distances, a lot of kids in the Mountain West are homeschooled or distance learning or both.

What is the degree of homeschooling in say, VT vs. CT?

Anonymous said...

"ETS does the analysis and reporting for NAEP. ETS recommends avoiding 2003-4th to 2007-8th comparisons. The NAEP scale doesn't support such comparions."

Of course ETS is a private company. If they irritate too many customers their employees and managers might actually have to go find real jobs. Their customers are actually schools and universities, not the students who pay for the testing - if the schools start not accepting SATs, the NAEP, and such, then ETS would be in trouble. So ETS panders to the educrats, in this case by avoiding, even misleading other users of the data, the obvious use of their data for inter-state and inter-school comparisons.

Audacious Epigone said...


Is there a standardized national sample of parental satisfaction? I'm just getting 'Parental satisfaction in Dallas district schools' in searches.


Interesting observation. A graphical representation would be nice. It would be a nice augmentation I will draw up.


Yes, having actual IQ averages would be enormously beneficial for a countless number of sociological work (in the broadest sense of the phrase).


The best stat to use? Population density would only be a rough proxy, as some states have little urbanization but are still not that rural, like Nebraska, where there are lots of little towns all over the place.


Of course the average scores are estimates, using representative samples of a state's population. What's your point? Exit polls rely on estimates. So does marketing research. And on and on.

I looked at fourth and eighth grades over a period that spanned eight years. The gaps correlated quite strongly for that entire time frame. Secondly, I converted score variances into standard deviations.

Steve Sailer said...

I wonder if it is or will become possible to link the National Longitudinal Study of Youth database to the NAEP? That could be hugely useful because we have IQ scores for 6209 children born to roughly the same number of women who took the AFQT IQ test back in 1980.

In 1979, the Bureau of Labor Statistics established a nationally representative sample of about 13,000 young people born from 1957 to 1964. In 1980, the military paid to have the entire sample take its enlistment IQ test, the Armed Forces Qualification Test. In 1990, the NLSY methodically checked up on how they were doing in life. The military provided the data to Charles Murray and Richard J. Herrnstein and it wound up as the centerpiece in the 1994 bestseller The Bell Curve.

The NLSY is still going on. It has now even measured the IQs of 6209 children of women in the original panel—2557 of whom were born to black female panelists.

If they keep it up, they'll probably get up to, what, maybe 8000 children and they must have the data somewhere broken down by states.

That's probably a big enough sample size to look at California, Texas, Florida, and New York and maybe at outliers like Connecticut and Massachusets.

Probably a better solution would be to pay to have the 6209 children of the NLSY women given the 4th grade NAEP tests as they reach the proper age, and then given the 8th grade NAEP test. We could then do a multiple regression analysis based on information about the schools they've attended and about their own early childhood IQs and their mother's performance on the AFQT!

Steve Sailer said...

By the way, here's my article with links to Charles Murray's paper on the IQs of the 6209 children of NLSY moms.

Peter said...

Having lived in Connecticut until a decade ago I'm rather surprised by the state's last-place finish. All I can think of is that Connecticut has many parochial and private schools, and they may skim off some of the better students. The main weakness with this theory is that children generally enter these schools in first or ninth grades, and therefore the "skimming off" factor wouldn't really account for the fourth to eighth grade drop.

Audacious Epigone said...

Probably a better solution would be to pay to have the 6209 children of the NLSY women given the 4th grade NAEP tests as they reach the proper age, and then given the 8th grade NAEP test. We could then do a multiple regression analysis based on information about the schools they've attended and about their own early childhood IQs and their mother's performance on the AFQT!
Charles Murray ought to write his next book on exactly that. Without NAEP data for these kids, I'm not sure how much getting at their IQs and breaking it down by state would help, beyond making for a better variable to correlate against improvement than IQ estimates based on other NAEP data.

If data by state is available, it should definitely be looked at. Setting school performance aside for a second, the state IQ estimates using NAEP data could easily be replaced with this data, which would, if representative, probably be more accurate (although sample size would be a problem in states with fewer than ~5 million people).

It seems so obvious I wonder why it hasn't been done.

Anonymous said...

There exists standards for educational testing. They require the test publisher to describe the instrument and how the results may be interpreted. ETS, which represents the NAEP's publisher, recommended that NAEP score comparsons of 2003-4th to 2007-8th be avoided.

The standards also permit uses of the NAEP data other than those recommended by the test publisher. However, the validity of other uses of of NAEP data must be demonstated. I would like to see the validity evidence that "Audacious Epigone" has put together to justify this particular use of NAEP data. I doubt that it exists (but I may be wrong).

Half Sigma said...

I've thought about this some more and I think this data is useless for figuring out which state has the best school system.

If you could give every kid a genetic-sequencing-based IQ test, and see to the extent the kids performed above or below their genetic potential, then you'd know.

beowulf said...

1. It could be as simple as states having varying tolerance (if not encouragement) of cheating.

2. Or perhaps states have different policies on social promotion-- I don't know at what level of school administration such policy is set but perhap a struggling 7th grader, in say, Kentucky is more likely to to be held back to repeat a grade while across the river in West Virginia the same student would be socially promoted to 8th grade. Tthe KY 7th grader (to use my hypo, who knows what each state's policy is) will have mastered 7th grade curriculum and will be a year older when he takes the 8th grade assessment and the WV 8th grader will bomb.

3. Perhaps there are different levels of chronic sleep deprivation. A school that starts an hour later means students get an hour more sleep. Chronic sleep deprivation affects cognitive performance in children and adults (in basic training, army recruits given more lights out time do better on both physical and academic tests).

4. Related to that, is there a difference in the percentage of students on ADHD drugs? Leaving aside whether ADHD is over or under-diagnosed, stimulants have been used to treat sleep deprivation since World War II.

5. Does family stability vary? Even if the family itself is stable, there are many transient students. If a student goes to 3 schools in three years (or worse, 3 schools in one year), the lack of a national curriculum means each school teaches required subjects in a different order. He might read Red Badge of Courage three times and never learn about photosynthesis.

Steve Sailer said...

Beowulf's idea about sleep deprivation is the kind of thing that could be analyzed in a multiple regression study. If states with 8:30 starting times do better than states with 7:30 starting times, well, then we've learned something actionable. (although that would probably show up most in the 12th grade NAEP scores).

Jeff Williams said...

I had another thought today about the cause of these divergent results on standardized tests:

In some states, teachers may take the NAEP test seriously; in those states, teachers would earnestly urge students to take their time and try to do well.

In other states (such as those with powerful teachers unions), the teachers would not care. They would say something like, "George Bush is making you take this test, so you have to do it. But it doesn't apply to your grade, so don't worry about it. If we finish early, we'll watch a video."

I call this the Blowoff Theorem, where there are some states where teachers basically blow off the NAEP.

To test this theorem, research would be needed as to which states apply any sort of reward or punishment based on NAEP scores.

SavRed said...


Every state is in the grip of NCLB and must provide a state test that measures Adequate Yearly Progress (AYP). These state tests are based on a state's curriculum--and each state curriculum is different.

And I mean widely different from state to state.

Variance could be from the paroxysms of curriculum upheaval that have been going on since the onset of NCLB legislation--some states have a much better state curriculum (scope & sequence, etc.) than others.

Audacious Epigone said...

Justin's perspicacity is pretty evident in this graphic (brown indicates greater improvement, blue indicates relative deterioration) of the states by relative improvement (in math and reading) from the 2003 class of fourth graders to 2007 eighth graders, for whites only. Excepting California, the West does better than the South and the Northeast, and the differences are stark.