The field of social psychology was embarrassed recently when revelations respected, highly productive social psychologist Diederik Stapel was discovered to be simply making up data for his popular papers.
But you can still run experiments and cheat anyway.
Via Andrew Gelman:
The Data Vigilante
Students aren’t the only ones cheating—some professors are, too. Uri Simonsohn is out to bust them.
By CHRISTOPHER SHEA
... Simonsohn initially targeted not flagrant dishonesty, but loose methodology. In a paper called “False-Positive Psychology,” published in the prestigious journal Psychological Science, he and two colleagues—Leif Nelson, a professor at the University of California at Berkeley, and Wharton’s Joseph Simmons—showed that psychologists could all but guarantee an interesting research finding if they were creative enough with their statistics and procedures.
The three social psychologists set up a test experiment, then played by current academic methodologies and widely permissible statistical rules. By going on what amounted to a fishing expedition (that is, by recording many, many variables but reporting only the results that came out to their liking); by failing to establish in advance the number of human subjects in an experiment; and by analyzing the data as they went, so they could end the experiment when the results suited them, they produced a howler of a result, a truly absurd finding. They then ran a series of computer simulations using other experimental data to show that these methods could increase the odds of a false-positive result—a statistical fluke, basically—to nearly two-thirds.
One thing that's interesting is how seldom these kind of data-mined false positives are published regarding The Gap, despite the huge incentives for somebody to come up with something reassuring about The Gap.
It's easy to come up with Jonah Lehrer-ready false positives if you don't care what your results are. Say you are having psych majors fill in Big Five personality questionnaires in four rooms: one is painted blue, one yellow, one light green, and one off-white. That gives you 5 personality traits times four rooms = 20 combos. It would hardly be surprising if one combination of room color and personality trait diverges enough to be statistically significant at the 95% level. (Especially if you can stop collecting data whenever you feel like.) People in yellow rooms are more neurotic! Or maybe they are less neurotic. Or maybe off-white rooms make people more conscientious. Or less. It doesn't really matter. Jonah Lehrer would have blogged your paper whatever result you came up with.
On the other hand, if your goal is to close The Gap, it's harder to stumble into a false positive by random luck because you know what you want ahead of time. You want to show you can close The Gap.
Thus, mostly we read about uncontrolled studies of Gap Closing: The Michelle Obama International Preparatory Academy of Entrepreneurial Opportunity Charter School, where black and Hispanic students are taught by Ivy League grads working 75 hours per week, had tests scores almost equal to the state average! (Don't mention that 45% of the public school students in the state are NAMs. And don't mention what white and Asian students would have done with those Ivy Leaguer teachers. There's no control group in this pseudo-experiment.)
The most popular social science research cited on The Gap -- stereotype effect -- seems to be a combination of two things: the file drawer effect (there isn't a big market for articles saying you couldn't replicate a beloved finding) and the fact that it's not really that hard to get black students not to work hard on meaningless tests.
"One thing that's interesting is how seldom these kind of data-mined false positives are published regarding The Gap, despite the huge incentives for somebody to come up with something reassuring about The Gap."
ReplyDeleteYou could argue that there's an equally "huge incentive" to keep the status quo. The diversity industry would go bankrupt the second The Gap disappeared.
> On the other hand, if your goal is to close The Gap, it's harder to stumble into a false positive by random luck because you know what you want ahead of time. You want to show you can close The Gap.
ReplyDeletehere's an example of a gap-closing false positive from about a year ago:
http://www.apa.org/monitor/2011/09/achievement.aspx
a 15 minute essay assignment that closes the gap.
want to guess how many times this has been successfully replicated?
It's not just bad science that's irritating, it's also their selective logic that's even more so.
ReplyDeleteHBD gets a bad rap because nobody's identified a specific race gene, yet they have no qualms about their assertion that gays are born not made, despite the same genetic quandary.
So what does this say about global warming science, drug trials, counter insurgency warfare, statistics driven science in general. Oh! Does it say anything about say IQ research?
ReplyDelete"You could argue that there's an equally 'huge incentive' to keep the status quo. The diversity industry would go bankrupt the second The Gap disappeared."
ReplyDeleteThe point of the exercise is not to make The Gap disappear but to make whites pay through the nose for its existence. There's no end in sight to that.
On the other hand, if your goal is to close The Gap, it's harder to stumble into a false positive by random luck because you know what you want ahead of time. You want to show you can close The Gap.
ReplyDeleteI don't know. They sell some nice clothing there.
The issue of multiple comparisons is well known, 10 different pair wise-comparisons gives you a 40% chance of finding at least one with a p value (chance of a type I error, i.e. observing a difference that isn't "real") of <0.05. There are several statistical corrections which should be used control for this problem, see:
ReplyDeletehttp://en.wikipedia.org/wiki/Familywise_error_rate
I wouldn't trust any research findings which doesn't utilize this analysis.
With regards to disproving "The Gap", the more likely attempt would utilize type II errors (failing to find a difference b/c the study is too small). For example, grab a couple of dozen kids at a selective charter school, give them some relatively non-g-loaded tests, and poof, the "Gap" is gone.
Steve, I can see that you are clearly intrigued by the near-Newtonian nature of the "Gap" and I hope you have something good brewing with respect to an explanation . . . .
Anonymous who talked about the 15 minute essay assignment intervention that closed the gap:
ReplyDelete-- Actually, the author of the study reported this effect *twice* in Science Magazine (not easy to get papers into.) His reported methods were rigorous.
-- Nonetheless most people I know in the field seem to agree that the result is pretty astonishing and needs independent replication before it should be believed.
-- Good news: the Dept of Ed has funded a large-scale replication of this by some neutral and first-rate people who specialize in careful randomized trials, and I believe we should hear pretty soon one way or another on this.
OT: The University of California's new "Onward California" campaign is a great parody of a university reinventing itself with faceless corporate branding:
ReplyDeletehttp://www.onwardcalifornia.com/
http://brand.universityofcalifornia.edu/
Except it's not a parody. It's all real, including the childish new UC logo. An online petition objecting to the use of the logo was created yesterday, and has already received over 30,000 signatures.
http://www.dailycal.org/2012/12/08/critics-object-to-new-uc-logo/
"Jonah Lehrer would have blogged your paper whatever result you came up with."
ReplyDeletewell a nice little house in hollywood hills does cost some.
Well, well, well: shades of 'Back To Blood's' schlocktor-doctor Norman Lewis. Yet again Wolfe's work anticipates reality.
ReplyDeleteThus, mostly we read about uncontrolled studies of Gap Closing: The Michelle Obama International Preparatory Academy of Entrepreneurial Opportunity Charter School, where black and Hispanic students are taught by Ivy League grads working 75 hours per week,
ReplyDeleteThis is exactly the kind of research that is *not* vulnerable to the kind of researcher bias cited in the article, because there are not many researcher degrees of freedom. It costs a lot to set up an intensive intervention like a new academy, you have to test the academy and you can't throw out the results, it is easy to see if the controls are randomized and if they aren't you can easily look at differences in other background variables, and both positive and negative findings are worthy of note and equally publish-able. Research on significant real-world interventions almost always uses pre-specified experimental methods, is carefully looked over by people on both sides of the politics of the issue, faces little publication bias, and is usually quite good in quality. Way better than little social psychology experiments by academics who control every detail of the experiment and have every incentive to find a 'striking result'. Stereotype threat being the classic example.
had tests scores almost equal to the state average! (Don't mention that 45% of the public school students in the state are NAMs. And don't mention what white and Asian students would have done with those Ivy Leaguer teachers. There's no control group in this pseudo-experiment.)
Your actual beef with this made-up case is not very closely related to the social psychology paper you cite. In the hypothetical you outline, your problem seems to be that the Michelle Obama Academy really does improve outcomes compared to no intervention, but the researchers have not noted that non-targeted students could also improve their scores given such an intervention. That may or may not be relevant to the reasons why this hypothetical academy was created (it may have been created to improve outcomes among poor kids generally, not to demonstrate that there is no racial gap in test scores given equal investment). Also, there are numerous (many dozens) of studies of real-world educational interventions that have real randomized control groups. The results from those kinds of experiments are by no means always politically correct, but they get plenty of publicity either way.
Social psychology is the most useless of all social sciences (and that's up against some pretty stiff competition), and many of its best-known results are probably rubbish. The methodology of social psychology is a mess of nongeneralizable laboratory experiments and small sample sizes coupled with unscrupulous data mining and outright fraud. Moreover, it's the most leftist field where only politically correct results can be published. Take a look at the Psych File Drawer and the Reproducibility Project -- the majority of social psychology findings fail to replicate. I regard all social psychology results as false until they are independently replicated using large samples and reliable methods.
ReplyDeleteSo what does this say about global warming science, drug trials, counter insurgency warfare, statistics driven science in general. Oh! Does it say anything about say IQ research?
From the outset, IQ research has been strenuously resisted in social science, yet none of the basic results have ever been refuted. The high predictive validity, stability, and heritability of IQ were established in the first half of the 20th century, and they and many other findings (e.g., The Gap), representing some of the largest effect sizes in social science, have been replicated any number of times.
Psychometrics, together with behavioral genetics, has benefitted from the fact that it has been criticized all the time, because this has made it methodologically the most sophisticated social science. There cannot be scientific progress if there's no intellectual disagreement. In Intelligence (the journal), there's a beautiful article where James Flynn pays tribute to Arthur Jensen's integrity and stresses the importance of intellectual opponents. Excerpt:
The question now is how to fill the void Jensen's death leaves, particularly for scholars open to scientific inquiry who challenge some of his conclusions. There is no substitute for someone of great intellectual caliber who disagrees with you. With Jensen no longer alive, we will have to invent him. But we cannot really do that, because no one is so constructed as to put the same energy and imagination into a fictitious opponent as we put into polishing our own ideas. No one can pretend to believe what they do not believe, but I hope there is a young scholar out there with the convictions and mind of Arthur Jensen. I am sometimes asked why I spoke so well of him. The answer is that it was easy.
it's not just social science studies that are bogus; medical science and other scientific studies are also bogus.
ReplyDeleteSteve, when talking about "The Gap' please let us know just what Gap you are referring to. we now have not only the gap between NAMs and Whites and Asians (WHAMs?)in scholastic achievement but also The Gap in voter preference between Women and Men, married and not married and Whites and just about everybody else.And of course we have the perpetual WAGE GAP, that evil patriarchal plot to subjugate women, impregnate them and steal their shoes.
ReplyDeleteSteve, I think it's naive to attribute the results on stereotype threats primarily to such things as the file drawer effect, etc.
ReplyDeleteDollars to donuts, the real factor here is fraud of one description or another. Even Stephen Jay Gould, as secure as he was in his position, didn't blanch at outright fraud to promote his ideology. Do you really imagine that academics who have no other claim to fame and influence than their work on stereotype threat, and who therefore have both personal AND ideological incentives to commit fraud, and who know perfectly well -- as did Gould -- that they won't be called on their fraud, will hesitate to distort their results?
As Ghostbusters thought us academics often see what they want to see.
ReplyDeleteMargaret Mead got the results she wanted because she saw what she wanted.
Kinsey got the results he wanted because he saw what he wanted.
John Money got the results he wanted because he saw what he wanted.
Red is the most successful colur of football teams because of Manchester United and Liverpool in England Ajax ion The Netherlands, Bayern Munich in Germany but Real Madrid, Anderlecht, Olympique Marseilles, Juventus, Al Ahly, Sliema Wanderers, Jeunesse D'Esch, Al Hilal must be ignored.
In the NFL you would have to ignore the Pittsburgh Steelers which I have no problem doing. Will academia beckon me?
The biggest lesson is simply that science is hard. It just barely works. In general, science is not all that resistant to fraud, which is one reason why fraud is a career ender. That's true pretty much everywhere, not just in the social sciences. Verification is expensive as hell--real verification looks like what the FDA puts drug companies through. Most of the time, nobody even bothers to replicate the original published work unless they're trying to do something else based on it.
ReplyDeleteThe second lesson is that science is so hard, even when you intend to get it right, there are a whole bunch of ways you can fool yourself, ranging from pretty obvious statistical goof-ups (say, I effectively tested 150 hypotheses at once, and I found a couple with nice, low p-values!) to subtle stuff like having the results of your experiments screwed up by some hard-to-detect lab contaminant. There have been a bunch of nice articles talking about this, about how many of the biggest papers in biology turn out not to be possible to replicate. There is not any reason to suspect fraud there, just subtle stuff that went wrong. (Though it's not so clear what fraction of researchers, right at the point of getting a tenure-track job, would go public with their sudden realization that the research that had won them their shot at a good life was really all wrong. Hmmm, you know, I'm tired of looking at murine retroviruses now, maybe I'd like to do some work on something completely different.)
Again, look at what we do when we really want to be sure we're right about some effect--we do something like a randomized double-blind drug trial, or mixing samples from the controls into the samples from the experiment and sending them all, unlabeled and mixed up, to the lab. Nobody can afford to do anything like that to decide whether to let your paper into a journal or conference!
One thing that determines the number of false positives that get published is the number of people looking for a particular kind of result. That, combined with the bias toward publishing successful results (who wants to read about yet abother way that you can educate kids where the Asian kids succeed and the black kids fail?), gives you the same kind of effect as when a single researcher tests 150 hypotheses at once looking for a p-value of .01, except there's not really a good way of adjusting for your results because you don't know how many people tried and failed. The hotter the topic, the more people are looking, and also the more excited a journal will be to publish anything promising at all.
ReplyDeleteAndrew,
ReplyDeleteYou do know that p values and error rates (type 1 or type 2) are fundamentally different concepts, right? It's the difference between significance testing and hypothesis testing. Neyman and Fisher certainly knew the difference.
'that evil patriarchal plot to subjugate women, impregnate them and steal their shoes'.
ReplyDeleteVile slander. They can keep all the shoes they want. Indeed, some of them should wear nothing else. Howsomever, come the revolution, they may NOT talk.
A top social psychologist once said to me, "Lots of people have tried and failed to replicate stereotype threat, but no one publishes on that because they don't want to be thought 'racist.'" He also said, "There are lots of psychology findings that fail to replicate, but the only time you'll hear about them is late at night in a hotel bar."
ReplyDelete> Anonymous who talked about the 15 minute essay assignment intervention that closed the gap:
ReplyDeleteThat's me
> -- Actually, the author of the study reported this effect *twice* in Science Magazine (not easy to get papers into.) His reported methods were rigorous.
The same sample was reported 2x, not two independent samples.
If you have a report that shows closing the Gap, you can get as many Science papers as you want.
> -- Nonetheless most people I know in the field seem to agree that the result is pretty astonishing and needs independent replication before it should be believed.
Like cold fusion astonishing.
> -- Good news: the Dept of Ed has funded a large-scale replication of this by some neutral and first-rate people who specialize in careful randomized trials, and I believe we should hear pretty soon one way or another on this.
RCTs are certainly worth pursing, I agree.
My take: there was a bias in the selection of the test vs control cohort, with the test cohort being stronger academically than the controls. Thus there is no doubt that they would do better in followups.
For background, from the APA story:
In a study published in Science in 2006 (Vol. 313, No. 5791), the researchers found that the short exercise reduced the achievement gap between the black and white students in the class by up to 40 percent over one school term, and that it was particularly effective for low-achieving black students, halving the percentage of black students who got a D or below in the class.
Three years later, Cohen and his colleagues published a follow-up paper in Science (Vol. 324, No. 5925) in which they tracked the original group of students through the eighth grade. Amazingly, the effect lasted—the low-achieving black students who had completed the values-affirmation exercises raised their GPAs by four-tenths of a point (on a four-point scale) compared with the control group, and were less likely to need to repeat a grade. The intervention didn't have any effect on white or high-achieving black students.
Just to continue my point about fraud, it's worthwhile to think about it from the standpoint of a single group of researchers. Do you really imagine that, say, Steele and Aronson actually engage in over 20 experiments (when p is less than .05) or over 100 experiments (when p is less than .01) for each one they actually submit? There isn't enough time in a decade, or money in a grant to fund such efforts.
ReplyDeleteThey keep repeating these statistically significant experiments -- sometimes highly significant experiments -- with just the desired results.
Assuming that the carefully designed studies that fail to find an effect are actually showing the true situation, there is little one can conclude about Steele and Aronson's experiments other than that they are systematically profoundly defective in their design, or are based on some form of fraud.
It really can't be the file drawer effect alone.
There is a real obvious explanation for why professors can get blacks to not score as well on meaningless tests as they had on the SAT by announcing that blacks don't do well on this test -- the black subjects rightly take the bizarre announcement as indications that the experimenters want them to not try hard on a test that they don't have any practical incentive to try hard on.
ReplyDeleteSteve, the explanation you offer up is plausible enough, but I do recollect that at least in one very well controlled study that attempted to replicate stereotype threat -- by John List among others -- there was simply no effect found. And apparently they tried really hard to replicate the claimed circumstances of stereotype threat experiments like those of Steele and Aronson.
ReplyDeleteSo, again, I have to believe that there's something deeply defective about how Steele and Aronson are designing their experiments that enables them to find what this study didn't find -- or that they are engaging in some form of fraud.
Perhaps all the defects amount to is some especially effective winks and nods they've developed to communicate the desired results to their subjects.
But even that strikes me as pretty damn close to outright fraud. It's hard to imagine that over the many, many experiments they've run that the possibility of such communication of intentions has entirely escaped them.
My vague recollection is that Claude Steele didn't originally push his results as validating the modern theory of stereotype threat. His 1990s articles in The Atlantic were more about how Black Kids These Days screw off more in college than when he (and his identical twin Shelby Steele) was young. Maybe I've got this all wrong, but it seemed like white people took off and ran with Steele's original experiment as the cause of The Gap and Steele went along for the ride.
ReplyDeleteWhen the immunologists Agnes Wold and Christine Wennerås had their study Nepotism and sexism in peer-review published in prestigious Nature 1997 (Vol. 387/22) they attracted great international attention. They showed that a woman has to be 2.6 times as productive as a man to receive a post doctoral fellowship from the Swedish Medical Research Council of that time. After a long time of struggling Wold and Wennerås were now met by an almost unanimous approval from journalists, politicians and scientists - all unusually willing to credit the irrepressible authority of the “cold” and “clear” figures.
ReplyDeleteOf course, the same journalists, politicians and scientists soon distorted the study’s limited claim on revealing discrimination in a research council to a statement about women’s conditions in general. And, watching a TV debate more than 10 years later, you can be certain that it is Wold and Wennerås that Gudrun Schyman refers to when she says that women in politics must be twice as capable as men to be regarded as equal: “It is scientifically proven!”
It is such routine misuse of science that I will criticize initially. But I also have a more interesting and fundamentally very important objection to the study in itself – an objection that I have, surprisingly, not been able to find anywhere else in spite of having run across quite a few attempts to have a go at Wold’s and Wennerås’ study.
http://www.athleticdesign.se/otherstuff/sexism_in_peer_review_english.html
the gentleman who thought getting papers into science was a problem.
btw they are also talking about how quality is more important, now that men publish more in studies.