July 27, 2009

Vulcan Society v. Fire Department of New York

In my VDARE.com column yesterday night on the Bush Administration's belated triumph over Disparate Impact in the Fire Department of New York, the ironies were so rich that I didn't have room to analyze the legal reasoning of Judge Nicholas G. Garaufis's opinion in Vulcan Society v. FDNY. Fortunately, a reader has done it for me, so we'll do a tag-team in-depth analysis of this important and stupid decision. (As an intro, please read my VDARE.com column first. By the way, you can take the tests for yourself here.)
Yesterday, I got a copy of a recent decision in the Eastern District of New York about another fireman discrimination lawsuit, this about the city of New York, which is rather larger and more significant than the one in New Haven. I read it and learned that (i) Ricci means nothing and (ii) the 4/5 test is dead—replaced by a simple test for statistical significance. Under the new test, any statistically significant difference between white and minority scores is prima facie evidence of discrimination.

The judge's decision is both insanely important and insane. He is saying that the Equal Employment Opportunity Commission's Four-Fifth's Rule isn't tough enough at sniffing out Disparate Impact. Instead, any "statistically significant" difference in passing rates between racial groups should shift the burden of proof in the case to the employer and make the employer meet the strict "business necessity" burden.

And with a big enough sample size, such as the 10,000 or so who take the FDNY entry level employment test, practically any racial difference, no matter how pragmatically insignificant, can be deemed to have attained statistical significance for purposes of bringing the hammer down legally on the employer under Title VII of the 1964 Civil Rights Act.
Judging from media coverage, you might think that racial quotas are an open question in this country. You might even get that impression from reading iSteve some days. But they aren’t. Racial quotas are part of how we live now. This case demonstrates how and why.

You can learn a lot about a case just by looking at the caption. The plaintiff is the United States of America, joined by a black firefighters group and individual black and Hispanic firefighters. The case number, beginning with 1:07 CV, shows that it was a civil case commenced in 2007. From that and the name of the plaintiff we can deduce that this case was brought by the Bush Justice Department in 2007, while Alberto Gonzales was attorney general and, you will recall, there was great worry that Bush was politicizing the Justice Department. In fact, cases like this get brought in every administration, no matter who is the president and who is the attorney general.

Defendants are the City of New York, the fire department, the Department of Citywide Administrative Services (which developed the tests at issue) and the mayor and the fire commissioner. No unions and no firefighters. No New York equivalent of Frank Ricci.

Judge Garaufis specifically allowed the Vulcan Society to take over as main plaintiff from the three individuals, and specifically banned the main union, the Uniformed Firefighters Association from joining the defense. The union was worried that the Bloomberg Administration would not put up a stiff enough fight. I can't find online documents showing Bloomberg's defense, so I have a hard time evaluating how true the union's worry was.
Checking the bio of the judge, Nicholas Garaufis, I see that he was appointed by Clinton on Sen. Moynihan’s recommendation and unanimously confirmed by the Senate in 2000. He has a typical background for a New York federal judge: associate at a white-shoe firm, Chadbourne & Park, nine years as counsel to the Queens borough president, Claire Shulman; five years as general counsel to the Federal Aviation Administration during the Clinton administration. If I had to guess I would say he was politically in the left half of the federal judiciary, but nevertheless he is a completely mainstream guy. In his opinion he cites to a 1972 decision by Judge Edward Weinfeld dinging an earlier version of the firefighters’ exam. Weinfeld was one of the most respected federal district court judges in the country.

The news angle: this decision made the front page of the New York Law Journal, which is basically the house organ of the New York bench and bar. It was a subject of a short article in the Times. It didn’t make the Wall Street Journal at all. It will get maybe one /one millionth of the coverage that Ricci got or that Professor Gates’ arrest is getting.

The Times article is innumerate but informative. There are the expected quotes from the intervenors’ lawyers, including a pious statement from Dana Lossia that, “Really what this decision says is, the exams you were using don’t pick the best-qualified people. What they really don’t do is pick the people who would best protect the city.” Not a word from the Department of Justice.

But it’s the stuff from the city that is really important. The city says: the tests are no longer being used, and since the city began administering a new test in Jan. 2007, minorities are now 38% of the candidates on the passing list and 33% of the top 4,000, who are most likely to be offered a job. They are a third of the most recent class of probationary firefighters. So, in practical terms: racial quotas for New York City firefighters have been in place since January 2007; and all this lawsuit is about is back pay and promotions for black and Hispanic applicants before 2007. Back pay and promotions will be addressed in the remedy phase, which is what Judge Garaufis, the lawyers, and the experts will turn to now.

The critical thing legally is that this case was decided on summary judgment: expert opinions and statistics. The judge did not hear any testimony. Under the law, you are only supposed to grant summary judgment when the losing side has failed to show any dispute about any fact that might conceivably affect the outcome. Or, in layman’s terms, there was no need for a trial, because New York City’s position was so weak that it didn’t deserve one.

Here is what Judge Garaufis said about Ricci in his opinion.
“Before proceeding to the legal analysis, I offer a brief word about the Supreme Court’s decision in Ricci v. DeStefano, 129 S. Ct. 2658 (June 29, 2009). I reference Ricci not because the Supreme Court’s ruling controls the outcome in this case; to the contrary, I mention Ricci precisely because it does not. In Ricci, the City of New Haven had set aside the results of a promotional examination, and the Supreme Court confronted the narrow issue of whether New Haven could defend a violation of Title VII’s disparate treatment provision by asserting that its challenged employment action was an attempt to comply with Title VII’s disparate impact provision. The Court held that such a defense is only available when “the employer can demonstrate a strong basis in evidence that, had it not taken the action, it would have been liable under the disparate-impact statute.” Id. at 2664. In contrast, this case presents the entirely separate question of whether Plaintiffs have shown that the City’s use of Exams 7029 and 2043 has actually had a disparate impact upon black and Hispanic applicants for positions as entry-level firefighters. Ricci did not confront this issue.”

The obvious corollary of this analysis is that if the City of New Haven had not set aside the results of the exam and simply allowed itself to be sued, it would have lost and racial quotas would have triumphed. I don’t think other municipalities will miss this.

To the numbers.

Disparate Impact

Both tests consisted of 85 multiple choice questions. You had to pass the test in order to get to take the physical test, the PPT. The written test and the PPT were then averaged and the applicants were rank-ordered. Judge Garaufis isn’t so simple-minded as to attach copies of the tests to his opinion, but it is my impression that they weren’t very different in content. [1]

You can see both the 1999 and 2002 tests in their entirety here.

But they were very, very different in passing grade. Test 7029, in effect from 1999 to 2002, required you to get 84.705% right (72/85) to pass. Test 2043, in effect from 2002 to 2007, required you to get only 70% (60/85) to pass. So, under test 2043, passage rates rocketed up and the racial gaps diminished—but not enough for the Justice Department or Judge Garaufis.

Test 7029 -- 1999

Whites passed at an 89.9% rate (11,613/12,915). Blacks passed at a 60.3% rate (1,054/1,749). So, blacks passed at 67% of the white rate. Test 7029 failed the 4/5 rule. An expert said there were 33.9 “units of standard deviation” between the white and the black rate, making it a one-in-a-ridiculously-large-number possibility that the disparity in white and black results were the result of chance. The expert report is not online, and I don’t know what a “unit of standard deviation” is. Obviously the black result is not 34 standard deviations below the white result. But perhaps here “units of standard deviation” simply expresses how far the white-black result is from chance, if you assume no difference in ability between the two groups.

Hispanics passed 7029 at a 76.7% rate, meaning their pass rate was 85.3% of the white pass rate. So test 7029 did not flunk the 4/5 rule for Hispanics. But, an expert found that there were 17.4 “units of standard deviation” between the white and Hispanic results, again making it staggeringly unlikely that the difference in scores between the two groups was the result of chance.

Test 2043 -- 2002

Whites passed this test at an impressive 97.2% rate (13,495/13,877). Blacks passed it at 85.4% (1,190/1,393). So, this is a spectacularly easy test. I would think you’d have to be damn near illiterate not to pass it. Anyway, the black pass rate was 87.8% of the white pass rate. So, the 4/5 rule is satisfied. But, an expert finds that there are 21.8 “units of standard deviation” between the white and black scores, again making it extraordinarily unlikely that the difference in scores between the groups is the result of chance.

The Hispanic pass rate is 92.8% of the white rate, or 95.5%. The 4/5 rule is easily satisfied. But there are 10.5 “units of standard deviation” between the white and Hispanic results, making it extremely unlikely that the difference between the scores was the result of chance.

The opinion then goes on to the rank ordering issue. There is a gap of more than 600 slots between the average white and the average black applicant on the hiring list under test 7029, and a gap of more than 900 slots between the average white and the average black under test 2043, because of course everyone and his cretinous brother are passing 2043 and getting on the list.

The city argues that the judge should deny summary judgment on the prima facie question of disparate impact where the 4/5 rule is satisfied (that is, test 7029 for Hispanics, and test 2043 for black and Hispanics). But the judge says no. Citing to various cases from the Second Circuit (i.e., the geographical branch of the federal appellate courts comprising New York and Connecticut) he says that disparate impact is established any time the racial gap is more than 3 standard deviations—by which he means, not that the black/Hispanic score is more than 3 standard deviations lower than the white score, but where the gap between the races is great enough that the result is almost certainly not caused by chance.

In other words, statistical significance at the .01 level. Assuming the two races were absolutely equal in test taking ability, what is the chance that blacks would do so much worse on average on this test given to many thousands of people just by random bad luck? The answer is of course one in bazillion. Therefore, the FDNY got a lot of 'splainin' to do.

Does the Judge know what "statistical significance" even means? As Inigo Montoya said in The Princess Bride: " You keep using that word. I do not think it means, what you think it means." If you have a huge sample size, you can find "statistical significance" in the technical sense even when there is no practical significance.
I won’t spend a lot of time discussing the legal tests and discussion of precedent, but will try to focus on the basics. The main conceptual flaw of these exams was that they measured cognitive abilities to the exclusion of non-cognitive abilities, even though non-cognitive qualities are clearly important to being a successful firefighter. (Of course, if non-cognitive qualities are equally distributed among the population, and cognitive abilities are concentrated among whites, then the rank-ordering of a test that considers non-cognitive as well as cognitive abilities won’t be different, racially speaking, than the rank ordering of a test that considers cognitive abilities alone. But the law, in its majestic equality, takes no notice of such trifles.)

The judge complains that the cognitive test didn't test for non-cognitive traits such as Conscientuousness and Cleanliness, but doesn't indicate how the city could have efficiently, effectively, and fairly tested these non-cognitive traits among many thousands of job applicants ... and with no Disparate Impact, either!

Perhaps city officials could have sniffed each job applicant and then graded him on Cleanliness?

Written tests of the boy scout virtues aren't bad, but they have a much weaker chain of evidence supporting them than do cognitive tests. So, all the judges' quibbles about this cognitive test would be an order of magnitude larger for any of the virtue tests.For example, pencil and paper virtue tests can be outsmarted. "Ooh, how can I look good on this question?" (This isn't such a problem for firemen, except in the case of smart arsonists like John Orr, but it's a big problem testing policemen, where the last thing you want is somebody out at the extreme of the Smart-Bad quadrant.)

The good thing about IQ tests is that they can't really be outsmarted (outside of blatant copying off the Asian guy sitting next to you). If you outsmart an IQ test, well, then, you are smart.

Nor does the judge cite any evidence that there won't be racial gaps in non-cognitive traits like Dependability, Cooperation, Concern for Others, Persistence, and Self Control. Any kind of valid test for those traits will show large (and embarrassing) racial gaps, so the fire department would be right back in court being charged with Disparate Impact. For example, the judge complains that the City should have tested for "mechanical ability" as if adding a test of "mechanical ability" would obviously reduce the black-white gap.
The main structural flaw in the tests was that the questions were written by firefighters, rather than by testing professionals. The judge made great hay of the fact that some of the questions were supposed to measure inductive reasoning, and several of the firefighters involved in the process had no idea what inductive reasoning was.

Judge Garaufis then engages in a lot of quibbling over how the test was made up -- in house, by NYC civil servants interviewing firemen.

Ironically, Garaufis compliments the New Haven Fire Department's test in comparison to FDNY's, even though Justice Ginsburg's losing dissent went on and on about how bad it is.

Clearly, though, a lot of work went into it, and no evidence is presented of racial bias. (Presumably, the city officials went to extra lengths to get minority firemen to contribute to the test design processs and to buy off on the results.)

NYC's good-enough-for-government-work approach possible isn't the ideal way to make up a fireman hiring test. Nor is the best way these expensive custom-made city-by-city tests like in New Haven. The best way is like in college admissions -- you have a couple of nationally competing companies put out national tests, like the SAT and ACT. But, that's not feasibly under the current law because everybody has to act like the only reason for disparate impact is because the last 37,434 fireman's tests didn't do it right, but this time, we're gonna all roll up our sleeves and, doggone it, get it right!

But if you know anything about test design (which the Judge clear doesn't), you also know that there are rapidly diminishing returns to test design sophistication. When you are just coming up with a test that eliminates the bottom 30% of white guys who want to be firemen, you know, it's really not that hard to come up with something good enough.
Other issues: the reading level of the test was too high (see p. 75); there was an insufficient showing of a linkage between the qualities measured and the abilities necessary to make a good fireman; and so on.

The government’s expert, Siskin, seems to acknowledge the existence of g, see p. 67 of the opinion. But the judge notes this solely by way of saying that the test was unjustifiably focused on the measurement of cognitive abilities.

That's a hilarious part where the plaintiff's statistical genius expert witness rediscovers the general factor of intelligence, like Charles Spearman discovered the g factor in 1904. The test design process identified nine cognitive skills useful in firefighting.

This analysis revealed a pattern showing that the “[questions] intended to measure an individual cognitive ability actually tend[ed] to correlate as or more highly with [questions] intended to measure different cognitive abilities . . . .”

Spearman invented factor analysis, which he used to discover the existence of the g Factor. No doubt wholly innocent of this bit of cognitive testing history from more than a century ago, the judge complains:
To further support his conclusion, Dr. Siskin applied a method called “factor analysis,” which is “a statistical methodology that, based on the empirical data, defines an underlying structure which can explain the correlation among the [questions].” (Id.) “For the results of factor analysis to confirm the test plan, the analysis should find that [questions] group together to comprise nine or 10 factors in a manner consistent with the test plan, such that the Deductive Reasoning [questions] group together to form one factor and the [questions] intended to measure Inductive Reasoning group together to form a second factor, and so forth.” Dr. Siskin’s factor analysis showed that the data did not “factor into nine distinct factors or ability domains,” but instead “seem to primarily measure a general cognitive ability (except, perhaps, Memorization), and to a much lesser extent, a second specific cognitive ability (which is different from any defined by the test developers).”. According to Dr. Siskin, [t]his result demonstrates that the purported intent of the test design (to measure and weight nine distinct cognitive ability domains) was not successful.”

I'm shocked, shocked to discover that this test validates the existence of the g Factor and once again fails to prove the validity of Howard Gardner's theory of multiple intelligences.

Then the judge says:
This evidence shows that the cognitive abilities intended to be tested on Exams 7029 and 2043 were not the most important cognitive abilities for the job of firefighter.

But, Judge G., you just complained that the test tested g, the general cognitive ability, rather than various hypothesized specific abilities. Tautologically speaking, isn't general cognitive ability the most important cognitive ability?
Interestingly, the use of a cutoff score is very problematic, even when the cutoff is based on the number of firefighters required (I should say, especially when the cutoff is based on the number of firefighters required). See p. 78. I think the obvious takeaway is that the fire department should consider many applicants more than it needs, and hire on a racially balanced basis from that larger pool.

In other words, do what Chicago is doing: hire firefighters largely at random.

We'd be better off with strict racial quotas than with that.

My published articles are archived at iSteve.com -- Steve Sailer

15 comments:

  1. Steve:

    It would be interesting to see actual sample test questions.

    I have a friend who tells me the test is simply a "dumbed down" version of the SAT with no prior knowledge of firefighting and related topics necessary.

    At the same time I keep hearing "fire buffs" are skewing the scores for "everyone else." There's a definite disconnect here.

    ReplyDelete
  2. "The expert report is not online, and I don’t know what a “unit of standard deviation” is. Obviously the black result is not 34 standard deviations below the white result."

    This probably means a t-statistic of 34. Given the large sample size, virtually all t-statistics will be highly significant.

    ReplyDelete
  3. "I have a friend who tells me the test is simply a "dumbed down" version of the SAT with no prior knowledge of firefighting and related topics necessary. At the same time I keep hearing "fire buffs" are skewing the scores for "everyone else." There's a definite disconnect here."

    The explanation is that you can pass the test in one of two ways:

    1. All the information needed to answer the questions is in reading passages above. So, anybody with good reading comprehension can walk in cold and pass the test.

    2. If you aren't that good at reading, you can study ahead of time and learn much of the material asked about so that your poor reading skills won't be a problem.

    However, if you are bad at reading and you aren't interested enough in the vocation of firefighting to study ahead of time, you probably won't do well on the test.

    ReplyDelete
  4. "he says that disparate impact is established any time the racial gap is more than 3 standard deviations"

    A dozen or so years ago I asked poker author Mason Malmuth a question on his forum, can't remember what. He replied something like "first, assume everything in the universe is withing 3 standard deviations".

    It's been a long time since I took math.

    I used to work in a casino as a pit boss and we'd do hourly action counts, add up what is being wagered, assume an x percent house edge, and estimate what we should be making. It was explained to me by the boss, a legit math guy, that anything south of two standard deviations over a significant period, you're probably getting ripped off by dealers and/or players, and if it's 3 standard deviations you *are* getting ripped off.

    Sort of the same principle I'm guessing is being applied here, that 3 standard deviations precludes luck being a factor in the disparity. Math guys feel free to elaborate.

    ReplyDelete
  5. Steve:

    Thanks. So much for my reading comprehension. You have a link in the article leading to the tests in totam.

    After skimming them, the term "dumbed down" SAT is an extreme overstatement.

    "How many emergency vehicles are in the picture?"

    "What's the address of the building in the picture?"

    What a joke. How can someone graduate Jr. High much less High School and have difficulty with this test?

    ReplyDelete
  6. Something is seriously fishy with the statistical analysis. It looks like the "units of standard deviation" mentioned in the opinion really are standard deviations (seems outrageous, I know)

    Check out page 26:

    (quoting Smith, 196 F.3d at 365);
    Waisome, 948 F.2d at 1376 (“[a] finding of two or three standard deviations (one in 384 chance
    the result is random) is generally highly probative of discriminatory treatment”

    Indeed, 3 standard deviations from the mean corresponds to approximately a 1 in 384 chance.

    But on page 18, there is talk of 21.8 standard deviations when the black pass rate is 87.8% of the white. He says that the odds this is by chance are "less than 1 in 4.5 million-billion", when it's actually something like 1 in 10e+105. This is without even getting into how they came up with the 21.8 standard deviations line.

    ReplyDelete
  7. Some recent commenters have said that the Obama administration is working hard to get government jobs for black people.

    Here we see a judge trying to push that same hidden agenda. All of his words are just misdirection and nonsense aimed at concealing a hidden agenda.

    ReplyDelete
  8. This judge may not even be as smart as Sotomayor. Look at the opinion:

    In recent years, black and Hispanic residents of New York City (the “City”) have come to
    comprise a substantial portion of the City’s
    population....


    "Comprise" means to be made up of, not to make up; the word is "constitute". Or "be".

    In 2002, the New York City
    Department of City Planning identified 25% of the City’s residents as black and 27% of its
    residents as Hispanic.


    And, without explanation, the population within the city limits is judged to be the relevant population, even though the financial sector and all sorts of other economic endeavors draw large portions -- or even majorities -- of their workers from outside NYC.

    These numbers stand in stark contrast to some of the
    nation’s other large cities, such as Los Angeles, Chicago, Philadelphia, and Houston, where
    minority firefighters have been represented in significantly higher percentages.


    Just four sentences earlier, this judge informed us that blacks and "Hispanics" were 52% of the city's population, which means that there is no ethnic majority in NYC. And, we have been informed, the composition of the population outside the city is irrelevant. Accordingly, no matter whom the NYFD hires, its work force will be 100% "minority", and it is an absolute impossibility that there is any place in the country "where minority firefighters have been represented in significantly higher percentages".

    In this case, Plaintiff United States of America (the “Federal Government”) as well as the
    Vulcan Society, Inc., Marcus Haywood, Candido Nuñez and Roger Gregg (the “Intervenors”),
    have sued to enforce....


    So the first sentence of the second paragraph of the opinion treats us to a subject/verb disagreement. "USA" has been a singular noun since circa 1865.

    Next sentence:

    Specifically, the Federal Government and the Intervenors (“Plaintiffs”)6 challenge the City’s reliance on two written examinations that are used
    to appoint entry-level firefighters to classes at the New York City Fire Academy (“Academy”).
    These examinations — Written Examination 7029 and Written Examination 2043 — were
    administered from 1999 to 2007....


    The judge doesn't know what year it is. The word is "were", not "are".

    God help us!

    ReplyDelete
  9. Regarding statistical significance and innumeracy, on the New Haven captain exam 16 of 25 whites passed, 3 of 8 blacks, and 3 of 8 Hispanics (according to Wikipedia). A chi-square test of homogeneity gives a p-value of over 1/2, so significance wouldn't be attained even at level 0.50, let alone 0.01.

    That is, even with no differences between the three groups, over half the time you would expect to see larger racial pass rate disparities than in New Haven.

    However, promotions weren't based on pass rates, but (again according to Wikipedia), who got the top nine scores (no blacks). Once more, even if scores for the 41 applicants were total noise, such as if an applicant's score was his social security number, then over 7% of the time you would still see no blacks among the top nine scorers. Usually you need to be under 5% to impress a judge.

    Significance is very hard to reach with small sample sizes, and easy to reach with big samples.

    Maybe small town firefighter unions should embrace Garaufis.

    ReplyDelete
  10. The expert report is not online....

    The relevant portions are online, obviously, as they had to be authenticated and included as an exhibit to the MSJ if they were to be considered. See docket no. 264.

    Get yourself a PACER account. It's free, and you pay just 8 cents a page for Public Access to the Courts' Electronic Records.

    And when you read it, you will see that the "expert" is an utter nediocrity whose opinions have never been exposed to intelligent cross-examination. Look at this:

    In practical terms, the effect of the disparate impact...upon Hispanics was to eliminate 292 Hispanics from any possibility of appointment as entry-level firefighters in the FDNY.

    Is this a joke? Why would the "disparate impact" eliminate Hispanics unfairly instead of including whites unfairly? Why isn't the result of the "disparate impact" that a large number of whites were wrongly afforded the "possibility of appointment as entry-level firefighters in the FDNY"? Or that some Hispanics were wrongly excluded from, and some whites were wrongly included in, the group of candidates who had the "possibility of appointment as entry-level firefighters in the FDNY"?

    These are the mistakes of an extremely lazy 115-IQer, and he's extremely lazy because, when he's involved, the fix is always in.

    ReplyDelete
  11. I guess the issue with strict racial quotas as used effectively by eg US Universities and the Navy is that it creates obvious, visible, gaps in ability between favoured and disfavoured groups, which will undermine confidence in members of the favoured groups. Hiring lots of incompetent whites can greatly reduce the visible difference in ability, although of course it also lowers ability overall.

    ReplyDelete
  12. Steve,

    What's the deal with "statistical significance?" Inferential stats are used on SAMPLE DATA to determine the probability of a fluky difference. These tests don't involve SAMPLE DATA. Every guy who wanted to be a fireman took the test. It's not a sample! And BTW, it's not a random sample, either.

    ReplyDelete
  13. "and hire on a racially balanced basis from that larger pool."


    What's the bet that blacks, were they to make up the majority, yould still insist on AA aka quotas?

    ReplyDelete
  14. Bring back quotas! At least whites would be hired and promoted on merit.

    "Leadership" and whatever the eff else used to benefit NAMs fool people, and whites get selected on bullshit too.

    ReplyDelete

Comments are moderated, at whim.