The whole world is suddenly talking about election pundit Nate Silver, and as a longtime heckler of Silver I find myself at a bit of a loss. These days, Silver is saying all the right things about statistical methodology and epistemological humility; he has written what looks like a very solid popular book about statistical forecasting; he has copped to being somewhat uncomfortable with his status as an all-seeing political guru, which tends to defuse efforts to make a nickname like “Mr. Overrated” stick; and he has, by challenging a blowhard to a cash bet, also damaged one of my major criticisms of his probabilistic presidential-election forecasts. That last move even earned Silver some prissy, ill-founded criticism from the public editor of the New York Times, which could hardly be better calculated to make me appreciate the man more.
The situation is that many of Nate Silver’s attackers don’t really know what the hell they are talking about. Unfortunately, this gives them something in common with many of Nate Silver’s defenders, who greet any objection to his standing or methods with cries of “Are you against SCIENCE? Are you against MAAATH?” If science and math are things you do appreciate and favour, I would ask you to resist the temptation to embody them in some particular person. ...
Silver is a terrific advocate for statistical literacy. But it is curious how often he seems to have failed upward almost inadvertently. Even this magazine’s coverage of Silver mentions the means by which he first gained public notice: his ostensibly successful background as a forecaster for the Baseball Prospectus website and publishing house.
Silver built a system for projecting future player performance called PECOTA—a glutinous mass of Excel formulas that claimed to offer the best possible guess as to how, say, Adam Dunn will hit next year. PECOTA, whose contents were proprietary and secret and which was a major selling point for BPro, quickly became an industry standard for bettors and fantasy-baseball players because of its claimed empirical basis. Unlike other projection systems, it would specifically compare Adam Dunn (and every other player) to similar players in the past who had been at the same age and had roughly the same statistical profile.
Colby's on to an interesting distinction here between modern philosophy of science's idealization of forecasting as the acid test of SCIENCE -- which we also assume to be related to the quest for reductionism, Occam's Razor, transparency, peer review, all that kind of elegant stuff -- versus the messy reality of forecasting as a profit-making business. In my corporate career, I got drafted into building sales forecasting models for both companies in the industry, so I've had first hand experience with these issues.
For most players in most years, Silver’s PECOTA worked pretty well. But the world of baseball research, like the world of political psephology, does have its cranky internet termites. They pointed out that PECOTA seemed to blunder when presented with unique players who lack historical comparators, particularly singles-hitting Japanese weirdo Ichiro Suzuki.
Ichiro is one of baseball history's more consistent players -- across two decades across two continents -- so his future stats have been pretty easy to predict by the most superficial fan: just project the trendline to account for aging. But Silver's proprietary system couldn't believe Ichiro could continue to get all those obviously fluky infield hits before regression to the mean crashed in, so PECOTA routinely underpredicted his performance, although by less and less as Silver added some kind of specialized secret counter-Japanese-weirdo gizmo or gizmos to his model to make his Ichiro forecasts less wrong.
Keep in mind that PECOTA's prejudice against Ichiro was grounded in a recent advance in general understanding. Sabermetricians had figured out in the late 20th Century that a lot of exceptional seasons really were just luck. When you look at hits as a percentage of balls in play, it turns out that some famous seasons were just a case of a guy getting lucky and hitting 'em where they ain't, which usually turns out not to be replicable across more than one season.
But, Ichiro kept it up for thousands of plate appearances.
Notice, however, that by making his black box forecast more accurate, Silver was making it scientifically less useful. Back when Silver was saying: Based on everything we know about slap hitters, Ichiro is due for a comeuppance this year. Nobody can continue to accumulate seeing-eye singles and slow rollers and all the lucky crud that Ichiro got last year.
And, year after year, Silver was wrong, which suggested that Ichiro wasn't like the other slap hitters, that American baseball minds needed to reverse engineer this Japanese import and figure out exactly how he does what he does. Maybe we could train a few guys over here to do it too, or maybe we could select more young players with some of Ichiro's obscure gifts.
But, over time, as Silver added secret adjustments to his hidden model, that Ichiro Anomaly became less glaring.
Science progresses by the accumulation of wrong predictions. For example, if Isaac Newton had set up a proprietary firm called Astronomy Prospectus that owned a black box model for predicting planetary orbits based on Newton's secret Law of Gravity, his successors could have made minor adjustments in their forecasts to account for the Mercury Anomaly. So, who needs Einstein's General Theory of Relativity when Astronomy Prospectus can just nudge their forecasts?
In general, however, we need to keep in mind the fundamental distinction between the physical sciences and the human sciences: humans can, to some extent, respond to predictions and thus mess up prediction models. The planet Mercury won't respond to predictions whether or not they are public.
I suspect that in the human sciences, prediction models go through a pattern where they become more and more baroque in their complexity, until they fail spectacularly, often because somebody figures out how to exploit the complexity. We saw that with mortgages, where a giant superstructure of math for modeling the actuarial risk of defaults was undermined by simple lowbrow methods like adding a zero to the mortgage applicant's income.
But, it's not clear if electoral prediction models have similar risks.
More importantly, PECOTA produced reasonable predictions, but they were only marginally better than those generated by extremely simple models anyone could build. The baseball analyst known as “Tom Tango” (a mystery man I once profiled for Maclean’s, if you can call it a profile) created a baseline for projection systems that he named the “Marcels” after the monkey on the TV show Friends—the idea being that you must beat the Marcels, year-in and year-out, to prove you actually know more than a monkey. PECOTA didn’t offer much of an upgrade on the Marcels—sometimes none at all.
Einstein famously said that science should be as simple as possible, but no simpler. But, that leaves a lot of latitude.
PECOTA came under added scrutiny in 2009, when it offered an outrageously high forecast—one that was derided immediately, even as people waited in fear and curiosity to see if it would pan out—for Baltimore Orioles rookie catcher Matt Wieters. Wieters did have a decent first year, but he has not, as PECOTA implied he would, rolled over the American League like the Kwantung Army sweeping Manchuria. By the time of the Wieters Affair, Silver had departed Baseball Prospectus for psephological godhood, ultimately leaving his proprietary model behind in the hands of a friendly skeptic, Colin Wyers, who was hired by BPro. In a series of 2010 posts by Wyers and others called “Reintroducing PECOTA”—though it could reasonably have been entitled “Why We Have To Bulldoze This Pigsty And Rebuild It From Scratch”—one can read between the lines. Or, hell, just read the lines.
Behind the scenes, the PECOTA process has always been like Von Hayes: large, complex, and full of creaky interactions and pinch points… The numbers crunching for PECOTA ended up taking weeks upon weeks every year, making for a frustrating delay for both authors of the Baseball Prospectus annual and fantasy baseball players nationwide. Bottlenecks where an individual was working furiously on one part of the process while everyone else was stuck waiting for them were not uncommon. To make matters worse, we were dealing with multiple sets of numbers.
…Like a Bizarro-world subway system where texting while drunk is mandatory for on-duty drivers, there were many possible points of derailment, and diagnosing problems across a set of busy people in different time zones often took longer than it should have. But we plowed along with the system with few changes despite its obvious drawbacks; Nate knew the ins and outs of it, in the end it produced results, and rebuilding the thing sensibly would be a huge undertaking. We knew that we weren’t adequately prepared in the event that Nate got hit by a bus, but such is the plight of the small partnership.
…As the season progressed, we had some of our top men—not in the Raiders of the Lost Ark meaning of the term—look at the spreadsheet to see how we could wring the intellectual property out of it and chuck what was left. But in addition to the copious lack of documentation, the measurables from the latest version of the spreadsheet I’ve got include nice round numbers like 26 worksheets, 532 variables, and a 103 MB file size. The file takes two and a half minutes to open on this computer, a fairly modern laptop. The file takes 30 seconds to close on this computer. …We’ve continued to push out PECOTA updates throughout the 2010 season, but we haven’t been happy with their presentation or documentation, and it’s become clear to everyone that it’s time to fix the problem once and for all.
The stuff in italics is not Colby talking, it's Nate Silver's successors at Baseball Prospectus talking.
I can sympathizes with Silver, because his Excel-based PECOTA model / ball of twine reminds me of the Excel-based sales forecast model I built for the other firm in the CPG marketing data industry after building an almost unbreakable Lotus 1-2-3 3.0 sales forecasting model for the first firm.
The third release of the once-famous Lotus spreadsheet at the end of the 1980s was designed to make decentralized forecasting and budgeting simple before the Internet by elegantly implementing a 3D model in which identical spreadsheets could be easily stacked and summarized.
And it was simple. In Lotus, I sent each regional sales manager a spreadsheet on which he or she would list every sales proposal, its size, some other details, and its chance of closing this quarter. Multiply the dollars by the probabilities, sum, and there's your regional sales forecast. Fed Ex the diskette (this is c. 1990) back to me at the Chicago HQ, and I have Lotus 3.0 aggregate the numbers across all the regional sheets and give the national forecast for the quarter to the CEO.
Awhile later, after I'd followed the firm's Chairman into his second (and not quite as successful) high-tech startup, the CEO got hired away by the archrival firm, and he hired me to build him the same sales forecasting system.
The only problem was that the other corporation had standardized on Microsoft Office, including Excel, so I had to build the system in Excel, not Lotus 3.0. Even though I knew how to do it now, building the same system in Excel took about three times as long, including about 100 hours on the phone with Excel tech support in Redmond. Each phone call I'd begin by explaining to the MS Excel wizard what the goal of my project was, and each one would reply that that sounded fascinating, but he'd never heard of anybody building a national forecasting system using Excel, and that, as far as he knew, nobody at Microsoft had ever contemplated that use when designing Excel. In theory, Excel supported drilling down through multiple worksheets, but, in practice, it was an ordeal to build, and, worse, practically impossible to explain to anybody else how it worked.
At my first company, when I left for the start-up, my assistants carried on running the Lotus-based sales forecasting system effortlessly. When I left the second company, I printed up a 45 page guide to how keep the sales forecasting system running, which was about an order of magnitude longer than would have been required with Lotus.
Of course, Excel went on to become the global standard in spreadsheets and Lotus 1-2-3 vanished. The idea of non-programmers building large systems in spreadsheets largely vanished with it, as well. Who knows how much productivity has been lost due to Excel's dominion?
It wasn't particularly Silver's fault that the system he pioneered in Excel became unwieldy. That's what Excel does. It's a hard lesson that people have to learn that you can't stay with Excel past a certain point. If you have a real business, at some point you have to bring a programmer in and rebuild the thing from scratch in a regular programming language.
Let's come back to the Philosophy of Science questions. In my sales forecasting systems, I did not attempt to build in ad hoc Ichiro-style adjustments. I kept them super-reductionist. I could have made my forecasts more accurate by putting in stuff like Smith is always overoptimistic by 15% until the week before the quarter closes, while Jones is notorious for sandbagging 10% of her likely revenue. But, I went instead for total transparency for my bosses at HQ and total reporting for their regional sales managers. If the national forecast was wrong, it was because specific regional sales managers were wrong, and it's up to those individuals to correct their biases and delusions, or face the consequences. Being honest and realistic with HQ was part of their job, and part of my job was to make clear when they weren't.
The corporate officers' jobs, however, included reporting profit forecasts to stock market analysts, and they could massage the numbers I reported to them for known biases (or hunches, hopes, or whatever, with only worries about their reputations with analysts and fears of shareholder lawsuits to rein them in).
In contrast to what I did, Silver is running a proprietary non-transparent black box business. He has incentives to be accurate, but he has other incentives as well, such as providing comforting fare to biased readers of the NYT and keeping his system secret.
If the history of Silver’s PECOTA is new to you, and you’re shocked by brutal phrases like “wring the intellectual property out of it and chuck what was left”, you should now have the sense to look slightly askance at the New PECOTA, i.e., Silver’s presidential-election model. When it comes to prestige, it stands about where PECOTA was in 2006. Like PECOTA, it has a plethora of vulnerable moving parts. Like PECOTA, it is proprietary and irreproducible. That last feature makes it unwise to use Silver’s model as a straw stand-in for “science”, as if the model had been fully specified in a peer-reviewed journal.
Silver has said a lot about the model’s theoretical underpinnings, and what he has said is all ostensibly convincing. The polling numbers he uses as inputs are available for scrutiny, if (but only if) you’re on his list of pollsters. The weights he assigns to various polling firms, and the generating model for those weights, are public. But that still leaves most of the model somewhat obscure, and without a long series of tests—i.e., U.S. elections—we don’t really know that Nate is not pulling the numbers out of the mathematical equivalent of a goat’s bum.
Unfortunately, the most useful practical tests must necessarily come by means of structurally unusual presidential elections. The one scheduled for Tuesday won’t tell us much, since Silver gives both major-party candidates a reasonable chance of victory and there is no Ross Perot-type third-party gunslinger or other foreseeable anomaly to put desirable stress on his model.
I'm reminded of the 1996 election, when the consensus was Clinton over Dole by double digits, but it turned out considerably closer. The consensus of the polls was off by enough to have reversed the results of the 2000 and, perhaps, the 2004 election. It didn't matter in 1996, so it's largely forgotten by now.
Among pollsters, only Zogby got the margin right in 1996. This did wonders for Zogby's career, but the whole incident remains shrouded. What did Zogby know that nobody else knew? Anything? Or was he just lucky? Who knows? His system was proprietary.
By the way, the failure of polls in 1996 helps explain why so many pundits didn't question the catastrophic failure of the exit polls in 2004 to pick the winner of the election. The afternoon of Election Day 2004, the word swept the country that the exit polls showed Kerry winning easily. This was widely accepted as true, in part because the reputation of telephone polls before the election had been badly dented by 1996. But, it turned out, the exit polls were biased, not the pre-election polls.
The funniest recent forecasting fiasco was Election Night in 2000 when the networks first called Florida for Gore even though Bush had a big lead in the partially counted vote, making Gore the Presumptive President. Then they switched and called Florida for Bush, declaring the country for Bush, even though Bush's lead in the actual votes counted was shrinking relentlessly. To me, watching at home, a simple trendline suggested that when they got to 99.9% of the votes counted, Florida would be virtually tied. Eventually, the networks figured that out too and switched Florida back to uncalled.
In summary, I like Nate Silver. He works extremely hard and has sensible approaches to presenting consensus forecasts. Eventually, the consensus won't prove right for reasons that might seem obvious in hindsight, but there will be a lot of people with egg on the face in that situation.
His most apparent problem is that he's young and thus can't remember a lot of confidence-sapping events. I've noticed that he naively accepts The Narrative about most late 20th Century political events. He wasn't paying attention so how would he know that the conventional wisdom about Prop. 187 or Willie Horton or whatever is a construct of wishful thinking?