April 16, 2013

Will Big Data change the world?

Everybody is talking about how life will never be the same again as we enter the Age of Big Data. Here, for example, is David Brooks' new column on Big Data. As I've mentioned before, I started working in Big Data 31 years ago. After I got my MBA in 1982, I joined a newish firm that was the first to effectively exploit for marketing research purposes the inundation of data from checkout scanners at supermarkets and drug stores. 

We paid to put the new laser scanners in every supermarket and drug store in four small cities, such as Eau Claire, Wisconsin. We recruited 2,500 households in each of our four test markets to identify themselves as they went through the checkout lane, so we could track every single consumer packaged goods purchase they made. We could finally answer quantitatively an endless number of questions that brand managers and marketing professors had dreamed up since WWII about consumer behavior. Do people who buy Tab by the six-pack also buy Lean Cuisine when it's on sale? Any question like that imaginable, we could answer.

 Furthermore, we could do over 30 years ago what Jim Manzi called for in his recent book Uncontrolled: use these towns and giant real world laboratories, where we could control the TV commercials seen by each of our panelists in their own living room. If P&G wanted to test a new Mr. Whipple spot for Charmin, we could divide our panel up into two cells with identical Charmin purchasing over the last year, then show one cell the new commercial and one cell the old commercial, then record which cell bought more Charmin over the next year.

We caused a sensation in the consumer packaged goods world. Wall Street said the world would never be the same and our stock doubled on the day it went public in March 1983. 

For awhile, we made a lot of money, so it definitely changed our world.

After awhile, though, brand managers at P&G got tired of paying hundreds of thousands to find out much how sales would go if P&G management granted their fondest wish of having their advertising budget doubled. The usual answer was: not much, if at all. The only time doubling an already generous P&G-sized ad budget was likely to move the needle was if P&G had some real new news about their brand to convey (which, being P&G and investing heavily in chemists to improve their products, they actually sometimes did). 

As this lesson sank in, it set off a mini-recession in TV advertising around 1986. So, did testing. Every year after that, we'd write into our budget that this would be the year that the world would rediscover how incredible this testing service was, but instead sales just eked slowly away. The company finally shut down the service late last year.

By 1987, however, we'd moved into even bigger Big Data, collecting all the supermarket sales for about 25 million people. 

That too changed the world. Yet, somehow, the world kept spinning on its axis. 

Let me offer another, more well-known example: baseball. I first became fascinated by baseball statistics in 1965 when I was six. Let's see if I can remember Ken Boyer's 1964 MVP line: .293, 24 homers, 111 RBIs. ... Nope, it was really .295 / 24 / 119. 

A vast amount has been written since then about how Big Data has revolutionized baseball. For example, today we know that Ken Boyer wasn't really the best player in the National League in 1964. No, a vast amount of statistical analysis has uncovered the electrifying news that the best player in 1964 was actually -- and you'll be stunned to learn this -- Willie Mays

Oh, wait, everybody back then knew Willie Mays was the best player in the National League. I was six and I knew. According to the latest Big Data analysis, Willie had been the best player in the league for 9 of the previous 11 years, just like most six-year-olds would have more or less guessed.

So, how much has baseball changed since 1965 due to the famous revolution in Big Data? Michael Brendan Dougherty says we are in a Golden Age of baseball, while Ross Douthat has caveats, citing my post on moneyball making baseball worse.

To check how much has changed, I turned on the radio to listen to the Dodgers like I did in 1965. Some things have changed, but others haven't, such as Vin Scully, who was a veteran announcer in 1965 with 16 years experience with the Dodgers, is still doing the play-by-play in 2013.

20 comments:

Anonymous said...

We recruited 2,500 households ... so we could track every single consumer packaged goods purchase they made. We could finally answer quantitatively an endless number of questions

But ... Severe selection bias! :-)

Anonymous said...

It does seem that a lot of the applications of data mining including incentives and strategies based on an understanding of said data lead in a new direction only to run into diminishing returns or to find that this new direction leads to horrible outcomes.

In sports it leads to new and unexciting play. In police work it leads to focus on easy revenue raising, and appears to lead to a reduction in the reporting of crime. The easiest way to increase homicide clearance stats is to ensure that as many difficult cases as possible are not classed as homicides.

In management this has lead to the importation of cheap labor with no thought as to the cost and whether the outcomes are even any better than they were before. Companies still have the same number of competitors, and if costs are now lower, competition has made sale prices lower commensurately.

These developments in turn leads to new laws, consumer backlash, and management backlash. And backlash from the public, or at least the remnant public still left over from before the importations.

Anonymous said...

But the real-time diversion of self-evident "discoveries" provided by incessant analysis of big-time data will entertain us as we orbit with increasing velocity around a vanishing point.

Voyeuristic, narcissistic, and limited.

Neil Templeton

DR said...

I'm sure what you were doing was fairly cutting edge for the time. But the notion that your career had anything to do with modern big data/machine learning is laughable.

Here's a clue IBM's Watson doesn't run in Lotus 123 (which you mentioned as being one of the tools you used in this job in a previous post).

I think people really fail to grasp how close computers are to wiping out pretty much all the jobs that the bottom half of the curve does. The vast majority of this advancement is driven by improvements machine learning from the past decade.

Steve Sailer said...

"But ... Severe selection bias! :-)"

We always had a lot of nice white ladies wanting to sign up for the panel, and fewer of other demographics, just like the Reuters-Ipsos election panel in 2012. But, that's who consumer packaged goods marketers wanted to target anyway, so it was fine.

Steve Sailer said...

"But the notion that your career had anything to do with modern big data/machine learning is laughable. Here's a clue IBM's Watson doesn't run in Lotus 123"

We didn't process 10% of all supermarket purchases in the country in the later 1980s on PCs. We did it on IBM mainframes, and they used various "machine learning" artificial intelligence techniques like neural networks for estimating missing data.

Jim said...

As a Braves fan, the most devastating blow to me as a fan of the team was the death of long time team broadcaster Skip Caray and the retirement of his partner Pete Van Wieren. I could barely listen to the replacements and indeed lost interest in baseball for years. The Voice of Summer is the most important part of the game.

wren said...

Big Data will be someone with those damn google glasses looking at you on the street and knowing EVERYTHING about you two seconds later.

That day is coming unfortunately.

Anonymous said...

But the notion that your career had anything to do with modern big data/machine learning is laughable. Here's a clue IBM's Watson doesn't run in Lotus 123

The fundamental principles at work are the same.

Leon Kautsky said...

Fascinating.

"machine learning" artificial intelligence techniques like neural networks for estimating missing data."

No you didn't. Or at least, it's extremely unlikely. Neural nets are among the most computationally expensive learning algos to use today (because you can't take adv. of global convexity) and they're 10x faster (1000x if you count computer speed-ups) than they were in 198x.

Nevertheless, I'm impressed by your intuition: I work in so-called big data and it is correct that the vast majority of our work is pretty useless. Once in a while, though, we strike gold: i.e. stuff like being able to predict your race/sexual orientation/sociopathy/intelligence/buying habits from your facebook likes.

There are AI applications like Watson/Siri or w/e that are big data powered because you need the machine to have vast amounts of knowledge before it becomes useful, let alone commercializable.

Anonymous said...

I just checked, and Willlie Mays came in 6th in the MVP voting that year.

astorian said...

The revolution in baseball stats, as Steve says, hasn't really changed things very much. After all, even thirty years ago, if you'd asked an old-school, innumerate fan to name the greatest hitters of all time, he'd probably have said "Babe Ruth, Lou Gehrig and Ted Williams," and he'd have been right!

The stats revolution hasn't shown that any perceived superstars really sucked, or that Mario Mendoza was really a Hall of Famer. Rather, it's shown that a handful of guys who were perceived as borderline Hall of Famers were really just very good, while some guys who were perceived as very good were actually borderline Hall of Famers.

Anonymous said...

Apparently they even have college courses called "Introduction to Data Science" now offered by statistics departments:

http://columbiadatascience.com/about-the-class/about-the-course/

Doesn't seem like anything different from what's been traditionally taught. Just marketed differently. The data probably just shows that people are suckers for marketing.

countenance said...

How many World Series or AL Pennants did the Oakland A's win during the Sabermetrics era?

I'm thinking of a number larger than -1 and smaller than 1.

countenance said...

One more thing: While I was initially sympathetic to the notion that Big Data is how Obama won re-election, I'm souring to that theory as time goes on. Not only that, I'm souring to all but the Occam's Razor explanation to why/how Obama won re-election. I suppose floating all the other theories was worth doing, and the theories all have some combination of truth and nonsense to them, some more of one than the other.

But sometimes, you just have to accept the Occam's Razor simple and obvious explanation.

In this case, Obama won re-election because he was the incumbent President. It's hard as hell to knock off incumbent politicians.

stari__momak said...

"I think people really fail to grasp how close computers are to wiping out pretty much all the jobs that the bottom half of the curve does"

Funny, but I don't see a lot of machine innovation in the occupations that the bottom half fill. In fact, probably the opposite. A guy in my neighborhood had his front yard redone (xeriscape). This being SoCal the workers were Mexican (the landscaper and old timer Japanese/Japanese American). The thing is, they took two weeks, digging trenches with picks and shovels, moving dirt around by wheelbarrow, etc. No Bobcat or even Ditch-Witch or Rototiller that I saw, let alone an sort of intelligent machine.

When you can hire cheap labor, you don't develop capital goods to replace labor.

helene edwards said...

Ron Santo had a nice year in '64, but I don't think he should be in the Hall.

Steve Sailer said...

"I work in so-called big data and it is correct that the vast majority of our work is pretty useless. Once in a while, though, we strike gold: i.e. stuff like being able to predict your race/sexual orientation/sociopathy/intelligence/buying habits from your facebook likes."

We were doing stuff like that involving supermarket shopping a long time ago. It wasn't useless, smart corporations paid a fair amount of money for our services. It just gets incorporated into how things are done and the world goes on, a little bit different than before, but not all that different.

Anon87 said...

countenance - MIT's Technology Review had a recent article where they going into great detail how "analytics" and "data" had a huge hand in the re-election. The Best And Brightest working on a whole different level than the stodgy old GOP. If you read between the lines though, I think you can see the obvious answer: Find voters, figure out what they want, and then promise it to them for free.

Anonymous said...

"No you didn't. Or at least, it's extremely unlikely."

I'm completely with Steve on this. I think you out in the weeds or very young and callow. (I'm envious!) I was involved in the neural networks field throughout the 80s. There was a big revolution around, maybe 1982-1983 when Bob Hecht-Nielsen got a significant amount of DARPA funding, organized the first conferences and some workshops, did a start-up, etc..

The big feeling in the air was "AI was back", but out from under the domination of the symbolic AI types (rightly-or-wrongly). And a lot of people didn't want complete AI, they just wanted to develop classifiers or recognizers that they didn't need to program. It all sort of came out of woodwork at about the same time. Backprop, Kohonen maps, the whole first generation.

By the late 80s people were trying to use neural nets for a lot of applications. They were also beginning to realize they couldn't debug them. Around the end of the decade folks began to realize how close backprop was mathematically to hidden markov models and the stats folks begun to say, hey, you guys should have talked to us about all this first... (or maybe they said, hey, we can do this non-linear stuff too...)

So I think you guys today do now have more theory motivating the algorithms. Things are on a more formal basis. You are getting demonstrably better results. They didn't have support vector machines, for instance, (a common algorithm now) back in the day. And of course you do have more compute power and disk space... Computer powers a funny thing, most computers today spend 90% of their time idle.

The application domains don't seem to have changed that much and I'm not really sure the overall effectiveness has... What has changed is that folks like Google now do have access to all the web-pages in the world and they want to crunch on them all the time, find out what's hot, where the money is... Hi, Google!

I think the plain old stats folks had a bit of revenge, the PCA and factor analysis types. The stuff the Oil companies run. It's all come a long way. I'm not knocking the current Big Data stuff, just saying that Steve could well have been doing large scale applications in the 80s, where large scale means all the data that's available. I also worked on the first laser cash registers collecting all this info in the very early 80s (not on the analytics side)... And if you collect the data, someone will have to do something to justify it all.