For generations, analysts have struggled with the term “statistical significance.” Traditionally, this meant we didn’t have enough data to make a useful inference. Most data starts to take on a discernible shape of predictability – what statisticians refer to as a distribution – after 30 samples. We know small sample size can lead to bad conclusions, an easy trap to be lured into.
From his experience, my dog believes cats live in trees. How often do we interact with others and draw the wrong conclusion?
— Don Dingee (@L2myowndevices) July 31, 2013
Big data lies ahead with 30 million or 30 billion samples, and we now have the opposite problem: too much data to make sense out of. Why did we want big data, again? Pop culture heroes like Nate Silver and the NSA will produce headlines with it. For the rest of us, the answer lies in the search for significance.
Data has always been out there, coming and going somewhere, but the connection, storage, and retrieval mechanisms were limited. Our personal, direct encounters and recollection plus what we could painstakingly store and retrieve on paper or celluloid helped. The huge breakthrough cane in real-time data broadcast from across the world, primarily from newspapers, radio, and TV. We anxiously awaited new data daily to expand, alert, and entertain. The content wheel was spinning.
News was at first significant, but entertainment turned out to be easiest to monetize. With the real-time channels opened by the media outlets, bigger pipes were needed for more content. Networking, telecom, search engine, and storage firms prospered as technology found ways to store, access, transport, and deliver everything people could ever want.
The idea of significance shifted again. Real-time turned into on-demand, and people view the stream of incoming data differently. Breaking news most of the time isn’t, and grabbing and holding our attention with so-called news is getting more and more difficult. In the background, everything is captured and stored somewhere, searchable and ready for playback. Data becomes significant when we are ready to consume it, not necessarily when the producer provides it.
An interesting example comes from baseball, which has shifted from a game of skill and intuition to a business of analysis and prediction. The 43rd SABR Convention, the annual gathering of practitioners of sabermetrics, this week produced a fascinating observation:
— sabr (@sabr) August 3, 2013
What does a 50x increase in available data do for baseball? For fantasy baseball, all that data is lifeblood. For those tasked with evaluating and acquiring talent, more data can help. The game itself is still played between the lines, with the unpredictability of human behavior, and the element of chance – there are no guarantees, but data points to the likely possibilities.
The funny part about big data is if someone else has it, and we don’t, we could be at a disadvantage. Fear propels us to try to gather data and define uncertainty, even if the attempt opens a giant can of jumping beans. Borrowing the lost words of Sammy Hagar in Van Halen’s “Big Fat Money”, updated to today’s situation where data is money:
Where’s it gonna come from? Who’s it gonna go to?
I ain’t beatin’, but I’m bein’ eaten by data, oh yeah, big, big data.
Now, gimme, gimme, gimme , gimme, gimme some of that big data.
The whole point of statistics is to improve our chances of making the right choice at the right moment. Winning means blending our experience with the correct and timely analysis of all that stored data combined with a rapid and accurate evaluation of new data coming in, much of which is increasingly routine and thereby insignificant.
We are now building the Internet of Things, generating even bigger data – so much data, we won’t ever be able to look at it all. Billions of devices will produce a non-stop stream of stuff, much of it whisked off to some disk drive, and data scientists will be bringing bigger and bigger data analytics tools and algorithms to assess it. Significance will come from exception handling, spotting new samples that don’t fit an understood trend of goodness, and alerting us to do something.
In the age of zettabytes and beyond, big data isn’t what we’re looking for. It’s mostly “telef***in’ teletrash” as Hagar put it, if no analysis is applied. We are looking for small, somewhere in the big being saved and statistically studied. We are searching for one piece of information that could tell us our health is changing and we should seek help, or we should change speed or direction to avoid a collision, or when to prevent a brownout by dialing back a few thousand air conditioners a couple of degrees, or how to spot some malfeasants based on who they have been calling or texting.
The Internet of Things will cause another shift, back to little bits of otherwise big data being significant when they say there is a problem developing – whether we notice it, or not. Big data, applied across billions of devices and a timeline of similar stored experiences, will bring little data back to us when and where we need it.