Big data’s power of misinformation equals and at times exceeds its power of information. An endless stream of articles and books now offer pronouncements on big data’s supposed power and how to harness it.
For such a quantitative subject, big data offers an interesting exercise in how naming something can frame how we think about it. In this case, the opposite term, micro data, would prove just as fitting. What we refer to as big data simply encapsulates the unprecedented quantity of information that we generate throughout our daily work and routines – every time we use our computers, mobiles, buy something, ask something, even offline computer activity is far too easily accessed. Big data refers to our attempts to extract meaning from this information overload. Companies can sense that beyond this cloud (forgive the pun) of endless information lies the nirvana of business planning. Those who find the quickest means of identifying the next trend shall inherit the earth.
The continuous probing of micro-information for any whiff of a pattern is essentially the opposite of the scientific method. Scientific experiments take the following iterative form, upon which the lean startup method is based:
Formulate hypothesis -> construct representative data sample -> see if you can reject hypothesis -> formulate new hypothesis based on results.
In scientific journals, you never hear of people proving anything, only whether or not they can reject a hypothesis with a certain level of statistical power. The test of any alleged piece of knowledge is whether it can be disproven, but if it can’t, then one can never confirm its veracity. When I talk about my adopted country of Colombia, I never bother trying to tell people what it is (the more I get to know the place, the more impossible that becomes), and instead always speak of what the place isn’t, in terms of seeking to correct various misconceptions about the place.
Big data analysis can sometimes fall victim to the opposite approach. Instead of formulating a hypothesis, constructing a representative dataset and then seeking to disprove a hypothesis, the following path is taken:
Probe data as it comes in for any possible pattern -> extrapolate out into conclusions based on patterns detected -> seek additional means to validate these conclusions.
This problem is nothing new and has led numerous companies and organizations astray, and the increasing amount of information only amplifies that danger. Statistical significance at the 1% level will occur 1% of the time, which with enough tests becomes fairly often. Running continuous tests on swaths of ongoing data guarantees that exciting patterns will spring up all over the place with amazingly high statistical certainty, many of them entirely by chance. The power of our own brains, likewise, can work against us. Just as nature abhors a vacuum, the human brain abhors randomness, and our innate ability to spot patterns is far more powerful than we often realize. We want to believe that we can find patterns everywhere, and so we do. This is evolution at its most basic – our need to be on the lookout for the slightest movement in the grass. You therefore run the highly amplified risk of arriving at conclusions based on false patterns that come simply from the endless tests you’re running and data at your disposal, and not from any real underlying trends.
This in no way means that you should not take advantage of the data and analytical tools available to you. Information is crucial to every aspect of your business, and it is indeed true that those who can discern real trends with the best speed and depth will realize profound competitive advantages.
Historically, however, technologies develop faster than our understanding of how to use them, and analytical technologies are no exception. The best models are often the simplest – those that don’t bend over backwards to incorporate every data point available but rather isolate a key relationship to be tested and the most direct means to do so. Avoid the shotgun approach of broad-based and poorly-defined tests in the hopes of finding something without really knowing what you’re looking for. Stick to rapid and iterative binary tests (binary meaning you’re seeking a straight yes or no answer, as opposed to leaving the door open to too many possible results that occur from chance). Accept no results unless a plausible narrative can accompany them.
This is discussed in the terrific book The Signal and the Noise, which touches on how the most advanced econometric models failed to foresee the meltdown of 2008, but much more straightforward ones based on human narrative (“these people can’t afford these mortgage rates”) more or less nailed it. Seek always to disprove your assumption, not to find evidence to support it. The single greatest danger of the data and analytics at our disposal is that we can prove anything we want.
Startups are advised to always be testing, iterating, and pivoting as new information comes to light, and this post is not meant to counter that advice but rather to hopefully make the path less treacherous. As with most balances in life and in business, there is no singular ideal tradeoff, and the best you can do is always bear it in mind and use your best judgment. Big data is simply a means to an end, and those who master it will be those who cut through its hugeness to construct well-specified tests and hone in on the elusive data points that subject them to true and rigorous verification.