Big data has given a window like never before into human behavior. Many types of human behaviors are trackable in online environments, in data sets that were almost inconceivable a few years ago. As Chris Anderson, editor of Wired.com, exuberantly (and more than a bit irrationally) exclaimed, “This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves” (2008).
Anderson (2008) seemed to think that large data sets make conclusions obvious about complex phenomena. He may be referring to the assertion that Google’s Peter Norvig made that sometimes it is not the model that you are using in an analysis that brings utility as much as the size of the data set (2010). He was discussing the accuracy of text analysis models that had been published in the academic literature and that he could take model from the low range of accuracy and just increase the data set to a billion words and this would then increase its accuracy beyond what the rest of the published models could do. It wasn’t the model or theorizing that increased the accuracy but the size of the data set. Anderson (2008) took these types of outcomes to mean that theory is dead and that data is king.
Norvig (2010), on the other hand, compared increasing a data set to the level of big data to that of a film real. Lots of images staked up together playing fast have a qualitatively different experience than single images. This is what Google does; it stacks up vast amounts of data to see what patterns emerge. Norvig would most likely agree with Anderson’s (2008) in statement, “The Petabyte Age is different because more is different .” but the meanings behind how either would use this statement seems to differ. Where, Anderson says theory is dead, Norvig would quote George Box’s famous utterance, “Essentially, all models are wrong, but some are useful” (Norvig, 2010). It is not that theories, models, etc. are irrelevant in the presence of big data, but it is an assertion that we should use the models that we can to improve upon what we currently do by scaling up to big data. This does not render science or philosophy irrelevant, but enables greater utility out of less effort. It may not be necessary to discover exactly why a phenomena is happening in order to find indicators that will help you avoid its occurrences, but contrary to what Wired.com’s editor may think, this does not mean that understanding why is not important or that other problems may require understanding why before decision aid technology become useful.
An illustration of where gaining utility from the minimum has its limits is in the Q&A session at the end of Norvig’s presentation; some questions were raised about whether Google’s relatively simple models for translation have reached their limits in utility. Norvig admitted that they most likely were near a plateau in algorithm-based translation technology between most common languages. He said better models were needed that incorporated meaning and nuance in order to increase beyond current accuracy rates. So, Norvig wasn’t saying before that Google’s translation technology had rendered all other translation methodologies irrelevant or that Google had made the need for understanding meaning in translation obsolete (as Anderson (2008) suggested), but that Google has been able to extract great utility out of their current models for transition, but in order to raise the bar, more innovative models that addressed factors that they had not yet included were necessary. It appears that some question still must be answered through old-fashion human science, philosophy, engineering and design.
Anderson, C. (2008, June 23). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired.com. Retrieved from http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
Norvig, P. (2010, September 27). The Unreasonable Effectiveness of Data. UBC. Retrieved from http://www.youtube.com/watch?v=9vR8Vddf7-s
This is a reflection on the Week 2 reading list of the Learning Analytics Open Course in conjunction with the 2011 Learning and Knowledge Analytics (LAK11) conference (which I presented at) organized by George Siemens of Athabasca University.