In Philosophy of Science we ask “What is science?” and how does science “work?” It is easy to sometimes forget that everything, even science, has a limit and in order to practice science we have to know what that limit is. It would be foolish of a project manager to start constructing a building, or any project for that matter, without knowing the scope of the project. When we use a tool such as a measuring tape we need to understand that there are some things that it simply cannot do, like measure water or time. Likewise when we are performing analytical analysis we have to know what the limit of our tool is… what the limit of the science itself is. If we are going to use analytics in data science we need to know what it’s limits are.
Thomas Kuhn and Normal Science
One of my favorite philosophy of science philosophers Thomas Kuhn attempted to answer this problem by dividing science into two parts, “normal science” and “extraordinary science.” 1 Normal science, he argues, is what we think of when we ourselves practiced science… it is the established norm or what Kuhn calls a paradigm. Extraordinary science on the other hand is what we do when we create entirely new systems to not only solve the old problems our traditional science attempts to solve (and perhaps failed to solve… or solved with massive amounts of theoretical “duct tape!”) but entirely new problems as well!
What is Data Science?
In Data Science, and more specifically predictive analytics we take MASSIVE amounts of data, often from many different sources, and feed it into a computer that searches for patterns. Once we have identified a pattern we can then build what is called a “model” based on past actions. We can then use that model to make predictions on future actions, assuming of course that the model is accurate, that the past is indicative of future actions, etc. As humans we build internal models of others all of the time… for instance lets say your husband plays golf every Saturday mornings in the summer, and never ever misses his tee time (except for when it is raining). If you wake up on a sunny summer Saturday morning and your husband isn’t home, according to your “internal model” your husband is probably out golfing. Can we be sure of it? No… none of our predictions are 100% accurate. Your husband could be at breakfast or flying on a plane for a work event… or hates golf and has been going to hang out with his friends at the golf course bar all of these years. All things equal however, without any attitional datapoints, you wouldn’t be surprised to find out he was out golfing.
Therein lies the first limit of Analytical analysis… our models offer predictions based on correlation of datapoints, and these predictions are simply that: predictions. Certainly we have a higher level of confidence in some predictions over others. Police officers for instance have naturally used analytical analysis forever; certain types of criminals “just act a certain way” and “9 times out of 10” the profile is accurate. But confidence is not the same as certainty. Can you be fairly confident your husband is golfing? Sure. Are you confident that the driver is driving drunk and not having a seizure or being stung by bees? Maybe. Would you bet $20? Probably, but would you bet a loved ones life on it however? Absolutely not. The reason is what we refer to as the Quine-Duhem Thesis and the problem of induction.
My favorite philosopher W.V.O.Quine and philosopher Pierre Duhem presented an argument which we collectively refer to as the Quine-Duhem Thesis. In “two Theories of Dogma” Quine argues that given any set of data with at least one unknown variable an indeterminable number of relationships exist which could yield the same result, or how I usually phrase it “a finite number of points can be interpreted an infinite number of ways.” The thesis argues “that no scientific hypothesis is by itself capable of making predictions.” Although our models may offer us a great deal of predictability, it says absolutely nothing about ‘Why” the model works or why the result is predicted. Our model is simply using induction to suggest that an event is likely given the patterns; it doesn’t give us any reason for the relationship between these patterns. Models offer no intentionality. We might suspect a reason, and that suspicion may be entirely accurate, but without a deductive argument, suspected reason is speculation at best.
The Demarcation of Data Science
Every “normal” science has a limit and when you go beyond that limit you are no longer practicing science; in order to practice science you must know the limitations of the science. Now could we develop a data science which predicts the future with absolute accuracy? Perhaps. Could we develop a way in data science to answer the “Why” our predictions are occurring? Maybe. These are questions for the data scientists that are practicing “Extraordinary” science, data science that is so theoretical that it makes science fiction seem like history. But the analyst using data science in your local big-box store and in your corporation are practicing normal science (no matter how genius they are)! And herein lies the important fact: the limits of our normal Data Science is that our predictions are simply statistical outcomes that may or may not occur, and even though the model may predict an outcome, it doesn’t tell us why.
In the case of data science we must continually remember the knowledge that I’m attempting to gain as a result of “doing” data science is not an “explanation,” but rather a correlation and a prediction which AT BEST simply predicts what chance something has of occurring. As soon as I start trying to grasp at an explanation I’m no longer practicing science, but rather engaging in speculation (That being said, these speculations could serve as a hypothesis to be further tested). Data science can and does produce amazing correlations which allow us to predict actions with a surprising amount of accuracy… but we must remember that as soon as we attempt to explain “why” these predictions occur we are no longer scientists.