Introduction and Data Types

From "Descriptive" to "Predictive"

Until now, we have been primarily concerned with describing, explaining, and visualizing what has happened in the past based on recorded data. That is an appropriate first step toward understanding what your data says about your business or organization. Descriptive visualizations (like those you made in Tableau) help you to identify potential explanations for what you are observing in your organization. However, the greatest forms of business value come from the next step in the process: predicting what will happen in the future. As we talked about in the very first chapter, it's time to establish "cause" and "effect." The overall "effect" that we want to explain is our organizations success, performance, and ability to beat competitors. If we understand what "causes" than effect, then we can focus our time, efforts, and resources to maximize those causes in order to maximize that effect. But "success" is a pretty high level variable; let's break that effect down into something smaller and more "bite-size."

For example, let's assume that we own a newly started bike shop. Our success is best represented by our sales volume. So we want to find out what causes a customer to buy a bike from our store. To do that, you begin by theorizing about what causes a customer to buy a bike from you. From, a theory is "a coherent group of tested general propositions, commonly regarded as correct, that can be used as principles of explanation and prediction for a class of phenomena." Another definition from Wikipedia, "a theory is a contemplative and rational type of abstract or generalizing thinking, or the results of such thinking. Depending on the context, the results might, for example, include generalized explanations of how nature works." But let's substitute a couple of words toward the end:


Depending on the context, the results of a theory might, for example, include generalized explanations of how organizations operate efficiently and effectively, why employees perform at high and low levels, or why customers make (repeat) purchase decisions.

Therefore, a bike shop owner may theorize that people will buy her or his bikes because they live or work near the bike store, because they have lots of expendable income, because they live a "commutable" distance from their jobs, because they are older or younger, because they have more or less education, or because they have children. Notice, in each of these examples that there are cause variables (e.g. address, income, commute distance, age, education, children) that are associated with the effect variable of interest: whether or not they will purchase a bike.

However, identifying the variables is only half of the job of a theory. There must also be a logical explanation concerning why those variables will cause bike purchases. For example, if the bike store specializes in road bikes, then commute distance is relevant. If the bikes are expensive, higher-end models, then income is relevant. If the bikes are primarily for enjoyment (rather than exercise), then children is is a relevant variable.

So how do you come up with a valid theory to begin with? Well, this is an expertise of academic researchers. They are literally paid to 1) review all of the relevant research results on a topic (often over many decades of work), 2) develop a theory to explain all of those results, 3) collect and analyze new data to confirm their theories, and then 4) clearly explain these theories in a digestable form. The do a pretty darn good job of 1-3, but a pretty lousey job of 4 (full disclosure, I'm an academic researcher). However, you can find their theories and results all in one place at Google Scholar. If you don't want to read through the original academic papers, you can often find books and news articles based on that research which are much easier to digest.

Now, let's assume that you, the bike shop owner, have done your research and now you have a theory about what causes (or will cause) customers to buy bikes from you. The next step is to collect and measure data that represents the factors/variables involved. Since this course is not about survey techniques or other data collection methods, let's assume that you've been able to acquire the data you're looking for and it's stored in the Excel workbook below (go ahead and download and open it).

Once you've opened and examined the data file, notice that not all factors (a.k.a. "variables") are the same. Some are numeric and continuous like income or age. Others are numeric, but discrete like "children" where only whole numbers may be used. Others are numeric, but ordinal like "commute distance" where the numbers are broken into ordered "chunks" (e.g. 0-1, 1-2, 2-5, 5-10) which do not necessarily have the same distance between chunks. Some data are categorical rather than numeric like region or gender where values cannot be ordered nor have values that are "higher" or "lower" relative to each other. Some categorical data are dichotomous; meaning they have only two possible values (e.g. PurchasedBike). Other categorical data are nominal; meaning they can have many possible values (e.g. "region" and "occupation"). One of the most common nominal factors that is not included in this particular data set is consumer ethnicity.

Some data is impossible to perfectly represent with either numeric or categorical values. Consider, for example, the data field representing education. "High school" is greater than "partial high school." However, the data is text rather than numeric. Also, the difference between categories is not necessarily the same. For example, can you say for certain that the difference between "partial high school" and "high school" is the same as the difference between "high school" and "partial college" (answer: no)? Therefore, if you represent education as a categorical variable, you lose the natural ordering of the category values. However, if you represent education with a rank ordered number, you misrepresent the "distance" between categories. However, you have to make a decision or else ignore education entirely. There are often tradeoff decisions, like this, that must be made in data analysis.

Some variables can be converted between numeric and categorical (e.g. dichotomous variables like "home owner" can be turned into 0 and 1 and ordinal variables like "education" can be turned into integers (e.g. 1 = partial high school, 2 = high school, 3 = partial college, 4 = bachelors, 5 = graduate). However, other variables, like region, cannot. Be aware of these data types as you perform the analyses below. You'll notice in the Excel workbook that you downloaded that we've created multiple worksheets. The "Original" worksheet contains the data as it was originally collected. The "Numeric" worksheet contains that same data along with newly created columns that convert as many of those categorical (or fields containing text) fields to numeric fields as possible.

Before we can jump right into analyzing the effect of each variable on whether or not they purchased a bike, we need to understand the data and the "bivariate" relationships between each pair of variables. This will help guide our analyses more efficiently.

<{}learningobjectivelink target="a47hm">