Data science is a popular topic today, not only in the tech industry, but across many industries. Google Trends shows a 300% increase in the interest for data science in the past two years. C-Level executives are hearing about the advances their organizations can make by hiring data scientists. However, data science is a nebulous concept; what exactly is data science and how can I use it in my organization today? Levvel will be presenting a blog series to shed some light on the topic by applying data science techniques to business problems facing many organizations.

Data science is not a new concept, but has recently become more utilized in the day to day operations of organizations as executives gain access to more and more data. There are mountains of data available to organizations today, but in order to gain value from the data, it must be organized, filtered, sorted, mined, visualized, and analyzed. Data science has a number of applications from sales lead prioritization, logistics demand forecasting, fraud detection, resume screening, claims review, product mix, customer segmentation and many others. One specific example of data science being used is traditional marketing segmentation All customers are generally lumped into a relatively small number of segments. Through the proliferation of data about customers and advances in technology executives can now segment their customer base into thousands or hundreds of thousands of segments to target very specific customer bases.

Data science has roots in a number of disciplines: statistics, predictive analytics, data visualization, data mining, sorcery and more. Effective data scientists are not those individuals that have doctorate degrees in these fields, but those individuals that can apply lessons learned from a variety of disciplines to the problem at hand. Effective data scientists will not only be able to model your existing churn rate, but tell you how to affect the rate and ultimately predict the changes over time.

**Ultimately, data science is the application of analysis techniques on data to affect a positive change in an organization.**

Levvel’s blog series on data science will start by presenting a simple sales forecasting model using Excel and R. It will focus on using data science to create a naive sales forecast in order to introduce some of the basic tools of a data scientist and why they add value. The series will continue in the forecasting trend by leveraging Hadoop and R to perform sales forecasting on larger data sets, and will ultimately use Amazon’s EMR product to demonstrate sales forecasting at a massive scale. Once we have covered sales forecasting at scale, the blog series will switch business domains to help product and marketing managers analyze A/B test results, perform customer segment, and finally help Chief Product Officers optimize their product mix for maximum profitability. Throughout the series, Levvel wants to focus on a variety of tools that a data scientist can use to create value for both small and large organizations.

If you would like to follow along during the blog series, you can clone this repository and follow along in the corresponding blog folder. Or run this command:

```
git clone git@github.com:GetLevvel/datascienceblogseries.git && cd datascienceblogseries/part1
```

## Data Set

Levvel will be referencing many data sets throughout the blog series, all for the enablement of Initech, an organization that is not only struggling for profitability, but can no longer ignore the fact that their competition is applying data science techniques to increase market share. For our first task, the CFO of Initech would like to see sales forecasts for every department across every location. He is currently working with IT to gather all the sales reports across the organization, but they have generated a small data sample for one of the departments. A preview of data can be seen here:

Our data set has four columns: Store Number, Department Number, Day Number, and Total Sales. The smalldataset.csv contains 143 days of data for Store 1/Department 1. While the IT department gathers the rest of the sales data we will begin setting up our forecasting framework to run at scale.

## Forecasting

When trying to solve a large data problem, the data scientist should not try to tackle billions of records at one time. The scientist should look to analyze a small subset of data to begin understanding the intricacies of the data, but not draw any conclusions until the entire data set is analyzed. In addition to looking at a smaller data set, a data scientist must be careful not to overfit the data. Overfitting data simply means the data scientists is drawing invalid conclusions from a data set because of a large number of variables are interacting together to create a causation. Looking for conclusions in a data set that has hundreds of variables, it is easy for the scientist to draw conclusions that do not exist because too many variables were inspected simultaneously. For our data set, let’s try to find a pattern in the sales data for a single store and department. Once we determine how the data is organized we can begin thinking of tackling the bigger problem of performing sales forecasts across the entire Initech organization.

There are a number of forecasting techniques leveraged by organizations. To a list a few, organizations utilize: Naive, Average, Drift method, Seasonal, Time series, Weighted Average, and Artificial Intelligence. For our simple data set, one of the easiest forecasting techniques we can employ is a naive approach using Linear Regression.

Linear regression is a technique to model a relationship between a scalar dependent variable and any number of independent variables. It is a great technique for quickly determining if a relationship exists amongst data and can be a quick gut check tool for data scientists of all levels.

Once we have created a forecasting model (in this case a simple linear regression line), the data scientists must test their model by performing backtesting. Backtesting is a technique to determine the effectiveness of a particular forecast model. This is accomplished by applying your forecast model to historical data and measuring the error rate for each data point.

## Linear Regression – Excel

There is a perception of data science that it requires advanced knowledge of esoteric tool chains, and while understanding advanced R, SAS, and Big Data analytics tools can be very helpful it is not necessary for every problem. Excel is a fantastic data analysis tool for many data sets.

In order to demonstrate a linear regression model quickly, we will first perform the regression on our data set in Excel (if you want to skip ahead, you can simply download this file.

The first step in our analysis is to copy the data from Initech’s sample data set into Excel. Once we have the data in Excel we can begin analyzing the data. Visualizing the data is one method data scientists utilize to determine if a relationship exists in their data. Creating a chart in Excel is very effective at creating a quick visualization. Select the Day and Sales columns in the Excel sheet and go to Insert tab to create a chart (Scatter Plot is what we will be using for our analysis):

Once you have created the chart it will become apparent that there is a general trend down in sales, but there are many data points that are outside the normal range. This is good to note, and data scientists must rely on their intuition, but let’s go ahead and add a trendline to validate our thoughts. In order to add a trendline, click “Add Chart Element” under the Chart Design tab in Excel. This will give an option to add a Linear Trendline to our chart.

With our trendline, we can now see there is a slight downward trend in sales, but how can we use this knowledge to get a specific forecast? We need to see the trendline formula in order to calculate the forecast for day 144. To see the equation of the trendline in Excel right click your trendline and click on “Format Trendline” You will then be given an option to display your equation on the chart.

Our final chart should look something like this:

To perform a forecast, add the following equation to any cell in the Excel document: `=-23.588*K44+29570`

. `K44`

is the cell in which we will enter the day we want to forecast. Simply put `144`

in cell `K44`

, and you will see a forecasted value of $26,173.33.

**NOTICE: In this case, linear regression is not a good forecasting technique as it does not closely map to the data. Performing a backtest will demonstrate how poorly our regression equation acts as a forecast model.**

To create a backtest of the forecast model, add a new column next to your sales data. Each entry in the column should utilize the trendline equation described above, but use Days 1 – 143 to calculate the values as shown here:

Once we have our backtest numbers, you can calculate the error rate by taking the absolute value of backtest value subtract actual total sales. The equation for your Error column should look like this:

Summing up the total error rate will show a total value of $1,230,000, which means our sales forecast over 143 days would be off by over one million dollars. The backtest error rate demonstrates how poor a forecast model linear regression is for this data set.

## Linear Regression – R

R is a language and environment for statistical computing and graphics. R has become one of the de facto languages for data science work. While performing a linear regression in Excel is easy, it is extremely easy in R. Once you have R installed, you can either create an R Script, or use R’s REPL to call the following four lines of code:

```
library(lattice)
mydata = read.csv("smalldataset.csv",
header = FALSE, col.names = c("Store", "Dept", "Day", "Sales"), sep = ",") # read csv file
trellis.device(device = "png", filename = "xyplot.png")
print(xyplot(Sales~Day, data = mydata, type = c("p", "r")))
```

This code is dependent on the lattice library to gain access to the xyplot function. The second line reads our csv file, while the third line sets up our printing capabilities to print to a PNG file. Finally, the fourth line of code performs the creation of the scatter plot with the linear regression trendline. Additional work would need to be performed to generate a backtest of the dataset, but R can give additional flexibility that may be outside the scope of Excel’s’ capabilities, especially on larger data sets.

If you created a script file for the code, simply run it through the R interpreter with:

```
Rscript simpleRegression.R
```

This will produce the following graphic:

We will go into more detail on how to utilize more advanced R later on in the blog series.

## Summary

Today we introduced you to data science, both what it is, and how it can be leveraged for any number of business strategies to help increase your bottom line. In addition, we demonstrated a very simple case of sales forecasting using both R and Excel to gain some initial understanding the basic tools used by a data scientists. Ultimately, our forecast model performed poorly as seen in our backtesting results, but we have now a baseline to work against for future forecasting models. In future posts in this series, Levvel will use R, Hadoop, EMR, Tableau, and other data science toolchains to create better forecast models, optimize product mix, analyze A/B test results, and apply data science to many other business problems to help the Initech organization achieve profitability.