To analyse this data, login to your gmail and go to http://drive.google.com Press on the +NEW button to create, first a new directory and after moving to the new directory, create a Colab notebook. If the option to create a new Colab notebook is not visible, you may have to Connect more apps.
Rename the new Colab notebook as RetailStore, connect it to a Virtual Machine by pressing the Connect button on the top right and then load this data into the VM.
Download Data
You may first download the data from the Tableau server or from GDrive into your laptop and then upload the same to the VM using the Upload feature of the VM but this is unnecessary. Instead, you may pull the data directly from the source website to the VM by using ANY ONE of the following TWO commands in one cell of the Colab Notebook
!wget -O Superstore.xls https://community.tableau.com/servlet/JiveServlet/downloadBody/1236-102-2-15278/Sample%20-%20Superstore.xls
#!gdown https://drive.google.com/uc?id=1HDeOmMfag6VX_oei0-Eqsc62unpBRlu2
Note that one of these two commands has been disabled with # symbol on the first line that identifies it as a non-executable comment. Depending on which source you want to use, you should comment out the other source.
Neither wget and gdown are python statements. They are Unix ( or Linux) utilities and hence they are preceded by the escape character ! You may Google for wget or gdown to understand more about what these commands do, but essentially they pull data from one website to the machine where they are being executed ( which in this case is your Virtual Machine at Google Colab) .
At the end of this process, you would have a file called Superstore.xls in our VM
Create Pandas Dataframe
Before we start processing this data with Python, we need to import various useful modules that has the required functions:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
After importing Pandas, we use the read_excel() method to read the Excel file into a Pandas dataframe dfSS_0. The default option assumes that the first ( or 0th) row of the spreadsheet contains the column names.
We also determine the shape of the dataframe and note that this data has 9994 rows and 21 columns.
dfSS_0 = pd.read_excel("Superstore.xls")
dfSS_0.shape
Next we take a quick look at the actual data. Since showing all 9994 rows is pointless, we look at the head of the table and pick out the first 5 rows.
dfSS_0.head()
This tells us the names of the columns as well shows us what the data looks like. To get more information about the data, we could have tried these commands as well : dfSS_0.info() or dfSS_0.describe()
We note that Row ID does not add value and from considerations of privacy, personally identifiable data like Customer ID and Customer Name should not be present in the data. Hence we drop these columns from the data. [ check out what do the drop() parameters axis and inplace mean? ]
# Remove unwanted columns
dfSS_0 = dfSS_0.drop(columns=["Row ID","Customer ID", "Customer Name"])
#Show first two rows
dfSS_0.head(2)
Finally, we need to check if there is any missing data anywhere in this huge dataset. Had it been a small Excel file we could have eyeballed it and located missing data. Since this is not possible, we use the following command
#check for null values
dfSS_0.isnull().values.any()
Once the data is loaded into a Pandas dataframe, there are many ways in which we could analyse the data depending on what one is looking for. Here is one approach :
Analyzing One Parameter (Sales or Profits)
Instead of looking at so many rows, let us work with a copy of the data that has fewer columns. Let us create a copy of the dataframe that contains only City, State, Category and Sales.
#Reduce the quantity of data
dfSS_1S = dfSS_0[["City","State","Category","Sales"]].copy()
dfSS_1S.head()
We have created a new dataframe dfSS_1S that contains only these four columns and the first five rows of this dataframe is as follows :
We could have also created another dataframe with Profit data as well
#Reduce the quantity of data
dfSS_1P = dfSS_0[["City","State","Category","Profit"]].copy()
dfSS_1P.head()
but for the moment we will work with the Sales data and get the total Sales that has been achieved.
dfSS_1S["Sales"].sum().round(2)
remove that last suffix .round(2) and see what happens
Note that there are three things happening here
- We extract only the data from the Sales column
- We apply a function sum() to get total Sales
- We apply a round(2) function to restrict the output to two decimal places
Depending on the situation we could have applied more functions but right now there is nothing more that would add any value.
Now that we have gross Sales data, let us see how to break this data and look at it from different perspectives.
- Individual Sales data from one State
- Total Sales data from one State
- Individual Sales data from either of two States
- Total Sales data from either of two States
- Individual Sales from one State and one Category
- Total Sales from one State and one Category
#Show ONLY few states
dfSS_1S[dfSS_1S["State"] == "Kentucky"]
dfSS_1S[dfSS_1S["State"] == "Kentucky"]["Sales"].sum().round(2)
#Data from either of two states
#Note the OR condition joining the clauses
dfSS_1S[(dfSS_1S["State"] == "Kentucky") | (dfSS_1S["State"] == "Alabama")]
#Data from either of two states
#Note the OR condition joining the clauses
dfSS_1S[(dfSS_1S["State"] == "Kentucky") | (dfSS_1S["State"] == "Alabama")]["Sales"].sum().round(2)
#Placing Two Conditions
#State AND Category
dfSS_1S[(dfSS_1S["State"] == "Kentucky") & (dfSS_1S["Category"] == "Furniture")]
dfSS_1S[(dfSS_1S["State"] == "Kentucky") & (dfSS_1S["Category"] == "Furniture")]["Sales"].sum().round(2)
Watch carefully how the queries are being built up
- First we specify conditional clause inside round brackets ()
- Next we join these conditional clauses with and & or |
- Then we put these clauses between square[] brackets to extract those rows from the dataframe
- Next we extract only one column, Sales from the resultant dataframe
- Finally we add the Sales values to get the total Sales if required
- The round() function only makes the output pretty
Grouped Data
We will now perform some more complex analysis. We want to group the Sales and Profit data by Region and draw a primitive graph. So we go back to our original dataframe and use the following command :
#Group BY
dfSS_RSP = dfSS_0.groupby("Region")["Region","Sales","Profit"].sum().round()
dfSS_RSP
We first create a new dataframe dfSS_RSP that contains only details of Region, Sales and Profit and then group the data by Region by using the sum() function. We could have also used the mean() function to get average data for each region but this is not so useful. This shows the following useful data :
We may also convert this into a very primitive chart as follows :
#Basic Default Plot
dfSS_RSP.plot()
Note that the dataframe dfSS_RSP has been created with Region as the index and Sales, Profit as regular columns. This is the default option of the groupby() function and the default plot() function only uses the index as the X axis.
# Plotting directly from DataFrames with Pandas
dfSS_RSP.plot(
y=['Sales', 'Profit'],
kind='bar',
legend=False,
color=['green','orange'],
figsize=(12,8)
)
# The plot is now created, and we use Matplotlib style
# commands to enhance the output.
plt.ylabel('Total Value')
plt.xlabel("Region")
plt.title("Value by Region")
plt.legend()
plt.gca().yaxis.grid(linestyle='--')
Check out various options in the charts by changing kind to, say, barh, figsize, color, linestyle etc. You could also put multiple charts in the same image as follows :
# Plotting directly from DataFrames with Pandas
dfSS_RSP.plot(
kind='pie',
subplots=True,
legend = False,
figsize=(12,8),
title = 'Sales and Profit by Region'
)
Multi-level groupings are also possible as shown in the following commands
#Group BY
dfSS_0.groupby("Sub-Category")["Sub-Category","Sales","Profit"].sum().round()
#Group BY two levels
dfSS_0.groupby(["Region","Sub-Category"])["Region","Sub-Category","Sales","Profit"].sum().round()
But that last command generates so much data that it becomes quite impossible to understand. In this case, it would make sense to first extract data from one Region and then see the total values of grouped by Category, as follows :
#Group BY two levels
dfSS_0[dfSS_0["Region"] == "Central"].groupby(["Region","Sub-Category"])["Region","Sub-Category","Sales","Profit"].sum().round()
Here we have first extracted data for Central Region and then grouped the data by Sub-Category
Data Transformation
Very often it so happens that the data that we want to explore is not directly available in the data set. For example, a manager might want to know Average Sales and Profits by Month. We note that even though the data has not been categorised by Month we do have the Order Date from which we can extract the name of the Month.
Rather than mess up the original dataframe we create a copy of the data in a second dataframe that contains a small subset of the columns of interest :
#Make a copy of the data
dfSS_2 = dfSS_0[["Order Date","State","Region","Category","Sales","Profit"]].copy()
dfSS_2.head(3)
Now we create two extra columns, Month and Year and populate the data in these columns by extracting the Month and Year from the Order Date :
# Extracting Month, Year from DateTime
dfSS_2['month'] = pd.DatetimeIndex(dfSS_2['Order Date']).month
dfSS_2['year'] = pd.DatetimeIndex(dfSS_2['Order Date']).year
dfSS_2.head()
Now we have the Month and Year for each transaction:
Now it is a simply a matter of grouping by Month and plotting the Sales and Profit by Month
#Group BY
dfSS_2.groupby("month")["Sales","Profit"].sum()
dfSS_2.groupby("month")["Sales","Profit"].sum().plot(
y=['Sales', 'Profit'],
kind='bar',
legend=True,
color=['lime','purple'],
figsize=(12,8)
)
to get the following chart
this contains data from both 2015 and 2016 but we will leave it as an exercise to extract data from only one year and plot this chart.
Solution : If you have done everything correctly, you should get have got this Colab Notebook.
No comments:
Post a Comment