Exploratory Data Analysis

9 minute read

In the early stages of a data project, it is crucial to creating visualizations to get an insight into your data. Matploylib and seaborn are two popular Python library that makes visualization much more manageable. You can put all Visualization techniques in four different categories:

Comparison: Either You want to compare features with each other or a single feature with itself at different time points, the most popular techniques is to use a bar chart.
Relationship: This is when you want to see how features in your data set are changing concerning each other. Scatter plot and line plot are in this category.
Composition: Pie chart should be the one to refer when you want to understand the composition in a feature.
Distribution: Histograms (eight bar type or line type) and scatter plot are usually used to understand the statistical characteristics of the feature.

Here I will use these visualization techniques to get a meaning from this dataset and answer the following questions and provide evidence for our hypothesis:

Introduction to the dataset: Customer value analysis

The goal of this post is to analyze a dataset and provide some useful information regarding customers demographics and buying behaviour. I choose to work on a dataset from IBM Watson Analytics (available here). I will provide answers to assigned questions by visualization methods in Python. This dataset will show us the most profitable customers and how they interact. I will show you how you can increase profitable customer response, retention and growth. Let’s start!

import pandas as pd
import numpy as np
import os
# Load the raw data using the read_csv object
df = pd.read_csv("WA_Fn_UseC__Marketing_Customer_Value_Analysis.csv")

There is a total of 24 variables in this dataset. The variables and the description of the values are as follows:

Customer: the customer ID number
State
Customer lifetime value: CLV present the finantial value of a customer. It is a prediction of the net profit attributed to the customer during his/her entire relationship with the company. CLV is highly important in shaping managers’ decisions.
Reponse: Yes or no response
Coverage: coverage type eigher basic, Premium, or extended
Education
Effective to Date
Employmnet status
Employed or unemployed
Gender: F or M
Income
Location Code
Marital Status
Monthly Premium Auto
Months Since Policy inception
Number of Open Complaints
Number of Polices
Policy Type
Policy
Renew Offer Type
Sales Channel
Total Claim Amount
Vehicle Class
Vehicle Size

df.head()

	Customer	State	Customer Lifetime Value	Response	Coverage	Education	Effective To Date	EmploymentStatus	Gender	Income	...	Months Since Policy Inception	Number of Policies	Policy Type	Policy	Renew Offer Type	Sales Channel	Total Claim Amount	Vehicle Class	Vehicle Size
0	BU79786	Washington	2763.519279	No	Basic	Bachelor	2/24/11	Employed	F	56274	...	5	1	Corporate Auto	Corporate L3	Offer1	Agent	384.811147	Two-Door Car	Medsize
1	QZ44356	Arizona	6979.535903	No	Extended	Bachelor	1/31/11	Unemployed	F	0	...	42	8	Personal Auto	Personal L3	Offer3	Agent	1131.464935	Four-Door Car	Medsize
2	AI49188	Nevada	12887.431650	No	Premium	Bachelor	2/19/11	Employed	F	48767	...	38	2	Personal Auto	Personal L3	Offer1	Agent	566.472247	Two-Door Car	Medsize
3	WW63253	California	7645.861827	No	Basic	Bachelor	1/20/11	Unemployed	M	0	...	65	7	Corporate Auto	Corporate L2	Offer1	Call Center	529.881344	SUV	Medsize
4	HB64268	Washington	2813.692575	No	Basic	Bachelor	2/3/11	Employed	M	43836	...	44	1	Personal Auto	Personal L1	Offer1	Agent	138.130879	Four-Door Car	Medsize

5 rows × 24 columns

# the size or structure of the dataset: 24 columns and over 9000 rows. 
df.shape

(9134, 24)

Dataset comprises of 9134 observations and 24 characteristics.

investigating the dataset

Now, I will get some basic information regarding this dataset.

df.isnull().sum()

Customer                         0
State                            0
Customer Lifetime Value          0
Response                         0
Coverage                         0
Education                        0
Effective To Date                0
EmploymentStatus                 0
Gender                           0
Income                           0
Location Code                    0
Marital Status                   0
Monthly Premium Auto             0
Months Since Last Claim          0
Months Since Policy Inception    0
Number of Open Complaints        0
Number of Policies               0
Policy Type                      0
Policy                           0
Renew Offer Type                 0
Sales Channel                    0
Total Claim Amount               0
Vehicle Class                    0
Vehicle Size                     0
dtype: int64

The dataset looks clean since there is no NaN value in our feature culumns. The data types have been also properly assigned.

df.dtypes

Customer                          object
State                             object
Customer Lifetime Value          float32
Response                          object
Coverage                          object
Education                         object
Effective To Date                 object
EmploymentStatus                  object
Gender                            object
Income                             int64
Location Code                     object
Marital Status                    object
Monthly Premium Auto               int64
Months Since Last Claim            int64
Months Since Policy Inception      int64
Number of Open Complaints          int64
Number of Policies                 int64
Policy Type                       object
Policy                            object
Renew Offer Type                  object
Sales Channel                     object
Total Claim Amount               float64
Vehicle Class                     object
Vehicle Size                      object
dtype: object

Exploring the features

Now lets build up some graphs to explore the features (variables).

import seaborn as sb
import scipy
from scipy.stats.stats import pearsonr
from scipy.stats import spearmanr
from scipy.stats import chi2_contingency
from IPython.display import display
pd.options.display.max_columns = None
import matplotlib.pyplot as plt
sb.set()
import pylab as pl
from pylab import rcParams
from matplotlib import cm
import plotly.graph_objs as go
import plotly.offline as py
import os
import plotly.tools as tls
import plotly.figure_factory as ff
import matplotlib.ticker as mtick

%matplotlib inline
#rcParams['figure.figsize'] = 28, 4
sb.set_style('whitegrid')

df["Vehicle Class"].value_counts()

Four-Door Car    4621
Two-Door Car     1886
SUV              1796
Sports Car        484
Luxury SUV        184
Luxury Car        163
Name: Vehicle Class, dtype: int64

df["Education"].value_counts()

Bachelor                2748
College                 2681
High School or Below    2622
Master                   741
Doctor                   342
Name: Education, dtype: int64

df["Coverage"].value_counts()

Basic       5568
Extended    2742
Premium      824
Name: Coverage, dtype: int64

df["State"].value_counts()

California    3150
Oregon        2601
Arizona       1703
Nevada         882
Washington     798
Name: State, dtype: int64

df["Customer Lifetime Value"].mean()

8004.93017578125

The describe() function in pandas is useful in getting various statistics. As shown below, this function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the dataset. We can see that a notably large difference between 75% tile and max values of most of our features. This point implies that we have some Outliers in our data set.

df.describe()

	Customer Lifetime Value	Income	Monthly Premium Auto	Months Since Last Claim	Months Since Policy Inception	Number of Open Complaints	Number of Policies	Total Claim Amount
count	9134.000000	9134.000000	9134.000000	9134.000000	9134.000000	9134.000000	9134.000000	9134.000000
mean	8004.930176	37657.380009	93.219291	15.097000	48.064594	0.384388	2.966170	434.088794
std	6870.965820	30379.904734	34.407967	10.073257	27.905991	0.910384	2.390182	290.500092
min	1898.007690	0.000000	61.000000	0.000000	0.000000	0.000000	1.000000	0.099007
25%	3994.251831	0.000000	68.000000	6.000000	24.000000	0.000000	1.000000	272.258244
50%	5780.182129	33889.500000	83.000000	14.000000	48.000000	0.000000	2.000000	383.945434
75%	8962.166992	62320.000000	109.000000	23.000000	71.000000	0.000000	4.000000	547.514839
max	83325.382812	99981.000000	298.000000	35.000000	99.000000	5.000000	9.000000	2893.239678

Data visualization

This was just a glimpse of this dataset. Let’s use visualization methods in Python and explore data with some graphs. Python’s Seaborn library, built on top of matplotlib, will be used to get statistical graphs to perform Univariate and Multivariate analysis.

sb.pairplot(df)

<seaborn.axisgrid.PairGrid at 0x1a27ead860>

png

df.corr()

	Customer Lifetime Value	Income	Monthly Premium Auto	Months Since Last Claim	Months Since Policy Inception	Number of Open Complaints	Number of Policies	Total Claim Amount
Customer Lifetime Value	1.000000	0.024366	0.396262	0.011517	0.009418	-0.036343	0.021955	0.226451
Income	0.024366	1.000000	-0.016665	-0.026715	-0.000875	0.006408	-0.008656	-0.355254
Monthly Premium Auto	0.396262	-0.016665	1.000000	0.005026	0.020257	-0.013122	-0.011233	0.632017
Months Since Last Claim	0.011517	-0.026715	0.005026	1.000000	-0.042959	0.005354	0.009136	0.007563
Months Since Policy Inception	0.009418	-0.000875	0.020257	-0.042959	1.000000	-0.001158	-0.013333	0.003335
Number of Open Complaints	-0.036343	0.006408	-0.013122	0.005354	-0.001158	1.000000	0.001498	-0.014241
Number of Policies	0.021955	-0.008656	-0.011233	0.009136	-0.013333	0.001498	1.000000	-0.002354
Total Claim Amount	0.226451	-0.355254	0.632017	0.007563	0.003335	-0.014241	-0.002354	1.000000

sb.heatmap(df.corr(), xticklabels=df.corr().columns.values, yticklabels=df.corr().columns.values, annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2d7df0b8>

png

spearmanr(df["Customer Lifetime Value"], df["Total Claim Amount"])

SpearmanrResult(correlation=0.2105979582520612, pvalue=4.351847002636142e-92)

chi2_contingency

<function scipy.stats.contingency.chi2_contingency>

table = pd.crosstab(df.Gender,df["EmploymentStatus"])

table

EmploymentStatus	Disabled	Employed	Medical Leave	Retired	Unemployed
Gender
F	244	2937	214	128	1135
M	161	2761	218	154	1182

chi2, p, dof, expected = chi2_contingency(table.values)

rcParams['figure.figsize'] = 14, 5
gb = df.groupby("Gender")["Education"].value_counts().to_frame().rename({"Education": "Numbers"}, axis = 1).reset_index()

gb

	Gender	Education	Numbers
0	F	Bachelor	1423
1	F	College	1352
2	F	High School or Below	1321
3	F	Master	393
4	F	Doctor	169
5	M	College	1329
6	M	Bachelor	1325
7	M	High School or Below	1301
8	M	Master	348
9	M	Doctor	173

sb.barplot(x = "Gender", y = "Numbers", data = gb, hue = "Education", palette = sb.color_palette("hls", 9)).set_title("Education vs. Gender");

output

rcParams['figure.figsize'] = 12, 5
gb = df.groupby("Vehicle Class")["Vehicle Size"].value_counts().to_frame().rename({"Vehicle Size": "Numbers"}, axis = 1).reset_index()
sb.barplot(x = "Vehicle Class", y = "Numbers", data = gb, hue = "Vehicle Size", palette = sb.color_palette("hls", 3)).set_title("Education vs. Gender");

output

rcParams['figure.figsize'] = 12, 5
gb = df.groupby("Vehicle Class")["Coverage"].value_counts().to_frame().rename({"Coverage": "Numbers"}, axis = 1).reset_index()
sb.barplot(x = "Vehicle Class", y = "Numbers", data = gb, hue = "Coverage", palette = sb.color_palette("hls", 12)).set_title("Education vs. Gender");

output

rcParams['figure.figsize'] = 12, 5
gb = df.groupby("Vehicle Size")["Coverage"].value_counts().to_frame().rename({"Coverage": "Numbers"}, axis = 1).reset_index()
sb.barplot(x = "Vehicle Size", y = "Numbers", data = gb, hue = "Coverage", palette = sb.color_palette("hls", 7)).set_title("Education vs. Gender");

output

rcParams['figure.figsize'] = 10, 5
gb = df.groupby("Policy")["Policy Type"].value_counts().to_frame().rename({"Policy Type": "Numbers"}, axis = 1).reset_index()
sb.barplot(x = "Policy", y = "Numbers", data = gb, hue = "Policy Type", palette = sb.color_palette("hls", 5)).set_title("Education vs. Gender");

output

gb = df.groupby("State").agg({"Customer Lifetime Value":'mean'}).rename({"Customer Lifetime Value": "Customer Lifetime Value Mean"}, axis = 1).reset_index()

gb

	State	Customer Lifetime Value Mean
0	Arizona	7861.341309
1	California	8003.647949
2	Nevada	8056.707031
3	Oregon	8077.901367
4	Washington	8021.472168

sb.set(style="whitegrid")
sb.barplot(x = "State", y = "Customer Lifetime Value Mean", data = gb, hue = "Customer Lifetime Value Mean" , palette = sb.color_palette("hls", 5)).set_title("The Mean value of the Customer Lifetime Value in each state");
#plt.legend(loc='best')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

<matplotlib.legend.Legend at 0x1a2ba6beb8>

output

sb.catplot(x="Customer Lifetime Value",y="Policy",data=df)

<seaborn.axisgrid.FacetGrid at 0x1a2b53e668>

output

g = sb.relplot(x="Monthly Premium Auto", y="Customer Lifetime Value",size='Income', kind="scatter", data=df)
g.fig.autofmt_xdate()

output

df["Customer Lifetime Value"]=df["Customer Lifetime Value"].astype('float32')

CLVunder10000 = df["Customer Lifetime Value"][(df["Customer Lifetime Value"] <= 10000.0)]
CLV10000_20000 = df["Customer Lifetime Value"][(df["Customer Lifetime Value"] <= 20000.0) & (df["Customer Lifetime Value"] >= 10000.1)]
CLV20000_40000 = df["Customer Lifetime Value"][(df["Customer Lifetime Value"] <= 40000.0) & (df["Customer Lifetime Value"] >= 20000.1)]
CLV40000_60000 = df["Customer Lifetime Value"][(df["Customer Lifetime Value"] <= 60000.0) & (df["Customer Lifetime Value"] >= 40000.1)]
CLVabove60000 = df["Customer Lifetime Value"][df["Customer Lifetime Value"] >= 60000.1]

CLV_range = ["1898(min)_ 10000","10000_20000","20000_40000","40000_60000","60000-83325 (max)"]
CLV_count = [len(CLVunder10000.values),len(CLV10000_20000.values),len(CLV20000_40000.values),len(CLV40000_60000.values),len(CLVabove60000.values)]

plt.figure(figsize=(8,5))
plt.bar(CLV_range, CLV_count, width=0.7, color = "black", align='center')
plt.show()

output

CLV_count

[7248, 1311, 515, 51, 9]

df["Customer Lifetime Value"].describe()

count     9134.000000
mean      8004.930176
std       6870.965820
min       1898.007690
25%       3994.251831
50%       5780.182129
75%       8962.166992
max      83325.382812
Name: Customer Lifetime Value, dtype: float64

sb.catplot(x="Customer Lifetime Value", y="Coverage", hue="State", kind="swarm", data=df);

output

sb.catplot(x="Customer Lifetime Value",y="Marital Status",kind='box',data=df)

<seaborn.axisgrid.FacetGrid at 0x1a2b7d52e8>

output

sb.catplot(x="Customer Lifetime Value",y="Marital Status", hue = "Gender", kind='box',data=df)

<seaborn.axisgrid.FacetGrid at 0x1a2b8c2940>

output

sb.catplot(x="Customer Lifetime Value",y="EmploymentStatus",kind='violin',data=df)

<seaborn.axisgrid.FacetGrid at 0x1a2c52f080>

output

rcParams['figure.figsize'] = 14, 8
ax = sb.catplot(y='Monthly Premium Auto',x='Months Since Policy Inception',hue='Gender',kind='point',data=df)
ax.fig.autofmt_xdate()
ax.

output

sb.catplot(x="Sales Channel", y="Customer Lifetime Value", hue="Renew Offer Type", kind="swarm", data=df);

output

df2=df[['Monthly Premium Auto', 'Months Since Last Claim', 'Months Since Policy Inception', 'Number of Open Complaints', 'Number of Policies', 'Customer Lifetime Value', 'Income', 'Total Claim Amount' ]]

number_of_rows=1
number_of_columns=8
l = df2.columns.values
plt.figure(figsize=(6*number_of_columns,8*number_of_rows))
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sb.distplot(df[l[i]],kde=True) 

output

l = df2.columns.values
number_of_columns=8
number_of_rows = 1
plt.figure()
for i in range(0,len(l)):
    plt.subplot(number_of_rows + 1,number_of_columns,i+1)
    sb.set_style('whitegrid')
    sb.boxplot(df2[l[i]],color='green',orient='v')
    plt.tight_layout()

output

Share on

Twitter Facebook Google+ LinkedIn

Hossein Hosseini

Exploratory Data Analysis

Introduction to the dataset: Customer value analysis

investigating the dataset

Exploring the features

Data visualization

Share on

You may also enjoy

Data Cleaning: Consumer Complaint Database

Machine Learning with R: Income prediction

Data Scraping from Clutch Company’s Website with Python

Interactive Data Visualizations