Data Cleaning: Consumer Complaint Database

6 minute read

Introduction

Here, I will clean data from the Financial Services Consumer Complaint Database which is publicly available in this link. As described in this webpage, it contains “data from the complaints received by the Consumer Financial Protection Bureau (CFPB) on financial products and services, including bank accounts, credit cards, credit reporting, debt collection, money transfers, mortgages, student loans, and other types of consumer credit. Data available about each complaint includes the name of the provider, type of complaint, date, zip code, and other information.”

Data cleaning

Import the libraries to set up the envirenment and load the data

import pandas as pd
import numpy as np
import os
# Load the raw data using the read_csv object
df = pd.read_csv("/Users/hossein/Consumer_Complaints/Consumer_Complaints.csv")

Investigate the data

Pandas provides enough tools to get information about your raw data by just a brief line of code. Let’s look at the top of our dataset by executing the following command:

df.head()

	Date received	Product	Sub-product	Issue	Sub-issue	Consumer complaint narrative	Company public response	Company	State	ZIP code	Tags	Consumer consent provided?	Submitted via	Date sent to company	Company response to consumer	Timely response?	Consumer disputed?	Complaint ID
0	02/04/2019	Credit card or prepaid card	General-purpose credit card or charge card	Getting a credit card	Sent card you never applied for	NaN	Company believes it acted appropriately as aut...	First Federal Credit Control	FL	NaN	NaN	NaN	Web	02/04/2019	Closed with explanation	Yes	NaN	3141563
1	02/04/2019	Debt collection	Payday loan debt	False statements or representation	Attempted to collect wrong amount	NaN	NaN	ONEMAIN FINANCIAL HOLDINGS, LLC.	MI	NaN	Older American	NaN	Web	02/04/2019	In progress	Yes	NaN	3141667
2	02/04/2019	Credit reporting, credit repair services, or o...	Credit reporting	Incorrect information on your report	Information belongs to someone else	NaN	NaN	NCC Business Services, Inc.	TX	NaN	NaN	NaN	Web	02/04/2019	In progress	Yes	NaN	3142226
3	02/04/2019	Mortgage	Conventional home mortgage	Struggling to pay mortgage	NaN	NaN	NaN	BAYVIEW LOAN SERVICING, LLC	CA	NaN	NaN	Other	Web	02/04/2019	In progress	Yes	NaN	3142610
4	02/04/2019	Debt collection	I do not know	Attempts to collect debt not owed	Debt was paid	NaN	NaN	Diversified Consultants, Inc.	OH	45044	NaN	NaN	Web	02/04/2019	In progress	Yes	NaN	3141862

A quick look at this dataset, we can see that in some of our columns, we have NaN values. I will get back to this but first let’s get some information about the structure of our data by executing df.dtypes and df.info().

df.shape

(1210607, 18)

df.dtypes.to_frame(name='Type') # df.info()  is also return back data type plus other useful info. 

	Type
Date received	object
Product	object
Sub-product	object
Issue	object
Sub-issue	object
Consumer complaint narrative	object
Company public response	object
Company	object
State	object
ZIP code	object
Tags	object
Consumer consent provided?	object
Submitted via	object
Date sent to company	object
Company response to consumer	object
Timely response?	object
Consumer disputed?	object
Complaint ID	int64

Looking at the output of these two commands, we can see we have 18 columns (features) and over 1.2M rows. So this is a pretty huge dataset. Another important thing that we need to consider is understanding data types. A data type is essentially an inner construct that a programming language recruits to know how to store and manipulate data. Here we can see that all of our features are objects (strings), but the Complaint ID is an integer type. Later on, I will convert some of these data types to other ones as needed.

Ok, now let’s see how many missing (NAN) values we have in each column. The following command provides this information for us. It sounds like we have a significant number of missing values in 9 columns. I will fill them mostly by a “Not given” str type in the next section.

df.isnull().sum().to_frame(name='missing counts')

	missing counts
Date received	0
Product	0
Sub-product	235166
Issue	0
Sub-issue	521887
Consumer complaint narrative	846404
Company public response	797586
Company	0
State	17098
ZIP code	104875
Tags	1045378
Consumer consent provided?	564853
Submitted via	0
Date sent to company	0
Company response to consumer	6
Timely response?	0
Consumer disputed?	442087
Complaint ID	0

Clean up the data

Let’s start with “date columns”. We have two colums that we need to convert themfrom strings to datetime objects. Pandas to_datetime command can easyly do this job for us:

df["Date received"] = pd.to_datetime(df["Date received"])
df['Date sent to company'] = pd.to_datetime(df['Date sent to company'])

There are three columns in our dataset that can be considered as Boolean type. The following three lines of commands will execute this conversion by recruiting Numpy’s “where” module. The idea is to convert all “Yes” values to True, and everything else will be assigned to False (in case of the third line, “Consent not provided” to False and the rest to True ).

The basic idea is to use the np.where() function to convert all “Y” values to True, and everything else assigned False.

df["Timely response?"] = np.where(df["Timely response?"] == "Yes", True, False)
df["Consumer disputed?"] = np.where(df["Consumer disputed?"] == "Yes", True, False)
df["Consumer consent provided?"] = np.where(df["Consumer consent provided?"] == "Consent not provided", False, True)

The “ZIP code” column contains lots of signs and also extra letters attached to some of the zip codes in this column. This function written below will strip all these out from the column.

def replace_ZIP_code_XX_USMOI(x):
    '''
    Replace XX in Zip code to 00
    '''
    try: 
        return x.replace('XX', '00').replace('(','').replace('"','').replace('-','').replace('$','').replace('.','').replace('!','').replace('+','').replace('*','').replace('`','').replace('/','').replace('UNITED STATES MINOR OUTLYING ISLANDS','USMOI')
    except AttributeError:
        return np.NaN

df['ZIP code'] = df['ZIP code'].map(replace_ZIP_code_XX_USMOI)
df["State"] = df["State"].map(replace_ZIP_code_XX_USMOI)

Now Let’s fill in all the missing values with appropriate values.
In “ZIP code” case, I will fill them in with “00000” and convert its type to integer. For the other cases, I will fill them in with “Not given” string.

df['ZIP code'] = df['ZIP code'].fillna("00000").astype(int)
df['State'] = df['State'].fillna("Not given")
df['Product'] = df['Product'].fillna("Not given")
df['Sub-product'] = df['Sub-product'].fillna("Not given")
df['Issue'] = df['Issue'].fillna("Not given")
df['Sub-issue'] = df['Sub-issue'].fillna("Not given")
df['Consumer complaint narrative'] = df['Consumer complaint narrative'].fillna("Not given")
df['Company public response'] = df['Company public response'].fillna("Not given")
df['Company response to consumer'] = df['Company response to consumer'].fillna("Not given")
df['Tags'] = df['Tags'].fillna("Not given")

#lets see How our data types look like. 
df.dtypes.to_frame(name='Type')

	Type
Date received	datetime64[ns]
Product	object
Sub-product	object
Issue	object
Sub-issue	object
Consumer complaint narrative	object
Company public response	object
Company	object
State	object
ZIP code	int64
Tags	object
Consumer consent provided?	bool
Submitted via	object
Date sent to company	datetime64[ns]
Company response to consumer	object
Timely response?	bool
Consumer disputed?	bool
Complaint ID	int64

Ok, now I will save the cleaned data set into the current working directory (%pwd or cwd) by the following command line.

df.to_csv("Customer_Complaints_cleaned.csv", sep='\t', encoding='utf-8') # cwd = os.getcwd()

Data vizualization

from IPython.display import display
pd.options.display.max_columns = None
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import pylab as pl
from matplotlib import cm
import plotly.graph_objs as go
import plotly.offline as py
import os
import plotly.tools as tls
import plotly.figure_factory as ff
import matplotlib.ticker as mtick

%matplotlib inline
sb.set_style('whitegrid')

Disputed_cunsumer_rate = (df['Consumer disputed?'].value_counts()*100.0 /len(df)).plot(kind='pie',\
        labels = ['No', 'Yes'], figsize = (6,6) , colors = ['grey','red'])

Disputed_cunsumer_rate.set_title('Consumer disputed?')
Disputed_cunsumer_rate.legend(labels=['No','Yes']);

output

Timely_response_rate = (df['Timely response?'].value_counts()*100.0 /len(df)).plot(kind='pie',\
        labels = ['Yes', 'No'], figsize = (7,7) , colors = ['grey','green'])

Timely_response_rate.set_title('Timely response?')
Timely_response_rate.legend(labels=['Yes','No']);

output

Consumer_consent_provided_rate = (df['Consumer consent provided?'].value_counts()*100.0 /len(df)).plot(kind='pie',\
        labels = ['Yes', 'No'], figsize = (7,7) , colors = ['grey','blue'])

Consumer_consent_provided_rate.set_title('Consumer consent provided?')
Consumer_consent_provided_rate.legend(labels=['Yes','No']);

output

rcParams['figure.figsize'] = 28, 4
gb = df.groupby("State")["Submitted via"].value_counts().to_frame().rename({"Submitted via": "Number of complaints"}, axis = 1).reset_index()
sns.barplot(x = "State", y = "Number of complaints", data = gb, hue = "Submitted via", palette = sns.color_palette("hls", 8)).set_title("Number of complaints in each state based on submission method");
plt.xticks(rotation=-90);

output

submission = (df['Submitted via'].value_counts()*100.0 /len(df)).plot(kind='pie',\
        figsize = (8,8) , colors = ['grey','purple', 'orange', 'green', 'blue', 'red'], fontsize = 15)

submission.set_title('Submission method ')
submission.legend(labels=['Web', 'Referral', 'Phone', 'Postal mail', 'Fax', 'Email']);

output

product_info = (df['Product'].value_counts()*100.0 /len(df2)).plot(kind='bar', stacked = True,\
                                                rot = 0, color = ['red','lightblue','green', 'blue', 'yellow'])
  
product_info.yaxis.set_major_formatter(mtick.PercentFormatter())
product_info.set_ylabel('Complaints')
product_info.set_xlabel('Product')
product_info.set_title('Percentage of complaints for each product');
plt.xticks(rotation=-90);

output

df["Company public response"] = np.where(df["Company public response"] == "Not given", False, True)

Company_public_response_rate = (df['Company public response'].value_counts()*100.0 /len(df2)).plot(kind='pie',\
        labels = ['No', 'Yes'], figsize = (7,7) , colors = ['grey','blue'])

Company_public_response_rate.set_title('Company public response provided?')
Company_public_response_rate.legend(labels=['No','Yes']);

output

date_delay = (df['Date sent to company']- df['Date received']).astype('timedelta64[D]')

rcParams['figure.figsize'] = 8, 6
plt.hist(date_delay, normed=False, bins=1000)
plt.ylabel('Number of complaints');
plt.xlabel('Time difference (days)');
plt.title('The difference between date a complaint received and the date it sent to company')
plt.xlim((0, 40))

output

Share on

Twitter Facebook Google+ LinkedIn

Hossein Hosseini

Data Cleaning: Consumer Complaint Database

Introduction

Data cleaning

Import the libraries to set up the envirenment and load the data

Investigate the data

Clean up the data

Data vizualization

Share on

You may also enjoy

Machine Learning with R: Income prediction

Exploratory Data Analysis

Data Scraping from Clutch Company’s Website with Python

Interactive Data Visualizations