Machine Learning with R: Income prediction
Introduction
The UCI Machine Learning Repository is an excellent source of interesting data sets for machine learning porpuses. Here, I will look at Census Income Dataset. Extraction of this dataset was done by Barry Becker from the 1994 Census database. The goal is to predict whether a person makes over $50k per year or not. I will use logistic regression to classify the output based on whether it is <$50K or >=$50K. This is a linear classification problem thus the logistic regression sounds a good choice for this job. To evaluate the model, I will use machine learning evaluation metrics (ROC, AUC).
The workflow will be as follows:
- Preparation of the Data.
- Implementing a logistic regression model (fitting)
- predicting the probabilities and evaluating the model.
Let’s start by importing the libraries we will use in this work: tidyverse, Caret, ROCR (To get this packages, you might need to download it from this link and put it in the folder containing the R libraries.).
library(tidyverse)
library(Caret)
librery(ROCR)
Reading in the data
df <- read_csv("adult.csv") # reading in the data
library(data.table)
setnames(df, c('age','workclass', 'fnlwgt','education', 'educationnum', 'maritalstatus','occupation', 'relationship', 'race','sex', 'capitalgain', 'capitalloss' , 'hoursperweek', 'nativecountry', 'yearsalary'))
dim_desc(df) # dimensions of the data
names(df) # names of the data
glimpse(df) # taking a look at the data
df <- df %>% mutate_if(is.character, as.factor) # changing character variables to factors
# selecting random seed to reproduce results
set.seed(5)
# sampling 75% of the rows
inTrain <- createDataPartition(y = df$yearsalary, p=0.75, list=FALSE)
# train/test split; 75%/25%
train <- df[inTrain,]
test <- df[-inTrain,]
Fitting The Model
fit <- glm(yearsalary~., data=train, family=binomial)
Making Predictions
yearsalary.probs <- predict(fit, test, type="response")
head(yearsalary.probs)
# Looking at the response encoding
contrasts(df$yearsalary) # >50K = 1, No = 0
# converting probabilities to ">50K" and "<=50K"
glm.pred = rep("<=50K", length(yearsalary.probs))
glm.pred[yearsalary.probs > 0.5] = ">50K"
glm.pred <- as.factor(glm.pred)
# Evaluating The Model
# creating a confusion matrix
confusionMatrix(glm.pred, test$yearsalary, positive = ">50K")
# need to create prediction object from ROCR
pr <- prediction(yearsalary.probs, test$yearsalary)
# plotting ROC curve
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
# AUC value
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc