Machine Learning with R: Income prediction

1 minute read

Introduction

The UCI Machine Learning Repository is an excellent source of interesting data sets for machine learning porpuses. Here, I will look at Census Income Dataset. Extraction of this dataset was done by Barry Becker from the 1994 Census database. The goal is to predict whether a person makes over $50k per year or not. I will use logistic regression to classify the output based on whether it is <$50K or >=$50K. This is a linear classification problem thus the logistic regression sounds a good choice for this job. To evaluate the model, I will use machine learning evaluation metrics (ROC, AUC).

The workflow will be as follows:

  • Preparation of the Data.
  • Implementing a logistic regression model (fitting)
  • predicting the probabilities and evaluating the model.

Let’s start by importing the libraries we will use in this work: tidyverse, Caret, ROCR (To get this packages, you might need to download it from this link and put it in the folder containing the R libraries.).

library(tidyverse)
library(Caret)
librery(ROCR)

Reading in the data

df <- read_csv("adult.csv") # reading in the data
library(data.table)
setnames(df, c('age','workclass', 'fnlwgt','education', 'educationnum', 'maritalstatus','occupation', 'relationship', 'race','sex', 'capitalgain', 'capitalloss' ,  'hoursperweek', 'nativecountry', 'yearsalary'))

dim_desc(df) # dimensions of the data
names(df) # names of the data
glimpse(df) # taking a look at the data
df <- df %>% mutate_if(is.character, as.factor) # changing character variables to factors
# selecting random seed to reproduce results
set.seed(5)
# sampling 75% of the rows
inTrain <- createDataPartition(y = df$yearsalary, p=0.75, list=FALSE)

# train/test split; 75%/25%
train <- df[inTrain,]
test <- df[-inTrain,]

Fitting The Model

fit <- glm(yearsalary~., data=train, family=binomial)

Making Predictions

yearsalary.probs <- predict(fit, test, type="response")
head(yearsalary.probs)
# Looking at the response encoding
contrasts(df$yearsalary)  # >50K = 1, No = 0

# converting probabilities to ">50K" and "<=50K" 
glm.pred = rep("<=50K", length(yearsalary.probs))
glm.pred[yearsalary.probs > 0.5] = ">50K"
glm.pred <- as.factor(glm.pred)

# Evaluating The Model
# creating a confusion matrix
confusionMatrix(glm.pred, test$yearsalary, positive = ">50K")

# need to create prediction object from ROCR
pr <- prediction(yearsalary.probs, test$yearsalary)

# plotting ROC curve
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

# AUC value
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc