Predicting the presence of heart-disease within Cleveland dataset

Hello world, I’m Jakob Serlier. This R Markdown file is a part of my final project for the R For Analytics class at Carnegie Mellon University. Let’s dive into it!

The dataset we will be taking a look at is the heart-disease-uci dataset retrieved from Kaggle.com. The reason I chose this dataset is because I saw that it had potential for supervised modelling, which is something I wanted to learn more about in the course of this project. My goal is to:

Explore different supervised learing models in R
Determine the impact of different preprocessing methods on the predictive power of the created models

The analysis of the data will follow the following methodology:

Performing exploratory data analysis
Creating three dataframes which were created using different pre-processing techniques
Testing and running various models using the full dataframe
Running the models for each of the three dataframes multiple times, each time resampling the train and test subset
Reviewing and discuss the results

The three dataframes will be divided as follows:

The full dataset
The full dataset containing various buckets/bins
A subset of the dataset, created by removing the lowest correlating predictors with the target variable.

The following models will be considered:

randomForest’s package RandomForest:“randomForest implements Breiman’s random forest algorithm (based on Breiman and Cutler’s original Fortran code) for classification and regression.”
The base GLM Model: “glm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.”
mboost’s GLMBoost: “Gradient boosting for optimizing arbitrary loss functions where component-wise linear models are utilized as base-learners.”
kernlab’s SVM Radial: “Support Vector Machines with Radial Basis Function Kernel”
caTool’s Boosted Logistic Regression: “Train logitboost classification algorithm using decision stumps (one node decision trees) as weak learners.”

All of these models are able to be used in classification problems.

Some assumptions and ‘shortcuts’ have been made during this project in order to keep it within a feasible scope. For the results of each of the model, I have chosen a single indicator variable, such as ‘accuracy’, in order to determine its effectiveness. This of course leaves out a lot of important data which might improve/decrease the effectiveness in a real world scenario (e.g. a confusion matrix). Keep this in mind as I interpret the results.

Environment setup

Importing libararies, reading the csv file, setting seed, setting the amount of repetitions.

#importing some of the libraries..
library(randomForest)
library(pscl) #McFadden R2 of logistic model
library(caret)
library(corrplot)
library(data.table) 
library(dplyr)
library(tidyverse)

#heart disease dataset: https://www.kaggle.com/ronitf/heart-disease-uci
#import the file 
hd <- read.csv(file="./datasets/heart-disease-uci/heart.csv", header=TRUE, sep=",")
#setting seed for reproducability
set.seed(123)

#Setting the global repetition variable: how often will we resample and run the set for each model.
nruns <- 10

Exploratory Data Analysis

Let’s start with some simple text/table based EDA:

any(is.na(hd)) #We can see that there are no NA values.

## [1] FALSE

dim(hd) #Dimensions of the df, 303 rows and 14 cols, meaning 1 target variable and 13 possible predictors.

## [1] 303  14

str(hd) #Taking an initial look at the structure of the dataset

## 'data.frame':    303 obs. of  14 variables:
##  $ ï..age  : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

#summary(hd)#summary.. excluded in final version since it takes up some space
sapply(hd, sd)#standard deviation

##     ï..age        sex         cp   trestbps       chol        fbs 
##  9.0821010  0.4660108  1.0320525 17.5381428 51.8307510  0.3561979 
##    restecg    thalach      exang    oldpeak      slope         ca 
##  0.5258596 22.9051611  0.4697945  1.1610750  0.6162261  1.0226064 
##       thal     target 
##  0.6122765  0.4988348

We can see that there are no NA values, which is great, since now we don’t have to bother with imputing data or removing rows.

Dimensions of the dataframe: 303 rows and 14 cols, meaning 1 target variable and 13 possible predictors.

The predictor variables are defined as follows:

age age in years
sex(1 = male; 0 = female)
cp chest pain type
trestbps resting blood pressure (in mm Hg on admission to the hospital)
chol serum cholestoral in mg/dl
fbs(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg resting electrocardiographic results
thalach maximum heart rate achieved
exang exercise induced angina (1 = yes; 0 = no)
oldpeakST depression induced by exercise relative to rest
slope the slope of the peak exercise ST segment
ca number of major vessels (0-3) colored by flourosopy
thal3 = normal; 6 = fixed defect; 7 = reversable defect

As we are mostly interested in the predictive power of the variables, we will not spend much time on exploring the real-life implication of these variables. As for the target variables, this is either 1 (positive) or 0 (negative).

In the structure, we can see that we are dealing with 14 numerical (int & num) variables.

Next, let’s rename the age variable to something which won’t mess with Rstudio

#lets rename hd$ï..age.. the column names causes some errors. 
hd$age <- hd$ï..age #Rstudio marks this as an error, but this can be ignored. 
hd$ï..age <- NULL #idem
#If youre wondering why I do it like this, its because hd <- hd %>% rename(ï..age = age) returns an error due to the character.

Visualizing the data

Correlation plot

By taking a look at the correlation plot we can get an initial idea of which predictors might be interesting to take a look at, especially the ones correlated with the target variable. The best classification and regression models have a low redundancy in predictor variables.

We can see that the variables chol, fbs restecg and trestbps have the lowest correlation, so these will be removed in the third analyzed dataframe.

Feature engineering

My feature engineering will mostly consist of pre-processing the dataset into 3 seperate datasets.

Transforming the target variable:
We will be transforming the target variable and giving it better labels in order to improve later interpretation. We will also be setting the target variable as a factor.

#step 3, transform target variable
hd$target <- ifelse(test=hd$target == 0, yes="negative", no='positive') #we want to do classification, so we want to transform the target variable into nominal values
hd$target <- as.factor(hd$target)

creating the 3 main datasets
As I have discribed in the introduction, we will be creating 3 seperate dataframes which will be analyzed. The first dataframe is the largely unaltered full dataset. Secondly, we will be creating a bucket dataframe, which wil have 3 variables (age, cholesterol levels and thalach) which will be transformed to ordinal variables (put into bins). Lastly, we will create a dataset which excludes the variables with low correlation, chol, fbs restecg and trestbps.

hd.full <- hd
hd.bucket <- hd
hd.excl <- hd[,-c(3:6)] #excluding the discussed variables

#Bucketing
#using age groups instead of age might improve the classification. Thanks to: 
agebreaks <- c(0,1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,500)
agelabels <- c("0-1","1-4","5-9","10-14","15-19","20-24","25-29","30-34",
               "35-39","40-44","45-49","50-54","55-59","60-64","65-69",
               "70-74","75-79","80-84","85+")

setDT(hd.bucket)[ , agegroups := cut(hd.bucket$age, 
                              breaks = agebreaks, 
                              right = FALSE, 
                              labels = agelabels)]
#Creating bcukets for cholesterol and thalach
hd.bucket$chol <- cut(hd.bucket$chol, 5, include.lowest=TRUE, labels=c("Low", "Low-Med", "Med", "Med-High", "High"))
hd.bucket$thalach <- cut(hd.bucket$thalach,  5, include.lowest=TRUE, labels=c("Low", "Low-Med", "Med", "Med-High", "High"))

Defining full model test and training set

Next we want to create a training and a test dataset for the full dataframe, as this is the one we will be using in the initial runs of the models.

smp_size <- floor(0.75 * nrow(hd))
train_set.full <- sample(seq_len(nrow(hd.full)), size = smp_size)
train.full <- hd[train_set.full, ]
test.full <- hd[-train_set.full, ]

Running the models (scroll down for the results)

First run of GLM + some interesting variables, results hidden

###
#Running the GLM model on the full dataset
###


glmfull <- glm(target ~ ., data = train.full, family = "binomial")
backwardshd <- step(glmfull,trace=0) # Backwards selection is the default
fitted.results <- predict(backwardshd,newdata=test.full,type='response') 
fitted.resultsBin <- ifelse(fitted.results > 0.5,'positive','negative')
ErrorFull <- mean(fitted.resultsBin != test.full$target)

print(paste('Accuracy of full dataframe',1-ErrorFull))

#some otherinteresting analysis tools for GLM:
anova(backwardshd, test="Chisq")
pR2(backwardshd)#McFadden R2 of logistic model

Running GLM model on the 3 datasets nruns times

LogisticModel <- function(set) {
  smp_size <- floor(0.75 * nrow(set))
  temptrainset <- sample(seq_len(nrow(set)), size = smp_size)
  temptrain <- hd[temptrainset, ]
  temptest <- hd[-temptrainset, ]
  
  glmfull <- glm(target ~ ., data = temptrain, family = "binomial")
  backwardshd = step(glmfull,trace=0) # Backwards selection is the default
  fitted.results <- predict(backwardshd,newdata=temptest,type='response') 
  fitted.resultsBin <- ifelse(fitted.results > 0.5,'positive','negative')
  Error <- mean(fitted.resultsBin != temptest$target)
  return(1-Error)
}

fullLMAcc <- replicate(nruns, LogisticModel(hd.full))
bucketLMAcc <- replicate(nruns, LogisticModel(hd.excl))
exclLMAcc <- replicate(nruns, LogisticModel(hd.bucket))

Running the basic random forest model on the full dataset, results hidden

#Running basic random forest: 
full.rf <- randomForest(target ~ ., data = train.full, oob.prox=TRUE)
#varImpPlot(full.rf) #commented out since no longer included into results
summary(full.rf)

Running the RandomForest model on the 3 datasets nruns times

randF <- function(set){
  #creating test and train sets
  smp_size <- floor(0.75 * nrow(set))
  temptrainset <- sample(seq_len(nrow(set)), size = smp_size)
  temptrain <- hd[temptrainset, ]
  temptest <- hd[-temptrainset, ]

  temp.rf <- randomForest(target ~ ., data=temptrain, oob.prox=TRUE)
  return(mean(temp.rf$err.rate))
}

fullRFError <- replicate(nruns, randF(hd.full))
bucketRFError <- replicate(nruns, randF(hd.bucket))
ExclRFError <- replicate(nruns, randF(hd.excl))

Testing svmRadial model using full dataset and running it on the 3 datasets nrun times

svmRfit <- train(target ~., data=train.full, method='svmRadial')
svmRfit
svmRfitClasses <- predict(svmRfit, newdata = test.full)
postResample(svmRfitClasses, test.full$target)
confusionMatrix(svmRfitClasses, test.full$target)

svmRfits <- function(set){
  #creating test and train sets
  smp_size <- floor(0.75 * nrow(set))
  temptrainset <- sample(seq_len(nrow(set)), size = smp_size)
  temptrain <- hd[temptrainset, ]
  temptest <- hd[-temptrainset, ]
  tempFit <- train(target ~., data=temptrain, method='svmRadial')
  tempClasses <- predict(tempFit, newdata = temptest)
  return(as.numeric(postResample(tempClasses, temptest$target)['Accuracy']))
}

fullsvmR <- replicate(nruns, svmRfits(hd.full))
bucketsvmR <- replicate(nruns, svmRfits(hd.bucket))
exclsvmR <- replicate(nruns, svmRfits(hd.excl))

Running LogitBoost on the full model and running it on the 3 dataset nrun times

#boostedLogitsic on full model
blrFit <- train(target ~.,
                data = train.full,
                method = "LogitBoost")
blrFit
blrClasses <- predict(blrFit, newdata = test.full)
postResample(blrClasses, test.full$target)[1]


bstLog <- function(set) {
    #creating test and train sets
  smp_size <- floor(0.75 * nrow(set))
  temptrainset <- sample(seq_len(nrow(set)), size = smp_size)
  temptrain <- hd[temptrainset, ]
  temptest <- hd[-temptrainset, ]
  
  tempfit <- train(target ~.,
                data = temptrain,
                method = "LogitBoost")
  tempClass <- predict(tempfit, newdata = temptest)
  as.numeric(return(postResample(tempClass, temptest$target)[1]))
}

fullBstLogAcc <- replicate(nruns, bstLog(hd.full))
bucketBstLogAcc <- replicate(nruns, bstLog(hd.bucket))
exclBstLogAcc <- replicate(nruns, bstLog(hd.excl))

Running the glmboost model on the full dataset and running the model on the 3 datasets nrun times

#glmBoost on full model
glmboost <- train(target ~.,
                data = train.full,
                method = "glmboost")
glmClasses <- predict(glmboost, newdata = test.full)
postResample(glmClasses, test.full$target)[1]
glmboost


glmbst <- function(set) {
    #creating test and train sets
  smp_size <- floor(0.75 * nrow(set))
  temptrainset <- sample(seq_len(nrow(set)), size = smp_size)
  temptrain <- hd[temptrainset, ]
  temptest <- hd[-temptrainset, ]
  
  tempfit <- train(target ~.,
                data = temptrain,
                method = "glmboost")
  tempClass <- predict(tempfit, newdata = temptest)
  as.numeric(return(postResample(tempClass, temptest$target)[1]))
}

fullBstglmAcc <- replicate(nruns, glmbst(hd.full))
bucketBstglmAcc <- replicate(nruns, glmbst(hd.bucket))
exclBstglmAcc <- replicate(nruns, glmbst(hd.excl))

compiling the results into variables and into a table

LogResults <- c('Pred. Accuracy',mean(fullLMAcc), mean(bucketLMAcc), mean(exclLMAcc))
RFResults <- c('1-Error', 1- mean(fullRFError), 1 - mean(bucketRFError), 1 - mean(ExclRFError))
svmRResults <- c('Accuracy', mean(fullsvmR), mean(bucketsvmR), mean(exclsvmR))
bstLogResults <- c('Accuracy', mean(fullBstLogAcc), mean(bucketBstLogAcc), mean(exclBstLogAcc))
bstglmResults <- c('Accuracy', mean(fullBstglmAcc), mean(bucketBstglmAcc), mean(exclBstglmAcc))

resultvector <- c(LogResults, bstglmResults, RFResults, svmRResults, bstLogResults)
results <- matrix(resultvector,ncol=5,byrow=FALSE)
colnames(results) <- c("Logistic Reg.",'Boosted glm',"RandomForest","svmRadial", 'Boosted Log. reg.')
rownames(results) <- c("testing variable","full model","bucket model", 'excluding model')
resultstable <- as.table(results)

Results of running the models

resultstable

##                  Logistic Reg.     Boosted glm       RandomForest     
## testing variable Pred. Accuracy    Accuracy          1-Error          
## full model       0.828947368421053 0.825             0.818553817855157
## bucket model     0.822368421052632 0.826315789473684 0.814886709294547
## excluding model  0.821052631578947 0.836842105263158 0.803963868995558
##                  svmRadial         Boosted Log. reg.
## testing variable Accuracy          Accuracy         
## full model       0.830263157894737 0.784210526315789
## bucket model     0.813157894736842 0.798684210526316
## excluding model  0.807894736842105 0.786842105263158

As we can see in the results table, the differences between the means of the indicator variables don’t seem to vary very much. Before we take a look at which model is the best predictory model, we will if the mean values of the indicator variables differ (statistically) significantly. In order to do so, we will run an anova test for the results of the 5 models, taking into account each of the iterations of the model results. The bucket and excluded dataset results will be run against the mean of the full dataset. An alpha value of 0,05 (5%) will be used to evaluate the null-hypothesis. Finally, as an extra measure, the assumption of homogeneity of variance is tested.

Results of ANOVA testing the means

mean_diff <- function(full, bucket, excl){
  temp_df <- data.frame(fullBstglmAcc, bucketBstglmAcc, exclBstglmAcc)
  temp_aov <- aov(full ~ bucket + excl,temp_df)
  return(temp_aov)
}
#Anova for models 1 through 5
aov.1 <- mean_diff(fullLMAcc, bucketLMAcc, exclLMAcc)
aov.2 <- mean_diff(fullRFError, bucketRFError, ExclRFError)
aov.3 <- mean_diff(fullsvmR, bucketsvmR, exclsvmR)
aov.4 <- mean_diff(fullBstLogAcc, bucketBstLogAcc, exclBstLogAcc)
aov.5 <- mean_diff(fullBstglmAcc, bucketBstglmAcc, exclBstglmAcc)

summary(aov.1)

##             Df   Sum Sq   Mean Sq F value Pr(>F)
## bucket       1 0.000224 0.0002244   0.260  0.626
## excl         1 0.001350 0.0013504   1.564  0.251
## Residuals    7 0.006043 0.0008633

summary(aov.2)

##             Df    Sum Sq   Mean Sq F value Pr(>F)
## bucket       1 0.0000024 2.370e-06   0.031  0.864
## excl         1 0.0002321 2.321e-04   3.073  0.123
## Residuals    7 0.0005285 7.550e-05

summary(aov.3)

##             Df   Sum Sq  Mean Sq F value Pr(>F)
## bucket       1 0.000183 0.000183   0.161   0.70
## excl         1 0.001362 0.001362   1.198   0.31
## Residuals    7 0.007960 0.001137

summary(aov.4)

##             Df   Sum Sq  Mean Sq F value Pr(>F)
## bucket       1 0.001172 0.001172   0.478  0.512
## excl         1 0.001813 0.001813   0.739  0.418
## Residuals    7 0.017168 0.002452

summary(aov.5)

##             Df    Sum Sq   Mean Sq F value Pr(>F)
## bucket       1 0.0007531 0.0007531   2.029  0.197
## excl         1 0.0004750 0.0004750   1.280  0.295
## Residuals    7 0.0025980 0.0003711

#finally, testing the homogeneity assumption of variance 
par(mfrow=c(2,2))
plot(aov.1, 1)
plot(aov.2, 1)
plot(aov.3, 1)
plot(aov.4, 1)

plot(aov.5, 1)

These results will be discussed in the analysis & discussion section.

Taking a look at the three CARET models

These models are the most promising models, by looking at the plots we will determine the best model.

results <- resamples(list(Glmboost=glmboost, svmRadial=svmRfit, BoostedLog=blrFit))
#summary(results) #commented out as no longer included in analysis & discussion
bwplot(results)

Analysis & discussion

Let’s start with the three created dataframes. The assumption of homogeneity test was passed. As we can see in the ANOVA analyses, there is not a single significant difference in the predictive power of the models created. This was determined by looking if there is a difference in the mean of the variable displaying the predictive strengths of the model. I find this to be quite surprising, as I had expected there to be significant differences. However, we have to keep in mind that I have only analyzed a single indicator variable, and other properties of the models were kept outside of the scope. In accordance to the results, we can conclude that the removal of the low-correlating variables, and the catigorization of the variables, did not have a significant impact on the predictive power of the resulting models. This concludes the main goal of the analysis.

Next, let’s consider the tested models. We have taken a look at 5 models, and have run them multiple times in order to prevent sampling bias. According to the results plot, the GLMBoost model and svmRadial model both have comparitively high predictive strengths. Out of all models, boosted logistic regression seems to have the least predictive power. My goal in this regard was to become more familiar with these models, which I have. RandomForest is the model which I am most comfortable with, as it is a model which can be used very quickly and easily. Lastly I was surprised by the detail, ease of use and documentation of the base R glm model, and it will be my go-to regression model in the future. The caret library has too many models to count (238 if you do try), and I’m excited to keep learning more about these models in future projects, especially since it has many helper functions which can be used for improving the feature engineering.

After analysing and running the models, I realized that I could’ve tested various models more easily using the trainControl method from the caret package. This is well documented in this article, which I found only after finishing my own analysis. This shows that I still have much to learn while using R, and I look forward to keep getting better at it!

Thank you for taking the time to read through the analysis. I look forward to using my newly learned R skills in future projects!

Jakob Serlier

#Sources for code bits and pieces:
#Age bins source: https://stackoverflow.com/questions/12979456/r-code-to-categorize-age-into-group-bins-breaks
#Valid & train set source: https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function
#Randomforest source: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
#caret inspiration: https://machinelearningmastery.com/compare-models-and-select-the-best-using-the-caret-r-package/
Logistic regression: #https://stats.idre.ucla.edu/r/dae/logit-regression/
#http://www.utstat.toronto.edu/~brunner/oldclass/appliedf11/handouts/2101f11StepwiseLogisticR.pdf
#https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/

For more of my projects visit https://jakobs.dev.

Determining model preditive strengths on Clevelent heart-disease dataset using different preprocessing techniques

Jakob Serlier

12/11/2019