First we need to install all packages, system dependencies and solve conflicts to produce a new renv.lock file.
1.1 Load Libraries
Code
# Run this once to install all the necessary packages# install.packages("dplyr")# install.packages("car")# install.packages("ResourceSelection")# install.packages("caret")# install.packages("rcompanion")# install.packages("pROC")# install.packages("cvAUC")
# For data manipulation and visualizationlibrary(tidyverse)library(ggplot2)library(corrplot)library(knitr)library(ggpubr)# For data preprocessing and modelinglibrary(mice)library(ROSE) # For SMOTElibrary(ranger) # A fast implementation of random forests# For stacking/ensemble modelslibrary(stacks)library(tidymodels)library(themis)library(gghighlight)library(pscl)library(dplyr)library(car)library(ResourceSelection)library(caret)library(rcompanion)library(Hmisc)library(pROC)library(cvAUC)# Set seed for reproducibilityset.seed(123)
ImportantPossible Dependencies Conflict
Need to further analyse if there are conflicts and System Dependency issues.
1.2 Load Data
Will be using my original Dataset as well Steve’s Dataset and compare for differences.
Renan: kaggle_data1 Steve: stroke1
1.2.1 Renan Dataset
Below will be loading the healthcare-dataset-stroke-data.csv and performing necessary changes to the dataset and loading into the DataFrame: kaggle_data1
Code
find_git_root <-function(start =getwd()) { path <-normalizePath(start, winslash ="/", mustWork =TRUE)while (path !=dirname(path)) {if (dir.exists(file.path(path, ".git"))) return(path) path <-dirname(path) }stop("No .git directory found — are you inside a Git repository?")}repo_root <-find_git_root()datasets_path <-file.path(repo_root, "datasets")kaggle_dataset_path <-file.path(datasets_path, "kaggle-healthcare-dataset-stroke-data/healthcare-dataset-stroke-data.csv")kaggle_data1 =read_csv(kaggle_dataset_path, show_col_types =FALSE)# unique(kaggle_data1$bmi)kaggle_data1 <- kaggle_data1 %>%mutate(bmi =na_if(bmi, "N/A")) %>%# Convert "N/A" string to NAmutate(bmi =as.numeric(bmi)) # Convert from character to numeric# Remove the 'Other' gender row and the 'id' columnkaggle_data1 <- kaggle_data1 %>%filter(gender !="Other") %>%select(-id) %>%mutate_if(is.character, as.factor) # Convert character columns to factors for easier modeling
1.2.1 Steve Dataset
Below will be loading the stroke.csv and performing necessary changes to the dataset and loading into the DataFrame: stroke1
Code
# Reading the datafile in (the same one you got for us Renan)#steve_dataset_path <-file.path(datasets_path, "steve/stroke.csv")stroke1 =read_csv(steve_dataset_path, show_col_types =FALSE)
1.3 Prepare Dataset
For each Column…removing the unncessary or unusable variables:
Smoking Status - remove unknown
bmi - remove N/A
Work type - remove children
age create numerical variable with 2 places after the decimal
gender -remove other
Exploring Dataset so we can plan on how to proceed and possible changes.
# converted all columns to numeric and removed idstr(stroke1_clean)nrow(stroke1_clean)LR_stroke1 <- stroke1_cleanstr(LR_stroke1)count_tables <-lapply(LR_stroke1, table)count_tables
2. Apply Logistic Regression
Part 2:Create and Run the Logistic Regression model from the dataset
# Part 2:Create and Run the Logistic Regression model from the datasetmodel <-glm(stroke ~ gender + age + hypertension + heart_disease + ever_married + work_type + Residence_type + avg_glucose_level + bmi + smoking_status, data=LR_stroke1, family = binomial)summary(model)
Call:
glm(formula = stroke ~ gender + age + hypertension + heart_disease +
ever_married + work_type + Residence_type + avg_glucose_level +
bmi + smoking_status, family = binomial, data = LR_stroke1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.426854 0.873243 -9.650 < 2e-16 ***
gender 0.080370 0.167274 0.480 0.630893
age 0.070967 0.006845 10.368 < 2e-16 ***
hypertension 0.570797 0.182580 3.126 0.001770 **
heart_disease 0.417884 0.220311 1.897 0.057856 .
ever_married 0.174316 0.261832 0.666 0.505569
work_type -0.109615 0.126101 -0.869 0.384703
Residence_type 0.005932 0.162188 0.037 0.970822
avg_glucose_level 0.004658 0.001375 3.388 0.000704 ***
bmi 0.006275 0.012875 0.487 0.625954
smoking_status 0.179921 0.106431 1.691 0.090932 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1403.5 on 3356 degrees of freedom
Residual deviance: 1145.4 on 3346 degrees of freedom
AIC: 1167.4
Number of Fisher Scoring iterations: 7
ImportantIssue - Unbalanced Data set
The dataset is oversampling stroke rate by 77%.
Issue - Unbalanced Data set
The \(stroke rate = 180/3357 = 054%\) . But the stroke rate in the US is 3.1% by the CDC, AHA, and the NCHS.
The dataset is oversampling stroke rate by 77%. We can either SMOTE, under/over sample….but too much time
Best way is to recalibrate the intercept for this model (Ie change it here) so it can be used from now on
# Issue _ Unbalanced Data set ## The stroke rate = 180/3357 = 054%. But the stroke rate in the US is 3.1% by the CDC, AHA, and the NCHS. ## The dataset is oversampling stroke rate by 77%. We can either SMOTE, under/over sample....but too much time ## Best way is to recalibrate the intercept for this model (Ie change it here) so it can be used from now on #ds_prev <- .054pop_prev <- .031log_odds_ds <-qlogis(ds_prev)log_odds_pop <-qlogis(pop_prev)offset <- log_odds_pop - log_odds_dscoefs <-coef(model)coefs[1] <- coefs[1] + offsetprint(coefs)
Changed intercept Coefficent to take into account current stroke rate or 3.1% = -9.005873116
all the other intercepts remain the same
3. Testing logistic Regression Model Assumptions
Part 3: Testing logistic Regression Model Assumptions
There are several assumptions for Logistic Regression. They are:
The Dependent Variable is binary (i.e, 0 or 1)
There is a linear relationship between th logit of the outcome and each predictor
There are NO high leverage outliers in the predictors
There is No high multicollinearity (ie strong correlations) between predictors
3.1 Testing Assumption 1
Testing Assumption 1: The Dependent Variable is binary (0 or 1)
unique(LR_stroke1$stroke)
[1] 1 0
3.2 Testing Assumption 2
Testing Assumption 2: There is a linear relationship between the outcome variable and each predictor
first, adjust all predictors so all values are positive
Conclusion: For all continuous variables , ageadj, avg_glucose_leveladj, and bniadj, the residual plots show linearity
Conclusion: all the other predictors are categorical, with the magenta line flat, and the values clustering around certain values, they are also appropriate for logistic regression
# fit logistic regression modelmodel_CM <-glm(stroke ~ gender + age + hypertension + heart_disease + ever_married + work_type + Residence_type + avg_glucose_level + bmi + smoking_status, data=LR_stroke1, family = binomial)
Get Predicted Probabilities for each observation
# Get Predicted Probabilities for each observationpred_prob <-predict(model_CM, type ="response")
create 10 confusion matrices at threshold intervals between 1 and 0 to create a ROC
Classify prediction using a threshold (0.5 is common but can adjust)
IF 1 row is all 0’s then model doesn’t show any predictability
At threshold of around 1.0
# create 10 confusion matrices at threshold intervals between 1 and 0 to create a ROC# Classify prediction using a threshold (0.5 is common but can adjust)# IF 1 row is all 0's then model doesn't show any predictability# At threshold of around 1.0pred_class <-factor(ifelse(pred_prob > .99, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3177 180
1 0 0
IF 1 row is all 0’s then model doesn’t show any predictability
At threshold of 0.9
# At threshold of 0.9pred_class <-factor(ifelse(pred_prob >0.9, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3177 180
1 0 0
IF 1 row is all 0’s then model doesn’t show any predictability
At threshold of 0.8
# At threshold of 0.8pred_class <-factor(ifelse(pred_prob >0.8, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3177 180
1 0 0
IF 1 row is all 0’s then model doesn’t show any predictability
At threshold of 0.7
# At threshold of 0.7pred_class <-factor(ifelse(pred_prob >0.7, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3177 180
1 0 0
At threshold of 0.6
# At threshold of 0.6pred_class <-factor(ifelse(pred_prob >0.6, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3177 179
1 0 1
at threshold of 0.6 that starts the models predictability
Extract precision,Recall and F1 from confusion matrix using the caret package
# Extract precision, Recall and F1 from confusion matrix using the caret packageprecision <- conf_matrix$byClass["Pos Pred Value"] recall <- conf_matrix$byClass["Sensitivity"]f1 <-2* ((precision * recall)/ (precision + recall))print(sprintf("F1: %f", f1))
[1] "F1: 0.011050"
print(sprintf("recall: %f", recall))
[1] "recall: 0.005556"
print(sprintf("precision: %f", precision))
[1] "precision: 1.000000"
at threshold of 0.5
# at threshold of 0.5pred_class <-factor(ifelse(pred_prob >0.5, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3176 179
1 1 1
Extract precision,Recall and F1 from confusion matrix using the caret package
# Extract precision,Recall and F1 from confusion matrix using the caret packageprecision <- conf_matrix$byClass["Pos Pred Value"] recall <- conf_matrix$byClass["Sensitivity"]f1 <-2* ((precision * recall)/ (precision + recall))print(sprintf("F1: %f", f1))
[1] "F1: 0.010989"
print(sprintf("recall: %f", recall))
[1] "recall: 0.005556"
print(sprintf("precision: %f", precision))
[1] "precision: 0.500000"
At threshold of 0.4
# At threshold of 0.4pred_class <-factor(ifelse(pred_prob >0.4, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3171 174
1 6 6
Extract precision,Recall and F1 from confusion matrix using the caret package
# Extract precision,Recall and F1 from confusion matrix using the caret packageprecision <- conf_matrix$byClass["Pos Pred Value"] recall <- conf_matrix$byClass["Sensitivity"]f1 <-2* ((precision * recall)/ (precision + recall))print(sprintf("F1: %f", f1))
[1] "F1: 0.062500"
print(sprintf("recall: %f", recall))
[1] "recall: 0.033333"
print(sprintf("precision: %f", precision))
[1] "precision: 0.500000"
at threshold of 0.3
# at threshold of 0.3pred_class <-factor(ifelse(pred_prob >0.3, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3140 164
1 37 16
Extract precision,Recall and F1 from confusion matrix using the caret package
# Extract precision,Recall and F1 from confusion matrix using the caret packageprecision <- conf_matrix$byClass["Pos Pred Value"] recall <- conf_matrix$byClass["Sensitivity"]f1 <-2* ((precision * recall)/ (precision + recall))print(sprintf("F1: %f", f1))
[1] "F1: 0.137339"
print(sprintf("recall: %f", recall))
[1] "recall: 0.088889"
print(sprintf("precision: %f", precision))
[1] "precision: 0.301887"
at threshold of 0.2
# at threshold of 0.2pred_class <-factor(ifelse(pred_prob >0.2, 1, 0), levels =c(0, 1))conf_matrix <-confusionMatrix(pred_class, factor(LR_stroke1$stroke, levels =c("0", "1")), positive ="1")print(conf_matrix$table)
Reference
Prediction 0 1
0 3040 134
1 137 46
Extract precision,Recall and F1 from confusion matrix using the caret package
# Extract precision,Recall and F1 from confusion matrix using the caret packageprecision <- conf_matrix$byClass["Pos Pred Value"] recall <- conf_matrix$byClass["Sensitivity"]f1 <-2* ((precision * recall)/ (precision + recall))print(sprintf("F1: %f", f1))
[1] "F1: 0.253444"
print(sprintf("recall: %f", recall))
[1] "recall: 0.255556"
print(sprintf("precision: %f", precision))
[1] "precision: 0.251366"
Using the different Confusion Matrices, Create the ROC curve
Conclusion
The code is unreadable and has several mistakes from Syntax to several implementation errors and System dependencies that I could not meet. So I had to do my best interpretation to reproduce it in Quarto.
Ideas for improving precision, recall and f1_score
Address imbalance by upsample (add stroke cases), downsample (remove non strokecases) and or SMOTE (Synthetic data)
Change the classification threshold
Compare with Alternative Models such as random forrests or XGBoost