Review steve code - Week 11

report

week 11

renan

For Week 11 we are reviewing Steve code to make it presentable

Author

Introduction

Reproducing Steve code in baseFirthFlic1116.qmd and RFirth11116asfactor_allbutflac.R using the dataset Stroke Prediction Dataset.

1. Setup and Data Loading

First, we need to load the required R packages and the dataset. The dataset is publicly available on Kaggle and was originally created by McKinsey & Company^[1].

1.1 Load Libraries

Code

# options(repos = c(CRAN = "https://cloud.r-project.org"))

packages <- c("dplyr", "car", "ResourceSelection", "caret", "pROC",  "logistf", "Hmisc", "rcompanion", "ggplot2", "summarytools", "tidyverse", "knitr")
# install.packages(packages)

We can use this to check installed packages:

```{r}
renv::activate("website")
"yardstick" %in% rownames(installed.packages())
```

Code

lapply(packages, library, character.only = TRUE)

[[1]]
[1] "dplyr"     "stats"     "graphics"  "grDevices" "datasets"  "utils"    
[7] "methods"   "base"     

[[2]]
 [1] "car"       "carData"   "dplyr"     "stats"     "graphics"  "grDevices"
 [7] "datasets"  "utils"     "methods"   "base"     

[[3]]
 [1] "ResourceSelection" "car"               "carData"          
 [4] "dplyr"             "stats"             "graphics"         
 [7] "grDevices"         "datasets"          "utils"            
[10] "methods"           "base"             

[[4]]
 [1] "caret"             "lattice"           "ggplot2"          
 [4] "ResourceSelection" "car"               "carData"          
 [7] "dplyr"             "stats"             "graphics"         
[10] "grDevices"         "datasets"          "utils"            
[13] "methods"           "base"             

[[5]]
 [1] "pROC"              "caret"             "lattice"          
 [4] "ggplot2"           "ResourceSelection" "car"              
 [7] "carData"           "dplyr"             "stats"            
[10] "graphics"          "grDevices"         "datasets"         
[13] "utils"             "methods"           "base"             

[[6]]
 [1] "logistf"           "pROC"              "caret"            
 [4] "lattice"           "ggplot2"           "ResourceSelection"
 [7] "car"               "carData"           "dplyr"            
[10] "stats"             "graphics"          "grDevices"        
[13] "datasets"          "utils"             "methods"          
[16] "base"             

[[7]]
 [1] "Hmisc"             "logistf"           "pROC"             
 [4] "caret"             "lattice"           "ggplot2"          
 [7] "ResourceSelection" "car"               "carData"          
[10] "dplyr"             "stats"             "graphics"         
[13] "grDevices"         "datasets"          "utils"            
[16] "methods"           "base"             

[[8]]
 [1] "rcompanion"        "Hmisc"             "logistf"          
 [4] "pROC"              "caret"             "lattice"          
 [7] "ggplot2"           "ResourceSelection" "car"              
[10] "carData"           "dplyr"             "stats"            
[13] "graphics"          "grDevices"         "datasets"         
[16] "utils"             "methods"           "base"             

[[9]]
 [1] "rcompanion"        "Hmisc"             "logistf"          
 [4] "pROC"              "caret"             "lattice"          
 [7] "ggplot2"           "ResourceSelection" "car"              
[10] "carData"           "dplyr"             "stats"            
[13] "graphics"          "grDevices"         "datasets"         
[16] "utils"             "methods"           "base"             

[[10]]
 [1] "summarytools"      "rcompanion"        "Hmisc"            
 [4] "logistf"           "pROC"              "caret"            
 [7] "lattice"           "ggplot2"           "ResourceSelection"
[10] "car"               "carData"           "dplyr"            
[13] "stats"             "graphics"          "grDevices"        
[16] "datasets"          "utils"             "methods"          
[19] "base"             

[[11]]
 [1] "lubridate"         "forcats"           "stringr"          
 [4] "purrr"             "readr"             "tidyr"            
 [7] "tibble"            "tidyverse"         "summarytools"     
[10] "rcompanion"        "Hmisc"             "logistf"          
[13] "pROC"              "caret"             "lattice"          
[16] "ggplot2"           "ResourceSelection" "car"              
[19] "carData"           "dplyr"             "stats"            
[22] "graphics"          "grDevices"         "datasets"         
[25] "utils"             "methods"           "base"             

[[12]]
 [1] "knitr"             "lubridate"         "forcats"          
 [4] "stringr"           "purrr"             "readr"            
 [7] "tidyr"             "tibble"            "tidyverse"        
[10] "summarytools"      "rcompanion"        "Hmisc"            
[13] "logistf"           "pROC"              "caret"            
[16] "lattice"           "ggplot2"           "ResourceSelection"
[19] "car"               "carData"           "dplyr"            
[22] "stats"             "graphics"          "grDevices"        
[25] "datasets"          "utils"             "methods"          
[28] "base"

Code

# Set seed for reproducibility
set.seed(123)

Might need to deal with the conflicts later:

1.2 Load Data

We will load the dataset and handle the data given the exploration done in Week5. The id column is unnecessary for prediction as well there are only 2 genders significant for prediction.

Below will be loading the stroke.csv and performing necessary changes to the dataset and loading into the DataFrame: stroke1

Code

find_git_root <- function(start = getwd()) {
  path <- normalizePath(start, winslash = "/", mustWork = TRUE)
  while (path != dirname(path)) {
    if (dir.exists(file.path(path, ".git"))) return(path)
    path <- dirname(path)
  }
  stop("No .git directory found — are you inside a Git repository?")
}

repo_root <- find_git_root()
datasets_path <- file.path(repo_root, "datasets")

# Reading the datafile in (the same one you got for us Renan)#
steve_dataset_path <- file.path(datasets_path, "steve/stroke.csv")
stroke1 = read_csv(steve_dataset_path, show_col_types = FALSE)

References

1. fedesoriano. (n.d.). Stroke Prediction Dataset. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset