Code
# options(repos = c(CRAN = "https://cloud.r-project.org"))
packages <- c("dplyr", "car", "ResourceSelection", "caret", "pROC", "logistf", "Hmisc", "rcompanion", "ggplot2", "summarytools", "tidyverse", "knitr")
# install.packages(packages)Reproducing Steve code in baseFirthFlic1116.qmd and RFirth11116asfactor_allbutflac.R using the dataset Stroke Prediction Dataset.
First, we need to load the required R packages and the dataset. The dataset is publicly available on Kaggle and was originally created by McKinsey & Company[1].
We can use this to check installed packages:
[[1]]
[1] "dplyr" "stats" "graphics" "grDevices" "datasets" "utils"
[7] "methods" "base"
[[2]]
[1] "car" "carData" "dplyr" "stats" "graphics" "grDevices"
[7] "datasets" "utils" "methods" "base"
[[3]]
[1] "ResourceSelection" "car" "carData"
[4] "dplyr" "stats" "graphics"
[7] "grDevices" "datasets" "utils"
[10] "methods" "base"
[[4]]
[1] "caret" "lattice" "ggplot2"
[4] "ResourceSelection" "car" "carData"
[7] "dplyr" "stats" "graphics"
[10] "grDevices" "datasets" "utils"
[13] "methods" "base"
[[5]]
[1] "pROC" "caret" "lattice"
[4] "ggplot2" "ResourceSelection" "car"
[7] "carData" "dplyr" "stats"
[10] "graphics" "grDevices" "datasets"
[13] "utils" "methods" "base"
[[6]]
[1] "logistf" "pROC" "caret"
[4] "lattice" "ggplot2" "ResourceSelection"
[7] "car" "carData" "dplyr"
[10] "stats" "graphics" "grDevices"
[13] "datasets" "utils" "methods"
[16] "base"
[[7]]
[1] "Hmisc" "logistf" "pROC"
[4] "caret" "lattice" "ggplot2"
[7] "ResourceSelection" "car" "carData"
[10] "dplyr" "stats" "graphics"
[13] "grDevices" "datasets" "utils"
[16] "methods" "base"
[[8]]
[1] "rcompanion" "Hmisc" "logistf"
[4] "pROC" "caret" "lattice"
[7] "ggplot2" "ResourceSelection" "car"
[10] "carData" "dplyr" "stats"
[13] "graphics" "grDevices" "datasets"
[16] "utils" "methods" "base"
[[9]]
[1] "rcompanion" "Hmisc" "logistf"
[4] "pROC" "caret" "lattice"
[7] "ggplot2" "ResourceSelection" "car"
[10] "carData" "dplyr" "stats"
[13] "graphics" "grDevices" "datasets"
[16] "utils" "methods" "base"
[[10]]
[1] "summarytools" "rcompanion" "Hmisc"
[4] "logistf" "pROC" "caret"
[7] "lattice" "ggplot2" "ResourceSelection"
[10] "car" "carData" "dplyr"
[13] "stats" "graphics" "grDevices"
[16] "datasets" "utils" "methods"
[19] "base"
[[11]]
[1] "lubridate" "forcats" "stringr"
[4] "purrr" "readr" "tidyr"
[7] "tibble" "tidyverse" "summarytools"
[10] "rcompanion" "Hmisc" "logistf"
[13] "pROC" "caret" "lattice"
[16] "ggplot2" "ResourceSelection" "car"
[19] "carData" "dplyr" "stats"
[22] "graphics" "grDevices" "datasets"
[25] "utils" "methods" "base"
[[12]]
[1] "knitr" "lubridate" "forcats"
[4] "stringr" "purrr" "readr"
[7] "tidyr" "tibble" "tidyverse"
[10] "summarytools" "rcompanion" "Hmisc"
[13] "logistf" "pROC" "caret"
[16] "lattice" "ggplot2" "ResourceSelection"
[19] "car" "carData" "dplyr"
[22] "stats" "graphics" "grDevices"
[25] "datasets" "utils" "methods"
[28] "base"
Might need to deal with the conflicts later:
We will load the dataset and handle the data given the exploration done in Week5. The id column is unnecessary for prediction as well there are only 2 genders significant for prediction.
Below will be loading the stroke.csv and performing necessary changes to the dataset and loading into the DataFrame: stroke1
find_git_root <- function(start = getwd()) {
path <- normalizePath(start, winslash = "/", mustWork = TRUE)
while (path != dirname(path)) {
if (dir.exists(file.path(path, ".git"))) return(path)
path <- dirname(path)
}
stop("No .git directory found — are you inside a Git repository?")
}
repo_root <- find_git_root()
datasets_path <- file.path(repo_root, "datasets")
# Reading the datafile in (the same one you got for us Renan)#
steve_dataset_path <- file.path(datasets_path, "steve/stroke.csv")
stroke1 = read_csv(steve_dataset_path, show_col_types = FALSE)