Code
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(fitdistrplus)
library(gsheet)
library(boot)
library(readr)Master of Data Science Program @ The University of West Florida (UWF)
This week we are getting started on how to setup the Quarto and R project for proper Collaboration.
This post will demonstrate how to install RENV, initate your Renv environment and then load the dataset and do some demonstrations manipulating the dataset.
We will be using the dataset: Stroke Prediction Dataset
renv is a pakcage manager that helps you create reproducible environments for your R projects.
Install the latest version of renv from CRAN with:
Use renv::init() to initialize renv in a new or existing project. This will set up a project library, containing all the packages you’re currently using. The packages (and all the metadata needed to reinstall them) are recorded into a lockfile, renv.lock, and a .Rprofile ensures that the library is used every time you open that project.
As you continue to work on your project, you will install and upgrade packages, either using install.packages() and update.packages() or renv::install() and renv::update(). After you’ve confirmed your code works as expected, use renv::snapshot() to record the packages and their sources in the lockfile.
Later, if you need to share your code with someone else or run your code on new machine, your collaborator (or you) can call renv::restore() to reinstall the specific package versions recorded in the lockfile.
If this is your first time using renv, we strongly recommend starting with the Introduction to renv vignette: this will help you understand the most important verbs and nouns of renv.
If you have a question about renv, please first check the FAQ to see whether your question has already been addressed. If it hasn’t, please feel free to ask on the Posit Forum.
If you believe you’ve found a bug in renv, please file a bug (and, if possible, a reproducible example) at https://github.com/rstudio/renv/issues.
Get the packages setup:
This should find the path to the datasets folder programatically.
find_git_root <- function(start = getwd()) {
path <- normalizePath(start, winslash = "/", mustWork = TRUE)
while (path != dirname(path)) {
if (dir.exists(file.path(path, ".git"))) return(path)
path <- dirname(path)
}
stop("No .git directory found — are you inside a Git repository?")
}
repo_root <- find_git_root()
datasets_path <- file.path(repo_root, "datasets")
# repo_root
# datasets_pathNow we define the dataset we want to load, healthcare-dataset-stroke-data.csv will be inside kaggle-healthcare-dataset-stroke-data.
Exploring the dataset, BMI is not stored as numeric value also the NA fields are stored as text “N/A”.
# A tibble: 6 × 12
id gender age hypertension heart_disease ever_married work_type
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 9046 Male 67 0 1 Yes Private
2 51676 Female 61 0 0 Yes Self-employed
3 31112 Male 80 0 1 Yes Private
4 60182 Female 49 0 0 Yes Private
5 1665 Female 79 1 0 Yes Self-employed
6 56669 Male 81 0 0 Yes Private
# ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <chr>,
# smoking_status <chr>, stroke <dbl>
id gender age hypertension
0 0 0 0
heart_disease ever_married work_type Residence_type
0 0 0 0
avg_glucose_level bmi smoking_status stroke
0 0 0 0
Apparently seems there is no NA values. Let’s continue.
id gender age hypertension
Min. : 67 Length:5110 Min. : 0.08 Min. :0.00000
1st Qu.:17741 Class :character 1st Qu.:25.00 1st Qu.:0.00000
Median :36932 Mode :character Median :45.00 Median :0.00000
Mean :36518 Mean :43.23 Mean :0.09746
3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
Max. :72940 Max. :82.00 Max. :1.00000
heart_disease ever_married work_type Residence_type
Min. :0.00000 Length:5110 Length:5110 Length:5110
1st Qu.:0.00000 Class :character Class :character Class :character
Median :0.00000 Mode :character Mode :character Mode :character
Mean :0.05401
3rd Qu.:0.00000
Max. :1.00000
avg_glucose_level bmi smoking_status stroke
Min. : 55.12 Length:5110 Length:5110 Min. :0.00000
1st Qu.: 77.25 Class :character Class :character 1st Qu.:0.00000
Median : 91.89 Mode :character Mode :character Median :0.00000
Mean :106.15 Mean :0.04873
3rd Qu.:114.09 3rd Qu.:0.00000
Max. :271.74 Max. :1.00000
Rows: 5,110
Columns: 12
$ id <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
$ gender <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
$ age <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ heart_disease <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
$ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
$ work_type <chr> "Private", "Self-employed", "Private", "Private", "S…
$ Residence_type <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
$ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
$ bmi <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
$ smoking_status <chr> "formerly smoked", "never smoked", "never smoked", "…
$ stroke <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Summary give some interesting insights but glimpse shows that there are NA values, even worse, the BMI values are stored and strings and should be numeric.
Now lets explore the Categorical and Numeric variables.
# check categorical variables
library(dplyr)
library(tidyr)
# Check one by one, lets see what we got
# kaggle_data1 %>% count(gender)
# kaggle_data1 %>% count(hypertension)
# kaggle_data1 %>% count(heart_disease)
# kaggle_data1 %>% count(ever_married)
# kaggle_data1 %>% count(work_type)
# kaggle_data1 %>% count(Residence_type )
# kaggle_data1 %>% count(smoking_status)
# kaggle_data1 %>% count(stroke)
# Now make it a little cleaner
cat_vars <- c("gender", "hypertension", "heart_disease", "ever_married",
"work_type", "Residence_type", "smoking_status", "stroke")
kaggle_data1[, cat_vars] %>%
# Convert all to character to avoid type conflicts
mutate_all(as.character) %>%
pivot_longer(cols = names(.), names_to = "variable", values_to = "value") %>%
count(variable, value) %>%
arrange(variable, desc(n)) %>% print(n = 22)# A tibble: 22 × 3
variable value n
<chr> <chr> <int>
1 Residence_type Urban 2596
2 Residence_type Rural 2514
3 ever_married Yes 3353
4 ever_married No 1757
5 gender Female 2994
6 gender Male 2115
7 gender Other 1
8 heart_disease 0 4834
9 heart_disease 1 276
10 hypertension 0 4612
11 hypertension 1 498
12 smoking_status never smoked 1892
13 smoking_status Unknown 1544
14 smoking_status formerly smoked 885
15 smoking_status smokes 789
16 stroke 0 4861
17 stroke 1 249
18 work_type Private 2925
19 work_type Self-employed 819
20 work_type children 687
21 work_type Govt_job 657
22 work_type Never_worked 22
Its pretty interesting, now lets see what happens with the numeric variables
id age hypertension heart_disease
Min. : 67 Min. : 0.08 Min. :0.00000 Min. :0.00000
1st Qu.:17741 1st Qu.:25.00 1st Qu.:0.00000 1st Qu.:0.00000
Median :36932 Median :45.00 Median :0.00000 Median :0.00000
Mean :36518 Mean :43.23 Mean :0.09746 Mean :0.05401
3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :72940 Max. :82.00 Max. :1.00000 Max. :1.00000
avg_glucose_level stroke
Min. : 55.12 Min. :0.00000
1st Qu.: 77.25 1st Qu.:0.00000
Median : 91.89 Median :0.00000
Mean :106.15 Mean :0.04873
3rd Qu.:114.09 3rd Qu.:0.00000
Max. :271.74 Max. :1.00000
We need to deal with the BMI data which has missing values and its not stored as numerical.
# unique(kaggle_data1$bmi)
kaggle_data2 <- kaggle_data1 %>%
mutate(bmi = na_if(bmi, "N/A")) %>% # Convert "N/A" string to NA
mutate(bmi = as.numeric(bmi)) # Convert from character to numeric
# kaggle_data2 <- kaggle_data1 %>% mutate(bmi = as.numeric(na_if(bmi, "N/A")))
# Check if it worked
str(kaggle_data2$bmi) num [1:5110] 36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
[1] 201
id age hypertension heart_disease
Min. : 67 Min. : 0.08 Min. :0.00000 Min. :0.00000
1st Qu.:17741 1st Qu.:25.00 1st Qu.:0.00000 1st Qu.:0.00000
Median :36932 Median :45.00 Median :0.00000 Median :0.00000
Mean :36518 Mean :43.23 Mean :0.09746 Mean :0.05401
3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :72940 Max. :82.00 Max. :1.00000 Max. :1.00000
avg_glucose_level bmi stroke
Min. : 55.12 Min. :10.30 Min. :0.00000
1st Qu.: 77.25 1st Qu.:23.50 1st Qu.:0.00000
Median : 91.89 Median :28.10 Median :0.00000
Mean :106.15 Mean :28.89 Mean :0.04873
3rd Qu.:114.09 3rd Qu.:33.10 3rd Qu.:0.00000
Max. :271.74 Max. :97.60 Max. :1.00000
NA's :201
The dataset is imbalanced and has many issues there are several research work that explore solutions:
The research Predictive modelling and identification of key risk factors for stroke using machine learning has made several contributions adding a lot of insights:
Exploring various data imputation techniques and addressing data imbalance issues in order to enhance the accuracy and robustness of stroke prediction models.
Identifying crucial features for stroke prediction and uncovering previously unknown risk factors, giving a comprehensive understanding of stroke risk assessment.
Creating an augmented dataset incorporating important key risk factor features using the imputed datasets, enhancing the effectiveness of stroke prediction models.
Assessing the effectiveness of advanced machine learning models across different datasets and creating a robust Dense Stacking Ensemble model for stroke prediction.
The key contribution is showcasing the enhanced predictive capabilities of the model in accurately identifying and testing strokes, surpassing the performance of prior studies that utilized the same dataset.
Large datasets might need Github LFS which is not setup, therefore must store then externally.
Quarto websites when combined with python and R is a great way to
Quarto websites, when combined with Python and R, offer a powerful way to create dynamic, data-driven content that turns out into amazing presentations rich in visual content.
However there are limitations, Github Actions runner is not powerful and before submitting the project for rendering must take that into consideration. On future work will evaluate solutions to the computational budget limitations in Github Action Runner.
How to efficiently break up a computationally heavy article into separate notebooks?#8410
Some have mentioned that the project can be split into sections.