Project Setup - Week 5

getting started

week 5

renan

For Week 5 setting up Quarto Website and Getting Started with R project

Author

Affiliation

Renan Monteiro Barbosa

Master of Data Science Program @ The University of West Florida (UWF)

This week we are getting started on how to setup the Quarto and R project for proper Collaboration.

This post will demonstrate how to install RENV, initate your Renv environment and then load the dataset and do some demonstrations manipulating the dataset.

We will be using the dataset: Stroke Prediction Dataset

What is Renv

renv is a pakcage manager that helps you create reproducible environments for your R projects.

Install the latest version of renv from CRAN with:

```{r}
install.packages("renv")
```

Renv Workflow

Use renv::init() to initialize renv in a new or existing project. This will set up a project library, containing all the packages you’re currently using. The packages (and all the metadata needed to reinstall them) are recorded into a lockfile, renv.lock, and a .Rprofile ensures that the library is used every time you open that project.

As you continue to work on your project, you will install and upgrade packages, either using install.packages() and update.packages() or renv::install() and renv::update(). After you’ve confirmed your code works as expected, use renv::snapshot() to record the packages and their sources in the lockfile.

Later, if you need to share your code with someone else or run your code on new machine, your collaborator (or you) can call renv::restore() to reinstall the specific package versions recorded in the lockfile.

Learning more

If this is your first time using renv, we strongly recommend starting with the Introduction to renv vignette: this will help you understand the most important verbs and nouns of renv.

If you have a question about renv, please first check the FAQ to see whether your question has already been addressed. If it hasn’t, please feel free to ask on the Posit Forum.

If you believe you’ve found a bug in renv, please file a bug (and, if possible, a reproducible example) at https://github.com/rstudio/renv/issues.

Import Dataset Example

Get the packages setup:

Code

library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)

library(fitdistrplus)
library(gsheet)
library(boot)
library(readr)

Import the dataset

This should find the path to the datasets folder programatically.

Code

find_git_root <- function(start = getwd()) {
  path <- normalizePath(start, winslash = "/", mustWork = TRUE)
  while (path != dirname(path)) {
    if (dir.exists(file.path(path, ".git"))) return(path)
    path <- dirname(path)
  }
  stop("No .git directory found — are you inside a Git repository?")
}

repo_root <- find_git_root()
datasets_path <- file.path(repo_root, "datasets")
# repo_root
# datasets_path

Now we define the dataset we want to load, healthcare-dataset-stroke-data.csv will be inside kaggle-healthcare-dataset-stroke-data.

Code

kaggle_dataset_path <- file.path(datasets_path, "kaggle-healthcare-dataset-stroke-data/healthcare-dataset-stroke-data.csv")

kaggle_data1 = read_csv(kaggle_dataset_path, show_col_types = FALSE)

Exploring the dataset, BMI is not stored as numeric value also the NA fields are stored as text “N/A”.

Code

head(kaggle_data1)

# A tibble: 6 × 12
     id gender   age hypertension heart_disease ever_married work_type    
  <dbl> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
1  9046 Male      67            0             1 Yes          Private      
2 51676 Female    61            0             0 Yes          Self-employed
3 31112 Male      80            0             1 Yes          Private      
4 60182 Female    49            0             0 Yes          Private      
5  1665 Female    79            1             0 Yes          Self-employed
6 56669 Male      81            0             0 Yes          Private      
# ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <chr>,
#   smoking_status <chr>, stroke <dbl>

Code

# Count total NAs per column
colSums(is.na(kaggle_data1))

               id            gender               age      hypertension 
                0                 0                 0                 0 
    heart_disease      ever_married         work_type    Residence_type 
                0                 0                 0                 0 
avg_glucose_level               bmi    smoking_status            stroke 
                0                 0                 0                 0

Apparently seems there is no NA values. Let’s continue.

Code

# overall
summary(kaggle_data1)

       id           gender               age         hypertension    
 Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
 1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
 Median :36932   Mode  :character   Median :45.00   Median :0.00000  
 Mean   :36518                      Mean   :43.23   Mean   :0.09746  
 3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
 Max.   :72940                      Max.   :82.00   Max.   :1.00000  
 heart_disease     ever_married        work_type         Residence_type    
 Min.   :0.00000   Length:5110        Length:5110        Length:5110       
 1st Qu.:0.00000   Class :character   Class :character   Class :character  
 Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :0.05401                                                           
 3rd Qu.:0.00000                                                           
 Max.   :1.00000                                                           
 avg_glucose_level     bmi            smoking_status         stroke       
 Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
 1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
 Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
 Mean   :106.15                                          Mean   :0.04873  
 3rd Qu.:114.09                                          3rd Qu.:0.00000  
 Max.   :271.74                                          Max.   :1.00000

Code

glimpse(kaggle_data1)

Rows: 5,110
Columns: 12
$ id                <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
$ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
$ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
$ hypertension      <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ heart_disease     <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
$ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
$ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S…
$ Residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
$ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
$ bmi               <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
$ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "…
$ stroke            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Summary give some interesting insights but glimpse shows that there are NA values, even worse, the BMI values are stored and strings and should be numeric.

Now lets explore the Categorical and Numeric variables.

Code

# check categorical variables
library(dplyr)
library(tidyr)

# Check one by one, lets see what we got
# kaggle_data1 %>% count(gender)
# kaggle_data1 %>% count(hypertension)
# kaggle_data1 %>% count(heart_disease)
# kaggle_data1 %>% count(ever_married)
# kaggle_data1 %>% count(work_type)
# kaggle_data1 %>% count(Residence_type )
# kaggle_data1 %>% count(smoking_status)
# kaggle_data1 %>% count(stroke)

# Now make it a little cleaner
cat_vars <- c("gender", "hypertension", "heart_disease", "ever_married",
              "work_type", "Residence_type", "smoking_status", "stroke")

kaggle_data1[, cat_vars] %>%
  # Convert all to character to avoid type conflicts
  mutate_all(as.character) %>%
  pivot_longer(cols = names(.), names_to = "variable", values_to = "value") %>%
  count(variable, value) %>%
  arrange(variable, desc(n)) %>% print(n = 22)

# A tibble: 22 × 3
   variable       value               n
   <chr>          <chr>           <int>
 1 Residence_type Urban            2596
 2 Residence_type Rural            2514
 3 ever_married   Yes              3353
 4 ever_married   No               1757
 5 gender         Female           2994
 6 gender         Male             2115
 7 gender         Other               1
 8 heart_disease  0                4834
 9 heart_disease  1                 276
10 hypertension   0                4612
11 hypertension   1                 498
12 smoking_status never smoked     1892
13 smoking_status Unknown          1544
14 smoking_status formerly smoked   885
15 smoking_status smokes            789
16 stroke         0                4861
17 stroke         1                 249
18 work_type      Private          2925
19 work_type      Self-employed     819
20 work_type      children          687
21 work_type      Govt_job          657
22 work_type      Never_worked       22

Its pretty interesting, now lets see what happens with the numeric variables

Code

# Check Numeric Variables - id, age, avg_glucose_level, bmi
kaggle_data1 %>%
  select_if(is.numeric) %>%
  summary()

       id             age         hypertension     heart_disease    
 Min.   :   67   Min.   : 0.08   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:17741   1st Qu.:25.00   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :36932   Median :45.00   Median :0.00000   Median :0.00000  
 Mean   :36518   Mean   :43.23   Mean   :0.09746   Mean   :0.05401  
 3rd Qu.:54682   3rd Qu.:61.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :72940   Max.   :82.00   Max.   :1.00000   Max.   :1.00000  
 avg_glucose_level     stroke       
 Min.   : 55.12    Min.   :0.00000  
 1st Qu.: 77.25    1st Qu.:0.00000  
 Median : 91.89    Median :0.00000  
 Mean   :106.15    Mean   :0.04873  
 3rd Qu.:114.09    3rd Qu.:0.00000  
 Max.   :271.74    Max.   :1.00000

We need to deal with the BMI data which has missing values and its not stored as numerical.

Code

# unique(kaggle_data1$bmi)
kaggle_data2 <- kaggle_data1 %>%
  mutate(bmi = na_if(bmi, "N/A")) %>%   # Convert "N/A" string to NA
  mutate(bmi = as.numeric(bmi))         # Convert from character to numeric

# kaggle_data2 <- kaggle_data1 %>% mutate(bmi = as.numeric(na_if(bmi, "N/A")))

# Check if it worked
str(kaggle_data2$bmi)

 num [1:5110] 36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...

Code

sum(is.na(kaggle_data2$bmi))

[1] 201

Code

# Check Numeric Variables - id, age, avg_glucose_level, bmi
kaggle_data2 %>%
  select_if(is.numeric) %>%
  summary()

       id             age         hypertension     heart_disease    
 Min.   :   67   Min.   : 0.08   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:17741   1st Qu.:25.00   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :36932   Median :45.00   Median :0.00000   Median :0.00000  
 Mean   :36518   Mean   :43.23   Mean   :0.09746   Mean   :0.05401  
 3rd Qu.:54682   3rd Qu.:61.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :72940   Max.   :82.00   Max.   :1.00000   Max.   :1.00000  
                                                                    
 avg_glucose_level      bmi            stroke       
 Min.   : 55.12    Min.   :10.30   Min.   :0.00000  
 1st Qu.: 77.25    1st Qu.:23.50   1st Qu.:0.00000  
 Median : 91.89    Median :28.10   Median :0.00000  
 Mean   :106.15    Mean   :28.89   Mean   :0.04873  
 3rd Qu.:114.09    3rd Qu.:33.10   3rd Qu.:0.00000  
 Max.   :271.74    Max.   :97.60   Max.   :1.00000  
                   NA's   :201

Conclusion

The dataset is imbalanced and has many issues there are several research work that explore solutions:

The research Predictive modelling and identification of key risk factors for stroke using machine learning has made several contributions adding a lot of insights:

Exploring various data imputation techniques and addressing data imbalance issues in order to enhance the accuracy and robustness of stroke prediction models.
Identifying crucial features for stroke prediction and uncovering previously unknown risk factors, giving a comprehensive understanding of stroke risk assessment.
Creating an augmented dataset incorporating important key risk factor features using the imputed datasets, enhancing the effectiveness of stroke prediction models.
Assessing the effectiveness of advanced machine learning models across different datasets and creating a robust Dense Stacking Ensemble model for stroke prediction.
The key contribution is showcasing the enhanced predictive capabilities of the model in accurately identifying and testing strokes, surpassing the performance of prior studies that utilized the same dataset.

Note

Large datasets might need Github LFS which is not setup, therefore must store then externally.

Additional Thoughts

Quarto websites when combined with python and R is a great way to

Quarto websites, when combined with Python and R, offer a powerful way to create dynamic, data-driven content that turns out into amazing presentations rich in visual content.

However there are limitations, Github Actions runner is not powerful and before submitting the project for rendering must take that into consideration. On future work will evaluate solutions to the computational budget limitations in Github Action Runner.

How to efficiently break up a computationally heavy article into separate notebooks?#8410

Some have mentioned that the project can be split into sections.

References

1. Melnykova, N., Patereha, Y., Skopivskyi, S., Farion, M., Fedushko, S., & Drohomyretska, K. (2025). Machine learning for stroke prediction using imbalanced data. Scientific Reports, 15(1), 33773.

2. Hassan, A., Gulzar Ahmad, S., Ullah Munir, E., Ali Khan, I., & Ramzan, N. (2024). Predictive modelling and identification of key risk factors for stroke using machine learning. Scientific Reports, 14(1), 11498.