Predicting stroke risk from common health indicators: a binary logistic regression analysis
The report goes here need some work
Slides: slides.html
Introduction
Logistic regression analysis is a type of regression technique that is used when a dataset’s response variable is categorical in nature. When the outcome variable takes on two distinct classes, a binary logistic regression model is used, and when there are more than two classes of the response variable, either a multinomial or ordinal logistic regression model can be utilized[1] (Kutner, 2013). Logistic regression analysis is commonly used across a variety of industries to glean insights about data to make optimal decisions about the data, to predict class labels, and to estimate probability of an event occurring; some fields include medicine, traffic and road engineering, environmental concerns, credit and fraud issues, and more. In this literature review, we take a closer look at logistic regression by discussing some applications of this algorithm in machine learning models, as well as several uses of logistic regression modeling in several peer- reviewed studies.
Many machine learning algorithms use logistic regression to train a model and make predictions about class labels. One example of this is through text classification to improve natural language processing tasks. The article “From logistic regression to the perceptron algorithm: Exploring gradient descent with large step sizes” investigates similarities between a logistic regression algorithm with gradient descent and a perceptron algorithm. Researchers observed that with very large steps, the logistic regression with gradient descent behaves like a perceptron, which in some sense links it back to the Deep Equilibrium networks study. The conclusions from this paper are counter intuitive, and further research and reflections about classification and optimization theory are encouraged[2] (Tyurin, 2025). Another way logistic regression is used in machine learning is through large language models (LLMs), which are complex neural networks trained on very large datasets to output human language[3] (Pedapati et al., 2024). The paper “Large language model confidence estimation via black-box access” addresses the problem of estimating the confidence of LLM outputs when only black-box (query-only) access is available. It is a simple technique that uses logistic regression to classify and validate the confidence of the outputs. Some problems of using black-box models are that there is no control over the model itself, and in some cases, the benefits and the value of buying these services that provide a black-box model outweigh training a personal, custom model (Pedapti).
Logistic regression analysis also assists road engineers and traffic control around the world by identifying common predictors of traffic accidents in general, and specifically, predictors of fatal accidents. According to The World Health Organization, one of the most common unnatural causes of death across the world is road accidents, so it is imperative to identify strong predictors associated with these events[4] (Akter et al., 2021). A different study from 2024 aimed to find which factors in traffic are strongly associated with the occurrence of traffic accidents[5] (Chang et al., 2024). Researchers used binary logistic regression to model the probability of a traffic accident occurring given a set of 25 predictors related to road safety. Another road traffic safety study from Bangladesh investigated strong predictors of motorcycle accidents. These researchers also utilized a binary logistic regression model to find strong predictors of severe accidents[4] (Akter, 2021). “Modeling Road Accident Severity with Logistic Regression (comparison study)” also exemplifies the use of a binary logistic regression model to analyze traffic risk[6] (Chen et al., 2020). Researchers compared the results of the logistic regression model to results of decision tree and random forest models, and it was found that the logistic regression model was more clear and understandable than the others. All three of these studies concluded by stating that there are several significant variables found when predicting severe road crashes. Knowing these significant predictors helps builders and developers eliminate or reduce these risk factors as they are building new roads; hence, it is crucial to continue researching road accident severity with logistic regression[4] (Akter, 2021).
Environmental issues can also be studied using logistic regression analysis due to its interesting properties; after all, logistic regression is a generalized linear model, which conducts mapping from any real number to probability values. “Priority prediction of Asian Hornet sighting report using machine learning methods” seeks to address the problem of Asian giant hornets[7] (Liu et al., 2021). They are an invasive species that pose a significant threat to native bee populations and local beekeeping, as well as to public safety due to their aggressive nature and potent venom. The goal of the research is to create an automated system to predict the priority of Asian giant hornet sighting reports. The authors modeled the priority prediction of sighting reports as a two-classification problem, with classes being either a “true positive” or a “false positive.” Their methodology is a straightforward application of logistic regression with feature extraction. Researchers then used a weighted binary cross-entropy function and the logistic regression is used for mapping the probability given the feature vector. The model achieved an average prediction accuracy of 83.5% on positive reports with the best weighting parameter settings, but still far from other works which achieved about 93% using Deep Learning. It was concluded that this still needs a lot of improvement or maybe it will never outmatch other methods due to hidden limitations[7] (Liu et al., 2021). One other example of logistic regression used in environmental contexts is in the article “Autoregressive Logistic Regression Applied to Atmospheric Circulation Patterns”[8] (Guanche et al., 2014). Researchers incorporate autoregressive time dependencies into logistic regression for climate modeling. They work with complex climatological dynamic data, and they explain both interpretation and simulation capabilities for weather patterns[8] (Guanche et al., 2014).
The article “Understanding Logistic Regression Analysis” discusses the usefulness of logistic regression, describes how to interpret results of the model, and gives an example of a logistic regression analysis using a synthetic dataset[9] (Sperandi, 2014). The data is about patients undergoing a drug treatment with a categorical outcome that is binary in nature, taking on values of survived (1), or did not survive (0). The result of the analysis explains how to interpret output from the model; one must take the exponentials of the slopes in the model to find the chances (the probability) of an event occurring. It is noted that in order to correctly understand results of a logistic regression, one must carefully consider the differences between the odds ratio, the log odds, and the probabilities of events occurring. Another important point in the article outlines the process of feature selection; a common way that predictors are selected for a logistic regression model is through a preliminary univariate analysis. After conducting a univariate analysis of each predictor in relation to the outcome variable, all significant predictors are included in the final multivariate logistic regression analysis[9] (Sperandi).
A 2024 study titled, “Determinants of coexistence of undernutrition and anemia among under- five children in Rwanda,” presents evidence from 2019/2020 demographic health survey data[10] . There are two outcome variables in the dataset: anemia and undernutrition in children under five years of age in Rwanda[10] (Agmas et al., 2024). The study analyzes the relationship between the two outcome variables, as well as the relationship between 26 predictors relating to the childrens’ preexisting health conditions, family information, details about the parents, and relevant geographic information. One result of the study was that the relationship between the two outcome variables, presence of malnutrition and presence of anemia, was found to be significant. Six other significant predictors were identified: mother’s age, drinking water quality, other children in household, child gender, birth order, and gender of head of household. The conclusion states that improving maternal education, supplementing with vitamin A and other nutrient dense foods, providing a healthy home environment, and decreasing maternal anemia may help improve rates of malnutrition and anemia in children[10] (Agmas, 2024).
In the study, “Predictors of hospital admission when presenting with acute on chronic breathlessness: Binary logistic regression” emergency room data from one hospital is analyzed to determine common predictors of patients that are admitted to the hospital[11] (Hutchinson et al., 2023). Specifically, patients presenting to the emergency room with acute on chronic breathlessness were surveyed to collect data that would help researchers understand common factors among those admitted to the hospital. Knowing common predictors ahead of time helps hospital staff more easily identify patients who are more at risk for being admitted to the hospital, and also helps identify which patients would be more likely to be able to be discharged without being admitted. A binary logistic regression analysis of the data revealed that the odds of admission to the hospital were positively correlated with three predictors: age, talking to a doctor about symptoms, and the presence of preexisting heart conditions. However, the odds of being admitted to the hospital were negatively associated with blood oxygen levels[11] (Hutchinson, 2023).
A new medical condition was recently recognized in 2004: airplane headache (AH), a condition described as a headache induced while taking off or landing in an airplane[12] (Prottengeier et al., 2025). Because AH is a relatively new addition to medical dictionaries, it is an underexplored condition that requires additional research. This study sought to identify common risk factors significantly associated with airplane headache to aid both travelers and airline employees. Two binary logistic regression models were constructed to compare two groups against the airplane headache group. The first regression model compared the airplane headache group to the no headache group, and 10 significant predictors of AH were identified; this model’s predictive power was found to be very high. The second model compared the airplane headache group to a group called other headache (individuals with symptoms of other types of headaches). The result from this analysis showed four significant predictors; however, the predictive power of the model was found to be very low. To conclude, it can be stated that binary logistic regression is a very effective way to find strong predictors of airplane headache when compared to those who do not have any headaches while flying[12] (Prottengeier).
One other way logistic regression is applied in the medical field is in identifying how the general public makes decisions regarding their health[13] (Liu J. et al., 2024). Researchers in China analyzed 2696 health survey responses collected from individuals across 31 Chinese provinces. They analyzed the data with a binary logistic regression model to classify points into two categories: unilateral decision making (value of 1), or collaborative decision making (value of 0). The researchers wanted to identify top predictors of individuals that make medical decisions by themselves and which predictors are correlated with patients making health decisions with more than one party (i.e. a patient, doctor, and family member all helping to make the health decision). It was found that most responses were classified as collaborative decision making (70%), which supports the idea that individuals in China strongly emphasize family- made decisions and strong family values. It was also concluded that significant predictors of unilateral decision making were gender, education level, family status, religious beliefs and occupation[13] (Liu J., 2024).
Logistic Regression is also useful in detecting common diseases, such as breast cancer. “Regularized logistic regression with network-based pairwise interaction for biomarker identification in breast cancer” uses regularized logistic regression along with biological network information and pairwise interactions, to find biomarkers, both single and interacting pairs, for breast cancer[14] (Wu et al., 2016). Researchers prioritized biologically plausible biomarker combinations and used an adaptive elastic net, a penalty that balances l1 and l2, with network constraints. The result of the study shows that their model outperforms simpler models in terms of predictive performance, and they were able to discover both individual biomarkers and interacting gene pairs[14] (Wu, 2016). Another study on breast cancer from 2016 aimed to identify gene signatures that predict chemosensitivity, that is, which tumors react to chemotherapy in breast cancer by combining genetic algorithms with sparse logistic regression[15] (Hu, 2016). What makes this analysis relevant and important is that it predicts which patients will react to chemotherapy, which gives more personalized treatment. The results show that SLR-28 and Notch-86, two gene signatures, perform well on training and validation sets in terms of accuracy, specificity, sensitivity, and other metrics[15] (Hu, 2016).
In this paper, we will discuss the methodology, analysis, results, visualizations, and conclusions of a binary logistic regression statistical analysis regarding risk of stroke from common health indicators. Stroke is the second most common cause of death globally, so understanding the risk factors associated with it is imperative[16] (“The Top 10”, 2024).