ISSN: 2455-5282
Global Journal of Medical and Clinical Research Articles
Research Article       Open Access      Peer-Reviewed

Critical Determinants of COVID-19 Severity and Predictive Modeling for Healthcare Optimization

Hammad Uallah1, Rija Ali1, Saad Ali1 and Umair Arif2*

1General Practitioner at Shifa Medical Complex, Jahanian District, Khanewal, Pakistan
2Lecturer Bio-Statistics, The University of Faisalabad, Faisalabad, Pakistan

*Corresponding author: Umair Arif, Lecturer Bio-Statistics, The University of Faisalabad, Faisalabad, Pakistan, E-mail: [email protected]
Received: 26 December, 2024 |Accepted: 10 January, 2025 | Published: 11 January, 2025
Keywords: COVID-19; Risk factors; Severity prediction; Resource allocation; Machine learning

Cite this as

Uallah H, Ali R, Ali S, Arif U. Critical Determinants of COVID-19 Severity and Predictive Modeling for Healthcare Optimization. Glob J Medical Clin Case Rep. 2025:12(1):004-010. Available from: 10.17352/2455-5282.000191

Copyright License

© 2025 Uallah H, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The COVID-19 pandemic placed unprecedented strain on global healthcare systems, highlighting the need to identify critical determinants of disease severity and develop predictive models for resource optimization. This study aimed to identify the most significant factors influencing COVID-19 severity, analyze comorbidity patterns, and develop machine learning models for predicting severe outcomes. Using a dataset of 1,000 COVID-19 patients, demographic, clinical, and medical history data were analyzed. Comorbidities such as COPD (96.3%), chronic renal disease (92.6%), cardiovascular issues (93.9%), and diabetes (69.9%) were found to be highly prevalent among severe cases. Over half of the patients required ICU admission (51.1%) or ventilator support (54.5%), indicating the critical impact of severe COVID-19 symptoms on healthcare systems. Four machine learning models decision tree, logistic regression, random forest, and AdaBoost were evaluated for predictive accuracy using a 20-80 ratio and 10-fold cross-validation. In the 20-80 ratio, AdaBoost and logistic regression emerged as the most effective models, achieving 77.00% accuracy, with AdaBoost excelling in precision at 79.84% and specificity at 91.75%, and Logistic Regression providing the highest sensitivity at 67.96% for balanced predictions. The average results across all folds were as follows: Decision Tree accuracy was 65.80%, Random Forest accuracy was 72.40%, Logistic Regression accuracy was 75.40%, and AdaBoost accuracy was 75.50%. These findings underscore the importance of comorbidities in determining COVID-19 severity and demonstrate the utility of predictive modeling in optimizing healthcare resources. The study concludes that tailored interventions for high-risk patients and machine learning-driven resource allocation strategies can enhance healthcare efficiency during pandemics.

Introduction

The coronavirus disease (COVID-19) pandemic, caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has posed unprecedented challenges to global healthcare systems [1-3]. With its first reported cases in late 2019, the virus quickly spread worldwide, causing widespread illness, death, and socioeconomic disruptions. COVID-19 is primarily a respiratory illness with symptoms ranging from mild fever and cough to severe complications such as Acute Respiratory Distress Syndrome (ARDS) and multi-organ failure [4,5]. While the majority of infected individuals recover with minimal intervention, certain groups are at a significantly higher risk of severe disease and mortality [6]. Identifying these high-risk individuals and understanding the underlying factors driving disease severity remains critical for mitigating the pandemic’s impact [7]. Underlying medical conditions such as cardiovascular disease, diabetes, chronic respiratory conditions, and cancer have been well-documented as risk factors for severe COVID-19 [8]. Similarly, demographic variables like age and sex have also shown strong associations with disease outcomes. Older adults and individuals with pre-existing comorbidities are more likely to experience complications requiring hospitalization, intensive care, or mechanical ventilation. The need for efficient management of limited medical resources such as ICU beds, ventilators, and personnel has underscored the importance of developing predictive tools to aid healthcare providers [9-11]. During the pandemic, healthcare systems globally have faced acute resource shortages, especially during peak infection waves. Timely predictions of resource requirements at the patient level can provide a significant advantage by allowing authorities to allocate limited resources where they are needed most. Additionally, understanding the interplay of demographic, clinical, and comorbidity-related factors contributing to COVID-19 severity can improve patient management strategies and public health policies [12,13].

Machine Learning (ML) has emerged as a transformative tool in addressing the challenges posed by the COVID-19 pandemic. By leveraging vast amounts of data, ML enables the identification of patterns, predictions, and insights crucial for effective pandemic response [14]. One of its primary applications in COVID-19 research is predicting disease severity and outcomes, assisting healthcare providers in risk stratification and resource allocation [15,16]. ML models, such as Random Forests, Logistic Regression, and advanced boosting algorithms like AdaBoost and XGBoost, have been widely used to analyze clinical, demographic, and medical history data to predict critical outcomes, including ICU admission, ventilator requirements, and mortality [17].

Another significant application of ML in COVID-19 is in diagnostic processes. Techniques such as Convolutional Neural Networks (CNNs) have been employed to analyze medical imaging data, like chest X-rays and CT scans, offering rapid and accurate diagnosis [18]. Moreover, ML algorithms have been pivotal in analyzing genomic sequences of the virus, aiding in vaccine development, and tracking virus mutations [19,20].

Despite its vast potential, ML in COVID-19 research faces challenges, including data privacy concerns, biases in datasets, and the need for interpretability to ensure clinical trust [21]. However, with continuous advancements, ML holds the promise of revolutionizing pandemic management by enhancing decision-making processes, optimizing healthcare resources, and ultimately improving patient outcomes [22]. It serves as a cornerstone in the fight against current and future public health crises. By comparing multiple algorithms such as Decision tree, Random Forest, Logistic Regression, and AdaBoost the research aims to determine the most effective approach for predicting high-risk cases.

This study aims to identify key factors influencing COVID-19 severity, focusing on the role of comorbidities in shaping outcomes and providing critical insights for clinical care and risk stratification. Leveraging advanced machine-learning techniques integrates risk factor analysis into a cohesive framework for predicting the likelihood of intensive care or ventilation needs, enabling efficient resource allocation and improved patient outcomes. Unlike previous research focused on isolated factors, this study emphasizes interpretability, ensuring the findings are statistically robust and practically valuable for clinicians and policymakers in real-time decision-making.

Methodology

The methodology for this study is designed to address the primary and secondary objectives of identifying critical factors influencing the severity of COVID-19 infections and predicting patient resource needs. This methodology provides a structured approach to identifying key factors affecting COVID-19 severity and predicting patient outcomes, with the ultimate aim of assisting healthcare providers in making data-driven decisions regarding resource allocation.

Data collection and understanding

The dataset used for this research was taken from Kaggle and used 1000 patients’ data which includes clinical and demographic details such as age, sex, pre-existing conditions, and hospitalization records. Key features include the presence of comorbidities like diabetes, hypertension, and chronic kidney disease, along with patient outcomes such as ICU admission and ventilator use. This comprehensive dataset provides an opportunity to gain insights into disease progression and improve decision-making in healthcare settings.

The target variable for this study is the classification of COVID-19, which categorizes patients based on the severity of their COVID-19 infection. This will allow the development of a model capable of predicting the severity of the infection.

Data preprocessing and splitting

Data preprocessing is a critical step in ensuring the quality and usability of the dataset for ML models. The following steps are performed as for other missing values, we will either drop rows with excessive missing data or use imputation techniques, such as mean or median imputation for numerical features and mode imputation for categorical features. Categorical variables such as sex, classification, and patient type are transformed into numerical values using techniques such as one-hot encoding or label encoding, depending on the nature of the variable. The target variable is labeled and encoded for modeling purposes. Features such as age, which may have wide ranges, are normalized using standardization (z-score) or min-max scaling to ensure that all features contribute equally to the model’s predictions. The target variable was imbalanced, so we used oversampling techniques like the Synthetic Minority Over-sampling Technique (SMOTE). Before model development, the dataset underwent preprocessing to ensure data quality and optimize model performance. The data was then split into training (80%) and testing (20%) sets for initial model evaluation. Additionally, to further assess the generalization ability and robustness of the models, 10-fold cross-validation was employed, ensuring that each model was evaluated on different subsets of the data, providing a more reliable estimation of performance across various scenarios [23].

Machine learning model development

The study implemented four machine learning models Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), and AdaBoost to analyze and predict COVID-19 severity outcomes. These models were chosen for their diverse approaches to classification and their ability to handle varied types of data.

Model descriptions

The DT model uses a tree-like structure to make decisions based on input features [24]. It is known for its simplicity and interpretability, making it a suitable baseline model for this study. The RF [25] model is an ensemble technique that builds multiple decision trees and aggregates their predictions to enhance accuracy and reduce overfitting. It is particularly effective in handling complex datasets with diverse features. The LR [26] model is a statistical method used for binary and multi-class classification tasks. It assumes a linear relationship between input features and the log odds of the target variable, offering a robust and interpretable solution. AdaBoost [27] is an ensemble method that combines weak classifiers, typically decision stumps, to create a strong predictive model. It focuses on incorrectly classified samples by assigning higher weights, improving overall accuracy.

Model training and evaluation

Each model was trained using the training dataset, with hyperparameter tuning performed via grid search to optimize performance. Evaluation metrics, including accuracy, precision, F1-score, sensitivity, specificity, and AUC, were calculated on the testing dataset to assess the models. Additionally, feature importance was computed using permutation importance across all models, and statistical significance for features was evaluated using p-values derived from the LR model. These measures provided insights into the contribution and relevance of individual features.

Experimental results

Out of 1000 patients, 570 were male and 430 were female patients which were included in this study Figure 1. Out of 1000 patients 71 belonged to the age group 1-15 years, 95 were included in the age group of 16-30 years, 130 were included in the age group of 31-45 years, 285 were included in the age group of 46-60 years and 419 patients had age above 60 years Figure 2. The data indicates that the majority of COVID-19 patients (70.1%) required hospitalization, reflecting the significant strain on healthcare resources during the pandemic. Additionally, most patients (68.0%) received care from third-level medical units, which are typically equipped to handle complex and severe cases, underscoring the severity of conditions in this dataset. Conversely, only 29.9% of patients were discharged home without requiring extensive care Table 1. The data highlights a high prevalence of comorbidities and risk factors among COVID-19 patients, with conditions such as COPD (96.3%), chronic renal disease (92.6%), cardiovascular issues (93.9%), diabetes (69.9%), and hypertension (67.9%) being particularly prominent. Additionally, lifestyle factors like obesity (83.3%) and tobacco use (92.4%) were common, alongside immunosuppression in 96.3% of cases. Conversely, conditions like asthma (1.7%) were rare. These findings underscore the critical impact of underlying health conditions and behaviors on the severity of COVID-19, providing essential insights for healthcare providers to prioritize resources and design interventions for individuals at higher risk of severe outcomes Table 2. The data reveals that more than half of the patients required critical interventions, with 54.5% needing ventilator support and 51.1% admitted to the ICU, indicating a significant proportion experienced severe COVID-19 symptoms. Furthermore, 53.4% of patients tested positive for COVID-19, demonstrating the widespread impact of the virus within the dataset Table 3. The evaluation of four machine learning models Decision Tree, Logistic Regression, Random Forest, and AdaBoost revealed varied performance across metrics. Logistic Regression and AdaBoost emerged as the most accurate models (77.00%), with AdaBoost excelling in precision (79.84%) and specificity (91.75%), making it particularly effective in minimizing false positives. However, its sensitivity (63.11%) was the lowest, indicating limitations in identifying true positives. Logistic Regression offered the best balance, achieving a high F1-score (76.84%) and the highest sensitivity (67.96%), making it reliable for balanced predictions. Random Forest demonstrated solid performance (accuracy: 74.50%) with strengths in specificity (82.47%) but lagged slightly in sensitivity (66.99%). In contrast, the Decision Tree had moderate metrics, with accuracy, precision, and F1-score around 66.50%, indicating weaker predictive ability overall. For applications prioritizing sensitivity and balanced performance, Logistic Regression is the best choice, while AdaBoost is ideal for scenarios where minimizing false positives is critical Table 4. In the above Figure 3 confusion matrix of all ML models is compared and AdaBoost has shown better performance among all the models. In this Figure 4 area under the curve is observed using all ML models used in this study. The random forest model has a better outcome its cover area is up to 82.07% but the AdaBoost model shows improved results in its cover area up to 84.02%. The results from the 10-fold cross-validation in Figure 5 reveal that LR and AdaBoost performed the best among the models tested. LR achieved an accuracy of 75.40%, with precision, recall, and F1 scores close to 76%, and an AUC of 0.7846, indicating strong classification performance. AdaBoost showed slightly better accuracy at 75.50%, with a precision of 77.46% and an AUC of 0.7946, suggesting that it excels in discriminating between classes, especially when precision is prioritized. Random Forest demonstrated a good performance with an accuracy of 72.40% and an AUC of 0.7799, its recall and F1 score were slightly lower than LR and AdaBoost, suggesting room for improvement in capturing true positive cases. Decision Tree, while simpler, exhibited the lowest performance, with an accuracy of 65.80% and AUC of 0.6586, indicating limited ability to generalize across different data folds compared to more complex models. Overall, LR and AdaBoost show the most balanced performance, with AdaBoost slightly outperforming LR in terms of precision and AUC. The feature importance and statistical significance analysis results explained in Table 5 indicate that certain medical conditions and patient characteristics play a significant role in predicting the target variable. Features such as Tobacco with a coefficient of 0.4985 and Hypertension coefficient of -0.5242 have strong relationships with the target, suggesting that smoking and hypertension are crucial factors in the model’s predictions. These features also have very low p-values of 0.0000, indicating they are statistically significant. Conversely, variables like Age, Intu bed, and Patient Type show weaker associations with the target, with higher p-values suggesting their limited influence in the model. The permutation test supports this, as features like Tobacco and Hypertension contribute most to the model’s performance, further highlighting their importance.

Overall, the model indicates that chronic conditions such as Tobacco use and Hypertension, along with Obesity and Cardiovascular issues, are significant predictors. Features with high p-values, such as Patient Type and Intu bed, show limited importance in this particular classification task. These findings suggest that focusing on medical conditions like Tobacco and Hypertension would yield the most meaningful insights for improving the model’s predictive accuracy.

Conclusion

This study comprehensively analyzes COVID-19 patient characteristics, comorbidities, resource utilization, and predictive modeling. The dataset highlights a significant burden of severe outcomes, with the majority of patients requiring hospitalization and critical care. Comorbidities such as COPD, chronic renal disease, cardiovascular issues, diabetes, and hypertension were prevalent, underscoring their critical role in exacerbating the severity of COVID-19. Moreover, high hospitalization rates and reliance on advanced medical units emphasize the pandemic’s strain on healthcare systems. Among machine learning models, Logistic Regression and AdaBoost showed the best predictive performance, with AdaBoost excelling in precision and specificity, and Logistic Regression demonstrating the highest sensitivity and balanced metrics. These findings offer valuable insights for healthcare providers, policymakers, and researchers to improve resource allocation and patient outcomes.

Limitations

Despite its contributions, this study has several limitations. This study relies on retrospective data, which may introduce biases or incomplete information. Some features, such as lifestyle factors, were self-reported and could be prone to inaccuracies. The performance of machine learning models could be further improved with more advanced algorithms or larger datasets. Finally, the study does not account for the evolving nature of COVID-19, including new variants and treatment protocols, which may influence outcomes in future datasets. Addressing these limitations in future research can enhance the applicability and robustness of the findings.

  1. Ciotti M, Ciccozzi M, Terrinoni A, Jiang WC, Wang CB, Bernardini S. The COVID-19 pandemic. Crit Rev Clin Lab Sci. 2020;57(6):365-388. Available from: https://doi.org/10.1080/10408363.2020.1783198
  2. Suryasa IW, Rodríguez-Gámez M, Koldoris T. The COVID-19 pandemic. Int J Health Sci. 2021;5(2):6-9. Available from: https://doi.org/10.53730/ijhs.v5n2.2937
  3. Daniel SJ. Education and the COVID-19 pandemic. Prospects. 2020;49(1):91-96. Available from: https://link.springer.com/article/10.1007/s11125-020-09464-3
  4. Baloch S, Baloch MA, Zheng T, Pei X. The Coronavirus Disease 2019 (COVID-19) Pandemic. Tohoku J Exp Med. 2020;250(4):271-278. Available from: https://doi.org/10.1620/tjem.250.271
  5. Omer SB, Malani P, Del Rio C. The COVID-19 pandemic in the US: a clinical update. JAMA. 2020;323(18):1767-1768. Available from: https://jamanetwork.com/journals/jama/fullarticle/2764366
  6. Nour TY, Altintaş KH. Effect of the COVID-19 pandemic on obesity and its risk factors: A systematic review. BMC Public Health. 2023;23(1):1018. Available from: https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-023-15833-2
  7. Jordan RE, Adab P, Cheng KK. Covid-19: risk factors for severe disease and death. BMJ. 2020;368:m1198. Available from: https://doi.org/10.1136/bmj.m1198
  8. Chen Q, Li W, Xiong J, Zheng X. Prevalence and risk factors associated with postpartum depression during the COVID-19 pandemic: a literature review and meta-analysis. Int J Environ Res Public Health. 2022;19(4):2219. Available from: https://doi.org/10.3390/ijerph19042219
  9. Woo HG, Park S, Yon H, Lee SW, Koyanagi A, Jacob L, Smith L, et al. National trends in sadness, suicidality, and COVID-19 pandemic–related risk factors among South Korean adolescents from 2005 to 2021. JAMA Netw Open. 2023;6(5):e2314838. Available from: https://doi.org/10.1001/jamanetworkopen.2023.14838
  10. Cena H, Fiechtner L, Vincenti A, Magenes VC, De Giuseppe R, Manuelli M, et al. COVID-19 pandemic as risk factors for excessive weight gain in pediatrics: the role of changes in nutrition behavior. A narrative review. Nutrients. 2021;13(12):4255. Available from: https://doi.org/10.3390/nu13124255
  11. van Loon AW, Creemers HE, Vogelaar S, Miers AC, Saab N, Westenberg PM, et al. Prepandemic risk factors of COVID‐19‐related concerns in adolescents during the COVID‐19 pandemic. J Res Adolesc. 2021;31(3):531-545. Available from: https://doi.org/10.1111/jora.12651
  12. Wolff D, Nee S, Hickey NS, Marschollek M. Risk factors for Covid-19 severity and fatality: a structured literature review. Infection. 2021;49:15-28. Available from: https://doi.org/10.1007/s15010-020-01509-1
  13. Zhang J, Wang X, Jia X, Li J, Hu K, Chen G, et al. Risk factors for disease severity, unimprovement, and mortality in COVID-19 patients in Wuhan, China. Clin Microbiol Infect. 2020;26(6):767-772. Available from: https://doi.org/10.1016/j.cmi.2020.04.012
  14. Heidari A, Jafari Navimipour N, Unal M, Toumaj S. Machine learning applications for COVID-19 outbreak management. Neural Comput Appl. 2022;34(18):15313-15348. Available from: https://link.springer.com/article/10.1007/s00521-022-07424-w
  15. Syeda HB, Syed M, Sexton KW, Syed S, Begum S, Syed F, et al. Role of machine learning techniques to tackle the COVID-19 crisis: systematic review. JMIR Med Inform. 2021;9(1):e23811. Available from: https://doi.org/10.2196/23811
  16. Kwekha-Rashid AS, Abduljabbar HN, Alhayani B. Coronavirus disease (COVID-19) cases analysis using machine-learning applications. Appl Nanosci. 2023;13(3):2013-2025. Available from: https://link.springer.com/article/10.1007/s13204-021-01868-7
  17. Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA. Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Comput Sci. 2021;2(1):11. Available from: https://doi.org/10.1007/s42979-020-00394-7
  18. Alaufi R, Kalkatawi M, Abukhodair F. Challenges of deep learning diagnosis for COVID-19 from chest imaging. Multimed Tools Appl. 2024;83(5):14337-14361. Available from: https://link.springer.com/article/10.1007/s11042-023-16017-1
  19. Alakus TB, Turkoglu I. Comparison of deep learning approaches to predict COVID-19 infection. Chaos Solitons Fractals. 2020;140:110120. Available from: https://doi.org/10.1016/j.chaos.2020.110120
  20. John CC, Ponnusamy V, Krishnan Chandrasekaran S, R N. A Survey on Mathematical, Machine Learning and Deep Learning Models for COVID-19 Transmission and Diagnosis. IEEE Rev Biomed Eng. 2022;15:325-340. Available from: https://doi.org/10.1109/rbme.2021.3069213
  21. Ahmad A, Garhwal S, Ray SK, Kumar G, Malebary SJ, Barukab OM. The number of confirmed cases of COVID-19 by using machine learning: Methods and challenges. Arch Comput Methods Eng. 2021;28:2645-2653. Available from: https://doi.org/10.1007/s11831-020-09472-8
  22. Liu Q, Nair R, Huang R, Zhu H, Anderson A, Belen O, et al. Using machine learning to determine a suitable patient population for anakinra for the treatment of COVID‐19 under the emergency use authorization. Clin Pharmacol Ther. 2024;115(4):890-895. Available from: https://doi.org/10.1002/cpt.3191
  23. Patil D, Rane N, Desai P, Rane J. Machine learning and deep learning: Methods, techniques, applications, challenges, and future research opportunities. Trustworthy Artificial Intelligence in Industry and Society. 2024:28-81. Available from: http://dx.doi.org/10.70593/978-81-981367-4-9_2
  24. Modhugu VR, Ponnusamy S. Comparative analysis of machine learning algorithms for liver disease prediction: SVM, logistic regression, and decision tree. Asian J Res Comput Sci. 2024;17(6):188-201. Available from: https://journalajrcos.com/index.php/AJRCOS/article/view/467
  25. Zhou X, Zhang J, Deng XM, Fu FM, Wang JM, Zhang ZY, et al. Using random forest and biomarkers for differentiating COVID-19 and Mycoplasma pneumoniae infections. Sci Rep. 2024;14(1):22673. Available from: https://www.nature.com/articles/s41598-024-74057-5
  26. Liu P, Xing Z, Peng X, Zhang M, Shu C, Wang C, et al. Machine learning versus multivariate logistic regression for predicting severe COVID‐19 in hospitalized children with Omicron variant infection. J Med Virol. 2024;96(2):e29447. Available from: https://doi.org/10.1002/jmv.29447
  27. Suliman M, Malik F, Qasim Khan M, Irfan Ullah, Abd Ur Rub. Integrating data augmentation with AdaBoost for effective COVID-19 pneumonia classification. J Comput Biomed Informatics. 2024;7(01):590-605. Available from: https://jcbi.org/index.php/Main/article/view/512
 

Help ?