Critical Determinants of COVID-19 Severity and Predictive Modeling for Healthcare Optimization

Hammad Uallah; Rija Ali; Saad Ali; Umair Arif; Hammad Uallah; Rija Ali; Saad Ali; Umair Arif

ISSN: 2455-5282

Global Journal of Medical and Clinical Research Articles

Research Article Open Access Peer-Reviewed

Critical Determinants of COVID-19 Severity and Predictive Modeling for Healthcare Optimization

Hammad Uallah¹, Rija Ali¹, Saad Ali¹ and Umair Arif^2*

¹General Practitioner at Shifa Medical Complex, Jahanian District, Khanewal, Pakistan
²Lecturer Bio-Statistics, The University of Faisalabad, Faisalabad, Pakistan

Author and article information

*Corresponding author: Umair Arif, Lecturer Bio-Statistics, The University of Faisalabad, Faisalabad, Pakistan, E-mail: [email protected]

doi : 10.17352/2455-5282.000191

Received: 26 December, 2024 |Accepted: 10 January, 2025 | Published: 11 January, 2025

Keywords: COVID-19; Risk factors; Severity prediction; Resource allocation; Machine learning

Cite this as

Uallah H, Ali R, Ali S, Arif U. Critical Determinants of COVID-19 Severity and Predictive Modeling for Healthcare Optimization. Glob J Medical Clin Case Rep. 2025:12(1):004-010. Available from: 10.17352/2455-5282.000191

Copyright License

© 2025 Uallah H, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

The COVID-19 pandemic placed unprecedented strain on global healthcare systems, highlighting the need to identify critical determinants of disease severity and develop predictive models for resource optimization. This study aimed to identify the most significant factors influencing COVID-19 severity, analyze comorbidity patterns, and develop machine learning models for predicting severe outcomes. Using a dataset of 1,000 COVID-19 patients, demographic, clinical, and medical history data were analyzed. Comorbidities such as COPD (96.3%), chronic renal disease (92.6%), cardiovascular issues (93.9%), and diabetes (69.9%) were found to be highly prevalent among severe cases. Over half of the patients required ICU admission (51.1%) or ventilator support (54.5%), indicating the critical impact of severe COVID-19 symptoms on healthcare systems. Four machine learning models decision tree, logistic regression, random forest, and AdaBoost were evaluated for predictive accuracy using a 20-80 ratio and 10-fold cross-validation. In the 20-80 ratio, AdaBoost and logistic regression emerged as the most effective models, achieving 77.00% accuracy, with AdaBoost excelling in precision at 79.84% and specificity at 91.75%, and Logistic Regression providing the highest sensitivity at 67.96% for balanced predictions. The average results across all folds were as follows: Decision Tree accuracy was 65.80%, Random Forest accuracy was 72.40%, Logistic Regression accuracy was 75.40%, and AdaBoost accuracy was 75.50%. These findings underscore the importance of comorbidities in determining COVID-19 severity and demonstrate the utility of predictive modeling in optimizing healthcare resources. The study concludes that tailored interventions for high-risk patients and machine learning-driven resource allocation strategies can enhance healthcare efficiency during pandemics.

Main article text

Introduction

The coronavirus disease (COVID-19) pandemic, caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has posed unprecedented challenges to global healthcare systems [1-3]. With its first reported cases in late 2019, the virus quickly spread worldwide, causing widespread illness, death, and socioeconomic disruptions. COVID-19 is primarily a respiratory illness with symptoms ranging from mild fever and cough to severe complications such as Acute Respiratory Distress Syndrome (ARDS) and multi-organ failure [4,5]. While the majority of infected individuals recover with minimal intervention, certain groups are at a significantly higher risk of severe disease and mortality [6]. Identifying these high-risk individuals and understanding the underlying factors driving disease severity remains critical for mitigating the pandemic’s impact [7]. Underlying medical conditions such as cardiovascular disease, diabetes, chronic respiratory conditions, and cancer have been well-documented as risk factors for severe COVID-19 [8]. Similarly, demographic variables like age and sex have also shown strong associations with disease outcomes. Older adults and individuals with pre-existing comorbidities are more likely to experience complications requiring hospitalization, intensive care, or mechanical ventilation. The need for efficient management of limited medical resources such as ICU beds, ventilators, and personnel has underscored the importance of developing predictive tools to aid healthcare providers [9-11]. During the pandemic, healthcare systems globally have faced acute resource shortages, especially during peak infection waves. Timely predictions of resource requirements at the patient level can provide a significant advantage by allowing authorities to allocate limited resources where they are needed most. Additionally, understanding the interplay of demographic, clinical, and comorbidity-related factors contributing to COVID-19 severity can improve patient management strategies and public health policies [12,13].

Machine Learning (ML) has emerged as a transformative tool in addressing the challenges posed by the COVID-19 pandemic. By leveraging vast amounts of data, ML enables the identification of patterns, predictions, and insights crucial for effective pandemic response [14]. One of its primary applications in COVID-19 research is predicting disease severity and outcomes, assisting healthcare providers in risk stratification and resource allocation [15,16]. ML models, such as Random Forests, Logistic Regression, and advanced boosting algorithms like AdaBoost and XGBoost, have been widely used to analyze clinical, demographic, and medical history data to predict critical outcomes, including ICU admission, ventilator requirements, and mortality [17].

Another significant application of ML in COVID-19 is in diagnostic processes. Techniques such as Convolutional Neural Networks (CNNs) have been employed to analyze medical imaging data, like chest X-rays and CT scans, offering rapid and accurate diagnosis [18]. Moreover, ML algorithms have been pivotal in analyzing genomic sequences of the virus, aiding in vaccine development, and tracking virus mutations [19,20].

Despite its vast potential, ML in COVID-19 research faces challenges, including data privacy concerns, biases in datasets, and the need for interpretability to ensure clinical trust [21]. However, with continuous advancements, ML holds the promise of revolutionizing pandemic management by enhancing decision-making processes, optimizing healthcare resources, and ultimately improving patient outcomes [22]. It serves as a cornerstone in the fight against current and future public health crises. By comparing multiple algorithms such as Decision tree, Random Forest, Logistic Regression, and AdaBoost the research aims to determine the most effective approach for predicting high-risk cases.

This study aims to identify key factors influencing COVID-19 severity, focusing on the role of comorbidities in shaping outcomes and providing critical insights for clinical care and risk stratification. Leveraging advanced machine-learning techniques integrates risk factor analysis into a cohesive framework for predicting the likelihood of intensive care or ventilation needs, enabling efficient resource allocation and improved patient outcomes. Unlike previous research focused on isolated factors, this study emphasizes interpretability, ensuring the findings are statistically robust and practically valuable for clinicians and policymakers in real-time decision-making.

Methodology

The methodology for this study is designed to address the primary and secondary objectives of identifying critical factors influencing the severity of COVID-19 infections and predicting patient resource needs. This methodology provides a structured approach to identifying key factors affecting COVID-19 severity and predicting patient outcomes, with the ultimate aim of assisting healthcare providers in making data-driven decisions regarding resource allocation.

Data collection and understanding

The dataset used for this research was taken from Kaggle and used 1000 patients’ data which includes clinical and demographic details such as age, sex, pre-existing conditions, and hospitalization records. Key features include the presence of comorbidities like diabetes, hypertension, and chronic kidney disease, along with patient outcomes such as ICU admission and ventilator use. This comprehensive dataset provides an opportunity to gain insights into disease progression and improve decision-making in healthcare settings.

The target variable for this study is the classification of COVID-19, which categorizes patients based on the severity of their COVID-19 infection. This will allow the development of a model capable of predicting the severity of the infection.

Data preprocessing and splitting

Data preprocessing is a critical step in ensuring the quality and usability of the dataset for ML models. The following steps are performed as for other missing values, we will either drop rows with excessive missing data or use imputation techniques, such as mean or median imputation for numerical features and mode imputation for categorical features. Categorical variables such as sex, classification, and patient type are transformed into numerical values using techniques such as one-hot encoding or label encoding, depending on the nature of the variable. The target variable is labeled and encoded for modeling purposes. Features such as age, which may have wide ranges, are normalized using standardization (z-score) or min-max scaling to ensure that all features contribute equally to the model’s predictions. The target variable was imbalanced, so we used oversampling techniques like the Synthetic Minority Over-sampling Technique (SMOTE). Before model development, the dataset underwent preprocessing to ensure data quality and optimize model performance. The data was then split into training (80%) and testing (20%) sets for initial model evaluation. Additionally, to further assess the generalization ability and robustness of the models, 10-fold cross-validation was employed, ensuring that each model was evaluated on different subsets of the data, providing a more reliable estimation of performance across various scenarios [23].

Machine learning model development

The study implemented four machine learning models Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), and AdaBoost to analyze and predict COVID-19 severity outcomes. These models were chosen for their diverse approaches to classification and their ability to handle varied types of data.

Model descriptions

The DT model uses a tree-like structure to make decisions based on input features [24]. It is known for its simplicity and interpretability, making it a suitable baseline model for this study. The RF [25] model is an ensemble technique that builds multiple decision trees and aggregates their predictions to enhance accuracy and reduce overfitting. It is particularly effective in handling complex datasets with diverse features. The LR [26] model is a statistical method used for binary and multi-class classification tasks. It assumes a linear relationship between input features and the log odds of the target variable, offering a robust and interpretable solution. AdaBoost [27] is an ensemble method that combines weak classifiers, typically decision stumps, to create a strong predictive model. It focuses on incorrectly classified samples by assigning higher weights, improving overall accuracy.

Model training and evaluation

Each model was trained using the training dataset, with hyperparameter tuning performed via grid search to optimize performance. Evaluation metrics, including accuracy, precision, F1-score, sensitivity, specificity, and AUC, were calculated on the testing dataset to assess the models. Additionally, feature importance was computed using permutation importance across all models, and statistical significance for features was evaluated using p-values derived from the LR model. These measures provided insights into the contribution and relevance of individual features.

Experimental results

Out of 1000 patients, 570 were male and 430 were female patients which were included in this study Figure 1. Out of 1000 patients 71 belonged to the age group 1-15 years, 95 were included in the age group of 16-30 years, 130 were included in the age group of 31-45 years, 285 were included in the age group of 46-60 years and 419 patients had age above 60 years Figure 2. The data indicates that the majority of COVID-19 patients (70.1%) required hospitalization, reflecting the significant strain on healthcare resources during the pandemic. Additionally, most patients (68.0%) received care from third-level medical units, which are typically equipped to handle complex and severe cases, underscoring the severity of conditions in this dataset. Conversely, only 29.9% of patients were discharged home without requiring extensive care Table 1. The data highlights a high prevalence of comorbidities and risk factors among COVID-19 patients, with conditions such as COPD (96.3%), chronic renal disease (92.6%), cardiovascular issues (93.9%), diabetes (69.9%), and hypertension (67.9%) being particularly prominent. Additionally, lifestyle factors like obesity (83.3%) and tobacco use (92.4%) were common, alongside immunosuppression in 96.3% of cases. Conversely, conditions like asthma (1.7%) were rare. These findings underscore the critical impact of underlying health conditions and behaviors on the severity of COVID-19, providing essential insights for healthcare providers to prioritize resources and design interventions for individuals at higher risk of severe outcomes Table 2. The data reveals that more than half of the patients required critical interventions, with 54.5% needing ventilator support and 51.1% admitted to the ICU, indicating a significant proportion experienced severe COVID-19 symptoms. Furthermore, 53.4% of patients tested positive for COVID-19, demonstrating the widespread impact of the virus within the dataset Table 3. The evaluation of four machine learning models Decision Tree, Logistic Regression, Random Forest, and AdaBoost revealed varied performance across metrics. Logistic Regression and AdaBoost emerged as the most accurate models (77.00%), with AdaBoost excelling in precision (79.84%) and specificity (91.75%), making it particularly effective in minimizing false positives. However, its sensitivity (63.11%) was the lowest, indicating limitations in identifying true positives. Logistic Regression offered the best balance, achieving a high F1-score (76.84%) and the highest sensitivity (67.96%), making it reliable for balanced predictions. Random Forest demonstrated solid performance (accuracy: 74.50%) with strengths in specificity (82.47%) but lagged slightly in sensitivity (66.99%). In contrast, the Decision Tree had moderate metrics, with accuracy, precision, and F1-score around 66.50%, indicating weaker predictive ability overall. For applications prioritizing sensitivity and balanced performance, Logistic Regression is the best choice, while AdaBoost is ideal for scenarios where minimizing false positives is critical Table 4. In the above Figure 3 confusion matrix of all ML models is compared and AdaBoost has shown better performance among all the models. In this Figure 4 area under the curve is observed using all ML models used in this study. The random forest model has a better outcome its cover area is up to 82.07% but the AdaBoost model shows improved results in its cover area up to 84.02%. The results from the 10-fold cross-validation in Figure 5 reveal that LR and AdaBoost performed the best among the models tested. LR achieved an accuracy of 75.40%, with precision, recall, and F1 scores close to 76%, and an AUC of 0.7846, indicating strong classification performance. AdaBoost showed slightly better accuracy at 75.50%, with a precision of 77.46% and an AUC of 0.7946, suggesting that it excels in discriminating between classes, especially when precision is prioritized. Random Forest demonstrated a good performance with an accuracy of 72.40% and an AUC of 0.7799, its recall and F1 score were slightly lower than LR and AdaBoost, suggesting room for improvement in capturing true positive cases. Decision Tree, while simpler, exhibited the lowest performance, with an accuracy of 65.80% and AUC of 0.6586, indicating limited ability to generalize across different data folds compared to more complex models. Overall, LR and AdaBoost show the most balanced performance, with AdaBoost slightly outperforming LR in terms of precision and AUC. The feature importance and statistical significance analysis results explained in Table 5 indicate that certain medical conditions and patient characteristics play a significant role in predicting the target variable. Features such as Tobacco with a coefficient of 0.4985 and Hypertension coefficient of -0.5242 have strong relationships with the target, suggesting that smoking and hypertension are crucial factors in the model’s predictions. These features also have very low p-values of 0.0000, indicating they are statistically significant. Conversely, variables like Age, Intu bed, and Patient Type show weaker associations with the target, with higher p-values suggesting their limited influence in the model. The permutation test supports this, as features like Tobacco and Hypertension contribute most to the model’s performance, further highlighting their importance.

Table 5: Feature Importance and Statistical Significance Comparison.
F. No	F. Name	LR	DT	RF	AdaBoost	p - Value (Logit)	Coefficient (Logit)
1	Medical Unit	-0.0126	0.1381	0.2329	0.1268	0.7483	-0.0126
5	Pneumonia	-0.0756	0.1107	0.1369	0.0092	0.3875	-0.0756
3	Patient Type	0.0006	0.0599	0.0678	0.0064	0.7808	0.0006
6	Age	-0.0659	0.1725	0.2822	0.0020	0.4944	-0.0659
8	COPD	-0.1498	0.0094	0.0115	0.0018	0.1137	-0.1498
12	Other Disease	-0.0302	0.0123	0.0203	0.0018	0.6549	-0.0302
15	Renal Chronic	0.0778	0.0298	0.0565	0.0013	0.3880	0.0778
0	Usmer	-0.1689	0.0661	0.0713	0.0013	0.0269	-0.1689
4	Intu bed	0.0102	0.0940	0.0978	0.0000	0.8146	0.0102
2	Sex	-0.1027	0.0783	0.0752	0.0000	0.2254	-0.1027
13	Cardiovascular	-0.2688	0.0105	0.0121	0.0000	0.0455	-0.2688
9	Asthma	-0.0646	0.0030	0.0042	0.0000	0.4517	-0.0646
11	Hypertension	-0.5242	0.0659	0.0680	0.0000	0.0000	-0.5242
10	Inmsupr	0.0327	0.0053	0.0115	0.0000	0.6446	0.0327
16	Tobacco	0.4985	0.0170	0.0169	0.0000	0.0000	0.4985
14	Obesity	-0.4730	0.0373	0.0323	0.0000	0.0000	-0.4730
17	ICU	-0.1059	0.0823	0.0864	0.0000	0.2866	-0.1059
7	Diabetes	-0.1510	0.0743	0.0790	-0.0002	0.0614	-0.1510

Overall, the model indicates that chronic conditions such as Tobacco use and Hypertension, along with Obesity and Cardiovascular issues, are significant predictors. Features with high p-values, such as Patient Type and Intu bed, show limited importance in this particular classification task. These findings suggest that focusing on medical conditions like Tobacco and Hypertension would yield the most meaningful insights for improving the model’s predictive accuracy.

Conclusion

This study comprehensively analyzes COVID-19 patient characteristics, comorbidities, resource utilization, and predictive modeling. The dataset highlights a significant burden of severe outcomes, with the majority of patients requiring hospitalization and critical care. Comorbidities such as COPD, chronic renal disease, cardiovascular issues, diabetes, and hypertension were prevalent, underscoring their critical role in exacerbating the severity of COVID-19. Moreover, high hospitalization rates and reliance on advanced medical units emphasize the pandemic’s strain on healthcare systems. Among machine learning models, Logistic Regression and AdaBoost showed the best predictive performance, with AdaBoost excelling in precision and specificity, and Logistic Regression demonstrating the highest sensitivity and balanced metrics. These findings offer valuable insights for healthcare providers, policymakers, and researchers to improve resource allocation and patient outcomes.

Limitations

Despite its contributions, this study has several limitations. This study relies on retrospective data, which may introduce biases or incomplete information. Some features, such as lifestyle factors, were self-reported and could be prone to inaccuracies. The performance of machine learning models could be further improved with more advanced algorithms or larger datasets. Finally, the study does not account for the evolving nature of COVID-19, including new variants and treatment protocols, which may influence outcomes in future datasets. Addressing these limitations in future research can enhance the applicability and robustness of the findings.

References

Order for reprints

Article Alerts

Subscribe to our articles alerts and stay tuned.

Subscribe Now!

This work is licensed under a Creative Commons Attribution 4.0 International License.

Quick Enquiry

Table 1: Medical Care and Resource Utilization.
Medical Unit Level		FREQ	%
	Unit-1	151	15.1
	Unit-2	169	16.9
	Unit-3	680	68.0
Type of Care the Patient Received
	Returned Home	299	29.9
	Hospitalized	701	70.1
	Total	1000	1000

Table 2: Clinical Symptoms and Conditions.
Symptoms and conditions
Air Sacs Inflammation		F	%	Asthma		F	%
	Yes	615	61.5		No	17	1.7
	No	385	38.5		Yes	983	98.3
Diabetes				Hypertension
	No	301	30.1		No	321	32.1
	Yes	699	69.9		Yes	679	67.9
Chronic Obstructive Pulmonary Disease				Chronic Renal Disease
	No	37	3.7		No	74	7.4
	Yes	963	96.3		Yes	926	92.6
Heart-related Disease				Obesity
	No	61	6.1		No	167	16.7
	Yes	939	93.9		Yes	833	83.3
Immunosuppressed				Tobacco User
	No	37	3.7		No	76	7.6
	Yes	963	96.3		Yes	924	92.4
other Diseases					Total	1000	100.0
	No	48	4.8
	Yes	952	95.2

Table 3: Severity Indicators.
Connected to the Ventilator		F	%	Covid Test Findings		F	%
	Yes	545	54.5		Affected	534	53.4
	No	455	45.5		Not Affected	466	46.6
Admitted to an ICU					Total	1000	100.0
	Yes	511	51.1
	No	489	48.9

Table 4: Comparison of Machine Learning Model.
Model	Accuracy	Precision	F1-score	Sensitivity	Specificity
DT	0.6650	0.6659	0.6650	0.6650	0.6804
LR	0.7700	0.7825	0.7684	0.6796	0.8660
RF	0.7450	0.7535	0.7438	0.6699	0.8247
AdaBoost	0.7700	0.7984	0.7658	0.6311	0.9175

Global Journal of Medical and Clinical Research Articles

Critical Determinants of COVID-19 Severity and Predictive Modeling for Healthcare Optimization