Student Capstone Projects
The capstone is the culminating project for each student in a SACS Master of Science program. The comprehensive, real-life industry-type projects are oriented toward the student’s domain of interest.
Each project includes: formulation of a question to be answered by the data; collection, cleaning and processing of data; choosing and applying a suitable model and/or analytic method to the problem; and communicating the results to a non-technical audience.
Ellen Gentile
M.S. Data Science
Effects of Different Types of State Abortion Laws on Maternal & Infant Mortality Rates
The United States has the highest infant & maternal Mortality rates of any comparable developed nation. Amongst black women and infants these rates are even higher. States with restrictive abortion laws have higher mortality rates. There are many types of abortion laws. However, studies on the association of abortion laws with infant and maternal mortality have typically been done based on an index of state restrictiveness, rather than based on specific laws.
In this study we examine the association of specific types of abortion laws with maternal, infant and combined (maternal + infant) mortality rates, as measured per 1,000 births.
The data for the study were gathered and joined from 18 independent sources, including various datasets within the CDC WONDER Database, U.S. Census and LawAtlas.com. Mann Whitney U tests were conducted for each law and outcome to assess univariate association. For multivariate studies, several regression models including Linear Regression, Lasso, Ridge, Random Forest Regression and Linear Mixed Effects models were conducted using an 80/20 train-test split with 5-fold cross validation and feature selection methods integrated into their pipeline. Model fit was assessed based on r-squared values and root mean squared error. Finally, the Mixed Effects Linear Model was used to determine the significance and effect size of predictors and confounders based on p-values and coefficients.
In the univariate analysis, all types of laws significantly increased maternal, infant and combined mortality except for bans6 weeks after a woman’s last menstrual period (LMP), which was only associated with a significant increase in maternal mortality and bans 7-14 weeks LMP which had no association with any of the outcomes.
The multivariate analyses all fit well (r-squared >0.7) to the combined and infant mortality rates, but poorly to the maternal mortality rates. This indicated that the combined mortality fit was likely driven by infant mortality.
The mixed effects linear model proved that the only law that was significantly increased infant and combined mortality rates after accounting for covariates was banning abortions between 15-20 weeks LMP (infant: B = 0.435, p=0.026, combined: B=0.494, p= 0.012). Significant covariates included natural log of the percent of births paid for by private insurance (infant: B=2.157, p=0.015, combined: B=2.061, p=0.019), natural log of the percent of births to mothers less than 19 years old (infant: B=1.499, p=0.044, combined: B=1.466, p=0.047), natural log of the percent of birth paid by self-pay (infant: B=0.380, p=0.007, combined: B=0.380, p=0.007), percent of births to minority mothers(infant: B=0.041, P<0.001, combined: B= 0.040, p<0.001), average interval since last other pregnancy outcome (infant: B= -0.066, p=0.019,combined: B=-0.065, p=0.02). It is possible that bans in the 15-20thweek LMP are associated with higher infant mortality because this is approximately the timeframe when a mother can determine if her growing baby has a fetal abnormality. The inability to terminate pregnancies that are ultimately not viable may be resulting in higher infant mortality rates. Further study should be conducted to examine if this is, in fact, occurring.
Bridgett Gordon
M.S. Data Science
Assessing Risk Factors Associated with Maternal Anxiety and Depression During Pregnancy
Maternal mental health (MMH) during pregnancy is becoming an increasingly important aspect of perinatal care. Depression, anxiety, and posttraumatic stress disorder (PTSD) are significant contributors to maternal and child morbidity but are often missed in standard screening. The goal of this capstone project was to use statistical and machine learning methodologies to identify predictors of mental health risk as defined by these three validated instruments: the City Birth Trauma Scale (CBTS); the Edinburgh Postnatal Depression Scale (EPDS); and the Hospital Anxiety and Depression Scale (HADS).
The study uses data from Swiss pregnant people (n=410) and explores the relationships between sociodemographic factors (age, education, gestational weeks, and marital status) and mental health outcomes among those pregnant people. Both standard regression models and ensemble machine learning models contributed to early identification approaches and revealed potential vulnerability patterns.
The project advances this goal through the development and evaluation of predictive models for each outcome. The findings provide insight into the role of sociodemographic health in psychological well-being.
Andrea Hannah
M.S. Biomedical Data Science
Applying Machine Learning to Ovarian Cancer Predicting Biomarkers
Identifying biomarkers that predict patient’s risk for Ovarian Cancer is a key factor in the fight to improve survival rates. Ovarian Cancer is a group of diseases that originate in the ovaries, fallopian tubes or peritoneum. Ovarian Cancer is best treated at its earliest stages when it is most treatable. Therefore, early screening and diagnosis is key to successfully treating or curing the disease. This study will use heatmap visualization, pearson correlation coefficient method, scatterplot visualizations, logistic regression, and existing literature to determine the best biomarkers of importance in comparison with elevated CA125 levels importance identified include Age, Menopause, Human Epididymis Protein 4 (HE4), Alkaline Phosphatase (ALP), and Calcium. Preliminary analysis shows variables of interest, except HE4, correspond with elevated CA125 levels and would be biomarkers to play closer attention to in predicting ovarian cancer with machine learning models. To optimize performance of the prediction model, removal of non-biomarkers, Age and Menopause, is necessary. Menopause is a nominal category that could still decrease performance even if its cleaned and converted to numeric form.

Alyshondria Hicks
MS Biomedical Data Science
Automated Detection of Cardiovascular Disease from Heart Sounds Using Machine Learning
Cardiovascular disease (CVD) is one of the leading causes of death worldwide, making early detection very important. Although doctors often listen to heart sounds during routine exams, there is still no consistent and reliable tool that can automatically interpret these sounds to help detect disease early. This project proposes a new machine learning system that uses heart sound recordings, also called phonocardiograms (PCG), as the main input to identify whether a person may have cardiovascular disease. Publicly available and de-identified heart sound datasets will be used to train and test the model. Deep learning models such as convolutional neural networks using mel-spectrogram features and long short-term memory will be used. Model performance will be measured using accuracy, sensitivity, specificity, F1-score, and AUC-ROC, since these values affect how well the system can detect true disease cases while avoiding unnecessary false alarms. The results of this research may guide future improvements, such as using larger datasets, combining heart sounds with other medical signals, and testing the system in real clinical settings. With further validation, this approach could lead to affordable diagnostic tools that can be used in digital stethoscopes or telehealth systems to improve access to early cardiovascular screening.

Mikel Houston
M.S. Data Science
Securing AI-Enabled Wearable Devices and Neurostimulators with Quantum Key Distribution and Post-Quantum Cryptography
Bio
Mikel Houston is a Master of Science in Data Science student at Meharry Medical College with a background in biology and a research focus at the intersection of healthcare, cybersecurity, and emerging quantum technologies. Her work centers on developing secure, data-driven systems using machine learning, big data frameworks, and cloud computing. She is particularly interested in quantum-resilient security for healthcare IoT environments, where sensitive patient data must be protected against evolving computational threats. Mikel aims to contribute to advancements in secure healthcare infrastructure through interdisciplinary research combining data science, artificial intelligence, and post-quantum cryptography.
Abstract
Mikel Houston’s capstone explores quantum-resilient security frameworks for healthcare IoT systems by combining Quantum Key Distribution (QKD) with Post-Quantum Cryptography (PQC). The project simulates secure communication between AI-enabled wearable devices, edge gateways, and cloud infrastructures, evaluating performance across key metrics including latency, scalability, and resilience against both classical and quantum attacks. A comparative analysis of classical encryption, PQC-only, and hybrid QKD-PQC approaches highlights trade-offs between efficiency and security. This research informs the design of next-generation cybersecurity architectures, ensuring robust protection of sensitive healthcare data in increasingly connected and intelligent medical environments.
Advisor: Uttam Ghosh, Ph.D.

Jalen Jackson
M.S. Data Science
A Machine Learning Framework for Predicting Pulmonary Sarcoidosis Using Multi-Cohort Gene Expression
Abstract
Sarcoidosis is a multisystem inflammatory disease characterized by granuloma formation and variable clinical presentations. Accurate diagnosis is often difficult because its symptoms overlap with other inflammatory and pulmonary conditions, and the absence of reliable non-invasive biomarkers. This study aimed to improve molecular classification of pulmonary sarcoidosis using machine learning applied to integrated multi-cohort gene expression data. Previous sarcoidosis studies relied on single-cohort datasets with limited generalizability. Three publicly available Gene Expression Omnibus datasets (GSE18781, GSE42834, and GSE83456) were combined and harmonized, with a final dataset of 398 samples: 229 healthy controls and 169 sarcoidosis cases, with 16,622 shared genes retained for modeling. The analytical workflow included data preprocessing, batch correction, low-variance gene filtering, statistical feature selection, feature importance analysis, scaling, principal component analysis (PCA), and classification modeling. Six algorithms were evaluated—Logistic Regression, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors, Naive Bayes, and Gradient Boosting—using accuracy and ROC-AUC as primary metrics, and precision, recall, and F1-score for detailed assessment of top models. Random Forest achieved the highest performance (accuracy 0.76; ROC-AUC 0.85), followed closely by SVM. PCA revealed partial separation between sarcoidosis and control samples, supporting moderate discriminatory power. Feature importance analysis highlighted genes involved in immune and inflammatory pathways. These findings demonstrate that integrating multi-cohort gene expression data enhances the robustness and generalizability of machine learning models for sarcoidosis classification.
Advisor: Naw Safrin Sattar, Ph.D.
Caroline Jena
M.S. Data Science
Integrating Clinical Characteristics to Predict ICU Mortality in Sepsis Patients: A Comprehensive Approach
Increasing sepsis mortality rate remains a critical global health concern, particularly in intensive care unit (ICU) patients. Therefore, early, and accurate prediction of sepsis outcomes is crucial for guiding timely clinical interventions and reducing preventable deaths. This study aimed to investigate a comprehensive machine learning (ML) approach to accurately predict mortality risk outcomes using clinical characteristics such as laboratory results and underlying comorbidities data from the Medical Information Mart for Intensive Care-IV (MIMIC-IV) collected from the first 48 hours of ICU admission. To predict outcomes the study utilized several ML methods, Logistic Regression, Random Forest and XGBoost. We analyzed data from 16,994 septic patients who were admitted into the ICU. Model performance was assessed using a combination of cross-validation and test set metrics, including ROC AUC, average precision, and accuracy. TheXGBoost model demonstrated the best overall performance, achieving a test ROC AUC of 0.86, average precision of 0.76, and accuracy of 80%. Logistic Regression and Decision Tree models showed moderate predictive capability with lower AUC and precision scores. The results confirmed that ensemble methods, particularly XGBoost, offer superior performance in predicting ICU mortality among sepsis patients in contrast to commonly used traditional scoring tools like SOFA. These findings highlight the potential of machine learning to aid in early risk stratification and informed clinical decision-making in the management of sepsis.
Dodji Kuwonu
M.S. Data Science
Investigating the impact of Socioeconomic and Environmental Factors on Children's Mental Health
Children’s mental health is increasingly recognized as vital to overall health, yet many external factors outside biological predispositions influence emotional, psychological, and social outcomes. Using nationally representative data from the National Survey of Children’s Health (NSCH), this study explores the extent to which the Socioeconomic and Environmental factors contribute to disparities in mental health outcomes such as anxiety, depression, ADHD, and autism. Logistic regression and Random Forest were applied to assess and predict risk factors. Findings indicate that disparities in income, neighborhood safety, healthcare access, and parental education significantly impact mental health. This research provides evidence-based recommendations to promote mental health equity among children.
