Student Capstone Projects
The capstone is the culminating project for each student in the M.S. Data Science and M.S. Biomedical Data Science programs. The comprehensive, real-life industry-type projects are oriented toward the student’s domain of interest.
Each project includes: formulation of a question to be answered by the data; collection, cleaning and processing of data; choosing and applying a suitable model and/or analytic method to the problem; and communicating the results to a non-technical audience.

Ashley, Saul
M.S. Biomedical Data Science
Development of Anxiety in Breast Cancer Patients Undergoing Therapy: A Preliminary Study Using NIH All of Us Data
Sources report that, 1 in 8 women in the United States will be diagnosed with breast cancer in her lifetime. Furthermore, the long-term mental health outcomes of radiation therapy and chemotherapy, a common practice in the field of oncology, for cancer patients are a significant concern. Research shows that the global prevalence of anxiety among cancer patients is 17%–69%, and the global prevalence of anxiety among the general population is 31.9%, implying that the mental health outcome may be more prevalent among cancer patients. More specifically, Living Beyond Breast Cancer (LBBC) states more than 40% of people diagnosed with breast cancer experience anxiety. Even survivors may face long-term psychological effects such as anxiety, depression, and a fear of cancer recurrence, which can impact their day-to-day life. Finally, the American Cancer Society (ACS) states some cancer treatments can cause cognitive effects in current patients and survivors, such as “chemo brain” or brain fog, which may lead to difficulties with concentration, memory, and multitasking. Using NIH All of Us data, This study seeks to further highlight and examine the associations between cancer therapy and procedures and the mental health outcome of anxiety as it relates to breast cancer patients using various regression techniques.

Theodore, Jean-Hus
M.S. Biomedical Data Science
Computational methods for novel antibiotic drug discovery for Caseinolytic peptidase P (ClpP)
Caseinolytic peptidase P (ClpP) is a multimeric serine protease found in many prokaryotes, as well as in the mitochondria of eukaryotic cells and in chloroplasts. In prokaryotes, ClpP is essential for maintaining protein homeostasis, and its disruption can affect the virulence and infectivity of various pathogens.1 In eukaryotes—particularly in humans—ClpP plays a crucial role in protein quality control by degrading denatured and misfolded proteins, thereby preserving the integrity of the respiratory chain and sustaining oxidative phosphorylation.2 Due to these multifaceted functions, ClpP is an attractive target for both anticancer and antibiotic drug development. Although several compounds, such as acyldepsipeptides (ADEP), have been developed to target ClpP, bacterial resistance continues to pose a significant challenge to effective antibiotic therapy.3 Here in this project, different computational drug discovery approaches will be implemented to discover novel antibiotics for ClpP. More specifically, ligand based and structure based drug discovery approaches will be utilized to discover candidate drug for ClpP. First, different chemical bioassay will be used to curate the biochemical data that is available for different ClpP gene from different species including prokaryotes and eukaroytes. These dataset will be used to train different machine learning model to predict the bioactivity of unknown compounds. Finally these models will be tested against different chemical databases including ( Enamine, Mcule)4-5 to predict the activity of unknown compound for ClpP. Our model will focuses of two approaches, such as identification of bioactive molecule for ClpP and predicting the IC50 ( biochemical descriptor) to determine the activity of a molecule. Overall, these approaches will allow to screen chemical compounds active to ClpP.

Edwards, Ariel
M.S. Biomedical Data Science
Exploring the Connection Between Anxiety and Lung Cancer Diagnosis Through Data Analysis
This capstone project explores the relationship between anxiety and lung cancer diagnosis, addressing a gap in research where psychological factors have often been overlooked. Using survey data from Kaggle and the UCI Machine Learning Repository, I built a logistic regression model to see if people who report experiencing anxiety are more likely to also report a lung cancer diagnosis. After accounting for factors like age, gender, smoking habits, and chronic disease, the results demonstrated that individuals reporting anxiety were over twice as likely to report having lung cancer. Overall, the model demonstrated solid performance, with consistent results observed across key demographic and behavioral subgroups, including smokers and individuals over 60. These findings suggest that mental health, particularly anxiety, may play a larger role in physical health outcomes than we often consider. By bringing psychological factors into the conversation, this research encourages a more holistic approach to how we think about cancer risk.

Gordon, Bridgett
M.S. Data Science
Assessing Risk Factors Associated with Maternal Anxiety and Depression During Pregnancy
Maternal mental health (MMH) during pregnancy is becoming an increasingly important aspect of perinatal care. Depression, anxiety, and posttraumatic stress disorder (PTSD) are significant contributors to maternal and child morbidity but are often missed in standard screening. The goal of this capstone project was to use statistical and machine learning methodologies to identify predictors of mental health risk as defined by these three validated instruments: the City Birth Trauma Scale (CBTS); the Edinburgh Postnatal Depression Scale (EPDS); and the Hospital Anxiety and Depression Scale (HADS).
The study uses data from Swiss pregnant people (n=410) and explores the relationships between sociodemographic factors (age, education, gestational weeks, and marital status) and mental health outcomes among those pregnant people. Both standard regression models and ensemble machine learning models contributed to early identification approaches and revealed potential vulnerability patterns.
The project advances this goal through the development and evaluation of predictive models for each outcome. The findings provide insight into the role of sociodemographic health in psychological well-being.

Sarder, Uma
M.S. Data Science
Impact of Socioeconomic Status on Mental Health Disorders among Pregnant Women Using NIH All of Us Data
Maternal mental health (MMH), particularly depression and anxiety are crucial to the well-being of pregnant women and affecting 1 in 5 pregnant individuals every year in the U.S. Socioeconomic status (SES) is increasingly recognized as a key factor influencing MMH outcomes. Understanding the complex interplay between socioeconomic status and MMH is crucial for developing effective preventive measures and treatment strategies aimed at reducing maternal and fetal mortality. Our study utilized data from the All of Us research program to examine the relationship between SES factors and MMH, especially depression and anxiety. We further developed predictive models for early prediction of MMH based on socioeconomic factors. Our findings demonstrate the potential of statistical and machine learning approaches to uncover the risk factors and enhance early detection strategies, which could contribute to improve the maternal and fetal health outcomes.

Dike, Judith
M.S. Data Science
The Impact of Stress and Anxiety on Cardiovascular Outcomes in Thyroid Dysfunction
Background: Thyroid dysfunction affects cardiovascular health through metabolic alterations. However, the effects of psychological stress and anxiety remain understudied. This study examines how chronic stress and anxiety influence cardiovascular disease (CVD) in patients with thyroid disorders.
Methods: We analyzed electronic health records (EHRs) from patients aged 18-80 years diagnosed with thyroid dysfunction between January 2018 and July 2022 in the All of Us research program. Predictors included stress and anxiety (ICD-10 codes), while outcomes were CVD diagnoses (CHF, COPD, arrhythmias). Using logistic regression adjusted for age and sex, we assessed associations between mental health conditions and CVD, ensuring thyroid diagnosis preceded cardiovascular events. Models were stratified by sex to examine effect modification.
Results: The cohort of 27,441 people was 78.9% female and 19.1% male, with a mean age of 58.6 years (SD 12.4). Racial/ethnic composition included: 65.5% White, 15.5% Hispanic/Latino, 11.3% Black, 2.5% Asian, 1.2% Other, 1.2% multiracial, and 2.9% unknown. CVD prevalence was 24.5% overall, with arrhythmias (20.3%, p<0.001), CHF (5.3%, p<0.001), and COPD (4.8%, p<0.001) being most common. Patients with mental health conditions had 2.06 times higher odds of CVD (95% CI: 1.93-2.18) than those without. The association was stronger in women (OR=2.14, CI:2.00-2.29) than men (OR=1.80, CI:1.62-2.00), with a significant interaction (p=0.01). Each additional year of age was associated with a 2.1% increase in CVD odds (OR=1.021, CI:1.018-1.022).
Conclusion: Thyroid patients with stress or anxiety had higher CVD risks, especially women (2.1 times vs 1.8 times higher odds). Since 1 in 4 thyroid patients developed CVD, mental health screening may help with reducing risks.

Kuwonu, Dodji
M.S. Data Science
Investigating the impact of Socioeconomic and Environmental Factors on Children's Mental Health
Children’s mental health is increasingly recognized as vital to overall health, yet many external factors outside biological predispositions influence emotional, psychological, and social outcomes. Using nationally representative data from the National Survey of Children’s Health (NSCH), this study explores the extent to which the Socioeconomic and Environmental factors contribute to disparities in mental health outcomes such as anxiety, depression, ADHD, and autism. Logistic regression and Random Forest were applied to assess and predict risk factors. Findings indicate that disparities in income, neighborhood safety, healthcare access, and parental education significantly impact mental health. This research provides evidence-based recommendations to promote mental health equity among children.

Boykin, Hannah
M.S. Data Science
Predicting ICU Admissions using Machine Learning
The COVID-19 pandemic put immense pressure on ICUs. Data-driven strategies are needed to improve patient management, yet challenges remain in predicting who will require ICU care, who may not survive despite admission, and how long survivors will stay. This study aims to build on prior work by identifying key clinical markers and evaluating the accuracy of predictive models for ICU admission, mortality, and length of stay before patients require critical care. Expanding on a previous study that used 733 hospitalized COVID-19 patients from a single institution, this research incorporates a larger and more diverse dataset from multiple hospitals to improve generalizability. Demographic, clinical, and laboratory data will be analyzed to enhance model reliability, addressing past concerns about sample size limitations, data imbalance, and the lack of external validation. Machine learning models will assess ICU risk using clinical data collected – allowing for early identification of high-risk patients. To improve fairness and accuracy, strategies will be applied to ensure the dataset is balanced, ensuring predictions remain reliable across different patient groups. The goal is to develop an interpretable and practical decision-support tool for ICU planning and resource allocation. By refining prediction methods and incorporating broader hospital data, this study aims to strengthen ICU management strategies. Future directions will focus on validating model performance across healthcare systems, addressing ethical considerations in ICU prioritization, and integrating predictions into clinical decision-making to support proactive patient care and resource efficiency.