Student Capstone Projects

Explore

The capstone is the culminating project for each student in a SACS Master of Science program. The comprehensive, real-life industry-type projects are oriented toward the student’s domain of interest.

Each project includes: formulation of a question to be answered by the data; collection, cleaning and processing of data; choosing and applying a suitable model and/or analytic method to the problem; and communicating the results to a non-technical audience.

Issac Adebayo

M.S. Data Science

NLP and use of AI in generating human-like text response

Bio

Isaac Adebayo is a master’s student in Data Science at Meharry Medical College whose research focuses on natural language processing, large language models, and multimodal AI systems. His academic interests include chatbot development, agentic retrieval augmented generation, summarization, and information retrieval. He has also worked on machine learning and predictive modeling using scikit learn, PyTorch, Keras, pandas, and NumPy, with experience in classification, regression, deep learning, and data visualization. With a background in psychology, neuroscience, and mental health care, Isaac brings an interdisciplinary perspective to data science. His broader research interests include deep learning, optimization, machine unlearning, Bayesian inference, stochastic methods, and transformer based models for NLP and generative AI.

Abstract

This capstone explores the foundational concepts and historical evolution of chatbots, from early rule based systems to modern large language model based chatbots enhanced with retrieval augmented generation. It demonstrates the implementation of a standard RAG pipeline using current AI tools, state of the art LLMs, and frameworks such as Hugging Face, LangChain, and Gradio, with deployment support through Hugging Face. The primary goal of the project is to present a working demonstration of a simple LLM based text chatbot. The study highlights several important challenges, including hallucinations caused by ineffective prompting strategies and retrieval errors arising from different chunking methods during the RAG process. It also examines the role of temperature in generating more natural, human like responses and its relationship to hallucination. Future work will focus on addressing issues such as retrieval granularity, value misassociation in RAG outputs, schema misinterpretation, and the difficulties of converting structured data into coherent unstructured text.

Saul Ashley

M.S. Biomedical Data Science

Development of Anxiety in Breast Cancer Patients Undergoing Therapy: A Preliminary Study Using NIH All of Us Data

Sources report that, 1 in 8 women in the United States will be diagnosed with breast cancer in her lifetime. Furthermore, the long-term mental health outcomes of radiation therapy and chemotherapy, a common practice in the field of oncology, for cancer patients are a significant concern. Research shows that the global prevalence of anxiety among cancer patients is 17%–69%, and the global prevalence of anxiety among the general population is 31.9%, implying that the mental health outcome may be more prevalent among cancer patients. More specifically, Living Beyond Breast Cancer (LBBC) states more than 40% of people diagnosed with breast cancer experience anxiety. Even survivors may face long-term psychological effects such as anxiety, depression, and a fear of cancer recurrence, which can impact their day-to-day life. Finally, the American Cancer Society (ACS) states some cancer treatments can cause cognitive effects in current patients and survivors, such as “chemo brain” or brain fog, which may lead to difficulties with concentration, memory, and multitasking. Using NIH All of Us data, This study seeks to further highlight and examine the associations between cancer therapy and procedures and the mental health outcome of anxiety as it relates to breast cancer patients using various regression techniques.

Charnisha Azubuike

M.S. Data Science

A Risk-Adaptive Safety Framework for Verifying Large Language Model (LLM) Generated Summaries in Trauma and Violence Intake Systems

Bio

Charnisha Azubuike is a Nashville native from East Nashville and serves as an Epidemiologist II at the Metro Public Health Department. She is a Master of Science in Data Science candidate at Meharry Medical College’s School of Applied Computational Sciences and holds a Master of Public Health. Her work focuses on applying data science and artificial intelligence to improve decision-making, safety, and outcomes in healthcare and public health. She is passionate about developing responsible and impactful data- driven solutions that advance safety, equity, and system reliability. She also enjoys traveling and experiencing new cultures, which enriches her personal and professional outlook.

Abstract

A Risk-Adaptive Safety Framework for Verifying Large Language Model (LLM) Generated Summaries in Trauma and Violence Intake Systems focuses on developing a prototype to improve the safety and reliability of AI-assisted documentation. The National Academy of Medicine estimates that documentation and diagnostic errors affect approximately 10-20% of patient encounters, and even small errors in violence intake records can compromise survivor safety, service eligibility, and care coordination. This framework evaluates LLM-generated summaries for hallucinations, missing information, incorrect identifiers, and inconsistencies, and assigns risk scores to flag potential errors. This work introduces a critical safety layer to help ensure artificial intelligence supports safe, accurate, and trustworthy trauma and violence intake documentation.

Advisor: Uttam Ghosh, Ph.D.

Evann Bailey

MS. Biomedical Data Science

A Machine Learning Approach to Predicting Cardiovascular Risk in Obstructive Sleep Apnea

Abstract

Traditional measures of sleep-disordered breathing, like the Apnea-Hypopnea Index, often miss short-term changes in the body’s autonomic system that can occur before a stroke. This study introduced a machine learning approach to classify stroke risk using detailed electrocardiogram data from participants in the Sleep Heart Health Study. Instead of analyzing entire sleep periods, the method focuses on 30-second segments during transitions into and out of apnea or hypopnea events. From these segments, features extracted related to timing, frequency and signal complexity were extracted.

The analysis was conducted on a Linux-based high-performance computing cluster to allow comparison across ten different classification models. To handle class imbalance and reduce the number of features, the study used SMOTE to generate additional minority samples and ANOVA-based feature selection (SelectKBest) to retain the 20 most statistically relevant predictors. Among the tested models, CatBoost performed best, achieving an F1-score of 73.33%. Its strong performance is linked to its use of structured decision trees and ordered boosting, which help reduce overfitting. Overall, the results indicate that complex, non-linear patterns in ECG signals can be effective indicators of stroke risk, offering a scalable foundation for integrating automated cardiovascular markers into sleep diagnostics while highlighting the need to address bias and ensure fairness across diverse populations.

Ariana Baker

MS. Biomedical Data Science

Maternal Morbidity Risk Across Latent Maternal Profiles: A Machine Learning Analysis of U.S. Natality Data

Maternal morbidity continues to be a critical public health concern in the United States, with persistent or worsening rates of severe complications during pregnancy and childbirth. Maternal morbidity encompasses any health condition, short-term or long-term, that causes adverse outcomes during pregnancy or after childbirth, including eclampsia, embolisms, and severe hemorrhaging. In 2020, mortality rates in the U.S. reached 24 deaths per 100,000 live births, which is nearly triple that of other high-income countries. Traditional risk assessments often rely on individual clinical factors, limiting their ability to capture the complex heterogeneity of maternal health. This study aims to identify latent maternal profiles using unsupervised machine learning to evaluate their associations with maternal morbidity outcomes and determine whether predictive modeling can be improved through their inclusion.

The 2024 U.S. Birth Data dataset comprises over 3.6 million birth records and 237 variables, covering maternal and paternal demographics, reproductive history, behavioral factors, prenatal care utilization, comorbidities, infections, and neonatal outcomes. Using only pre-delivery maternal characteristics, Principal Component Analysis (PCA) was applied for dimensionality reduction, followed by K-means clustering to identify distinct maternal profiles. The profiles were characterized and analyzed for their associations with maternal morbidity across sociodemographic, reproductive, behavioral, and clinical domains using logistic regression models. Additionally, predictive models were developed for the full cohort and separately for each identified maternal profile to evaluate whether profile-stratified modeling improves risk prediction.

Four distinct maternal profiles were identified, each exhibiting unique combinations of clinical and behavioral characteristics. There were significant differences in maternal morbidity risk observed across all profiles, with an elevated risk associated with specific metabolic, behavioral, and reproductive factors. Predictive modeling results demonstrated modest performance, reflecting the challenges of predicting maternal morbidity outcomes using pre-delivery features alone.

Overall, this study demonstrates that unsupervised learning can uncover meaningful latent structures in maternal health data, while underscoring the complexity of predicting maternal morbidity prior to clinical onset and the need for a more holistic approach that integrates detailed clinical and geographic data.

Broderick Bellard

M.S. Data Science

HeartWise: A Governed Agentic RAG System for Safe and Reliable Heart-Failure Self-Management

Bio

Broderick Bellard is a data scientist with a background in analytical chemistry and business, currently pursuing a Master of Science in Data Science at Meharry Medical College. His work focuses on the intersection of artificial intelligence, healthcare, and ethics, with an emphasis on building reliable and transparent AI systems. Broderick’s research explores agentic retrieval-augmented generation (RAG) systems, evaluation frameworks for large language models, and governance mechanisms to ensure safe deployment in clinical settings. He is particularly interested in advancing equitable healthcare through data-driven solutions and plans to pursue a Ph.D. in Data Science at Meharry Medical College.

Abstract

This capstone project develops and evaluates an agentic retrieval-augmented generation (RAG) system for heart failure support, designed to provide reliable, context-aware clinical guidance. The study compares multiple system configurations, including a baseline large language model (LLM), standard RAG, agentic RAG, and a governed agentic RAG framework with an integrated evaluation layer. A central focus is understanding how governance mechanisms impact system performance and reliability. Through controlled experiments, the project systematically varies judge temperature to assess score variance, inter-run consistency, and failure modes. These evaluations highlight how structured oversight can stabilize LLM behavior, contributing to the development of more transparent, dependable, and clinically trustworthy AI systems.

Advisor: Mamun Abdullah, Ph.D.

Hannah Boykin

M.S. Data Science

Predicting ICU Admissions using Machine Learning

The COVID-19 pandemic put immense pressure on ICUs. Data-driven strategies are needed to improve patient management, yet challenges remain in predicting who will require ICU care, who may not survive despite admission, and how long survivors will stay. This study aims to build on prior work by identifying key clinical markers and evaluating the accuracy of predictive models for ICU admission, mortality, and length of stay before patients require critical care. Expanding on a previous study that used 733 hospitalized COVID-19 patients from a single institution, this research incorporates a larger and more diverse dataset from multiple hospitals to improve generalizability. Demographic, clinical, and laboratory data will be analyzed to enhance model reliability, addressing past concerns about sample size limitations, data imbalance, and the lack of external validation. Machine learning models will assess ICU risk using clinical data collected – allowing for early identification of high-risk patients. To improve fairness and accuracy, strategies will be applied to ensure the dataset is balanced, ensuring predictions remain reliable across different patient groups. The goal is to develop an interpretable and practical decision-support tool for ICU planning and resource allocation. By refining prediction methods and incorporating broader hospital data, this study aims to strengthen ICU management strategies. Future directions will focus on validating model performance across healthcare systems, addressing ethical considerations in ICU prioritization, and integrating predictions into clinical decision-making to support proactive patient care and resource efficiency.

Lexus Brinkley-Tapp

M.S. Data Science

Machine Learning-Enhanced Quantum Approximate Optimization for Network Routing: A Comparative Study on Large-Scale Graph Datasets

Bio

Lexus Brinkley-Tapp is a data scientist and quantum technology researcher based in Atlanta, Georgia. She is completing a Master of Science in Data Science at Meharry Medical College, where her research focuses on quantum networking, hybrid quantum-classical optimization, and secure AI-driven infrastructure. Her work examines how machine learning and quantum algorithms can be used to optimize the development of quantum communication topology. Professionally, her background in large-scale fiber network activation and telecommunications infrastructure brings hands-on experience that directly informs her research interests in secure distributed systems, network strength, and adaptable infrastructure. Her academic work includes developing optimization frameworks for quantum communication networks and contributing to research at the intersection of quantum machine learning and network science. Her broader research vision connects quantum technologies, artificial intelligence, and critical infrastructure, particularly secure next-generation communication networks and smart grids, with a future focus on building systems that are robust, equitably designed, and protected.

Abstract

This capstone develops an ML-enhanced QAOA hybrid framework for routing and resource allocation in quantum communication networks. It tackles combinatorial congestion and flow optimization in 10–50 node simulated infrastructures under entanglement, fidelity, and memory constraints. Qiskit benchmarks compare QAOA variants to classical heuristics on graph datasets, quantifying gains in routing efficiency and congestion reduction. Results inform scalable quantum internet design. Future work deploys on NISQ hardware with error mitigation, scales via tensor-network simulators to 100+ nodes, and adapts the pipeline to multi-omics drug discovery for target identification, pathway optimization, and precision-medicine screening.

Supervisor: Pushpita Chatterjee, Ph.D.