Student Capstone Projects

Explore

The capstone is the culminating project for each student in a SACS Master of Science program. The comprehensive, real-life industry-type projects are oriented toward the student’s domain of interest.

Each project includes: formulation of a question to be answered by the data; collection, cleaning and processing of data; choosing and applying a suitable model and/or analytic method to the problem; and communicating the results to a non-technical audience.

Pykiet, Cameron

M.S. Data Science

Developing a Novel Protein Function Prediction Pipeline using Structurally Aware Tokenization (SAT)

Proteins are the biomolecules that form the building blocks of life. The genetic information encoded in Deoxyribonucleic Acid (DNA) is translated into proteins from Ribonucleic Acid (RNA) transcription. Proteins perform multitudes of functions that include but are not limited to catalyzing reactions as enzymes, participating in the body’s defense mechanism as antibodies, forming structures and transporting important chemicals. Therefore, functional characterization of proteins is crucial to understanding life, diseases, and developing novel therapeutics. In this research, we propose a protein function prediction (PFP) pipeline that utilizes a new way of tokenization called SAT (structurally aware Tokenization). SAT works by decomposing a protein sequence into its constituent domains which are the evolutionarily conserved structural parts of a protein sequence. Typically, given a protein sequence as input, the objective of protein function prediction or annotation is to assign appropriate Gene Ontology (GO) terms to describe the functional characteristics of the query protein. Tokenization – the split of the input sequence into smaller subunits (tokens), is crucial in language models. Typically, these models take protein sequences of amino acids and use common tokenization techniques such as individual amino acids (AA), K-Mers, or Byte Pair Encoding (BPE) to build protein language models. However, we hypothesize that including structural constructs of protein such as ordered list of motifs and domains as tokens will provide additional valuable structural information into the model as they directly link to functions. In this work, we have explored SAT for protein function prediction in comparison with AA, k-Mers, and BPE under various machine learning and natural language processing techniques including ‘Word2Vec’ and large language models. Additionally, we have implemented a GPT2 style transformer architecture to train a PFP model using proposed SAT tokenization. The experimental outcome presents promising performance although it demands further exploration to confirm its efficacy.

Dyson, Jaylin

M.S. Data Science

Developing a Knowledge Graph-Driven AI Agent for Protein Function Prediction

Knowledge graphs (KGs) connect nodes and edges for knowledge representation, particularly in biology, to illustrate findings and link biological connections. The most accurate KGs are created from journals, abstracts, and lab experiments, prioritizing precision over recall for high accuracy. However, this manual process is slow and resource-intensive, resulting in gaps and missing connections. To address this challenge, computational approaches for knowledge graph completion (KGC) have emerged to help accelerate the discovery of new links. Many of these computational methods were originally developed for non-biological link prediction tasks within computer science and have been successfully adapted for biological KGs. Current models primarily employ embedding techniques (TransE, ComplEx, and RotatE) and Graph Neural Networks (GNN). However, other techniques, like generative adversarial networks (GANs) and reinforcement learning (RL), remain largely unexplored in biological link prediction. In this context, we propose a novel framework that integrates GANs and RL to generate and rank plausible, new links for protein function prediction within biological knowledge graphs.

Williams, Lorayya

M.S. Data Science

A Novel Pipeline for Virus Integration Sites Detection in Tumor Genomes Using Deep Learning

Cancer is one of the leading causes of death worldwide. Pathogenic viruses are estimated to be responsible for 15% of all human cancers globally and pose significant threats to public health. Viruses integrate their genetic material into the host genome, increasing the risk of cancer promoting changes in it. To understand the molecular mechanisms of virus-mediated cancers, it is crucial to identify viral insertion sites in cancer genomes. However, this effort is hindered by the rapidly increasing volume of tumor sequencing data, along with the challenges of accurate data analysis caused by high viral mutation rates and the difficulty of aligning short reads to the reference genome. Thus, it is crucial to develop an efficient method for virus integration site detection in tumor genomes. This paper proposes a novel pipeline to identify viral integration sites leveraging deep Convolutional Neural Networks (CNN). Our contributions are twofold: (i) We propose and integrate two novel matrix generation methods into the pipeline, developed after aligning the host and viral genomes with their respective reference genomes.; (ii) We employ one-hot encoded images with reduced computational complexity to represent viral integration sites and harness the capabilities of Deep CNN networks for detection. The paper illustrates our proposed approach and presents experiments conducted using both synthetic and real sequencing data. Our experimental results are promising, showcasing the effectiveness of the proposed methods in detecting viral integration sites.

Rukundo, Ange

M.S. Data Science

Diesel Prices Trends Analysis and Forecasting with Machine Learning on State and National Levels in U.S. (2020 – 2024)

Diesel fuel prices in the U.S. saw dramatic changes between 2020 and 2024, driven by a mix of global events, shifting regulations, and supply chain disruptions. These price swings had wide-reaching effects—raising transportation costs, straining businesses, and putting extra pressure on consumers, especially in fuel-dependent industries. This project overviews the trends behind these changes, looking at how factors like crude oil prices, geopolitical tensions, and environmental policies have shaped regional diesel price patterns. Using machine learning models like Linear Regression and Gradient Boosting Regressors, the study forecasts diesel prices into 2025, offering a data-driven approach to help businesses and different stakeholders plan ahead. The results point to continued price volatility, with significant differences across U.S. regions. By blending historical analysis with predictive tools, this research aims to support smarter, more resilient decision-making in an uncertain energy landscape.

Broughton, Julian

M.S. Biomedical Data Science

Learning Unbiased Risk Prediction Based Algorithms in Healthcare

The rapid advancement of Artificial Intelligence (AI) has significantly transformed healthcare, enhancing traditional methods of diagnosis and treatment. These innovations have enabled quicker disease detection, improved management, and more personalized care. However, many AI tools currently used in clinical settings suffer from algorithmic and data-driven biases, often due to inadequate representation of certain racial, gender, and age groups. These gaps can result in misdiagnoses, health disparities, and inequitable outcomes. Therefore, addressing these biases is essential. This project investigates the presence and impact of such biases by examining both pre-processing and post-processing stages of AI model development, using a widely adopted real-world healthcare dataset from primary care patients. It uncovers previously overlooked biases and offers practical strategies to reduce disparities related to race, gender, and age. By applying machine learning algorithms and utilizing the Fairlearn toolkit, the study identifies and quantifies the biases, evaluates their effects on predictive performance, and presents methods to mitigate them. The findings provide strong evidence of systemic bias in healthcare AI systems, particularly in how they influence resource distribution and decision-making. As a result, it is imperative to incorporate bias detection and mitigation techniques to ensure that AI technologies in healthcare are fair, dependable, and ethically sound.

Pounds, Destiny

M.S. Biomedical Data Science

Forecasting Minute-by-Minute Stress, Anxiety, and Affective States Using Time-Series Analysis of Wearable Sensor Data

This capstone project explores the potential of using wearable sensor data to predict psychological states in real time. The study focuses on features extracted from electrodermal activity (EDA) and heart rate variability (HRV) signals to predict self-reported stress, anxiety, and affect using the WESAD dataset. Two modeling approaches were developed: static models that predict self-reports from single-minute data snapshots and time-series models that use ten-minute sliding windows of physiological data and labels to forecast the next minute’s self-reported data. Multiple machine learning algorithms, including Random Forest, XGBoost, Light GBM, and Support Vector Machine, were trained and evaluated using subject-aware cross-validation to ensure generalizability across unseen participants. Results show that time-series models consistently outperform static models in predicting stress, anxiety, and affective states, highlighting the value of incorporating temporal physiological context. Feature importance analysis further identified key physiological markers associated with psychological states. These findings support the potential of wearable technology and machine learning for real-time, personalized mental health monitoring, offering a foundation for future research and clinical applications in remote or continuous mental health assessment.

Dawkins, Gabrielle

M.S. Biomedical Data Science

Voice Analysis to Differentiate between Neurological, Respiratory, Cardiovascular Conditions

The human voice, a biomarker of complex movement of communication has been known to change for a person as they age or have changes in health.  Neurological, cardiovascular, and respiratory disorders are disease that can alter the acoustic features of voice. Fluid accumulation in vocal fold, fatigue can lead to change the features of voice such as pitch, formant, jitter, shimmer and higher order features such as MFCCs.  In this study, we aim to distinguish the feature change between patients with Neurological, cardiovascular, and respiratory disorders and have multiple comorbidities. We have used Bridge2AI voice dataset, which consists of 302 subjects with the disorders. We have analyzed 140 acoustic features to investigate significant difference between diseases and found several features were different. We have further implemented machine learning algorithms to identify the patients (for both subject dependent and independent cases) from their voice features and achieved F1 score above 0.60. We aim to enhance this study in future to increase the accuracy level.

McDonald, Micaiah

M.S. Biomedical Data Science

Exploring the Impact of Neighborhood Environment, Food Insecurity, Discrimination, and Social Support on Mental Health Among People Who Use Marijuana

This study examines the impact of Social Determinants of Health (SDoH), including neighborhood environment, food insecurity, discrimination, and social support, on mental health outcomes, specifically depression and anxiety, among individuals who use marijuana. Using data from the NIH All of Us Research Program, which works to improve health care through research. The All of Us Research Program is building a diverse database that can inform thousands of studies on a variety of health conditions.  The research focused on participants who completed the SDoH and lifestyle surveys, where marijuana use was self-reported. Electronic Health Records (EHR) were used to identify participants diagnosed with mental health conditions, including depression and anxiety, using ICD-10 codes. Variables such as neighborhood conditions (cleanliness, noise, graffiti), food insecurity (binary indicator), discrimination (experiences of inequitable treatment) and perceived social support were extracted from the surveys. This analysis also took into account demographic factors such as age, race, gender, education, marital status, and income. To explore how these factors are related to mental health outcomes, logistic regression models were used for statistical analysis.   The study included 7,519 participants, with 51% reporting a prior diagnosis of depression and 54% reporting anxiety before completing the survey. The findings of this study showed that both food insecurity and discrimination were significant factors influencing depression and anxiety. Social support was a protective factor, which means that greater social support will reduce both the diagnoses of depression and anxiety. In addition, people with lower education levels were at an increased risk of being diagnosed with anxiety and people with lower income have a higher likelihood of a diagnosis of depression.  Overall, this study highlights the role that social determinants play in shaping mental health outcomes such as depression and anxiety. These results underscore the importance of addressing health disparities in social support and income through targeted interventions to help reduce mental health burdens in diverse populations.   The study will also offer valuable information on how these social factors influence mental health and points to a key area for future research and intervention, in particular to develop public health strategies that will help equip both individual and broader systemic causes of mental health challenges.